Commit Graph

1027 Commits

Author SHA1 Message Date
Maysam Yabandeh
361010d447 Exporting compaction stats in the form of a map
Summary:
Currently the compaction stats are printed to stdout. We want to export the compaction stats in a map format so that the upper layer apps (e.g., MySQL) could present
the stats in any format required by the them.
Closes https://github.com/facebook/rocksdb/pull/1477

Differential Revision: D4149836

Pulled By: maysamyabandeh

fbshipit-source-id: b3df19f
2016-11-11 20:54:14 -08:00
Reid Horuff
1ca5f6d132 Fix 2PC Recovery SeqId Miscount
Summary:
Originally sequence ids were calculated, in recovery, based off of the first seqid found if the first log recovered. The working seqid was then incremented from that value based on every insertion that took place. This was faulty because of the potential for missing log files or inserts that skipped the WAL. The current recovery scheme grabs sequence from current recovering batch and increments using memtableinserter to track how many actual inserts take place. This works for 2PC batches as well scenarios where some logs are missing or inserts that skip the WAL.
Closes https://github.com/facebook/rocksdb/pull/1486

Differential Revision: D4156064

Pulled By: reidHoruff

fbshipit-source-id: a6da8d9
2016-11-10 11:09:22 -08:00
Andrew Kryczka
c90fef88b1 fix open failure with empty wal
Summary: Closes https://github.com/facebook/rocksdb/pull/1490

Differential Revision: D4158821

Pulled By: IslamAbdelRahman

fbshipit-source-id: 59b73f4
2016-11-09 22:24:26 -08:00
Reid Horuff
d133b08f68 Use correct sequence number when creating memtable
Summary:
copied from: 5ebfd2623a

Opening existing RocksDB attempts recovery from log files, which uses
wrong sequence number to create the memtable. This is a regression
introduced in change a400336.

This change includes a test demonstrating the problem, without the fix
the test fails with "Operation failed. Try again.: Transaction could not
check for conflicts for operation at SequenceNumber 1 as the MemTable
only contains changes newer than SequenceNumber 2.  Increasing the value
of the max_write_buffer_number_to_maintain option could reduce the
frequency of this error"

This change is a joint effort by Peter 'Stig' Edwards thatsafunnyname
and me.
Closes https://github.com/facebook/rocksdb/pull/1458

Differential Revision: D4143791

Pulled By: reidHoruff

fbshipit-source-id: 5a25033
2016-11-09 12:24:17 -08:00
Islam AbdelRahman
9bd191d2f4 Fix deadlock between (WriterThread/Compaction/IngestExternalFile)
Summary:
A deadlock is possible if this happen

(1) Writer thread is stopped because it's waiting for compaction to finish
(2) Compaction is waiting for current IngestExternalFile() calls to finish
(3) IngestExternalFile() is waiting to be able to acquire the writer thread
(4) WriterThread is held by stopped writes that are waiting for compactions to finish

This patch fix the issue by not incrementing num_running_ingest_file_ except when we acquire the writer thread.

This patch include a unittest to reproduce the described scenario
Closes https://github.com/facebook/rocksdb/pull/1480

Differential Revision: D4151646

Pulled By: IslamAbdelRahman

fbshipit-source-id: 09b39db
2016-11-09 10:54:10 -08:00
Andrew Kryczka
9e7cf3469b DeleteRange user iterator support
Summary:
Note: reviewed in  https://reviews.facebook.net/D65115

- DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times.
- DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205)
- DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones
Closes https://github.com/facebook/rocksdb/pull/1464

Differential Revision: D4131753

Pulled By: ajkr

fbshipit-source-id: be86559
2016-11-04 12:09:22 -07:00
Andrew Kryczka
f998c9790f DeleteRange Get support
Summary:
During Get()/MultiGet(), build up a RangeDelAggregator with range
tombstones as we search through live memtable, immutable memtables, and
SST files. This aggregator is then used by memtable.cc's SaveValue() and
GetContext::SaveValue() to check whether keys are covered.

added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761
Closes https://github.com/facebook/rocksdb/pull/1456

Differential Revision: D4111271

Pulled By: ajkr

fbshipit-source-id: 6e388d4
2016-11-03 18:54:20 -07:00
Yi Wu
437942e481 Add avoid_flush_during_shutdown DB option
Summary:
Add avoid_flush_during_shutdown DB option.
Closes https://github.com/facebook/rocksdb/pull/1451

Differential Revision: D4108643

Pulled By: yiwu-arbug

fbshipit-source-id: abdaf4d
2016-11-02 15:39:18 -07:00
Andrew Kryczka
40a2e406f8 DeleteRange flush support
Summary:
Changed BuildTable() (used for flush) to (1) add range
tombstones to the aggregator, which is used by CompactionIterator to
determine which keys can be removed; and (2) add aggregator's range
tombstones to the table that is output for the flush.
Closes https://github.com/facebook/rocksdb/pull/1438

Differential Revision: D4100025

Pulled By: ajkr

fbshipit-source-id: cb01a70
2016-10-31 20:54:18 -07:00
Siying Dong
da61f348d3 Print compression and Fast CRC support info as Header level
Summary:
Currently the compression suppport and fast CRC support information is printed as info level. They should be in the same level as options, which is header level.

Also add ZSTD to this printing.
Closes https://github.com/facebook/rocksdb/pull/1448

Differential Revision: D4106608

Pulled By: yiwu-arbug

fbshipit-source-id: cb9a076
2016-10-31 16:09:13 -07:00
Siying Dong
c90c48d3c8 Show More DB Stats in info logs
Summary:
DB Stats now are truncated if there are too many CFs. Extend the buffer size to allow more to be printed out. Also, separate out malloc to another log line.
Closes https://github.com/facebook/rocksdb/pull/1439

Differential Revision: D4100943

Pulled By: yiwu-arbug

fbshipit-source-id: 79f7218
2016-10-29 16:09:18 -07:00
Siying Dong
04751d5345 L0 compression should follow options.compression_per_level if not empty
Summary:
Currently, we don't use options.compression_per_level[0] as the compression style for L0 compression type, unless it is None. This behavior
 doesn't look like on purpose. This diff will make sure L0 compress using the style of options.compression_per_level[0].

Reviewed and accepted in: https://reviews.facebook.net/D65607
Closes https://github.com/facebook/rocksdb/pull/1435

Differential Revision: D4099368

Pulled By: siying

fbshipit-source-id: cfbbdcd
2016-10-28 17:39:20 -07:00
Aaron Gao
9de2f75216 revert Prev() in MergingIterator to use previous code in non-prefix-seek mode
Summary: Siying suggested to keep old code for normal mode prev() for safety

Test Plan: make check -j64

Reviewers: yiwu, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65439
2016-10-24 13:13:01 -07:00
Aaron Gao
59a7c0337b Change ioptions to store user_comparator, fix bug
Summary:
change ioptions.comparator to user_comparator instread of internal_comparator.
Also change Comparator* to InternalKeyComparator* to make its type explicitly.

Test Plan: make all check -j64

Reviewers: andrewkr, sdong, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65121
2016-10-21 11:31:42 -07:00
Islam AbdelRahman
869ae5d786 Support IngestExternalFile (remove AddFile restrictions)
Summary:
Changes in the diff

API changes:
- Introduce IngestExternalFile to replace AddFile (I think this make the API more clear)
- Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file)
- Deprecate AddFile() API

Logic changes:
- If our file overlap with the memtable we will flush the memtable
- We will find the first level in the LSM tree that our file key range overlap with the keys in it
- We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it
- We will assign a global sequence number to our new file
- Remove AddFile restrictions by using global sequence numbers

Other changes:
- Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob

Test Plan:
unit tests (still need to add more)
addfile_stress (https://reviews.facebook.net/D65037)

Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong

Reviewed By: sdong

Subscribers: jkedgar, hcz, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65061
2016-10-20 17:05:32 -07:00
Andrew Kryczka
f47054015d Handle WAL deletion when using avoid_flush_during_recovery
Summary:
Previously the WAL files that were avoided during recovery would never
be considered for deletion. That was because alive_log_files_ was only
populated when log files are created. This diff further populates
alive_log_files_ with existing log files that aren't flushed during recovery,
such that FindObsoleteFiles() can find them later.

Depends on D64053.

Test Plan: new unit test, verifies it fails before this change and passes after

Reviewers: sdong, IslamAbdelRahman, yiwu

Reviewed By: yiwu

Subscribers: leveldb, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D64059
2016-10-14 12:59:51 -07:00
Yi Wu
e29d3b67c2 Make max_background_compactions and base_background_compactions dynamic changeable
Summary:
Add DB::SetDBOptions to dynamic change max_background_compactions and base_background_compactions.
I'll add more dynamic changeable options soon.

Test Plan: unit test.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64749
2016-10-14 12:25:39 -07:00
Islam AbdelRahman
5691a1d8a4 Fix compaction conflict with running compaction
Summary:
Issue scenario:
(1) We have 3 files in L1 and we issue a compaction that will compact them into 1 file in L2
(2) While compaction (1) is running, we flush a file into L0 and trigger another compaction that decide to move this file to L1 and then move it again to L2 (this file don't overlap with any other files)
(3) compaction (1) finishes and install the file it generated in L2, but this file overlap with the file we generated in (2) so we break the LSM consistency

Looks like this issue can be triggered by using non-exclusive manual compaction or AddFile()

Test Plan: unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: hermanlee4, jkedgar, andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D64947
2016-10-13 10:49:06 -07:00
Andrew Kryczka
017de666c7 fixup commit
Summary: I accidentally left out these changes from my commit of D64053 due to
messing up the merge conflict resolution.

Test Plan: ./db_wal_test

Reviewers:

Subscribers:

Tasks:

Blame Revision: D64053
2016-10-13 08:48:40 -07:00
Andrew Kryczka
1b7af5fb1a Redo handling of recycled logs in full purge
Summary:
This reverts commit 9e4aa798c3,
which doesn't handle all cases (see inline comment).

I reimplemented the logic as suggested in the initial PR: https://github.com/facebook/rocksdb/pull/1313.

This approach has two benefits:

- All the parsing/filtering of full_scan_candidate_files is kept together in PurgeObsoleteFiles.
- We only need to check whether log file is recycled in one place where we've already determined it's a log file

Test Plan:
new unit test, verified fails before the original fix, still passes
now.

Reviewers: IslamAbdelRahman, yiwu, sdong

Reviewed By: yiwu, sdong

Subscribers: leveldb, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D64053
2016-10-12 23:13:09 -07:00
Aaron Gao
447f17127c new Prev() prefix support using SeekForPrev()
Summary:
1) The previous solution for Prev() prefix support is not clean.
Since I add api SeekForPrev(), now the Prev() can be symmetric to Next().
and we do not need SeekToLast() to be called in Prev() any more.

Also, Next() will Seek(prefix_seek_key_) to solve the problem of possible inconsistency between db_iter and merge_iter when
there is merge_operator. And prefix_seek_key is only refreshed when change direction to forward.

2) This diff also solves the bug of Iterator::SeekToLast() with iterate_upper_bound_ with prefix extractor.

add test cases for the above two cases.

There are some tests for the SeekToLast() in Prev(), I will clean them later.

Test Plan: make all check

Reviewers: IslamAbdelRahman, andrewkr, yiwu, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63933
2016-10-11 13:54:26 -07:00
Reid Horuff
2c1f95291d Add facility to write only a portion of WriteBatch to WAL
Summary:
When constructing a write batch a client may now call MarkWalTerminationPoint() on that batch. No batch operations after this call will be added written to the WAL but will still be inserted into the Memtable. This facility is used to remove one of the three WriteImpl calls in 2PC transactions. This produces a ~1% perf improvement.

```
RocksDB - unoptimized 2pc, sync_binlog=1, disable_2pc=off
INFO 2016-08-31 14:30:38,814 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2619 seconds. Requests/second = 28628

RocksDB - optimized 2pc , sync_binlog=1, disable_2pc=off
INFO 2016-08-31 16:26:59,442 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2581 seconds. Requests/second = 29054
```

Test Plan: Two unit tests added.

Reviewers: sdong, yiwu, IslamAbdelRahman

Reviewed By: yiwu

Subscribers: hermanlee4, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D64599
2016-10-07 11:32:10 -07:00
Islam AbdelRahman
87dfc1d23e Fix conflict between AddFile() and CompactRange()
Summary:
Fix the conflict bug between AddFile() and CompactRange() by
- Make sure that no AddFile calls are running when asking CompactionPicker to pick compaction for manual compaction
- If AddFile() run after we pick the compaction for the manual compaction it will be aware of it since we will add the manual compaction to running_compactions_ after picking it

This will solve these 2 scenarios
- If AddFile() is running, we will wait for it to finish before we pick a compaction for the manual compaction
- If we already picked a manual compaction and then AddFile() started ... we ensure that it never ingest a file in a level that will overlap with the manual compaction

Test Plan: unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, yoshinorim, jkedgar, dhruba

Differential Revision: https://reviews.facebook.net/D64449
2016-09-28 15:42:06 -07:00
yiwu-arbug
eb3894cf42 Recompute compaction score on SetOptions (#1346)
Summary: We didn't recompute compaction score on SetOptions, and end up not having compaction if no flush happens afterward. The PR fixing it.

Test Plan: See unit test.

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D64167
2016-09-27 11:17:15 -07:00
Islam AbdelRahman
5c64fb67d2 Fix AddFile() conflict with compaction output [WaitForAddFile()]
Summary:
Since AddFile unlock/lock the mutex inside LogAndApply() we need to ensure that during this period other compactions cannot run since such compactions are not aware of the file we are ingesting and could create a compaction that overlap wit this file

this diff add
- WaitForAddFile() call that will ensure that no AddFile() calls are being processed right now
- Call `WaitForAddFile()` in 3 locations
-- When doing manual Compaction
-- When starting automatic Compaction
-- When  doing CompactFiles()

Test Plan: unit test

Reviewers: lightmark, yiwu, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, yoshinorim, jkedgar, dhruba

Differential Revision: https://reviews.facebook.net/D64383
2016-09-27 00:14:55 -07:00
Yi Wu
9ed928e7a9 Split DBOptions into ImmutableDBOptions and MutableDBOptions
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.

Test Plan:
  make all check

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64065
2016-09-23 16:34:04 -07:00
yiwu-arbug
4bc8c88e6b Recover same sequence id from WAL (#1350)
Summary:
Revert the behavior where we don't read sequence id from WAL, but increase it as we replay the log. We still keep the behave for 2PC for now but will fix later.

This change fixes github issue 1339, where some writes come with WAL disabled and we may recover records with wrong sequence id.

Test Plan: Added unit test.

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D64275
2016-09-23 16:15:14 -07:00
Aaron Gao
715256338a forbid merge during recovery
Summary:
Mitigate regression bug of options.max_successive_merges hit during DB Recovery
For https://reviews.facebook.net/D62625

Test Plan: make all check

Reviewers: horuff, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62655
2016-09-21 11:05:07 -07:00
Yi Wu
e4d3f5d9b8 Fix DBImpl::GetWalPreallocateBlockSize Mac build error
Summary: Specify type param with std::min to resolve compile error on Mac.

Test Plan: https://travis-ci.org/facebook/rocksdb/builds/161223845

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64143
2016-09-20 10:17:28 -07:00
sdong
d78a4401b5 DBImpl::GetWalPreallocateBlockSize() should return size_t
Summary: WritableFile::SetPreallocationBlockSize() requires parameter as size_t, and options used in DBImpl::GetWalPreallocateBlockSize() are all size_t. WritableFile::SetPreallocationBlockSize() should return size_t to avoid build break if size_t is not uint64_t.

Test Plan: Run existing tests.

Reviewers: andrewkr, IslamAbdelRahman, yiwu

Reviewed By: yiwu

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D64137
2016-09-19 16:51:38 -07:00
sdong
b666f85445 Consider more factors when determining preallocation size of WAL files
Summary: Currently the WAL file preallocation size is 1.1 * write_buffer_size. This, however, will be over-estimated if options.db_write_buffer_size or options.max_total_wal_size is set and is much smaller.

Test Plan: Add a unit test.

Reviewers: andrewkr, yiwu

Reviewed By: yiwu

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63957
2016-09-19 12:04:35 -07:00
Yi Wu
0a88f38b7e Remove ColumnFamilyData::options()
Summary: One more small refactor before I split DBOptions into mutable and immutable parts.

Test Plan: existing unit tests.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64047
2016-09-16 15:09:14 -07:00
Andrew Kryczka
06b4785fec Fix recovery for WALs without data for all CFs
Summary:
if one or more CFs had no data in the WAL, the log number that's used
by FindObsoleteFiles() wasn't updated. We need to treat this case the same as
if the data for that WAL had been flushed.

Test Plan: new unit test

Reviewers: IslamAbdelRahman, yiwu, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63963
2016-09-15 11:40:48 -07:00
Yi Wu
17f76fc564 DB::GetOptions() reflect dynamic changed options
Summary: DB::GetOptions() reflect dynamic changed options.

Test Plan: See the new unit test.

Reviewers: yhchiang, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63903
2016-09-14 22:10:28 -07:00
Yi Wu
81747f1be6 Refactor MutableCFOptions
Summary:
* Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions.
* Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable.

Test Plan: existing unit tests.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63945
2016-09-13 21:11:59 -07:00
somnathr
9e4aa798c3 Summary: (#1313)
If log recycling is enabled with the rocksdb (recycle_log_file_num=16)
 db->Writebatch is erroring out with keynotfound after ~5-6 hours of run
 (1M seq but can happen to any workload I guess).See my detailed bug
 report here (https://github.com/facebook/rocksdb/issues/1303).
 This commit is the fix for this, a check is been added not to delete
 the log file if it is already there in the recycle list.

Test Plan:
 Unit tested it and ran the similar profile. Not reproducing anymore.
2016-09-12 16:53:42 -07:00
Islam AbdelRahman
1cca091298 Temporarily revert Prev() prefix support
Summary:
Temporarily revert commits for supporting prefix Prev() to unblock MyRocks and RocksDB release

These are the commits reverted

  - 6a14d55bd9
  - b18f9c9eac
  - db74b1a219
  - 2482d5fb45

Test Plan: make check -j64

Reviewers: sdong, lightmark

Reviewed By: lightmark

Subscribers: andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D63789
2016-09-08 14:45:32 -07:00
Injun Song
ce1be2ce37 Fix build error on Windows (AppVeyor) (#1315)
Add 'cf_options' to source list and db_imple.cc

fix casting
2016-09-06 08:41:43 -07:00
Aaron Gao
2482d5fb45 support Prev() in prefix seek mode
Summary: As title, make sure Prev() works as expected with Next() when the current iter->key() in the range of the same prefix in prefix seek mode

Test Plan: make all check -j64 (add prefix_test with PrefixSeekModePrev test case)

Reviewers: andrewkr, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yoshinorim, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61419
2016-08-29 20:55:39 -07:00
sdong
dade61ac26 Mitigate regression bug of options.max_successive_merges hit during DB Recovery
Summary:
After 1b8a2e8fdd, DB Pointer is passed to WriteBatchInternal::InsertInto() while DB recovery. This can cause deadlock if options.max_successive_merges hits. In that case DB::Get() will be called. Get() will try to acquire the DB mutex, which is already held by the DB::Open(), causing a deadlock condition.

This commit mitigates the problem by not passing the DB pointer unless 2PC is allowed.

Test Plan: Add a new test and run it.

Reviewers: IslamAbdelRahman, andrewkr, kradhakrishnan, horuff

Reviewed By: kradhakrishnan

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62625
2016-08-25 17:30:34 -07:00
Justin Gibbs
b2ce59537c Persist data during user initiated shutdown
Summary:
Move the manual memtable flush for databases containing data that has
bypassed the WAL from DBImpl's destructor to CancleAllBackgroundWork().

CancelAllBackgroundWork() is a publicly exposed API which allows
async operations performed by background threads to be disabled on a
database. In effect, this places the database into a "shutdown" state
in advance of calling the database object's destructor. No compactions
or flushing of SST files can occur once a call to this API completes.

When writes are issued to a database with WriteOptions::disableWAL
set to true, DBImpl::has_unpersisted_data_ is set so that
memtables can be flushed when the database object is destroyed. If
CancelAllBackgroundWork() has been called prior to DBImpl's destructor,
this flush operation is not possible and is skipped, causing unnecessary
loss of data.

Since CancelAllBackgroundWork() is already invoked by DBImpl's destructor
in order to perform the thread join portion of its cleanup processing,
moving the manual memtable flush to CancelAllBackgroundWork() ensures
data is persisted regardless of client behavior.

Test Plan:
Write an amount of data that will not cause a memtable flush to a rocksdb
database with all writes marked with WriteOptions::disableWAL. Properly
"close" the database. Reopen database and verify that the data was
persisted.

Reviewers: IslamAbdelRahman, yiwu, yoshinorim, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62277
2016-08-25 12:24:22 -07:00
sdong
56dd034115 read_options.background_purge_on_iterator_cleanup to cover forward iterator and log file closing too.
Summary: With read_options.background_purge_on_iterator_cleanup=true, File deletion and closing can still happen in forward iterator, or WAL file closing. Cover those cases too.

Test Plan: I am adding unit tests.

Reviewers: andrewkr, IslamAbdelRahman, yiwu

Reviewed By: yiwu

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61503
2016-08-10 13:16:41 -07:00
Yi Wu
ee027fc19f Ignore write stall triggers when auto-compaction is disabled
Summary:
My understanding is that the purpose of write stall triggers are to wait for auto-compaction to catch up. Without auto-compaction, we don't need to stall writes.

Also with this diff, flush/compaction conditions are recalculated on dynamic option change. Previously the conditions are recalculate only when write stall options are changed.

Test Plan: See the new test. Removed two tests that are no longer valid.

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61437
2016-08-02 21:55:26 -07:00
sdong
2a6d0cde72 Ignore stale logs while restarting DBs
Summary:
Stale log files can be deleted out of order. This can happen for various reasons. One of the reason is that no data is ever inserted to a column family and we have an optimization to update its log number, but not all the old log files are cleaned up (the case shown in the unit tests added). It can also happen when we simply delete multiple log files out of order.

This causes data corruption because we simply increase seqID after processing the next row and we may end up with writing data with smaller seqID than what is already flushed to memtables.

In DB recovery, for the oldest files we are replaying, if there it contains no data for any column family, we ignore the sequence IDs in the file.

Test Plan: Add two unit tests that fail without the fix.

Reviewers: IslamAbdelRahman, igor, yiwu

Reviewed By: yiwu

Subscribers: hermanlee4, yoshinorim, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60891
2016-07-25 11:47:31 -07:00
sdong
d5a51d4de3 Need to make sure log file synced before flushing memtable of one column family
Summary: Multiput atomiciy is broken across multiple column families if we don't sync WAL before flushing one column family. The WAL file may contain a write batch containing writes to a key to the CF to be flushed and a key to other CF. If we don't sync WAL before flushing, if machine crashes after flushing, the write batch will only be partial recovered. Data to other CFs are lost.

Test Plan: Add a new unit test which will fail without the diff.

Reviewers: yhchiang, IslamAbdelRahman, igor, yiwu

Reviewed By: yiwu

Subscribers: yiwu, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60915
2016-07-21 16:29:06 -07:00
Aaron Gao
dda6c72ac8 Add DestroyColumnFamilyHandle(ColumnFamilyHandle**) to db.h
Summary:
add DestroyColumnFamilyHandle(ColumnFamilyHandle**) to close column family instead of deleting cfh*
User should call this to close a cf and then we can detect the deletion in this function.

Test Plan: make all check -j64

Reviewers: andrewkr, yiwu, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60765
2016-07-13 17:59:25 -07:00
Yi Wu
6ea41f8527 Fix deadlock when trying update options when write stalls
Summary:
When write stalls because of auto compaction is disabled, or stop write trigger is reached,
user may change these two options to unblock writes. Unfortunately we had issue where the write
thread will block the attempt to persist the options, thus creating a deadlock. This diff
fix the issue and add two test cases to detect such deadlock.

Test Plan:
Run unit tests.

Also, revert db_impl.cc to master (but don't revert `DBImpl::BackgroundCompaction:Finish` sync point) and run db_options_test. Both tests should hit deadlock.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60627
2016-07-12 15:30:38 -07:00
Aaron Gao
8e6b38d895 update DB::AddFile to ingest list of sst files
Summary:
DB::AddFile(std::string file_path) API that allow them to ingest an SST file created using SstFileWriter
We want to update this interface to be able to accept a list of files that will be ingested, DB::AddFile(std::vector<std::string> file_path_list).

Test Plan:
Add test case `AddExternalSstFileList` in `DBSSTTest`. To make sure:
1. files key ranges are not overlapping with each other
2. each file key range dont overlap with the DB key range
3. make sure no snapshots are held

Reviewers: andrewkr, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58587
2016-07-11 10:43:12 -07:00
sdong
a00bf1b3cf Add More Logging to track total_log_size
Summary: We saw instances where total_log_size is off the real value, but I'm not able to reproduce it. Add more logging to help debugging when it happens again.

Test Plan: Run the unit test and see the logging.

Reviewers: andrewkr, yhchiang, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60081
2016-07-06 14:29:18 -07:00
sdong
32df9733d1 Add options.write_buffer_manager: control total memtable size across DB instances
Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances.

Test Plan: Add a new unit test.

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59925
2016-07-05 18:11:25 -07:00
omegaga
a45ee83181 Fix a bug that accesses invalid address in iterator cleanup function
Summary: Reported in T11889874. When registering the cleanup function we should copy the option so that we can still access it if ReadOptions is deleted.

Test Plan: Add a unit test to reproduce this bug.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60087
2016-07-05 11:57:14 -07:00
omegaga
c4e19b77e8 Add a read option to enable background purge when cleaning up iterators
Summary:
Add a read option `background_purge_on_iterator_cleanup` to avoid deleting files in foreground when destroying iterators.
Instead, a job is scheduled in high priority queue and would be executed in a separate background thread.

Test Plan: Add a variant of PurgeObsoleteFileTest. Turn on background purge option in the new test, and use sleeping task to ensure files are deleted in background.

Reviewers: IslamAbdelRahman, sdong

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59499
2016-06-21 18:41:23 -07:00
Islam AbdelRahman
fa813f7478 Update DB::AddFile() to ingest the file to the lowest possible level
Summary:
DB::AddFile() right now always add the ingested file to L0
update the logic to add the file to the lowest possible level

Test Plan: unit tests

Reviewers: jkedgar, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D59637
2016-06-21 17:57:59 -07:00
sdong
7b79238b65 Deprectate filter_deletes
Summary: filter_deltes is not a frequently used feature. Remove it.

Test Plan: Run all test suites.

Reviewers: igor, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59427
2016-06-17 10:30:47 -07:00
Islam AbdelRahman
30a24f2d3d Add InternalStats and logging for AddFile()
Summary:
We dont report the bytes that we ingested from AddFile which make the write amplification numbers incorrect
Update InternalStats and add logging for AddFile()

Test Plan: Make sure the code compile and existing tests pass

Reviewers: lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59763
2016-06-16 16:21:41 -07:00
Yi Wu
bc8af90e8c add option to not flush memtable on open()
Summary:
Add option to not flush memtable on open()
In case the option is enabled, don't delete existing log files by not updating log numbers to MANIFEST.
Will still flush if we need to (e.g. memtable full in the middle). In that case we also flush final memtable.
If wal_recovery_mode = kPointInTimeRecovery, do not halt immediately after encounter corruption. Instead, check if seq id of next log file is last_log_sequence + 1. In that case we continue recovery.

Test Plan: See unit test.

Reviewers: dhruba, horuff, sdong

Reviewed By: sdong

Subscribers: benj, yhchiang, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57813
2016-06-13 11:34:16 -07:00
Wanning Jiang
56887f6cb8 Backup Options
Summary: Backup options file to private directory

Test Plan:
backupable_db_test.cc, BackupOptions
	   Modify DB options by calling OpenDB for 3 times. Check the latest options file is in the right place. Also check no redundent files are backuped.

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D59373
2016-06-09 19:03:10 -07:00
Andrew Kryczka
842958651f Fix race condition in SwitchMemtable
Summary:
MemTableList::current_ could be written by background flush thread and
simultaneously read in the user thread (NumNotFlushed() is used in
SwitchMemtable()). Use the lock to prevent this case. Found the error from tsan.

Related: D58833

Test Plan:
  $ OPT=-g COMPILE_WITH_TSAN=1 make -j64 db_test
  $ TEST_TMPDIR=/dev/shm/rocksdb ./db_test --gtest_filter=DBTest.RepeatedWritesToSameKey

Reviewers: lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59139
2016-06-02 17:11:45 -07:00
PraveenSinghRao
3a276b0cbe Add a callback for when memtable is moved to immutable (#1137)
* Create a callback for memtable becoming immutable

Create a callback for memtable becoming immutable

Create a callback for memtable becoming immutable

moved notification outside the lock

Move sealed notification to unlocked portion of SwitchMemtable

* fix lite build
2016-06-02 11:57:31 -07:00
Mike Kolupaev
936973d145 Small tweaks to logging to track the number of immutable memtables
Summary:
We see some write stalls because of number of unflushed memtables. With existing logging I couldn't figure out what's happening exactly. See internal task t11446054 for details if interested. This diff adds:
- logging of memtable creation at info level; I wanted it on multiple occasions for different reasons; also include number of immutable memtables,
- logging of number of remaining immutable memtables after a flush.

Test Plan: ran tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58833
2016-06-01 11:11:33 -07:00
Reid Horuff
5d85fdb2c5 add missing lock 2016-05-31 12:26:48 -07:00
Ashish Shenoy
99765ed855 Clean up the ComputeCompactionScore() API
Summary: Make CompactionOptionsFIFO a part of mutable_cf_options

Test Plan: UT

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, lgalanis, dhruba

Differential Revision: https://reviews.facebook.net/D58653
2016-05-23 15:55:29 -07:00
Sage Weil
11f329bd40 db/db_impl: restrict WALRecoveryMode when using recycled log files
kPointInTimeRecovery is indistinguishable from
kTolerateCorruptedTailRecords in recycle mode since we define
the "end" of the log as the first corrupt record we encounter.

kAbsoluteConsistency doesn't make sense because even a clean
shutdown leaves old junk at the end of the log file.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-05-22 22:00:15 -07:00
Aaron Orenstein
2073cf3775 Eliminate use of 'using namespace std'. Also remove a number of ADL references to std functions.
Summary: Reduce use of argument-dependent name lookup in RocksDB.

Test Plan: 'make check' passed.

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58203
2016-05-20 07:42:18 -07:00
Islam AbdelRahman
c70a9335de Fix mutex unlock issue between scheduled compaction and ReleaseCompactionFiles()
Summary:
NotifyOnCompactionCompleted can unlock the mutex.
That mean that we can schedule a background compaction that will start before we ReleaseCompactionFiles().

Test Plan:
added unittest
existing unittest

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: yoshinorim, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58065
2016-05-18 14:56:30 -07:00
Aaron Gao
43afd72bee [rocksdb] make more options dynamic
Summary:
make more ColumnFamilyOptions dynamic:
- compression
- soft_pending_compaction_bytes_limit
- hard_pending_compaction_bytes_limit
- min_partial_merge_operands
- report_bg_io_stats
- paranoid_file_checks

Test Plan:
Add sanity check in `db_test.cc` for all above options except for soft_pending_compaction_bytes_limit and hard_pending_compaction_bytes_limit.
All passed.

Reviewers: andrewkr, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57519
2016-05-17 13:11:56 -07:00
Islam AbdelRahman
f6aedb62c0 Fix Transaction memory leak
Summary:
- Make sure we clean up recovered_transactions_ on DBImpl destructor
- delete leaked txns and env in TransactionTest

Test Plan: Run transaction_test under valgrind

Reviewers: sdong, andrewkr, yhchiang, horuff

Reviewed By: horuff

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58263
2016-05-16 16:32:55 -07:00
Reid Horuff
a400336398 TransactionLogIterator sequence gap fix
Summary: DBTestXactLogIterator.TransactionLogIterator was failing due the sequence gaps. This was caused by an off-by-one error when calculating the new sequence number after recovering from logs.

Test Plan: db_log_iter_test

Reviewers: andrewkr

Subscribers: andrewkr, hermanlee4, dhruba, IslamAbdelRahman

Differential Revision: https://reviews.facebook.net/D58053
2016-05-12 13:54:08 -07:00
Reid Horuff
c27061dae7 [rocksdb] 2PC double recovery bug fix
Summary:
1. prepare()
2. crash
3. recover
4. commit()
5. crash
6. data is lost

This is due to the transaction data still only residing in the WAL but because the logs were flushed on the first recovery the data is ignored on the second recovery. We must scan all logs found on recovery and only ignore redundant data at the time of replay. It is not possible to know which logs still contain relevant data at time of recovery. We cannot simply ignore a log because all of the non-2pc data it contains has already been written to L0.

The changes made to MemTableInserter are to ensure that prepared sections are still recovered even if all of the non-2pc data in that log has already been flushed to L0.

Test Plan: Provided test.

Reviewers: sdong

Subscribers: andrewkr, hermanlee4, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57729
2016-05-10 14:06:07 -07:00
Reid Horuff
a657ee9a9c [rocksdb] Recovery path sequence miscount fix
Summary:
Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert.
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)]
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)]
[4: COMMIT(a)]
[7: COMMIT(b)]

The first two batches do not consume any sequence numbers so are both prefixed with seq=1.
For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL.
We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries.
This is a valid WAL.

Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains.

We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b))

So, now upon recovery we must track the actual consumption of sequence numbers.
In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct?

Test Plan: provided test.

Reviewers: sdong

Subscribers: andrewkr, leveldb, dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D57645
2016-05-10 14:06:07 -07:00
Reid Horuff
1b8a2e8fdd [rocksdb] Memtable Log Referencing and Prepared Batch Recovery
Summary:
This diff is built on top of WriteBatch modification: https://reviews.facebook.net/D54093 and adds the required functionality to rocksdb core necessary for rocksdb to support 2PC.

modfication of DBImpl::WriteImpl()
- added two arguments *uint64_t log_used = nullptr, uint64_t log_ref = 0;
- *log_used is an output argument which will return the log number which the incoming batch was inserted into, 0 if no WAL insert took place.
-  log_ref is a supplied log_number which all memtables inserted into will reference after the batch insert takes place. This number will reside in 'FindMinPrepLogReferencedByMemTable()' until all Memtables insertinto have flushed.

- Recovery/writepath is now aware of prepared batches and commit and rollback markers.

Test Plan: There is currently no test on this diff. All testing of this functionality takes place in the Transaction layer/diff but I will add some testing.

Reviewers: IslamAbdelRahman, sdong

Subscribers: leveldb, santoshb, andrewkr, vasilep, dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D56919
2016-05-10 14:06:07 -07:00
Islam AbdelRahman
4b31723433 Add bottommost_compression option
Summary:
Add a new option that can be used to set a specific compression algorithm for bottommost level.
This option will only affect levels larger than base level.

I have also updated CompactionJobInfo to include the compression algorithm used in compaction

Test Plan:
added new unittest
existing unittests

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: lightmark, andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D57669
2016-05-09 15:57:19 -07:00
Yi Wu
a92049e3e7 Added EventListener::OnTableFileCreationStarted() callback
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.

Test Plan: unit test.

Reviewers: dhruba, yhchiang, ott, sdong

Reviewed By: sdong

Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba

Differential Revision: https://reviews.facebook.net/D56337
2016-04-29 11:35:00 -07:00
sdong
992a8f83b7 Not enable jemalloc status printing if USE_CLANG=1
Summary: Warning is printed out with USE_CLANG=1 when including jemalloc.h. Disable it in that case.

Test Plan: Run db_bench with USE_CLANG=1 and not. Make sure they can all build and jemalloc status is printed out in the case where USE_CLANG is not set.

Reviewers: andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57399
2016-04-28 16:16:14 -07:00
Islam AbdelRahman
0850bc5147 Fix build on machines without jemalloc
Summary: It looks like we mistakenly enable JEMALLOC even if it's not available on the machine, that's why travis is failing

Test Plan:
check on my devserver
check on my mac

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57345
2016-04-27 18:25:19 -07:00
Sergey Makarenko
1c80dfab24 Print memory allocation counters
Summary:
Introduced option to dump malloc statistics using new option flag.
    Added new command line option to db_bench tool to enable this
    funtionality.
    Also extended build to support environments with/without jemalloc.

Test Plan:
1) Build rocksdb using `make` command. Launch the following command
    `./db_bench --benchmarks=fillrandom --dump_malloc_stats=true
    --num=10000000` end verified that jemalloc dump is present in LOG file.
    2) Build rocksdb using `DISABLE_JEMALLOC=1  make db_bench -j32` and ran
    the same db_bench tool and found the following message in LOG file:
    "Please compile with jemalloc to enable malloc dump".
    3) Also built rocksdb using `make` command on MacOS to verify behavior
    in non-FB environment.
    Also to debug build configuration change temporary changed
    AM_DEFAULT_VERBOSITY = 1 in Makefile to see compiler and build
    tools output. For case 1) -DROCKSDB_JEMALLOC was present in compiler
    command line. For both 2) and 3) this flag was not present.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57321
2016-04-27 16:23:33 -07:00
sdong
ac0e54b4c6 CompactedDB should not be used if there is outstanding WAL files
Summary: CompactedDB skips memtable. So we shouldn't use compacted DB if there is outstanding WAL files.

Test Plan: Change to options.max_open_files = -1 perf context test to create a compacted DB, which we shouldn't do.

Reviewers: yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57057
2016-04-26 14:22:07 -07:00
Yueh-Hsuan Chiang
24110ce90c Correct Statistics FLUSH_WRITE_BYTES
Summary:
In https://reviews.facebook.net/D56271, we fixed an issue where
we consider flush as compaction.  However, that makes us mistakenly
count FLUSH_WRITE_BYTES twice (one in flush_job and one in db_impl.)

This patch removes the one incremented in db_impl.

Test Plan: db_test

Reviewers: yiwu, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57111
2016-04-25 12:01:01 -07:00
sdong
4b6833aec1 Rename options.compaction_measure_io_stats to options.report_bg_io_stats and include flush too.
Summary: It is useful to print out IO stats in flush jobs too. Extend options.compaction_measure_io_stats to flush jobs and raname it.

Test Plan: Try db_bench and see the stats are printed out.

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: kradhakrishnan, yiwu, IslamAbdelRahman, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56769
2016-04-15 10:22:18 -07:00
Jay Edgar
2448f80375 Make sure that if use_mmap_reads is on use_os_buffer is also on
Summary: The code assumes that if use_mmap_reads is on then use_os_buffer is also on.  This make sense as by using memory mapped files for reading you are expecting the OS to cache what it needs.  Add code to make sure the user does not turn off use_os_buffer when they turn on use_mmap_reads

Test Plan: New test: DBTest.MMapAndBufferOptions

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56397
2016-04-08 14:30:15 -07:00
Andrew Kryczka
2391ef7214 Embed column family name in SST file
Summary:
Added the column family name to the properties block. This property
is omitted only if the property is unavailable, such as when RepairDB()
writes SST files.

In a next diff, I will change RepairDB to use this new property for
deciding to which column family an existing SST file belongs. If this
property is missing, it will add it to the "unknown" column family (same
as its existing behavior).

Test Plan:
New unit test:

  $ ./db_table_properties_test --gtest_filter=DBTablePropertiesTest.GetColumnFamilyNameProperty

Reviewers: IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55605
2016-04-06 23:10:32 -07:00
Marton Trencseni
9b51987521 Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.

Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56133
2016-04-01 10:42:39 -07:00
zensan
78711524b7 In all the places where log records are read, there was a check that
record.size() should not be less than 12.

This "magic number" seems to be the WriteBatch header (8 byte sequence
and 4 byte count).   Replaced all the places where "12" was used
by WriteBatchInternal::kHeader.
2016-03-30 23:05:22 +05:30
Praveen Rao
583157f710 Avoid overloaded virtual function 2016-03-22 17:10:31 -07:00
Praveen Rao
4f1c74a46e merge from master 2016-03-18 12:48:01 -07:00
Praveen Rao
f8c2189307 Publish log numbers for column family to wal_filter, and provide log number in the record callback 2016-03-18 12:32:15 -07:00
Baris Yazici
e8e6cf0173 fix: handle_fatal_signal (sig=6) in std::vector<std::string, std::allocator<std::string> >::_M_range_check | c++/4.8.2/bits/stl_vector.h:794 #174
Summary:
Fix for https://github.com/facebook/mysql-5.6/issues/174

When there is no old files to purge, vector.at(i) function was crashing

if (old_info_log_file_count != 0 &&
      old_info_log_file_count >= db_options_.keep_log_file_num) {
    std::sort(old_info_log_files.begin(), old_info_log_files.end());
    size_t end = old_info_log_file_count - db_options_.keep_log_file_num;
    for (unsigned int i = 0; i <= end; i++) {
      std::string& to_delete = old_info_log_files.at(i);

Added check to old_info_log_file_count be non zero.

Test Plan: run existing tests

Reviewers: gunnarku, vasilep, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: andrewkr, webscalesql-eng, dhruba

Differential Revision: https://reviews.facebook.net/D55245
2016-03-11 11:11:45 -08:00
Andrew Kryczka
d9620239d2 Cleanup stale manifests outside of full purge
Summary:
- Keep track of obsolete manifests in VersionSet
- Updated FindObsoleteFiles() to put obsolete manifests in the JobContext for later use by PurgeObsoleteFiles()
- Added test case that verifies a stale manifest is deleted by a non-full purge

Test Plan:
  $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation

Reviewers: IslamAbdelRahman, yoshinorim, sdong

Reviewed By: sdong

Subscribers: andrewkr, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D55269
2016-03-10 18:16:21 -08:00
Yueh-Hsuan Chiang
765597fa78 Update compaction score right after CompactFiles forms a compaction
Summary:
This is a follow-up patch of https://reviews.facebook.net/D54891.
As the information about files being compacted will also be used
when making compaction decision, it is necessary to update the compaction
score when a compaction plan has been made but not yet execute.

This patch adds a missing call to update the compaction score in
CompactFiles().

Test Plan: compact_files_test

Reviewers: sdong, IslamAbdelRahman, kradhakrishnan, yiwu, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55227
2016-03-10 14:34:28 -08:00
Yueh-Hsuan Chiang
a7d4eb2f34 Fix a bug where flush does not happen when a manual compaction is running
Summary:
Currently, when rocksdb tries to run manual compaction to refit data into a level,
there's a ReFitLevel() process that requires no bg work is currently running.
When RocksDB plans to ReFitLevel(), it will do the following:

 1. pause scheduling new bg work.
 2. wait until all bg work finished
 3. do the ReFitLevel()
 4. unpause scheduling new bg work.

However, as it pause scheduling new bg work at step one and waiting for all bg work
finished in step 2, RocksDB will stop flushing until all bg work is done (which
could take a long time.)

This patch fix this issue by changing the way ReFitLevel() pause the background work:

1. pause scheduling compaction.
2. wait until all bg work finished.
3. pause scheduling flush
4. do ReFitLevel()
5. unpause both flush and compaction.

The major difference is that.  We only pause scheduling compaction in step 1 and wait
for all bg work finished in step 2.  This prevent flush being blocked for a long time.
Although there's a very rare case that ReFitLevel() might be in starvation in step 2,
but it's less likely the case as flush typically finish very fast.

Test Plan: existing test.

Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55029
2016-03-04 14:24:52 -08:00
Islam AbdelRahman
dfe96c72c3 Fix WriteLevel0TableForRecovery file delete protection
Summary:
The call to

```
CaptureCurrentFileNumberInPendingOutputs()
```

should be before

```
versions_->NewFileNumber()
```
Right now we are not actually protecting the file from being deleted

Test Plan: make check

Reviewers: sdong, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D54645
2016-03-03 18:25:07 -08:00
sdong
e79ad9e184 Add Iterator Property rocksdb.iterator.version_number
Summary: We want to provide a way to detect whether an iterator is stale and needs to be recreated. Add a iterator property to return version number.

Test Plan: Add two unit tests for it.

Reviewers: IslamAbdelRahman, yhchiang, anthony, kradhakrishnan, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54921
2016-03-02 16:23:59 -08:00
Islam AbdelRahman
6743135ea1 Fix DB::AddFile() issue when PurgeObsoleteFiles() is called
Summary:
In some situations the DB will scan all existing files in the DB path and delete the ones that are Obsolete.
If this happen during adding an external sst file. this could cause the file to be deleted while we are adding it.
This diff fix this issue

Test Plan:
unit test to reproduce the bug
existing unit tests

Reviewers: sdong, yhchiang, andrewkr

Reviewed By: andrewkr

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D54627
2016-03-01 12:05:29 -08:00
sdong
38201b3599 Fix assert failure when DBImpl::SyncWAL() conflicts with log rolling
Summary: DBImpl::SyncWAL() releases db mutex before calling DBImpl::MarkLogsSynced(), while inside DBImpl::MarkLogsSynced() we assert there is none or one outstanding log file. However, a memtable switch can happen in between and causing two or outstanding logs there, failing the assert. The diff adds a unit test that repros the issue and fix the assert so that the unit test passes.

Test Plan: Run the new tests.

Reviewers: anthony, kolmike, yhchiang, IslamAbdelRahman, kradhakrishnan, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54621
2016-02-23 11:42:15 -08:00
Mike Kolupaev
eef63ef807 Fixed CompactFiles() spuriously failing or corrupting DB
Summary:
We started getting two kinds of crashes since we started using `DB::CompactFiles()`:
(1) `CompactFiles()` fails saying something like "/data/logdevice/4440/shard12/012302.sst: No such file or directory", and presumably makes DB read-only,
(2) DB fails to open saying "Corruption: Can't access /267000.sst: IO error: /data/logdevice/4440/shard1/267000.sst: No such file or directory".

AFAICT, both can be explained by background thread deleting compaction output as "obsolete" while it's being written, before it's committed to manifest. If it ends up committed to the manifest, we get (2); if compaction notices the disappearance and fails, we get (1). The internal tasks t10068021 and t10134177 have some details about the investigation that led to this.

Test Plan: `make -j check`; the new test fails to reopen the DB without the fix

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, sdong

Differential Revision: https://reviews.facebook.net/D54561
2016-02-22 13:54:58 -08:00
Islam AbdelRahman
df9ba6df62 Introduce SstFileManager::SetMaxAllowedSpaceUsage() to cap disk space usage
Summary:
Introude SstFileManager::SetMaxAllowedSpaceUsage() that can be used to limit the maximum space usage allowed for RocksDB.
When this limit is exceeded WriteImpl() will fail and return Status::Aborted()

Test Plan: unit testing

Reviewers: yhchiang, anthony, andrewkr, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53763
2016-02-17 15:20:23 -08:00
reid horuff
5bcf952a87 Fix WriteImpl empty batch hanging issue
Summary: There is an issue in DBImpl::WriteImpl where if an empty writebatch comes in and sync=true then the logs will be marked as being synced yet the sync never actually happens because there is no data in the writebatch. This causes the next incoming batch to hang while waiting for the logs to complete syncing. This fix syncs logs even if the writebatch is empty.

Test Plan: DoubleEmptyBatch unit test in transaction_test.

Reviewers: yoshinorim, hermanlee4, sdong, ngbronson, anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54057
2016-02-16 12:21:33 -08:00
Mike Kolupaev
44371501f0 Fixed a segfault when compaction fails
Summary: We've hit it today.

Test Plan: `make -j check`; didn't reproduce the issue

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D54219
2016-02-16 11:11:16 -08:00
Baraa Hamodi
21e95811d1 Updated all copyright headers to the new format. 2016-02-09 15:12:00 -08:00
Yueh-Hsuan Chiang
4a8cbf4e31 Allows Get and MultiGet to read directly from SST files.
Summary:
Add kSstFileTier to ReadTier, which allows Get and MultiGet to
read only directly from SST files and skip mem-tables.

    kSstFileTier = 0x2      // data in SST files.
                          // Note that this ReadTier currently only supports
                          // Get and MultiGet and does not support iterators.

Test Plan: add new test in db_test.

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: igor, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53511
2016-02-09 11:20:22 -08:00