Commit Graph

2612 Commits

Author SHA1 Message Date
Aaron Gao
b39b2ee12f do not call get() in recovery mode
Summary:
This is a previous fix that has a typo
Closes https://github.com/facebook/rocksdb/pull/1487

Differential Revision: D4157381

Pulled By: lightmark

fbshipit-source-id: f079be8
2016-11-10 11:24:20 -08:00
Reid Horuff
1ca5f6d132 Fix 2PC Recovery SeqId Miscount
Summary:
Originally sequence ids were calculated, in recovery, based off of the first seqid found if the first log recovered. The working seqid was then incremented from that value based on every insertion that took place. This was faulty because of the potential for missing log files or inserts that skipped the WAL. The current recovery scheme grabs sequence from current recovering batch and increments using memtableinserter to track how many actual inserts take place. This works for 2PC batches as well scenarios where some logs are missing or inserts that skip the WAL.
Closes https://github.com/facebook/rocksdb/pull/1486

Differential Revision: D4156064

Pulled By: reidHoruff

fbshipit-source-id: a6da8d9
2016-11-10 11:09:22 -08:00
Andrew Kryczka
c90fef88b1 fix open failure with empty wal
Summary: Closes https://github.com/facebook/rocksdb/pull/1490

Differential Revision: D4158821

Pulled By: IslamAbdelRahman

fbshipit-source-id: 59b73f4
2016-11-09 22:24:26 -08:00
Andrew Kryczka
4e20c5da20 Store internal keys in TombstoneMap
Summary:
This fixes a correctness issue where ranges with same begin key would overwrite each other.

This diff uses InternalKey as TombstoneMap's key such that all tombstones have unique keys even when their start keys overlap. We also update TombstoneMap to use an internal key comparator.

End-to-end tests pass and are here (https://gist.github.com/ajkr/851ffe4c1b8a15a68d33025be190a7d9) but cannot be included yet since the DeleteRange() API is yet to be checked in. Note both tests failed before this fix.
Closes https://github.com/facebook/rocksdb/pull/1484

Differential Revision: D4155248

Pulled By: ajkr

fbshipit-source-id: 304b4b9
2016-11-09 15:09:18 -08:00
Yueh-Hsuan Chiang
a9fb346e4a Fix RocksDB Lite build failure in c_test.cc
Summary:
Fix the following RocksDB Lite build failure in c_test.cc

db/c_test.c:1051:3: error: implicit declaration of function 'fprintf' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
  fprintf(stderr, "SKIPPED\n");
  ^
db/c_test.c:1051:3: error: declaration of built-in function 'fprintf' requires inclusion of the header <stdio.h> [-Werror,-Wbuiltin-requires-header]
db/c_test.c:1051:11: error: use of undeclared identifier 'stderr'
  fprintf(stderr, "SKIPPED\n");
          ^
3 errors generated.
Closes https://github.com/facebook/rocksdb/pull/1479

Differential Revision: D4151160

Pulled By: yhchiang

fbshipit-source-id: a471a30
2016-11-09 12:24:18 -08:00
Reid Horuff
d133b08f68 Use correct sequence number when creating memtable
Summary:
copied from: 5ebfd2623a

Opening existing RocksDB attempts recovery from log files, which uses
wrong sequence number to create the memtable. This is a regression
introduced in change a400336.

This change includes a test demonstrating the problem, without the fix
the test fails with "Operation failed. Try again.: Transaction could not
check for conflicts for operation at SequenceNumber 1 as the MemTable
only contains changes newer than SequenceNumber 2.  Increasing the value
of the max_write_buffer_number_to_maintain option could reduce the
frequency of this error"

This change is a joint effort by Peter 'Stig' Edwards thatsafunnyname
and me.
Closes https://github.com/facebook/rocksdb/pull/1458

Differential Revision: D4143791

Pulled By: reidHoruff

fbshipit-source-id: 5a25033
2016-11-09 12:24:17 -08:00
Islam AbdelRahman
9bd191d2f4 Fix deadlock between (WriterThread/Compaction/IngestExternalFile)
Summary:
A deadlock is possible if this happen

(1) Writer thread is stopped because it's waiting for compaction to finish
(2) Compaction is waiting for current IngestExternalFile() calls to finish
(3) IngestExternalFile() is waiting to be able to acquire the writer thread
(4) WriterThread is held by stopped writes that are waiting for compactions to finish

This patch fix the issue by not incrementing num_running_ingest_file_ except when we acquire the writer thread.

This patch include a unittest to reproduce the described scenario
Closes https://github.com/facebook/rocksdb/pull/1480

Differential Revision: D4151646

Pulled By: IslamAbdelRahman

fbshipit-source-id: 09b39db
2016-11-09 10:54:10 -08:00
Islam AbdelRahman
193221e0a1 Fix Forward Iterator Seek()/SeekToFirst()
Summary:
In ForwardIterator::SeekInternal(), we may end up passing empty Slice representing an internal key to InternalKeyComparator::Compare.
and when we try to extract the user key from this empty Slice, we will create a slice with size = 0 - 8 ( which will overflow and cause us to read invalid memory as well )

Scenarios to reproduce these issues are in the unit tests
Closes https://github.com/facebook/rocksdb/pull/1467

Differential Revision: D4136660

Pulled By: lightmark

fbshipit-source-id: 151e128
2016-11-08 13:54:31 -08:00
Aaron Gao
e48f3f8b9e remove tabs and duplicate #include in c api
Summary:
fix lint error about tabs and duplicate includes.
Closes https://github.com/facebook/rocksdb/pull/1476

Differential Revision: D4149646

Pulled By: lightmark

fbshipit-source-id: 2e0a632
2016-11-08 13:54:31 -08:00
Jay Lee
a7875272d7 c: support seek_for_prev
Summary:
support seek_for_prev in c abi.
Closes https://github.com/facebook/rocksdb/pull/1457

Differential Revision: D4135360

Pulled By: lightmark

fbshipit-source-id: 61256b0
2016-11-08 12:54:13 -08:00
Andrew Kryczka
9e7cf3469b DeleteRange user iterator support
Summary:
Note: reviewed in  https://reviews.facebook.net/D65115

- DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times.
- DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205)
- DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones
Closes https://github.com/facebook/rocksdb/pull/1464

Differential Revision: D4131753

Pulled By: ajkr

fbshipit-source-id: be86559
2016-11-04 12:09:22 -07:00
Andrew Kryczka
f998c9790f DeleteRange Get support
Summary:
During Get()/MultiGet(), build up a RangeDelAggregator with range
tombstones as we search through live memtable, immutable memtables, and
SST files. This aggregator is then used by memtable.cc's SaveValue() and
GetContext::SaveValue() to check whether keys are covered.

added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761
Closes https://github.com/facebook/rocksdb/pull/1456

Differential Revision: D4111271

Pulled By: ajkr

fbshipit-source-id: 6e388d4
2016-11-03 18:54:20 -07:00
zhangjinpeng1987
879f366366 Add C api for RateLimiter
Summary:
Add C api for RateLimiter.
Closes https://github.com/facebook/rocksdb/pull/1455

Differential Revision: D4116362

Pulled By: yiwu-arbug

fbshipit-source-id: cb05a8d
2016-11-03 11:09:17 -07:00
Yi Wu
437942e481 Add avoid_flush_during_shutdown DB option
Summary:
Add avoid_flush_during_shutdown DB option.
Closes https://github.com/facebook/rocksdb/pull/1451

Differential Revision: D4108643

Pulled By: yiwu-arbug

fbshipit-source-id: abdaf4d
2016-11-02 15:39:18 -07:00
Benoit Girard
2b16d664cb Change max_bytes_for_level_multiplier to double
Summary: Closes https://github.com/facebook/rocksdb/pull/1427

Differential Revision: D4094732

Pulled By: yiwu-arbug

fbshipit-source-id: b9b79e9
2016-11-01 21:09:23 -07:00
Jay Lee
16fb04434f expose IngestExternalFile to c abi
Summary:
IngestExternalFile is very useful when doing bulk load. This pr expose this API to c so many bindings can benefit from it too.
Closes https://github.com/facebook/rocksdb/pull/1454

Differential Revision: D4113420

Pulled By: yiwu-arbug

fbshipit-source-id: 307c6ae
2016-11-01 17:09:39 -07:00
Andrew Kryczka
40a2e406f8 DeleteRange flush support
Summary:
Changed BuildTable() (used for flush) to (1) add range
tombstones to the aggregator, which is used by CompactionIterator to
determine which keys can be removed; and (2) add aggregator's range
tombstones to the table that is output for the flush.
Closes https://github.com/facebook/rocksdb/pull/1438

Differential Revision: D4100025

Pulled By: ajkr

fbshipit-source-id: cb01a70
2016-10-31 20:54:18 -07:00
Vladislav Vaintroub
d5555d95a3 Fix MSVC compile error in 32 bit compilation
Summary:
Passing std::atomic<uint64_t> variables to ASSERT_EQ()
results in compile error
C2718 'const T1': actual parameter with requested alignment of 8 won't be aligned.

VS2015 defines std::atomic as specially aligned type ( with 'alignas'),
however the compiler does not like declspec(align)ed  function
arguments.

Worked around by casting std::atomic<uint64_t> types to uint64_t
in ASSERT_EQ.
Closes https://github.com/facebook/rocksdb/pull/1450

Differential Revision: D4106788

Pulled By: yiwu-arbug

fbshipit-source-id: 5fb42c3
2016-10-31 17:24:18 -07:00
Siying Dong
da61f348d3 Print compression and Fast CRC support info as Header level
Summary:
Currently the compression suppport and fast CRC support information is printed as info level. They should be in the same level as options, which is header level.

Also add ZSTD to this printing.
Closes https://github.com/facebook/rocksdb/pull/1448

Differential Revision: D4106608

Pulled By: yiwu-arbug

fbshipit-source-id: cb9a076
2016-10-31 16:09:13 -07:00
Siying Dong
c90c48d3c8 Show More DB Stats in info logs
Summary:
DB Stats now are truncated if there are too many CFs. Extend the buffer size to allow more to be printed out. Also, separate out malloc to another log line.
Closes https://github.com/facebook/rocksdb/pull/1439

Differential Revision: D4100943

Pulled By: yiwu-arbug

fbshipit-source-id: 79f7218
2016-10-29 16:09:18 -07:00
Siying Dong
1b295ac8ae DBTest.GetThreadStatus: Wait for test results for longer
Summary:
The current 10 millisecond waiting for test results may not be sufficient in some test environments. Increase it to 60 seconds and check the results for every 1 milliseond.

Already reviewed: https://reviews.facebook.net/D65457
Closes https://github.com/facebook/rocksdb/pull/1437

Differential Revision: D4099443

Pulled By: siying

fbshipit-source-id: cf1f205
2016-10-29 16:09:18 -07:00
Aaron Gao
b50a81a2bb Add a test for tailing_iterator
Summary:
A bug that tailingIterator->Seek(target) skips records.

I think the bug is in the SeekInternal starting at lines 387:
search_left_bound > search_right_bound
There are only 2 cases this can happen:
(1) target key is smaller than left most file
(2) target key is larger than right most file

The comment is wrong, there is another possibility that at the higher level there is a big gap such that the file in the lower level fits completely in the gap and then
indexer->GetNextLevelIndex returns search_left_bound > search_right_bound I think pointing on the files after and before the gap.
details: https://github.com/facebook/rocksdb/issues/1372

fixed this bug with test case added.
Closes https://github.com/facebook/rocksdb/pull/1436

Reviewed By: IslamAbdelRahman

Differential Revision: D4099313

Pulled By: lightmark

fbshipit-source-id: 6a675b3
2016-10-28 18:24:14 -07:00
Siying Dong
04751d5345 L0 compression should follow options.compression_per_level if not empty
Summary:
Currently, we don't use options.compression_per_level[0] as the compression style for L0 compression type, unless it is None. This behavior
 doesn't look like on purpose. This diff will make sure L0 compress using the style of options.compression_per_level[0].

Reviewed and accepted in: https://reviews.facebook.net/D65607
Closes https://github.com/facebook/rocksdb/pull/1435

Differential Revision: D4099368

Pulled By: siying

fbshipit-source-id: cfbbdcd
2016-10-28 17:39:20 -07:00
Andrew Kryczka
2946cadc46 Improve RangeDelAggregator documentation
Summary:
as requested in D62259
Closes https://github.com/facebook/rocksdb/pull/1434

Differential Revision: D4099047

Pulled By: ajkr

fbshipit-source-id: a258cfb
2016-10-28 15:54:21 -07:00
Aaron Gao
bc429de490 revert fractional cascading in farward iterator
Summary: As offline discussion with Siying, revert this since it has bug with seek.

Test Plan: make check -j64

Reviewers: yiwu, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65559
2016-10-28 10:25:39 -07:00
Andrew Kryczka
b9bc7a2aa4 Use skiplist rep for range tombstone memtable
Summary: somehow missed committing this update in D62217

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65361
2016-10-27 10:07:28 -07:00
Siying Dong
9ee84067f6 Disable DBTest.RepeatedWritesToSameKey (#1420)
Summary:
The verification condition of the test DBTest.RepeatedWritesToSameKey doesn't hold anymore after 3ce3bb3da2.
Disable the test for now before we find a way to replace it.

Test Plan: Run the test and make sure it is disabled.
2016-10-25 10:23:50 -07:00
Aaron Gao
9de2f75216 revert Prev() in MergingIterator to use previous code in non-prefix-seek mode
Summary: Siying suggested to keep old code for normal mode prev() for safety

Test Plan: make check -j64

Reviewers: yiwu, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65439
2016-10-24 13:13:01 -07:00
sdong
24495186da DBSSTTest.RateLimitedDelete: not to use real clock
Summary: Using real clock causes failures of DBSSTTest.RateLimitedDelete in some cases. Turn away from the real time. Use fake time instead.

Test Plan: Run the tests and all existing tests.

Reviewers: yiwu, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65145
2016-10-24 10:35:00 -07:00
sdong
1168cb810a Fix a bug that may cause a deleted row to appear again
Summary:
The previous fix of reappearing of a deleted row 0ce258f9b3 missed a corner case, which can be reproduced using test CompactionPickerTest.OverlappingUserKeys7. Consider such an example:

input level file: 1[B E] 2[F H]
output level file: 3[A C] 4[D I] 5[I K]

First file 2 is picked, which overlaps to file 4. 4 expands to 5. Now the all range is [D K] with 2 output level files. When we try to expand that, [D K] overlaps with file 1 and 2 in the input level, and 1 and 2 overlaps with 3 and 4 in the output level. So we end up with picking 3 and 4 in the output level. Without expanding, it also has 2 files, so we determine the output level doesn't change, although they are the different two files.

The fix is to expand the output level files after we picked 3 and 4. In that case, there will be three output level files so we will abort the expanding.

I also added two unit tests related to marked_for_compaction and being_compacted. They have been passing though.

Test Plan: Run the new unit test, as well as all other tests.

Reviewers: andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yoshinorim, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65373
2016-10-24 09:49:07 -07:00
Edouard A
99c052a34f Fix integer overflow in GetL0ThresholdSpeedupCompaction (#1378) 2016-10-23 18:43:29 -07:00
Aaron Gao
59a7c0337b Change ioptions to store user_comparator, fix bug
Summary:
change ioptions.comparator to user_comparator instread of internal_comparator.
Also change Comparator* to InternalKeyComparator* to make its type explicitly.

Test Plan: make all check -j64

Reviewers: andrewkr, sdong, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65121
2016-10-21 11:31:42 -07:00
Islam AbdelRahman
869ae5d786 Support IngestExternalFile (remove AddFile restrictions)
Summary:
Changes in the diff

API changes:
- Introduce IngestExternalFile to replace AddFile (I think this make the API more clear)
- Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file)
- Deprecate AddFile() API

Logic changes:
- If our file overlap with the memtable we will flush the memtable
- We will find the first level in the LSM tree that our file key range overlap with the keys in it
- We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it
- We will assign a global sequence number to our new file
- Remove AddFile restrictions by using global sequence numbers

Other changes:
- Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob

Test Plan:
unit tests (still need to add more)
addfile_stress (https://reviews.facebook.net/D65037)

Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong

Reviewed By: sdong

Subscribers: jkedgar, hcz, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65061
2016-10-20 17:05:32 -07:00
sdong
1d9dbef64e Restrict running condition of UniversalCompactionTrivialMoveTest2
Summary: DBTestUniversalCompaction.UniversalCompactionTrivialMoveTest2 verifies non-trivial move is not triggered if we load data in sequential order. However, if there are multiple compaction threads, this conditon may not hold. Restrict the running condition to 1 compaction thread to make the test more robust.

Test Plan: Run the test and make sure at least it doesn't regress normally.

Reviewers: yhchiang, andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65277
2016-10-20 15:43:00 -07:00
sdong
fb2e412943 column_family_test: disable some tests in LITE
Summary: Some tests in column_family_test depend on functions that are not available in LITE build, which sometimes cause flakiness. Disable them.

Test Plan: Run those tests in LITE build.

Reviewers: yiwu, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65271
2016-10-19 15:55:56 -07:00
Aaron Gao
5af651db24 fix data race in compact_files_test
Summary: fix data race

Test Plan: compact_files_test

Reviewers: sdong, yiwu, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65259
2016-10-19 13:37:51 -07:00
Andrew Kryczka
a0ba0aa877 Fix uninitialized variable gcc error for MyRocks
Summary: make sure seq_ is properly initialized even if ParseInternalKey() fails.

Test Plan: run myrocks release tests

Reviewers: lightmark, mung, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65199
2016-10-19 10:59:46 -07:00
Islam AbdelRahman
b88f8e87c5 Support SST files with Global sequence numbers [reland]
Summary:
reland https://reviews.facebook.net/D62523

- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file

Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks

Test Plan: unit tests

Reviewers: sdong, yhchiang

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65211
2016-10-18 16:59:37 -07:00
Aaron Gao
52c9808c3a not split file in compaciton on level 0
Summary: we should not split file on level 0 in compaction because it will fail the following verification of seqno order on level 0

Test Plan: check with filldeterministic in db_bench

Reviewers: yhchiang, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65193
2016-10-18 16:30:34 -07:00
Aaron Gao
5e0d6b4cc9 fix db_stress assertion failure
Summary: in rocksdb::DBIter::FindValueForCurrentKey(), last_not_merge_type could also be SingleDelete() which is omitted

Test Plan: db_iter_test

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65187
2016-10-18 16:07:10 -07:00
sdong
b4d07123c4 SamePrefixTest.InDomainTest to clear the test directory before testing
Summary: SamePrefixTest.InDomainTest may fail if the previous run of some test cases in prefix_test fail.

Test Plan: Run the test

Reviewers: lightmark, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65163
2016-10-18 14:01:10 -07:00
Islam AbdelRahman
aa09d03381 Avoid calling GetDBOptions() inside GetFromBatchAndDB()
Summary:
MyRocks hit a regression, @mung generated perf reports showing that the reason is the cost of calling `GetDBOptions()` inside `GetFromBatchAndDB()`
This diff avoid calling `GetDBOptions` and use the `ImmutableDBOptions` instead

Test Plan: make check -j64

Reviewers: sdong, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba, mung

Differential Revision: https://reviews.facebook.net/D65151
2016-10-18 13:19:26 -07:00
Andrew Kryczka
6fbe96baf8 Compaction Support for Range Deletion
Summary:
This diff introduces RangeDelAggregator, which takes ownership of iterators
provided to it via AddTombstones(). The tombstones are organized in a two-level
map (snapshot stripe -> begin key -> tombstone). Tombstone creation avoids data
copy by holding Slices returned by the iterator, which remain valid thanks to pinning.

For compaction, we create a hierarchical range tombstone iterator with structure
matching the iterator over compaction input data. An aggregator based on that
iterator is used by CompactionIterator to determine which keys are covered by
range tombstones. In case of merge operand, the same aggregator is used by
MergeHelper. Upon finishing each file in the compaction, relevant range tombstones
are added to the output file's range tombstone metablock and file boundaries are
updated accordingly.

To check whether a key is covered by range tombstone, RangeDelAggregator::ShouldDelete()
considers tombstones in the key's snapshot stripe. When this function is used outside of
compaction, it also checks newer stripes, which can contain covering tombstones. Currently
the intra-stripe check involves a linear scan; however, in the future we plan to collapse ranges
within a stripe such that binary search can be used.

RangeDelAggregator::AddToBuilder() adds all range tombstones in the table's key-range
to a new table's range tombstone meta-block. Since range tombstones may fall in the gap
between files, we may need to extend some files' key-ranges. The strategy is (1) first file
extends as far left as possible and other files do not extend left, (2) all files extend right
until either the start of the next file or the end of the last range tombstone in the gap,
whichever comes first.

One other notable change is adding release/move semantics to ScopedArenaIterator
such that it can be used to transfer ownership of an arena-allocated iterator, similar to
how unique_ptr is used for malloc'd data.

Depends on D61473

Test Plan: compaction_iterator_test, mock_table, end-to-end tests in D63927

Reviewers: sdong, IslamAbdelRahman, wanning, yhchiang, lightmark

Reviewed By: lightmark

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62205
2016-10-18 12:04:56 -07:00
Andrew Kryczka
f47054015d Handle WAL deletion when using avoid_flush_during_recovery
Summary:
Previously the WAL files that were avoided during recovery would never
be considered for deletion. That was because alive_log_files_ was only
populated when log files are created. This diff further populates
alive_log_files_ with existing log files that aren't flushed during recovery,
such that FindObsoleteFiles() can find them later.

Depends on D64053.

Test Plan: new unit test, verifies it fails before this change and passes after

Reviewers: sdong, IslamAbdelRahman, yiwu

Reviewed By: yiwu

Subscribers: leveldb, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D64059
2016-10-14 12:59:51 -07:00
Yi Wu
e29d3b67c2 Make max_background_compactions and base_background_compactions dynamic changeable
Summary:
Add DB::SetDBOptions to dynamic change max_background_compactions and base_background_compactions.
I'll add more dynamic changeable options soon.

Test Plan: unit test.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64749
2016-10-14 12:25:39 -07:00
Aaron Gao
21e8daced5 fix assertion failure in Prev()
Summary:
fix assertion failure in db_stress.
It happens because of prefix seek key is larger than merge iterator key when they have the same user key

Test Plan: ./db_stress --max_background_compactions=1 --max_write_buffer_number=3 --sync=0 --reopen=20 --write_buffer_size=33554432 --delpercent=5 --log2_keys_per_lock=10 --block_size=16384 --allow_concurrent_memtable_write=0 --test_batches_snapshots=0 --max_bytes_for_level_base=67108864 --progress_reports=0 --mmap_read=0 --writepercent=35 --disable_data_sync=0 --readpercent=50 --subcompactions=4 --ops_per_thread=20000000 --memtablerep=skip_list --prefix_size=0 --target_file_size_multiplier=1 --column_families=1 --threads=32 --disable_wal=0 --open_files=500000 --destroy_db_initially=0 --target_file_size_base=16777216 --nooverwritepercent=1 --iterpercent=10 --max_key=100000000 --prefixpercent=0 --use_clock_cache=false --kill_random_test=888887 --cache_size=1048576 --verify_checksum=1

Reviewers: sdong, andrewkr, yiwu, yhchiang

Reviewed By: yhchiang

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65025
2016-10-13 17:36:48 -07:00
Yueh-Hsuan Chiang
040328a30d Remove an assertion for single-delete in MergeHelper::MergeUntil
Summary:
Previously we have an assertion which triggers when we issue Merges
after a single delete.  However, merges after a single delete are
unrelated to that single delete.  Thus this behavior should be
allowed.

This will address a flakyness of db_stress.

Test Plan: db_stress

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64923
2016-10-13 14:26:57 -07:00
Islam AbdelRahman
f26a139d89 Log successful AddFile
Summary: Log successful AddFile

Test Plan: visually check LOG file

Reviewers: yiwu, andrewkr, lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, jkedgar, dhruba

Differential Revision: https://reviews.facebook.net/D65019
2016-10-13 11:56:27 -07:00
Islam AbdelRahman
5691a1d8a4 Fix compaction conflict with running compaction
Summary:
Issue scenario:
(1) We have 3 files in L1 and we issue a compaction that will compact them into 1 file in L2
(2) While compaction (1) is running, we flush a file into L0 and trigger another compaction that decide to move this file to L1 and then move it again to L2 (this file don't overlap with any other files)
(3) compaction (1) finishes and install the file it generated in L2, but this file overlap with the file we generated in (2) so we break the LSM consistency

Looks like this issue can be triggered by using non-exclusive manual compaction or AddFile()

Test Plan: unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: hermanlee4, jkedgar, andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D64947
2016-10-13 10:49:06 -07:00
Andrew Kryczka
017de666c7 fixup commit
Summary: I accidentally left out these changes from my commit of D64053 due to
messing up the merge conflict resolution.

Test Plan: ./db_wal_test

Reviewers:

Subscribers:

Tasks:

Blame Revision: D64053
2016-10-13 08:48:40 -07:00
Andrew Kryczka
1b7af5fb1a Redo handling of recycled logs in full purge
Summary:
This reverts commit 9e4aa798c3,
which doesn't handle all cases (see inline comment).

I reimplemented the logic as suggested in the initial PR: https://github.com/facebook/rocksdb/pull/1313.

This approach has two benefits:

- All the parsing/filtering of full_scan_candidate_files is kept together in PurgeObsoleteFiles.
- We only need to check whether log file is recycled in one place where we've already determined it's a log file

Test Plan:
new unit test, verified fails before the original fix, still passes
now.

Reviewers: IslamAbdelRahman, yiwu, sdong

Reviewed By: yiwu, sdong

Subscribers: leveldb, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D64053
2016-10-12 23:13:09 -07:00
Aaron Gao
447f17127c new Prev() prefix support using SeekForPrev()
Summary:
1) The previous solution for Prev() prefix support is not clean.
Since I add api SeekForPrev(), now the Prev() can be symmetric to Next().
and we do not need SeekToLast() to be called in Prev() any more.

Also, Next() will Seek(prefix_seek_key_) to solve the problem of possible inconsistency between db_iter and merge_iter when
there is merge_operator. And prefix_seek_key is only refreshed when change direction to forward.

2) This diff also solves the bug of Iterator::SeekToLast() with iterate_upper_bound_ with prefix extractor.

add test cases for the above two cases.

There are some tests for the SeekToLast() in Prev(), I will clean them later.

Test Plan: make all check

Reviewers: IslamAbdelRahman, andrewkr, yiwu, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63933
2016-10-11 13:54:26 -07:00
Islam AbdelRahman
2ad68b971a Support running consistency checks in release mode
Summary:
We always run consistency checks when compiling in debug mode
allow users to set Options::force_consistency_checks to true to be able to run such checks even when compiling in release mode

Test Plan:
make check -j64
make release

Reviewers: lightmark, sdong, yiwu

Reviewed By: yiwu

Subscribers: hermanlee4, andrewkr, yoshinorim, jkedgar, dhruba

Differential Revision: https://reviews.facebook.net/D64701
2016-10-07 17:21:45 -07:00
Islam AbdelRahman
d062328977 Revert "Support SST files with Global sequence numbers"
This reverts commit ab01da5437.
2016-10-07 14:05:12 -07:00
Reid Horuff
2c1f95291d Add facility to write only a portion of WriteBatch to WAL
Summary:
When constructing a write batch a client may now call MarkWalTerminationPoint() on that batch. No batch operations after this call will be added written to the WAL but will still be inserted into the Memtable. This facility is used to remove one of the three WriteImpl calls in 2PC transactions. This produces a ~1% perf improvement.

```
RocksDB - unoptimized 2pc, sync_binlog=1, disable_2pc=off
INFO 2016-08-31 14:30:38,814 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2619 seconds. Requests/second = 28628

RocksDB - optimized 2pc , sync_binlog=1, disable_2pc=off
INFO 2016-08-31 16:26:59,442 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2581 seconds. Requests/second = 29054
```

Test Plan: Two unit tests added.

Reviewers: sdong, yiwu, IslamAbdelRahman

Reviewed By: yiwu

Subscribers: hermanlee4, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D64599
2016-10-07 11:32:10 -07:00
Islam AbdelRahman
ab01da5437 Support SST files with Global sequence numbers
Summary:
- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file

Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks

Test Plan: unit tests

Reviewers: andrewkr, yhchiang, yiwu, sdong

Reviewed By: sdong

Subscribers: hcz, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62523
2016-10-03 16:12:39 -07:00
Andrew Kryczka
6009c473c7 Store range tombstones in memtable
Summary:
- Store range tombstones in a separate MemTableRep instantiated with ColumnFamilyOptions::memtable_factory
- MemTable::NewRangeTombstoneIterator() returns a MemTableIterator over the separate MemTableRep
- Part of the read path is not implemented yet (i.e., MemTable::Get())

Test Plan: see unit tests

Reviewers: wanning

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62217
2016-09-30 09:06:43 -07:00
Aaron Gao
26388247aa delete unused variable for PrevInterval()
Summary: delete unused variable

Test Plan: make check

Reviewers: sdong, andrewkr, IslamAbdelRahman, tianx

Reviewed By: tianx

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64509
2016-09-29 13:19:58 -07:00
Islam AbdelRahman
87dfc1d23e Fix conflict between AddFile() and CompactRange()
Summary:
Fix the conflict bug between AddFile() and CompactRange() by
- Make sure that no AddFile calls are running when asking CompactionPicker to pick compaction for manual compaction
- If AddFile() run after we pick the compaction for the manual compaction it will be aware of it since we will add the manual compaction to running_compactions_ after picking it

This will solve these 2 scenarios
- If AddFile() is running, we will wait for it to finish before we pick a compaction for the manual compaction
- If we already picked a manual compaction and then AddFile() started ... we ensure that it never ingest a file in a level that will overlap with the manual compaction

Test Plan: unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, yoshinorim, jkedgar, dhruba

Differential Revision: https://reviews.facebook.net/D64449
2016-09-28 15:42:06 -07:00
Aaron Gao
f517d9dd09 Add SeekForPrev() to Iterator
Summary:
Add new Iterator API, `SeekForPrev`: find the last key that <= target key
support prefix_extractor
support prefix_same_as_start
support upper_bound
not supported in iterators without Prev()

Also add tests in db_iter_test and db_iterator_test

Pass all tests
Cheers!

Test Plan: make all check -j64

Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64149
2016-09-27 18:20:57 -07:00
yiwu-arbug
eb3894cf42 Recompute compaction score on SetOptions (#1346)
Summary: We didn't recompute compaction score on SetOptions, and end up not having compaction if no flush happens afterward. The PR fixing it.

Test Plan: See unit test.

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D64167
2016-09-27 11:17:15 -07:00
Islam AbdelRahman
5c64fb67d2 Fix AddFile() conflict with compaction output [WaitForAddFile()]
Summary:
Since AddFile unlock/lock the mutex inside LogAndApply() we need to ensure that during this period other compactions cannot run since such compactions are not aware of the file we are ingesting and could create a compaction that overlap wit this file

this diff add
- WaitForAddFile() call that will ensure that no AddFile() calls are being processed right now
- Call `WaitForAddFile()` in 3 locations
-- When doing manual Compaction
-- When starting automatic Compaction
-- When  doing CompactFiles()

Test Plan: unit test

Reviewers: lightmark, yiwu, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, yoshinorim, jkedgar, dhruba

Differential Revision: https://reviews.facebook.net/D64383
2016-09-27 00:14:55 -07:00
Islam AbdelRahman
9e9f5a0b92 Fix CompactFilesTest.ObsoleteFiles timeout (#1353) 2016-09-26 10:39:07 -07:00
Aaron Gao
c2a62a4cb2 not cut compaction output when compact to level 0
Summary: we should not call ShouldStopBefore() in compaction when the compaction targets level 0. Otherwise, CheckConsistency will fail the assertion of seq number check on level 0.

Test Plan:
make all check -j64
I also manully test that using db_bench to compact files to level 0. Without this line change, the assertion files and multiple files are generated on level 0 after compaction.

Reviewers: yhchiang, andrewkr, yiwu, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64269
2016-09-23 17:16:38 -07:00
Yi Wu
9ed928e7a9 Split DBOptions into ImmutableDBOptions and MutableDBOptions
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.

Test Plan:
  make all check

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64065
2016-09-23 16:34:04 -07:00
yiwu-arbug
4bc8c88e6b Recover same sequence id from WAL (#1350)
Summary:
Revert the behavior where we don't read sequence id from WAL, but increase it as we replay the log. We still keep the behave for 2PC for now but will fix later.

This change fixes github issue 1339, where some writes come with WAL disabled and we may recover records with wrong sequence id.

Test Plan: Added unit test.

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D64275
2016-09-23 16:15:14 -07:00
Aaron Gao
0a1bd9c509 add cfh deletion started listener
Summary: add ColumnFamilyHandleDeletionStarted listener which can be called when user deletes handler.

Test Plan: ./listener_test

Reviewers: yiwu, IslamAbdelRahman, sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60717
2016-09-22 11:56:18 -07:00
Islam AbdelRahman
da5a9a65c1 Fix mac build 2016-09-21 20:22:09 -07:00
Islam AbdelRahman
abc0ae462b Add AddFile() InternalStats for Total files/L0 files/total keys ingested
Summary:
Report more information about the ingested files in CF InternalStats
- Total files
- Total L0 files
- Total keys

There was also noticed that we were reporting files that failed to ingest, fix this bug

Test Plan: print stats in tests

Reviewers: sdong, andrewkr, lightmark

Reviewed By: lightmark

Subscribers: jkedgar, andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D63039
2016-09-21 14:24:08 -07:00
Aaron Gao
715256338a forbid merge during recovery
Summary:
Mitigate regression bug of options.max_successive_merges hit during DB Recovery
For https://reviews.facebook.net/D62625

Test Plan: make all check

Reviewers: horuff, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62655
2016-09-21 11:05:07 -07:00
Yi Wu
e4d3f5d9b8 Fix DBImpl::GetWalPreallocateBlockSize Mac build error
Summary: Specify type param with std::min to resolve compile error on Mac.

Test Plan: https://travis-ci.org/facebook/rocksdb/builds/161223845

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64143
2016-09-20 10:17:28 -07:00
sdong
d78a4401b5 DBImpl::GetWalPreallocateBlockSize() should return size_t
Summary: WritableFile::SetPreallocationBlockSize() requires parameter as size_t, and options used in DBImpl::GetWalPreallocateBlockSize() are all size_t. WritableFile::SetPreallocationBlockSize() should return size_t to avoid build break if size_t is not uint64_t.

Test Plan: Run existing tests.

Reviewers: andrewkr, IslamAbdelRahman, yiwu

Reviewed By: yiwu

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D64137
2016-09-19 16:51:38 -07:00
sdong
b666f85445 Consider more factors when determining preallocation size of WAL files
Summary: Currently the WAL file preallocation size is 1.1 * write_buffer_size. This, however, will be over-estimated if options.db_write_buffer_size or options.max_total_wal_size is set and is much smaller.

Test Plan: Add a unit test.

Reviewers: andrewkr, yiwu

Reviewed By: yiwu

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63957
2016-09-19 12:04:35 -07:00
rockeet
4c3f4496b5 Add TableBuilderOptions::level and relevant changes (#1335) 2016-09-17 22:30:43 -07:00
Yi Wu
0a88f38b7e Remove ColumnFamilyData::options()
Summary: One more small refactor before I split DBOptions into mutable and immutable parts.

Test Plan: existing unit tests.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64047
2016-09-16 15:09:14 -07:00
Yi Wu
8d9bf5c498 Fix DBOptionsTest.GetLatestOptions
Summary: RandomInitCFOptions will allocate a new compaction filter, which we have to delete afterward.

Test Plan: valgrind against the test

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64011
2016-09-15 14:57:32 -07:00
Yi Wu
40cfa3e021 Fix DBWALTest.RecoveryWithLogDataForSomeCFs with mac
Summary: Seems there's no std::array on mac+clang. Use raw array instead.

Test Plan: run ./db_wal_test on mac.

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64005
2016-09-15 13:44:33 -07:00
Andrew Kryczka
06b4785fec Fix recovery for WALs without data for all CFs
Summary:
if one or more CFs had no data in the WAL, the log number that's used
by FindObsoleteFiles() wasn't updated. We need to treat this case the same as
if the data for that WAL had been flushed.

Test Plan: new unit test

Reviewers: IslamAbdelRahman, yiwu, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63963
2016-09-15 11:40:48 -07:00
Andrew Kryczka
d7242ff4d5 Fix GetSortedWalFiles when log recycling enabled
Summary:
Previously the sequence number was mistakenly passed in an argument
where the log number should go. This caused the reader to assume the old WAL
format was used, which is incompatible with the WAL recycling format.

Test Plan:
new unit test, verified it fails before this change and passes
afterwards.

Reviewers: yiwu, lightmark, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63987
2016-09-15 09:55:02 -07:00
Yi Wu
17f76fc564 DB::GetOptions() reflect dynamic changed options
Summary: DB::GetOptions() reflect dynamic changed options.

Test Plan: See the new unit test.

Reviewers: yhchiang, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63903
2016-09-14 22:10:28 -07:00
Yi Wu
81747f1be6 Refactor MutableCFOptions
Summary:
* Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions.
* Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable.

Test Plan: existing unit tests.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63945
2016-09-13 21:11:59 -07:00
somnathr
9e4aa798c3 Summary: (#1313)
If log recycling is enabled with the rocksdb (recycle_log_file_num=16)
 db->Writebatch is erroring out with keynotfound after ~5-6 hours of run
 (1M seq but can happen to any workload I guess).See my detailed bug
 report here (https://github.com/facebook/rocksdb/issues/1303).
 This commit is the fix for this, a check is been added not to delete
 the log file if it is already there in the recycle list.

Test Plan:
 Unit tested it and ran the similar profile. Not reproducing anymore.
2016-09-12 16:53:42 -07:00
Adam Faulkner
a10e8a056d Fix C api memtable rep bugs. (#1328) 2016-09-12 15:31:42 -07:00
zhangjinpeng1987
b06b191362 add C api for set wal_recovery_mode (#1327)
* add C api for set wal recovery mode

* add test
2016-09-09 10:11:30 -07:00
Islam AbdelRahman
1cca091298 Temporarily revert Prev() prefix support
Summary:
Temporarily revert commits for supporting prefix Prev() to unblock MyRocks and RocksDB release

These are the commits reverted

  - 6a14d55bd9
  - b18f9c9eac
  - db74b1a219
  - 2482d5fb45

Test Plan: make check -j64

Reviewers: sdong, lightmark

Reviewed By: lightmark

Subscribers: andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D63789
2016-09-08 14:45:32 -07:00
Islam AbdelRahman
52ee07b021 Move AddFile() tests to external_sst_file_test.cc
Summary: Simply move the tests

Test Plan: make check -j64

Reviewers: andrewkr, lightmark, yiwu, yhchiang, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62529
2016-09-07 15:41:54 -07:00
Edouard A
66a91e2607 Add NoSpace subcode to IOError (#1320)
Add a sub code to distinguish "out of space" errors from regular I/O errors
2016-09-07 12:37:45 -07:00
sdong
67036c0406 Fix Flaky ColumnFamilyTest.FlushCloseWALFiles
Summary: In ColumnFamilyTest.FlushCloseWALFiles, there is a small window in which the flush has finished but the log writer is not yet closed, causing the assert failure. Fix it by explicitly waiting the flush job to finish.

Test Plan: Run the test many times in high parallelism.

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63423
2016-09-07 11:19:15 -07:00
sdong
607628d349 Support ZSTD with finalized format
Summary:
ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility.
Still keep ZSTDNotFinal for compatibility reason.

Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older.

Reviewers: andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63141
2016-09-06 12:22:16 -07:00
Injun Song
ce1be2ce37 Fix build error on Windows (AppVeyor) (#1315)
Add 'cf_options' to source list and db_imple.cc

fix casting
2016-09-06 08:41:43 -07:00
sdong
22696b0881 Fix uninitlized CompactionJob::SubcompactionState::current_output_file_size
Summary: The new variable introduced in 2149059f910149197d1a0f79ac08cf19465ea2d may be unitialized. Valgrind is failing because of it.

Test Plan: Run valgrind tests

Reviewers: yiwu, andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63201
2016-09-02 17:06:20 -07:00
Yi Wu
a88677d2cf Remove ImmutableCFOptions from public API
Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (5665e5e285) but later its reference in /include/rocksdb/table.h is removed.

Test Plan:
  make all check

Reviewers: IslamAbdelRahman, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: yhchiang, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63177
2016-09-02 14:16:31 -07:00
Islam AbdelRahman
80c75593ed Fix data race in AddFile() with multiple files + custom comparator bug
Summary:
When ingesting multiple files
- We should use user comparator
- Should not call `cfd->current()` outside of mutex

Test Plan: unit tests

Reviewers: sdong, lightmark

Reviewed By: lightmark

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63075
2016-09-02 11:17:58 -07:00
John Alexander
4fd08f4b8b Ensure Correct Behavior of StatsLevel kExceptDetailedTimers and kExceptTimeForMutex (#1308)
* Fix StatsLevel so that kExceptTimeForMutex leaves compression stats enabled and kExceptDetailedTimers disables mutex lock stats. Also change default stats level to kExceptDetailedTimers (disabling both compression and mutex timing).

* Changed order of StatsLevel enum to simplify logic for determining what stats to record.
2016-09-01 19:57:55 -07:00
sdong
32149059f9 Merge options source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes
Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes.

Test Plan: Add two new unit tests. Run all existing tests, including jtest.

Reviewers: yhchiang, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59829
2016-09-01 14:33:24 -07:00
Aaron Gao
6a14d55bd9 add prefix_seek_mode to db_iter_test
Summary: add prefix_seek_mode to db_iter_test to enable data race test for iterator when prefix_extractor != nullptr

Test Plan: make all check -j64

Reviewers: andrewkr, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63027
2016-08-31 12:07:09 -07:00
Aaron Gao
2482d5fb45 support Prev() in prefix seek mode
Summary: As title, make sure Prev() works as expected with Next() when the current iter->key() in the range of the same prefix in prefix seek mode

Test Plan: make all check -j64 (add prefix_test with PrefixSeekModePrev test case)

Reviewers: andrewkr, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yoshinorim, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61419
2016-08-29 20:55:39 -07:00
Islam AbdelRahman
b49b92cf28 Introduce Read amplification bitmap (read amp statistics)
Summary:
Add ReadOptions::read_amp_bytes_per_bit option which allow us to create a bitmap for every data block we read
the bitmap will contain (block_size / read_amp_bytes_per_bit) bits.

We will use this bitmap to mark which bytes have been used of the block so we can calculate the read amplification

Test Plan: added new tests

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: yiwu, leveldb, march, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58707
2016-08-26 18:55:58 -07:00
sdong
dade61ac26 Mitigate regression bug of options.max_successive_merges hit during DB Recovery
Summary:
After 1b8a2e8fdd, DB Pointer is passed to WriteBatchInternal::InsertInto() while DB recovery. This can cause deadlock if options.max_successive_merges hits. In that case DB::Get() will be called. Get() will try to acquire the DB mutex, which is already held by the DB::Open(), causing a deadlock condition.

This commit mitigates the problem by not passing the DB pointer unless 2PC is allowed.

Test Plan: Add a new test and run it.

Reviewers: IslamAbdelRahman, andrewkr, kradhakrishnan, horuff

Reviewed By: kradhakrishnan

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62625
2016-08-25 17:30:34 -07:00
Justin Gibbs
b2ce59537c Persist data during user initiated shutdown
Summary:
Move the manual memtable flush for databases containing data that has
bypassed the WAL from DBImpl's destructor to CancleAllBackgroundWork().

CancelAllBackgroundWork() is a publicly exposed API which allows
async operations performed by background threads to be disabled on a
database. In effect, this places the database into a "shutdown" state
in advance of calling the database object's destructor. No compactions
or flushing of SST files can occur once a call to this API completes.

When writes are issued to a database with WriteOptions::disableWAL
set to true, DBImpl::has_unpersisted_data_ is set so that
memtables can be flushed when the database object is destroyed. If
CancelAllBackgroundWork() has been called prior to DBImpl's destructor,
this flush operation is not possible and is skipped, causing unnecessary
loss of data.

Since CancelAllBackgroundWork() is already invoked by DBImpl's destructor
in order to perform the thread join portion of its cleanup processing,
moving the manual memtable flush to CancelAllBackgroundWork() ensures
data is persisted regardless of client behavior.

Test Plan:
Write an amount of data that will not cause a memtable flush to a rocksdb
database with all writes marked with WriteOptions::disableWAL. Properly
"close" the database. Reopen database and verify that the data was
persisted.

Reviewers: IslamAbdelRahman, yiwu, yoshinorim, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62277
2016-08-25 12:24:22 -07:00
Mike Kolupaev
7b81095171 Fix a crash when compaction fails to open a file
Summary:
We've got a crash with this stack trace:

  Program terminated with signal SIGTRAP, Trace/breakpoint trap.

  #0  0x00007fc85f2f4009 in raise () from /usr/local/fbcode/gcc-4.9-glibc-2.20-fb/lib/libpthread.so.0
  #1  0x00000000005c8f61 in facebook::logdevice::handle_sigsegv(int) () at logdevice/server/sigsegv.cpp:159
  #2  0x00007fc85f2f4150 in <signal handler called> () at /usr/local/fbcode/gcc-4.9-glibc-2.20-fb/lib/libpthread.so.0
  #3  0x00000000031ed80c in rocksdb::NewReadaheadRandomAccessFile() at util/file_reader_writer.cc:383
  #4  0x00000000031ed80c in rocksdb::NewReadaheadRandomAccessFile() at util/file_reader_writer.cc:472
  #5  0x00000000031558e7 in rocksdb::TableCache::GetTableReader() at db/table_cache.cc:99
  #6  0x0000000003156329 in rocksdb::TableCache::NewIterator() at db/table_cache.cc:198
  #7  0x0000000003166568 in rocksdb::VersionSet::MakeInputIterator() at db/version_set.cc:3345
  #8  0x000000000324a94f in rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*) () at db/compaction_job.cc:650
  #9  0x000000000324c2f6 in rocksdb::CompactionJob::Run() () at db/compaction_job.cc:530
  #10 0x00000000030f5ae5 in rocksdb::DBImpl::BackgroundCompaction() at db/db_impl.cc:3269
  #11 0x0000000003108d36 in rocksdb::DBImpl::BackgroundCallCompaction(void*) () at db/db_impl.cc:2970
  #12 0x00000000029a2a9a in facebook::logdevice::RocksDBEnv::callback(void*) () at logdevice/server/locallogstore/RocksDBEnv.cpp:26
  #13 0x00000000029a2a9a in facebook::logdevice::RocksDBEnv::callback(void*) () at logdevice/server/locallogstore/RocksDBEnv.cpp:30
  #14 0x00000000031e7521 in rocksdb::ThreadPool::BGThread() at util/threadpool.cc:230
  #15 0x00000000031e7663 in rocksdb::BGThreadWrapper(void*) () at util/threadpool.cc:254
  #16 0x00007fc85f2ea7f1 in start_thread () at /usr/local/fbcode/gcc-4.9-glibc-2.20-fb/lib/libpthread.so.0
  #17 0x00007fc85e8fb46d in clone () at /usr/local/fbcode/gcc-4.9-glibc-2.20-fb/lib/libc.so.6

From looking at the code, probably what happened is this:
 - `TableCache::GetTableReader()` called `Env::NewRandomAccessFile()`, which dispatched to a `PosixEnv::NewRandomAccessFile()`, where probably an `open()` call failed, so the `NewRandomAccessFile()` left a nullptr in the resulting file,
 - `TableCache::GetTableReader()` called `NewReadaheadRandomAccessFile()` with that `nullptr` file,
 - it tried to call file's method and crashed.

This diff is a trivial fix to this crash.

Test Plan: `make -j check`

Reviewers: sdong, andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62451
2016-08-25 04:39:26 -07:00
Islam AbdelRahman
2a9c97108e [Flaky Test] Disable DBPropertiesTest.GetProperty
Summary: Disable flaky test

Test Plan: run it

Reviewers: yiwu, andrewkr, kradhakrishnan, yhchiang, lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62487
2016-08-24 15:32:01 -07:00
Yi Wu
badbff65b7 Not insert into block cache if cache is full and not holding handle
Summary:
We used to allow insert into full block cache as long as `strict_capacity_limit=false`. This diff further restrict insert to full cache if caller don't intent to hold handle to the cache entry after insert.

Hope this diff fix the assertion failure with db_stress: https://our.intern.facebook.com/intern/sandcastle/log/?instance_id=211853102&step_id=2475070014

  db_stress: util/lru_cache.cc:278: virtual void rocksdb::LRUCacheShard::Release(rocksdb::Cache::Handle*): Assertion `lru_.next == &lru_' failed.

The assertion at lru_cache.cc:278 can fail when an entry is inserted into full cache and stay in LRU list.

Test Plan:
  make all check

Reviewers: IslamAbdelRahman, lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62325
2016-08-23 13:53:49 -07:00
Yi Wu
4a16c32ece Option to cache index/filter blocks with priority
Summary:
Add option to block based table to insert index/filter blocks to block cache with priority. Combined with LRUCache with high_pri_pool_ratio, we can reserved space for index/filter blocks, make them less likely to be evicted.

Depends on D61977.

Test Plan: See unit test.

Reviewers: lightmark, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, march, leveldb

Differential Revision: https://reviews.facebook.net/D62241
2016-08-23 13:44:13 -07:00
Islam AbdelRahman
6a17b07ca8 Add TablePropertiesCollector support in SstFileWriter
Summary: Update SstFileWriter to use user TablePropertiesCollectors that are passed in Options

Test Plan: unittests

Reviewers: sdong

Reviewed By: sdong

Subscribers: jkedgar, andrewkr, hermanlee4, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D62253
2016-08-19 16:17:56 -07:00
Wanning Jiang
78837f5d61 TableBuilder / TableReader support for range deletion
Summary: 1. Range Deletion Tombstone structure 2. Modify Add() in table_builder to make it usable for adding range del tombstones 3. Expose NewTombstoneIterator() API in table_reader

Test Plan: table_test.cc (now BlockBasedTableBuilder::Add() only accepts InternalKey. I make table_test only pass InternalKey to BlockBasedTableBuidler. Also test writing/reading range deletion tombstones in table_test )

Reviewers: sdong, IslamAbdelRahman, lightmark, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61473
2016-08-19 15:10:31 -07:00
Jay
49d88be021 c abi: allow compaction filter ignore snapshot (#1268)
close #1262
2016-08-17 18:48:43 -07:00
Anirban Rahut
2fc2fd92a9 Single Delete Mismatch and Fallthrough statistics
Summary:
Added 2 statistics in compaction job statistics, to
identify if single deletes are not meeting a matching key
(fallthrough) or single deletes are meeting a merge, delete or
another single delete (i.e. not the expected case of put).

Test Plan: Tested the statistics using write_stress and compaction_job_stats_test

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61749
2016-08-16 08:21:43 -07:00
Andrew Kryczka
3771e37970 WriteBatch support for range deletion
Summary:
Add API to WriteBatch to store range deletions in its buffer
which are later added to memtable. In the WriteBatch buffer, a range
deletion is encoded as "<optype><CF ID (optional)><begin key><end key>".

With this diff, the range tombstones are stored inline with the data in
the memtable. It's useful for now because the test cases rely on the
data being accessible via memtable. My next step is to store range
tombstones in a separate area in the memtable.

Test Plan: unit tests

Reviewers: IslamAbdelRahman, sdong, wanning

Reviewed By: wanning

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61401
2016-08-16 08:16:04 -07:00
Islam AbdelRahman
64a0082c69 Fix DBSSTest::AddExternalSstFileSkipSnapshot valgrind fail
Summary: Fix the test by releasing the last snapshot

Test Plan: run the test under valgrind

Reviewers: andrewkr, yiwu, lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62091
2016-08-15 14:04:40 -07:00
sdong
6525ce4caf Compaction stats printing: "batch" => "commit group"
Summary: "Batch" is ambiguous in this context. It can mean "write batch" or commit group. Change it to commit group to be clear.

Test Plan: Build

Reviewers: MarkCallaghan, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62055
2016-08-15 10:47:29 -07:00
Islam AbdelRahman
a297643f2e Fix valgrind memory leak 2016-08-11 23:34:19 -07:00
Islam AbdelRahman
d11c09d9e2 Eliminate memcpy from ForwardIterator
Summary:
This diff update ForwardIterator to support pinning keys and values, which will allow DBIter to take advantage of that and eliminate memcpy when executing merge operators
This diff is stacked on D61305

Test Plan:
existing tests (updated them to test tailing iterator)
new test

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60009
2016-08-11 19:10:16 -07:00
Islam AbdelRahman
b693ba68b5 Minor PinnedIteratorsManager Refactoring
Summary:
This diff include these simple change
- Rename ReleasePinnedIterators to ReleasePinnedData
- Rename PinIteratorIfNeeded to PinIterator
- Use std::vector directly in PinnedIteratorsManager instead of std::unique_ptr<std::vector>
- Generalize PinnedIteratorsManager by adding PinPtr which can pin any pointer

Test Plan: existing tests

Reviewers: sdong, yiwu, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61305
2016-08-11 11:54:17 -07:00
sdong
4beffe001d Fix test data race in two FaultInjectionTest tests
Summary: Background sleeping tasks may conflict with test cleaning up. Wait for the sleeping tasks to finish before ending the test.

Test Plan: Run these tests.

Reviewers: andrewkr, yiwu

Reviewed By: yiwu

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61827
2016-08-10 13:56:50 -07:00
sdong
56dd034115 read_options.background_purge_on_iterator_cleanup to cover forward iterator and log file closing too.
Summary: With read_options.background_purge_on_iterator_cleanup=true, File deletion and closing can still happen in forward iterator, or WAL file closing. Cover those cases too.

Test Plan: I am adding unit tests.

Reviewers: andrewkr, IslamAbdelRahman, yiwu

Reviewed By: yiwu

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61503
2016-08-10 13:16:41 -07:00
Islam AbdelRahman
ccecf3f4fb UniversalCompaction should ignore sorted runs being compacted (when compacting for file num)
Summary:
If we have total number of sorted runs greater than level0_file_num_compaction_trigger, Universal compaction will always issue a compaction
even if the number of sorted runs that are not being compacted is less than level0_file_num_compaction_trigger.

This diff changes this behaviour to relay on the `number of sorted runs not being compacted` instead of `total number of sorted runs`

Test Plan: New unit test

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61533
2016-08-10 12:37:43 -07:00
Zongzhi Chen
98d0b78eac Added check_snapshot option in the DB's AddFile function (#1261)
* Added check_snapshot option in the DB's AddFile function

* change check_snapshot to skip_snapshot_check

* add unit test for skip_snapshot_check

* Add skip_snapshot_check comment
2016-08-09 18:14:13 -07:00
Yi Wu
7882cb9773 Make DBOptionsTest::EnableAutoCompactionAndTriggerStall less falky
Summary: Explicitly flush two times to generate two sst files.

Test Plan: run the test.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61671
2016-08-05 16:45:57 -07:00
omegaga
c3a4bea5dc Fix flaky test ObsoleteFiles
Summary: The test `ObsoleteFiles` failed occasionally on slow device. This problem appears on Travis CI several times. The reason is that we did not wait until compaction jobs are finished in the test, while in slower device the background jobs take longer time to finish.

Test Plan: Pass existing tests.

Reviewers: yiwu, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61479
2016-08-03 15:19:35 -07:00
Yi Wu
ee027fc19f Ignore write stall triggers when auto-compaction is disabled
Summary:
My understanding is that the purpose of write stall triggers are to wait for auto-compaction to catch up. Without auto-compaction, we don't need to stall writes.

Also with this diff, flush/compaction conditions are recalculated on dynamic option change. Previously the conditions are recalculate only when write stall options are changed.

Test Plan: See the new test. Removed two tests that are no longer valid.

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61437
2016-08-02 21:55:26 -07:00
Aaron Gao
343304e1d3 Use StopWatch to do statistic job in db_impl_add_file.cc
Summary:
patch for diff https://reviews.facebook.net/D58587
Also change StopWatch class to add a fifth param named overwrite which decides whether to overwrite *elapse or add on it.

Test Plan: make all check -j64

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61239
2016-08-02 14:53:29 -07:00
Jay Edgar
cdc4eb6892 Add a GetComparator() function to the ColumnFamilyHandle base class so that the user's comparator can be retrieved.
Summary: MyRocks is adding support for the user of the SstFileWriter which needs a comparator.  It would be more convenient to get the comparator from the column family (which already has to have it) than to have caller keep track of it.

Test Plan: Standard tests (adding one for the new method)

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61155
2016-08-02 14:34:57 -07:00
Islam AbdelRahman
5e2c796589 Make DBTest.CompressionStatsTest more deterministic
Summary:
DBTest.CompressionStatsTest on non_shm test where the storage device is slow

DBTest.CompressionStatsTest assumes that a flush happens to check the number of compressed blocks.
This is not always true if the Flush is slow, make the test more deterministic by forcing a flush before doing the check

Test Plan: Run the test locally

Reviewers: andrewkr, yiwu, lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61317
2016-07-29 11:42:28 -07:00
Aaron Gao
e72ea485ed add InDomain regression test
Summary: regression tests to make sure seek keys not in domain would not fail assertion

Test Plan:
```
[gzh@dev6163.prn2 ~/local/rocksdb] ./prefix_test --gtest_filter=SamePrefixTest.*
/tmp/rocksdbtest-112628/prefix_test
Note: Google Test filter = SamePrefixTest.*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from SamePrefixTest
[ RUN      ] SamePrefixTest.InDomainTest
[       OK ] SamePrefixTest.InDomainTest (211 ms)
[----------] 1 test from SamePrefixTest (211 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (211 ms total)
```

Reviewers: andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61161
2016-07-27 18:45:53 -07:00
sdong
e5b5f12b81 Change options memtable_prefix_bloom_huge_page_tlb_size => memtable_huge_page_size and cover huge page to memtable too
Summary: Extend the option memtable_prefix_bloom_huge_page_tlb_size from just putting memtable bloom filter to huge page to memtable itself too.

Test Plan: Run all existing tests.

Reviewers: IslamAbdelRahman, yhchiang, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60513
2016-07-26 18:15:11 -07:00
sdong
0ce258f9b3 Compaction picker to expand output level files for keys cross files' boundary too.
Summary: We may wrongly drop delete operation if we pick a file with the entry to be delete, the put entry of the same user key is in the next file in the level, and the next file is not picked. We expand compaction inputs for output level too.

Test Plan: Add unit tests that reproduct the bug of dropping delete entry. Change compaction_picker_test to assert the new behavior.

Reviewers: IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61173
2016-07-26 17:56:36 -07:00
Wanning Jiang
e12270dfee fix previous typo
Summary: old typos with FILTER/INDEX_CACHE

Test Plan: still pass this unit test

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61185
2016-07-26 11:15:14 -07:00
Islam AbdelRahman
16e225f70d Fix MergeContext::copied_operands_ strings moving
Summary:
MergeContext::copied_operands contain strings that MergeContext::operand_list_ Slices point to
It's possible that when MergeContext::copied_operands grow, these strings are moved and there place in memory is changed, this will cause MergeContext::operand_list_ to point to invalid memory.
fix this problem by using unique_ptr<string> instead of string

Test Plan: run tests under mac/clang

Reviewers: sdong, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61023
2016-07-25 15:31:41 -07:00
Yi Wu
ae0ad719de Fix flaky DBSSTTEST::DeleteObsoleteFilesPendingOutputs
Summary: The test is flaky on Travis in osx environment. The background flush the test wanting to block can run behind the L2 manual compaction, making the test actually blocking the L2 compaction and won't able to proceed.

Test Plan: Test run on travis

Reviewers: kradhakrishnan, sdong, andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61101
2016-07-25 15:09:34 -07:00
Yi Wu
c6654588bd Disable two dynamic options tests under lite build
Summary: RocksDB lite don't support dynamic options. Disable the two test from lite build, and assert `SetOptions` should return `status::OK`.

Test Plan: Run the db_options test under lite build and normal build.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61119
2016-07-25 11:48:17 -07:00
sdong
2a6d0cde72 Ignore stale logs while restarting DBs
Summary:
Stale log files can be deleted out of order. This can happen for various reasons. One of the reason is that no data is ever inserted to a column family and we have an optimization to update its log number, but not all the old log files are cleaned up (the case shown in the unit tests added). It can also happen when we simply delete multiple log files out of order.

This causes data corruption because we simply increase seqID after processing the next row and we may end up with writing data with smaller seqID than what is already flushed to memtables.

In DB recovery, for the oldest files we are replaying, if there it contains no data for any column family, we ignore the sequence IDs in the file.

Test Plan: Add two unit tests that fail without the fix.

Reviewers: IslamAbdelRahman, igor, yiwu

Reviewed By: yiwu

Subscribers: hermanlee4, yoshinorim, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60891
2016-07-25 11:47:31 -07:00
sdong
d5a51d4de3 Need to make sure log file synced before flushing memtable of one column family
Summary: Multiput atomiciy is broken across multiple column families if we don't sync WAL before flushing one column family. The WAL file may contain a write batch containing writes to a key to the CF to be flushed and a key to other CF. If we don't sync WAL before flushing, if machine crashes after flushing, the write batch will only be partial recovered. Data to other CFs are lost.

Test Plan: Add a new unit test which will fail without the diff.

Reviewers: yhchiang, IslamAbdelRahman, igor, yiwu

Reviewed By: yiwu

Subscribers: yiwu, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60915
2016-07-21 16:29:06 -07:00
Yi Wu
89f319c2df Fix unit test which breaks lite build
Summary: Comment out assertion of number of table files from lite build.

Test Plan:
    OPT=-DROCKSDB_LITE make check

Reviewers: lightmark

Reviewed By: lightmark

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60999
2016-07-21 14:32:12 -07:00
Yi Wu
32604e6601 Fix flush not being commit while writing manifest
Summary:
Fix flush not being commit while writing manifest, which is a recent bug introduced by D60075.

The issue:
# Options.max_background_flushes > 1
# Background thread A pick up a flush job, flush, then commit to manifest. (Note that mutex is released before writing manifest.)
# Background thread B pick up another flush job, flush. When it gets to `MemTableList::InstallMemtableFlushResults`, it notices another thread is commiting, so it quit.
# After the first commit, thread A doesn't double check if there are more flush result need to commit, leaving the second flush uncommitted.

Test Plan: run the test. Also verify the new test hit deadlock without the fix.

Reviewers: sdong, igor, lightmark

Reviewed By: lightmark

Subscribers: andrewkr, omegaga, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60969
2016-07-21 10:10:41 -07:00
John Alexander
9ab38c45ad Remove %z Format Specifier and Fix Windows Build of sim_cache.cc (#1224)
* Replace %zu format specifier with Windows-compatible macro 'ROCKSDB_PRIszt'

* Added "port/port.h" include to sim_cache.cc for call to snprintf().

* Applied cleaner fix to windows build, reverting part of 7bedd94
2016-07-20 15:28:04 -07:00
omegaga
e70020e4f6 Only cache level 0 indexes and filter when opening table reader
Summary: In T8216281 we decided to disable prefetching the index and filter during opening table handlers during startup (max_open_files = -1).

Test Plan: Rely on `IndexAndFilterBlocksOfNewTableAddedToCache` to guarantee L0 indexes and filters are still cached and change `PinL0IndexAndFilterBlocksTest` to make sure other levels are not cached (maybe add one more test to test we don't cache other levels?)

Reviewers: sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59913
2016-07-20 11:23:31 -07:00
Islam AbdelRahman
68a8e6b8fa Introduce FullMergeV2 (eliminate memcpy from merge operators)
Summary:
This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice>

This diff is stacked on top of D56493 and D56511

In this diff we
- Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future
- Replace std::deque<std::string> with std::vector<Slice> to pass operands
- Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187)
- Allow FullMergeV2 output to be an existing operand

```
[Everything in Memtable | 10K operands | 10 KB each | 1 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000

[FullMergeV2]
readseq      :       0.607 micros/op 1648235 ops/sec; 16121.2 MB/s
readseq      :       0.478 micros/op 2091546 ops/sec; 20457.2 MB/s
readseq      :       0.252 micros/op 3972081 ops/sec; 38850.5 MB/s
readseq      :       0.237 micros/op 4218328 ops/sec; 41259.0 MB/s
readseq      :       0.247 micros/op 4043927 ops/sec; 39553.2 MB/s

[master]
readseq      :       3.935 micros/op 254140 ops/sec; 2485.7 MB/s
readseq      :       3.722 micros/op 268657 ops/sec; 2627.7 MB/s
readseq      :       3.149 micros/op 317605 ops/sec; 3106.5 MB/s
readseq      :       3.125 micros/op 320024 ops/sec; 3130.1 MB/s
readseq      :       4.075 micros/op 245374 ops/sec; 2400.0 MB/s
```

```
[Everything in Memtable | 10K operands | 10 KB each | 10 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000

[FullMergeV2]
readseq      :       3.472 micros/op 288018 ops/sec; 2817.1 MB/s
readseq      :       2.304 micros/op 434027 ops/sec; 4245.2 MB/s
readseq      :       1.163 micros/op 859845 ops/sec; 8410.0 MB/s
readseq      :       1.192 micros/op 838926 ops/sec; 8205.4 MB/s
readseq      :       1.250 micros/op 800000 ops/sec; 7824.7 MB/s

[master]
readseq      :      24.025 micros/op 41623 ops/sec;  407.1 MB/s
readseq      :      18.489 micros/op 54086 ops/sec;  529.0 MB/s
readseq      :      18.693 micros/op 53495 ops/sec;  523.2 MB/s
readseq      :      23.621 micros/op 42335 ops/sec;  414.1 MB/s
readseq      :      18.775 micros/op 53262 ops/sec;  521.0 MB/s

```

```
[Everything in Block cache | 10K operands | 10 KB each | 1 operand per key]

[FullMergeV2]
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
readseq      :      14.741 micros/op 67837 ops/sec;  663.5 MB/s
readseq      :       1.029 micros/op 971446 ops/sec; 9501.6 MB/s
readseq      :       0.974 micros/op 1026229 ops/sec; 10037.4 MB/s
readseq      :       0.965 micros/op 1036080 ops/sec; 10133.8 MB/s
readseq      :       0.943 micros/op 1060657 ops/sec; 10374.2 MB/s

[master]
readseq      :      16.735 micros/op 59755 ops/sec;  584.5 MB/s
readseq      :       3.029 micros/op 330151 ops/sec; 3229.2 MB/s
readseq      :       3.136 micros/op 318883 ops/sec; 3119.0 MB/s
readseq      :       3.065 micros/op 326245 ops/sec; 3191.0 MB/s
readseq      :       3.014 micros/op 331813 ops/sec; 3245.4 MB/s
```

```
[Everything in Block cache | 10K operands | 10 KB each | 10 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions

[FullMergeV2]
readseq      :      24.325 micros/op 41109 ops/sec;  402.1 MB/s
readseq      :       1.470 micros/op 680272 ops/sec; 6653.7 MB/s
readseq      :       1.231 micros/op 812347 ops/sec; 7945.5 MB/s
readseq      :       1.091 micros/op 916590 ops/sec; 8965.1 MB/s
readseq      :       1.109 micros/op 901713 ops/sec; 8819.6 MB/s

[master]
readseq      :      27.257 micros/op 36687 ops/sec;  358.8 MB/s
readseq      :       4.443 micros/op 225073 ops/sec; 2201.4 MB/s
readseq      :       5.830 micros/op 171526 ops/sec; 1677.7 MB/s
readseq      :       4.173 micros/op 239635 ops/sec; 2343.8 MB/s
readseq      :       4.150 micros/op 240963 ops/sec; 2356.8 MB/s
```

Test Plan: COMPILE_WITH_ASAN=1 make check -j64

Reviewers: yhchiang, andrewkr, sdong

Reviewed By: sdong

Subscribers: lovro, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57075
2016-07-20 09:49:03 -07:00
sdong
e70ba4e40e MemTable::PostProcess() can skip updating num_deletes if the delta is 0
Summary: In many use cases there is no deletes. No need to pay the overhead of atomically updating num_deletes.

Test Plan: Run existing test.

Reviewers: ngbronson, yiwu, andrewkr, igor

Reviewed By: andrewkr

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60555
2016-07-19 18:10:18 -07:00
sdong
2a282e5f54 DBTablePropertiesTest.GetPropertiesOfTablesInRange: Fix Flaky
Summary:
Summary
There is a possibility that there is no L0 file after writing the data. Generate an L0 file to make it work.

Test Plan: Run the test many times.

Reviewers: andrewkr, yiwu

Reviewed By: yiwu

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60825
2016-07-19 15:46:20 -07:00
John Alexander
9430333f84 New Statistics to track Compression/Decompression (#1197)
* Added new statistics and refactored to allow ioptions to be passed around as required to access environment and statistics pointers (and, as a convenient side effect, info_log pointer).

* Prevent incrementing compression counter when compression is turned off in options.

* Prevent incrementing compression counter when compression is turned off in options.

* Added two more supported compression types to test code in db_test.cc

* Prevent incrementing compression counter when compression is turned off in options.

* Added new StatsLevel that excludes compression timing.

* Fixed casting error in coding.h

* Fixed CompressionStatsTest for new StatsLevel.

* Removed unused variable that was breaking the Linux build
2016-07-19 09:44:03 -07:00
sdong
21c55bdb6e DBTest.DynamicLevelCompressionPerLevel: Tune Threshold
Summary: Each SST's file size increases after we add more table properties. Threshold in DBTest.DynamicLevelCompressionPerLevel need to adjust accordingly to avoid occasional failures.

Test Plan: Run the test

Reviewers: andrewkr, yiwu

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60819
2016-07-15 16:10:09 -07:00
sdong
6797e6ffac Avoid updating memtable allocated bytes if write_buffer_size is not set
Summary: If options.write_buffer_size is not set, nor options.write_buffer_manager, no need to update the bytes allocated counter in MemTableAllocator, which is expensive in parallel memtable insert case. Remove it can improve parallel memtable insert throughput by 10% with write batch size 128.

Test Plan:
Run benchmarks
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -disable_auto_compactions -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -num=10000000 --writes=1000000 -max_background_flushes=16 -max_write_buffer_number=16 --threads=32 --batch_size=128   -allow_concurrent_memtable_write -enable_write_thread_adaptive_yield

The throughput grows 10% with the benchmark.

Reviewers: andrewkr, yiwu, IslamAbdelRahman, igor, ngbronson

Reviewed By: ngbronson

Subscribers: ngbronson, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60465
2016-07-13 19:33:57 -07:00
Aaron Gao
dda6c72ac8 Add DestroyColumnFamilyHandle(ColumnFamilyHandle**) to db.h
Summary:
add DestroyColumnFamilyHandle(ColumnFamilyHandle**) to close column family instead of deleting cfh*
User should call this to close a cf and then we can detect the deletion in this function.

Test Plan: make all check -j64

Reviewers: andrewkr, yiwu, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60765
2016-07-13 17:59:25 -07:00
Andrew Kryczka
56222f57df Avoid FileMetaData copy
Summary: as titled

Test Plan: unit tests

Reviewers: sdong, lightmark

Reviewed By: lightmark

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60597
2016-07-13 15:36:22 -07:00
Yi Wu
6ea41f8527 Fix deadlock when trying update options when write stalls
Summary:
When write stalls because of auto compaction is disabled, or stop write trigger is reached,
user may change these two options to unblock writes. Unfortunately we had issue where the write
thread will block the attempt to persist the options, thus creating a deadlock. This diff
fix the issue and add two test cases to detect such deadlock.

Test Plan:
Run unit tests.

Also, revert db_impl.cc to master (but don't revert `DBImpl::BackgroundCompaction:Finish` sync point) and run db_options_test. Both tests should hit deadlock.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60627
2016-07-12 15:30:38 -07:00
Jay Edgar
efd013d6d8 Miscellaneous performance improvements
Summary:
I was investigating performance issues in the SstFileWriter and found all of the following:

- The SstFileWriter::Add() function created a local InternalKey every time it was called generating a allocation and free each time.  Changed to have an InternalKey member variable that can be reset with the new InternalKey::Set() function.
- In SstFileWriter::Add() the smallest_key and largest_key values were assigned the result of a ToString() call, but it is simpler to just assign them directly from the user's key.
- The Slice class had no move constructor so each time one was returned from a function a new one had to be allocated, the old data copied to the new, and the old one was freed.  I added the move constructor which also required a copy constructor and assignment operator.
- The BlockBuilder::CurrentSizeEstimate() function calculates the current estimate size, but was being called 2 or 3 times for each key added.  I changed the class to maintain a running estimate (equal to the original calculation) so that the function can return an already calculated value.
- The code in BlockBuilder::Add() that calculated the shared bytes between the last key and the new key duplicated what Slice::difference_offset does, so I replaced it with the standard function.
- BlockBuilder::Add() had code to copy just the changed portion into the last key value (and asserted that it now matched the new key).  It is more efficient just to copy the whole new key over.
- Moved this same code up into the 'if (use_delta_encoding_)' since the last key value is only needed when delta encoding is on.
- FlushBlockBySizePolicy::BlockAlmostFull calculated a standard deviation value each time it was called, but this information would only change if block_size of block_size_deviation changed, so I created a member variable to hold the value to avoid the calculation each time.
- Each PutVarint??() function has a buffer and calls std::string::append().  Two or three calls in a row could share a buffer and a single call to std::string::append().

Some of these will be helpful outside of the SstFileWriter.  I'm not 100% the addition of the move constructor is appropriate as I wonder why this wasn't done before - maybe because of compiler compatibility?  I tried it on gcc 4.8 and 4.9.

Test Plan: The changes should not affect the results so the existing tests should all still work and no new tests were added.  The value of the changes was seen by manually testing the SstFileWriter class through MyRocks and adding timing code to identify problem areas.

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59607
2016-07-12 14:15:32 -07:00
Aaron Gao
816ae098ea fix test failure
Summary: fix Rocksdb Unit Test USER_FAILURE

Test Plan: make all check -j64

Reviewers: sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60603
2016-07-11 13:33:52 -07:00
Aaron Gao
8e6b38d895 update DB::AddFile to ingest list of sst files
Summary:
DB::AddFile(std::string file_path) API that allow them to ingest an SST file created using SstFileWriter
We want to update this interface to be able to accept a list of files that will be ingested, DB::AddFile(std::vector<std::string> file_path_list).

Test Plan:
Add test case `AddExternalSstFileList` in `DBSSTTest`. To make sure:
1. files key ranges are not overlapping with each other
2. each file key range dont overlap with the DB key range
3. make sure no snapshots are held

Reviewers: andrewkr, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58587
2016-07-11 10:43:12 -07:00
Yi Wu
296545a2c7 Fix clang analyzer errors
Summary:
Fixing erros reported by clang static analyzer.
* Removing some unused variables.
* Adding assertions to fix false positives reported by clang analyzer.
* Adding `__clang_analyzer__` macro to suppress false positive warnings.

Test Plan:
    USE_CLANG=1 OPT=-g make analyze -j64

Reviewers: andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60549
2016-07-08 17:50:51 -07:00