Commit Graph

1950 Commits

Author SHA1 Message Date
Igor Canadi
95ffc5d2bc Correct ASSERT_OK() in ReadDroppedColumnFamily
Summary: ReadDroppedColumnFamily is consistently failing in Travis CI environment (can't repro locally). I suspect it might be failing with non-OK status. This diff will give us more info about the failure.

Test Plan: none

Reviewers: sdong, kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: kradhakrishnan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46611
2015-09-10 14:17:12 -07:00
Ari Ekmekji
3c37b3cccd Determine boundaries of subcompactions
Summary:
Up to this point, the subcompactions that make up a compaction
job have been divided based on the key range of the L1 files, and each
subcompaction has handled the key range of only one file. However
DBOption.max_subcompactions allows the user to designate how many
subcompactions at most to perform. This patch updates the
CompactionJob::GetSubcompactionBoundaries() to determine these
divisions accordingly based on that option and other input/system factors.

The current approach orders the starting and/or ending keys of certain
compaction input files and then generates a histogram to approximate the
size covered by the key range between each consecutive pair of keys. Then
it groups these ranges into groups so that the sizes are approximately equal
to one another. The approach has also been adapted to work for universal
compaction as well instead of just for level-based compaction as it was before.

These subcompactions are then executed in parallel by locally spawning
threads, one for each. The results are then aggregated and the compaction
completed.

Test Plan: make all && make check

Reviewers: yhchiang, anthony, igor, noetzli, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43269
2015-09-10 13:50:00 -07:00
krad
1126644082 Relaxing consistency detection to include errors while inserting to memtable as WAL recovery error.
Summary: The current code, considers data to be consistent if the record
checksum passes. We do have customer issues where the record checksum passed but
the data was incomprehensible. There is no way to get out of this error case
since all WAL recovery model will consider this error as unrelated to WAL.

Relaxing the definition and including errors while inserting to memtable as WAL
errors and handing them as per the recovery level.

Test Plan: Used customer dump to verify the fix for different level. The db
opens for kSkipAnyCorruptedRecords and kPointInTimeRecovery, but fails for
kAbsoluteConsistency and kTolerateCorruptedTailRecords.

Reviewers: sdon igor

CC: leveldb@

Task ID: #7918721

Blame Rev:
2015-09-10 12:56:17 -07:00
sdong
abc7f5fdb2 Make DBTest.ReadLatencyHistogramByLevel more robust
Summary: DBTest.ReadLatencyHistogramByLevel was not written as expected. After writes, reads aren't guaranteed to hit data written. It was not expected. Fix it.

Test Plan: Run the test multiple times

Reviewers: IslamAbdelRahman, rven, anthony, kradhakrishnan, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D46587
2015-09-10 11:32:19 -07:00
Igor Canadi
ac9bcb55ce Set max_open_files based on ulimit
Summary: We should never set max_open_files to be bigger than the system's ulimit. Otherwise we will get "Too many open files" errors. See an example in this Travis run: https://travis-ci.org/facebook/rocksdb/jobs/79591566

Test Plan:
make check

I will also verify that max_max_open_files is reasonable.

Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46551
2015-09-10 10:49:28 -07:00
agiardullo
b5b2b75e52 better tuning of arena block size
Summary: Currently, if users didn't set options.arena_block_size, we set "result.arena_block_size = result.write_buffer_size / 10". It makes result.arena_block_size not a multiplier of 4KB, even if options.write_buffer_size is a multiplier of MBs. When calling malloc to arena_block_size, we may waste a small amount of memory for it. We now make the default to be /8 or /16 and align it to 4KB.

Test Plan: unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46467
2015-09-08 20:53:32 -07:00
sdong
342ba80895 Make DBTest.OptimizeFiltersForHits more deterministic
Summary:
This commit makes DBTest.OptimizeFiltersForHits more deterministic by:
(1) make key inserts more random
(2) make sure L0 has one file
(3) make file size smaller compared to level target so L1 will cover more range.

Test Plan: Run the test many times.

Reviewers: rven, IslamAbdelRahman, kradhakrishnan, igor, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D46461
2015-09-08 19:31:34 -07:00
Andres Notzli
e17e92ea19 Relaxed assert in forward iterator
Summary:
It looks like in some cases an assert in SeekInternal failed when computing the
hints for the next level because user_key was the same as the largest key and
not strictly smaller. Relaxing the assert to expect smaller or equal keys.

Test Plan: make clean all check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46443
2015-09-08 17:15:11 -07:00
Andres Noetzli
6bdc484fd8 Added Equal method to Comparator interface
Summary:
In some cases, equality comparisons can be done more efficiently than three-way
comparisons. There are quite a few places in the code where we only care about
equality. This patch adds an Equal() method that defaults to using the
Compare() method.

Test Plan: make clean all check

Reviewers: rven, anthony, yhchiang, igor, sdong

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46233
2015-09-08 15:30:49 -07:00
Andres Noetzli
3a0df7f161 Fixed comparison in ForwardIterator when computing hint for GetNextLevelIndex()
Summary: When computing the hint for GetNextLevelIndex(), ForwardIterator was doing a redundant comparison. This patch fixes the comparison (using https://github.com/facebook/rocksdb/blob/master/db/version_set.cc#L158 as a reference) and moves it inside an assert because we expect `level_files[f_idx]` to contain the next key after Seek(), so user_key should always be smaller than the largest key.

Test Plan: make clean all check

Reviewers: rven, anthony, yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: tnovak, sdong, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46227
2015-09-08 09:47:54 -07:00
Venkatesh Radhakrishnan
91f3c90792 Fix case when forward iterator misses a new update
Summary:
This diff fixes a case when the forward iterator misses a new
insert when the mutable iterator is not current. The test is also
improved and the check for deleted iterators is made more informative.

Test Plan: DBTailingIteratorTest.*Trim

Reviewers: tnovak, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D46167
2015-09-04 14:28:45 -07:00
Andres Noetzli
3c9cef1eed Unified maps with Comparator for sorting, other cleanup
Summary:
This diff is a collection of cleanups that were initially part of D43179.
Additionally it adds a unified way of defining key-value maps that use a
Comparator for sorting (this was previously implemented in four different
places).

Test Plan: make clean check all

Reviewers: rven, anthony, yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45993
2015-09-02 13:58:22 -07:00
sdong
3e0a672c50 Bug fix: table readers created by TableCache::Get() doesn't have latency histogram reported
Summary: TableCache::Get() puts parameters in the wrong places so that table readers created by Get() will not have the histogram updated.

Test Plan: Will write a unit test for that.

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D46035
2015-09-02 12:57:07 -07:00
Tomislav Novak
5508122ed6 Fix a perf regression in ForwardIterator
Summary:
I noticed that memtable iterator usually crosses the `iterate_upper_bound`
threshold when tailing. Changes introduced in D43833 made `NeedToSeekImmutable`
always return true in such case, even when `Seek()` only needs to rewind the
memtable iterator. In a test I ran, this caused the "tailing efficiency"
(ratio of calls to `Seek()` that only affect the memtable versus all seeks)
to drop almost to zero.

This diff attempts to fix the regression by using a different flag to indicate
that `current_` is over the limit instead of resetting `valid_` in
`UpdateCurrent()`.

Test Plan: `DBTestTailingIterator.TailingIteratorUpperBound`

Reviewers: sdong, rven

Reviewed By: rven

Subscribers: dhruba, march

Differential Revision: https://reviews.facebook.net/D45909
2015-09-01 09:54:30 -07:00
Andres Notzli
b722007778 Fix listener_test when using ROCKSDB_MALLOC_USABLE_SIZE
Summary:
Flushes in listener_test happened to early when ROCKSDB_MALLOC_USABLE_SIZE was
active (e.g. when compiling with ROCKSDB_FBCODE_BUILD_WITH_481=1) due to
malloc_usable_size() reporting a better estimate (similar to
https://reviews.facebook.net/D43317 ). This patch grows the write buffer size
slightly to compensate for this.

Test Plan: ROCKSDB_FBCODE_BUILD_WITH_481=1 make listener_test && ./listener_test

Reviewers: rven, anthony, yhchiang, igor, sdong

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45921
2015-08-31 23:11:12 -07:00
agiardullo
18db1e4695 better db_bench options for transactions
Summary:
Pessimistic Transaction expiration time checking currently causes a performace regression,  Lets disable it in db_bench by default.

Also, in order to be able to better tune how much contention we're simulating, added new optinos to set lock timeout and snapshot.

Test Plan: run db_bench randomtranansaction

Reviewers: sdong, igor, yhchiang, MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45831
2015-08-31 15:56:07 -07:00
Ari Ekmekji
8b689546b6 Add Subcompactions to Universal Compaction Unit Tests
Summary:
Now that the approach to parallelizing L0-L1 level-based
compactions by breaking the compaction job into subcompactions is
being extended to apply to universal compactions as well, the unit
tests need to account for this and run the universal compaction
tests with subcompactions both enabled and disabled.

Test Plan: make all && make check

Reviewers: sdong, igor, noetzli, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D45657
2015-08-31 12:59:02 -07:00
sdong
3d78eb66bb Arena usage to be calculated using malloc_usable_size()
Summary: malloc_usable_size() gets a better estimation of memory usage. It is already used to calculate block cache memory usage. Use it in arena too.

Test Plan: Run all unit tests

Reviewers: anthony, kradhakrishnan, rven, IslamAbdelRahman, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43317
2015-08-31 09:39:27 -07:00
Andres Noetzli
effd9dd1e1 Fix deadlock in WAL sync
Summary:
MarkLogsSynced() was doing `logs_.erase(it++);`. The standard is saying:

```
all iterators and references are invalidated, unless the erased members are at an end (front or back) of the deque (in which case only iterators and references to the erased members are invalidated)
```

Because `it` is an iterator to the first element of the container, it is
invalidated, only one iteration is executed and `log.getting_synced = false;`
is not being done, so `while (logs_.front().getting_synced)` in `WriteImpl()`
is not terminating.

Test Plan: make db_bench && ./db_bench --benchmarks=fillsync

Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, yhchiang, sdong, tnovak

Reviewed By: tnovak

Subscribers: kolmike, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45807
2015-08-28 18:06:32 -07:00
Andres Noetzli
72a9b73c9e Removed unnecessary checks in DBTest.ApproximateMemoryUsage
Summary:
Just realized that after D45675, part of the code in
DBTest.ApproximateMemoryUsage, does not really test anything anymore, so I
removed it.

Test Plan: make clean all check

Reviewers: rven, igor, sdong, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45783
2015-08-28 11:13:20 -07:00
Venkatesh Radhakrishnan
cb164bfc48 Do not delete iterators for immutable memtables.
Summary:
The immutable memtable iterators are allocated from an arena and there
is no benefit from deleting these. Also the immutable memtables
themselves will continue to be in memory until the version set
containing it is alive. We will not remove immutable memtable iterators
over the upper bound. We now add immutable iterators to the test.

Test Plan: db_tailing_iter_test.TailingIteratorTrimSeekToNext

Reviewers: tnovak, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45597
2015-08-28 11:07:07 -07:00
sdong
7a0dbdf3ac Add ZSTD (not final format) compression type
Summary: Add ZSTD compression type. The same way as adding LZ4.

Test Plan: run all tests. Generate files in db_bench. Make sure reads succeed. But the SST files cannot be opened in older versions. Also some other adhoc tests.

Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, igor

Reviewed By: igor

Subscribers: MarkCallaghan, maykov, yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45747
2015-08-28 11:01:13 -07:00
Andres Noetzli
e853191c17 Fix DBTest.ApproximateMemoryUsage
Summary:
This patch fixes two issues in DBTest.ApproximateMemoryUsage:
- It was possible that a flush happened between getting the two properties in
  Phase 1, resulting in different numbers for the properties and failing the
  assertion. This is fixed by waiting for the flush to finish before getting
  the properties.
- There was a similar issue in Phase 2 and additionally there was an issue that
  rocksdb.size-all-mem-tables was not monotonically increasing because it was
  possible that a flush happened just after getting the properties and then
  another flush just before getting the properties in the next round. In this
  situation, the reported memory usage decreased. This is fixed by forcing a
  flush before getting the properties.

Note: during testing, I found that kFlushesPerRound does not seem very
accurate. I added a TODO for this and it would be great to get some input on
what to do there.

Test Plan:
The first issue can be made more likely to trigger by inserting a
`usleep(10000);` between the calls to GetIntProperty() in Phase 1.
The second issue can be made more likely to trigger by inserting a
`if (r != 0) usleep(10000);` before the calls to GetIntProperty() and a
`usleep(10000);` after the calls.
Then execute make db_test && ./db_test --gtest_filter=DBTest.ApproximateMemoryUsage

Reviewers: rven, yhchiang, igor, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45675
2015-08-27 16:17:08 -07:00
Yueh-Hsuan Chiang
8ef0144e2f Add argument --show_table_properties to db_bench
Summary:
Add argument --show_table_properties to db_bench

  -show_table_properties (If true, then per-level table properties will be
    printed on every stats-interval when stats_interval is set and
    stats_per_interval is on.) type: bool default: false

Test Plan:
./db_bench --show_table_properties=1 --stats_interval=100000 --stats_per_interval=1
./db_bench --show_table_properties=1 --stats_interval=100000 --stats_per_interval=1 --num_column_families=2

Sample Output:

    Compaction Stats [column_family_name_000001]
    Level    Files   Size(MB) Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) Stall(cnt)  KeyIn KeyDrop
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
      L0      3/0          5   0.8      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0     86.3         0        17    0.021          0       0      0
      L1      5/0          9   0.9      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0         0    0.000          0       0      0
      L2      9/0         16   0.2      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0         0    0.000          0       0      0
     Sum     17/0         31   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0     86.3         0        17    0.021          0       0      0
     Int      0/0          0   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0     83.9         0         2    0.022          0       0      0
    Flush(GB): cumulative 0.030, interval 0.004
    Stalls(count): 0 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 0 leveln_slowdown_soft, 0 leveln_slowdown_hard

    Level[0]: # data blocks=2571; # entries=84813; raw key size=2035512; raw average key size=24.000000; raw value size=8481300; raw average value size=100.000000; data block size=5690119; index block size=82415; filter block size=0; (estimated) table size=5772534; filter policy name=N/A;
    Level[1]: # data blocks=4285; # entries=141355; raw key size=3392520; raw average key size=24.000000; raw value size=14135500; raw average value size=100.000000; data block size=9487353; index block size=137377; filter block size=0; (estimated) table size=9624730; filter policy name=N/A;
    Level[2]: # data blocks=7713; # entries=254439; raw key size=6106536; raw average key size=24.000000; raw value size=25443900; raw average value size=100.000000; data block size=17077893; index block size=247269; filter block size=0; (estimated) table size=17325162; filter policy name=N/A;
    Level[3]: # data blocks=0; # entries=0; raw key size=0; raw average key size=0.000000; raw value size=0; raw average value size=0.000000; data block size=0; index block size=0; filter block size=0; (estimated) table size=0; filter policy name=N/A;
    Level[4]: # data blocks=0; # entries=0; raw key size=0; raw average key size=0.000000; raw value size=0; raw average value size=0.000000; data block size=0; index block size=0; filter block size=0; (estimated) table size=0; filter policy name=N/A;
    Level[5]: # data blocks=0; # entries=0; raw key size=0; raw average key size=0.000000; raw value size=0; raw average value size=0.000000; data block size=0; index block size=0; filter block size=0; (estimated) table size=0; filter policy name=N/A;
    Level[6]: # data blocks=0; # entries=0; raw key size=0; raw average key size=0.000000; raw value size=0; raw average value size=0.000000; data block size=0; index block size=0; filter block size=0; (estimated) table size=0; filter policy name=N/A;

Reviewers: anthony, IslamAbdelRahman, MarkCallaghan, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45651
2015-08-26 18:27:23 -07:00
Igor Canadi
5f4166c90e ReadaheadRandomAccessFile -- userspace readahead
Summary:
ReadaheadRandomAccessFile acts as a transparent layer on top of RandomAccessFile. When a Read() request is issued, it issues a much bigger request to the OS and caches the result. When a new request comes in and we already have the data cached, it doesn't have to issue any requests to the OS.

We add ReadaheadRandomAccessFile layer only when file is read during compactions.

D45105 was incorrectly closed by Phabricator because I committed it to a separate branch (not master), so I'm resubmitting the diff.

Test Plan: make check

Reviewers: MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45123
2015-08-26 15:25:59 -07:00
sdong
d286b5df90 DBIter to out extra keys with higher sequence numbers when changing direction from forward to backward
Summary:
When DBIter changes iterating direction from forward to backward, it might see some much larger keys with higher sequence ID. With this commit, these rows will be actively filtered out. It should fix existing disabled tests in db_iter_test.

This may not be a perfect fix, but it introduces least impact on existing codes, in order to be safe.

Test Plan:
Enable existing tests and make sure they pass. Add a new test DBIterWithMergeIterTest.InnerMergeIteratorDataRace8.
Also run all existing tests.

Reviewers: yhchiang, rven, anthony, IslamAbdelRahman, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45567
2015-08-26 13:01:39 -07:00
Andres Noetzli
3795449c9d Fix DBTest.GetProperty
Summary:
DBTest.GetProperty was failing occasionally (see task #8131266). The reason was
that the test closed the database before the compaction was done. When the test
reopened the database, RocksDB would schedule a compaction which in turn
created table readers and lead the test to fail the assertion that
rocksdb.estimate-table-readers-mem is 0. In most cases, GetIntProperty() of
rocksdb.estimate-table-readers-mem happened before the compaction created the
table readers, hiding the problem. This patch changes the
WaitForFlushMemTable() to WaitForCompact(). WaitForFlushMemTable() is not
necessary because it is already being called a couple of lines before without
any insertions in-between.

Test Plan:
Insert `usleep(10000);` just after `Reopen(options);` on line 2333 to make the issue more likely, then run:
make db_test && while ./db_test --gtest_filter=DBTest.GetProperty; do true; done

Reviewers: rven, yhchiang, anthony, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45603
2015-08-26 10:10:26 -07:00
Igor Canadi
a7834a1292 Merge pull request #698 from yuslepukhin/address_noexcept_windows
Address noexcept and const integer lambda capture on win
2015-08-25 17:15:23 -07:00
Dmitri Smirnov
6924d7582b Address noexcept and const integer lambda capture
VS 2013 does not support noexcept.
   Complains about usage of ineteger constant within lambda requiring explicit capture.
2015-08-25 15:17:14 -07:00
Ari Ekmekji
2f8d71ec05 Moving sequence number compaction variables from SubCompactionState to CompactionJob
Summary:
It was pointed out to me that the members of SubCompactionState
'earliest_snapshot', 'latest_snapshot' and 'visible_at_tip' are never
modified by the subcompactions, so they can stay as global varaibles
instead to make things simpler.

Test Plan: make all && make check

Reviewers: sdong, igor, noetzli, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D45477
2015-08-25 14:03:10 -07:00
Venkatesh Radhakrishnan
bab9934d9e Fix build failure caused by bad merge.
Summary: There was a bad merge during refresh.

Test Plan: make -j all; make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D45555
2015-08-25 14:02:03 -07:00
Venkatesh Radhakrishnan
4d28a7d8ab Add a whitebox test for deleted file iterators.
Summary:
We have earlier added a feature to delete file iterators when the
current key is over the iterate upper bound. We now add a whitebox test
to check if the file iterators were actually deleted.

Test Plan: Add check for a range which has deleted iterators.

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45321
2015-08-25 13:40:58 -07:00
Venkatesh Radhakrishnan
249fb4f881 Fix use of deleted file iterators with incomplete iterators
Summary:
After deleting file iterators which are over the iterate upper
bound, we also need to check for null pointers in
ResetIncompletIterators.

Test Plan: db_tailing_iter_test.TailingIteratorTrimSeekToNext

Reviewers: tnovak, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45525
2015-08-25 13:38:35 -07:00
Andres Notzli
09d982f9e0 Fix compact_files_example
Summary:
See task #7983654. The example was triggering an assert in compaction job
because the compaction was not marked as manual. With this patch,
CompactionPicker::FormCompaction() marks compactions as manual. This patch
also fixes a couple of typos, adds optimistic_transaction_example to
.gitignore and librocksdb as a dependency for examples. Adding librocksdb as
a dependency makes sure that the examples are built with the latest changes
in librocksdb.

Test Plan: make clean && cd examples && make all && ./compact_files_example

Reviewers: rven, sdong, anthony, igor, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45117
2015-08-25 12:29:44 -07:00
Yueh-Hsuan Chiang
6996de87af Expose per-level aggregated table properties via GetProperty()
Summary:
This patch adds "rocksdb.aggregated-table-properties"
and "rocksdb.aggregated-table-properties-at-levelN", the former
returns the aggreated table properties of a column family,
while the later returns the aggregated table properties
of the specified level N.

Test Plan: Added tests in db_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45087
2015-08-25 12:03:54 -07:00
Andres Noetzli
2050832974 Fixing race condition in DBTest.DynamicMemtableOptions
Summary:
This patch fixes a race condition in DBTEst.DynamicMemtableOptions. In rare cases,
it was possible that the main thread would fill up both memtables before the flush
job acquired its work. Then, the flush job was flushing both memtables together,
producing only one L0 file while the test expected two. Now, the test waits for
flushes to finish earlier, to make sure that the memtables are flushed in separate
flush jobs.

Test Plan:
Insert "usleep(10000);" after "IOSTATS_SET_THREAD_POOL_ID(Env::Priority::HIGH);" in BGWorkFlush()
to make the issue more likely. Then test with:
make db_test && time while ./db_test --gtest_filter=*DynamicMemtableOptions; do true; done

Reviewers: rven, sdong, yhchiang, anthony, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45429
2015-08-24 17:04:18 -07:00
Igor Canadi
e46bcc08b9 Remove an extra 's' from cur-size-all-mem-tabless
Summary: As title

Test Plan: make check

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45447
2015-08-24 16:43:18 -07:00
Igor Canadi
4ab26c5ad1 Smarter purging during flush
Summary:
Currently, we only purge duplicate keys and deletions during flush if `earliest_seqno_in_memtable <= newest_snapshot`. This means that the newest snapshot happened before we first created the memtable. This is almost never true for MyRocks and MongoRocks.

This patch makes purging during flush able to understand snapshots. The main logic is copied from compaction_job.cc, although the logic over there is much more complicated and extensive. However, we should try to merge the common functionality at some point.

I need this patch to implement no_overwrite_i_promise functionality for flush. We'll also need this to support SingleDelete() during Flush(). @yoshinorim requested the feature.

Test Plan:
make check
I had to adjust some unit tests to understand this new behavior

Reviewers: yhchiang, yoshinorim, anthony, sdong, noetzli

Reviewed By: noetzli

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42087
2015-08-24 11:11:12 -07:00
Ari Ekmekji
b6def58f73 Changed 'num_subcompactions' to the more accurate 'max_subcompactions'
Summary:
Up until this point we had DbOptions.num_subcompactions, but
it is semantically more correct to call this max_subcompactions since
we will schedule *up to* DbOptions.max_subcompactions smaller compactions
at a time during a compaction job.

I also added a --subcompactions option to db_bench

Test Plan: make all   make check

Reviewers: sdong, igor, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D45069
2015-08-21 14:25:34 -07:00
sdong
c852968465 db_iter_test: add more test cases for the data race bug
Summary: Add more test cases of data race causing wrong iterating results. Tag tests not passing as DISABLED_

Test Plan: Run the tests

Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: tnovak, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44907
2015-08-21 12:14:12 -07:00
sdong
9130873a13 Add options.new_table_reader_for_compaction_inputs
Summary: Currently compaction inputs share the same file descriptor and table reader as other foreground threads. It makes fadvise works less predictable. Add options.new_table_reader_for_compaction_inputs to enforce to create a new file descriptor and new table reader for it.

Test Plan: Add the option.

Reviewers: rven, anthony, kradhakrishnan, IslamAbdelRahman, igor, yhchiang

Reviewed By: igor

Subscribers: igor, MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43311
2015-08-21 08:46:29 -07:00
sdong
07d2d34160 Add a counter about estimated pending compaction bytes
Summary:
Add a counter of estimated bytes the DB needs to compact for all the compactions to finish. Expose it as a DB Property.
In the future, we can use threshold of this counter to replace soft rate limit and hard rate limit. A single threshold of estimated compaction debt in bytes will be easier for users to reason about when should slow down and stopping than more abstract soft and hard rate limits.

Test Plan: Add unit tests

Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, anthony, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44205
2015-08-20 22:17:10 -07:00
Yueh-Hsuan Chiang
a203b913c1 Fixed a rare deadlock in DBTest.ThreadStatusFlush
Summary:
Currently, ThreadStatusFlush uses two sync-points to ensure
there's a flush currently running when calling GetThreadList().
However, one of the sync-point is inside db-mutex, which could
cause deadlock in case there's a DB::Get() call.

This patch fix this issue by moving the sync-point to a better
place where the flush job does not hold the mutex.

Test Plan: db_test

Reviewers: igor, sdong, anthony, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45045
2015-08-20 17:18:47 -07:00
Siying Dong
962aa64292 Merge pull request #695 from yuslepukhin/address_windows_build
Address windows build issues caused by introducing Subcompaction
2015-08-20 17:04:48 -07:00
Dmitri Smirnov
5bf8907622 More indent adjustment. 2015-08-20 14:14:02 -07:00
Dmitri Smirnov
e2a9f43d64 Adjust indent 2015-08-20 14:10:51 -07:00
Dmitri Smirnov
1cac89c9b1 Address windows build issues
Intro SubCompactionState move functionality
 =delete copy functionality
 #ifdef SyncPoint in tests for Windows Release builds
2015-08-20 14:08:24 -07:00
Islam AbdelRahman
027ca5b2cd Total SST files size DB Property
Summary: Add a new DB property that calculate the total size of files used by all RocksDB Versions

Test Plan: Unittests for the new property

Reviewers: igor, yhchiang, anthony, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44799
2015-08-20 11:47:19 -07:00
Andres Noetzli
b604d2562f Removing unused variables to fix build
Summary: Removing two unused variables that prevented compilation.

Test Plan: make all

Reviewers: rven, sdong, yhchiang, anthony, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D44991
2015-08-19 16:57:40 -07:00
Venkatesh Radhakrishnan
1b114eed4d Free file iterators for files which are above the iterate upper bound to Improve memory utilization
Summary:
This diff improves the memory utilization for tailing iterators RocksDB,
by freeing file iterators which are over the upper bound.
It is an updating on Siying's original diff for improving the memory usage for
tailing iterators. The changes for the seek and next path are now complete
and a test has been added to exercise these paths while deleting file iterators
which are above the upper bound.

Test Plan: db_tailing_iter_test.TailingIteratorTrimSeekToNext

Reviewers: march, tnovak, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43833
2015-08-19 16:05:51 -07:00
Islam AbdelRahman
3fd70b05b8 Rate limit deletes issued by DestroyDB
Summary: Update DestroyDB so that all SST files in the first path id go through DeleteScheduler instead of being deleted immediately

Test Plan: added a unittest

Reviewers: igor, yhchiang, anthony, kradhakrishnan, rven, sdong

Reviewed By: sdong

Subscribers: jeanxu2012, dhruba

Differential Revision: https://reviews.facebook.net/D44955
2015-08-19 15:02:17 -07:00
Yueh-Hsuan Chiang
df79eafcb3 Introduce GetIntProperty("rocksdb.size-all-mem-tables")
Summary:
Currently, GetIntProperty("rocksdb.cur-size-all-mem-tables") only returns
the memory usage by those memtables which have not yet been flushed.

This patch introduces GetIntProperty("rocksdb.size-all-mem-tables"),
which includes the memory usage by all the memtables, includes those
have been flushed but pinned by iterators.

Test Plan: Added a test in db_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D44229
2015-08-19 13:32:09 -07:00
sdong
888fbdc889 Remove the contstaint that iterator upper bound needs to be within a prefix
Summary: There is a check to fail the iterator if prefix extractor is specified but upper bound is out of the prefix for the seek key. Relax this constraint to allow users to set upper bound to the next prefix of the current one.

Test Plan: make commit-prereq

Reviewers: igor, anthony, kradhakrishnan, yhchiang, rven

Reviewed By: rven

Subscribers: tnovak, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44949
2015-08-19 11:03:51 -07:00
Ari Ekmekji
137c376675 Removing variables used only in assertions to prevent build error
Summary:
A couple variables were declared but only used in assertions
which causes issues when building in fbcode.

Test Plan: make dbg  and   make release

Reviewers: yhchiang, sdong, igor, anthony, MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44937
2015-08-19 08:52:22 -07:00
Ari Ekmekji
b47cc58516 Bounding Number of Subcompactions
Summary:
In D43239 (https://reviews.facebook.net/D43239) the number
of subcompactions is set based on the number of L1 files with
unique starting keys. In certain cases when this number is very large
this causes issues, particularly with the overlap between files since
very small output files can be generated. This diff bounds the number
of subcompactions to the user option DBOption.num_subcompactions.

Test Plan: ./db_test ./db_compaction_test

Reviewers: sdong, igor, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44883
2015-08-18 14:56:31 -07:00
Venkatesh Radhakrishnan
e58e1b18e7 Make tailing iterator show new entries in memtable.
Summary:
Reseek mutable_iter if it is invalid in Next and immutable_iter
is invalid.

Test Plan: DBTestTailingIterator.TailingIteratorSeekToNext

Reviewers: tnovak, march, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D44865
2015-08-18 14:40:06 -07:00
Ari Ekmekji
601b1aaca0 Fixing Failed Assertion in Subcompaction State Diff
Summary:
In D43239 (https://reviews.facebook.net/D43239) there is an
assertion to make sure a subcompaction's output is never empty at the
end of execution. This assertion however breaks the build because some
tests lead to exactly that scenario. So instead I have altered the logic
to handle this case instead of just failing the assertion.

The reason that it is possible for a subcompaction's output to be empty is
that during a sequential execution of subcompactions, if a user aborts the
compaction job then some of the later subcompactions to be executed may
have yet to process any keys and therefore have yet to generate output files.
This becomes very rare once the subcompactions are executed in parallel,
but for now they are still sequential so the case is possible when there is an
early termination, as in some of the tests.

Test Plan: ./db_test  ./db_compaction_test

Reviewers: sdong, igor, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44877
2015-08-18 12:27:12 -07:00
Ari Ekmekji
f0da6977a3 [Parallel L0-L1 Compaction Prep]: Giving Subcompactions Their Own State
Summary:
In prepration for running multiple threads at the same time during
a compaction job, this patch assigns each subcompaction its own state
(instead of sharing the one global CompactionState). Each subcompaction then
uses this state to update its statistics, keep track of its snapshots, etc.
during the course of execution. Then at the end of all the executions the
statistics are aggregated across the subcompactions so that the final result
is the same as if only one larger compaction had run.

Test Plan: ./db_test  ./db_compaction_test  ./compaction_job_test

Reviewers: sdong, anthony, igor, noetzli, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43239
2015-08-18 11:06:23 -07:00
Andres Notzli
f32a572099 Simplify querying of merge results
Summary:
While working on supporting mixing merge operators with
single deletes ( https://reviews.facebook.net/D43179 ),
I realized that returning and dealing with merge results
can be made simpler. Submitting this as a separate diff
because it is not directly related to single deletes.

Before, callers of merge helper had to retrieve the merge
result in one of two ways depending on whether the merge
was successful or not (success = result of merge was single
kTypeValue). For successful merges, the caller could query
the resulting key/value pair and for unsuccessful merges,
the result could be retrieved in the form of two deques of
keys and values. However, with single deletes, a successful merge
does not return a single key/value pair (if merge
operands are merged with a single delete, we have to generate
a value and keep the original single delete around to make
sure that we are not accidentially producing a key overwrite).
In addition, the two existing call sites of the merge
helper were taking the same actions independently from whether
the merge was successful or not, so this patch simplifies that.

Test Plan: make clean all check

Reviewers: rven, sdong, yhchiang, anthony, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43353
2015-08-17 17:34:38 -07:00
sdong
72613657f0 Measure file read latency histogram per level
Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled.

Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected

Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44193
2015-08-14 17:32:42 -07:00
Nathan Bronson
b7198c3afe reduce db mutex contention for write batch groups
Summary:
This diff allows a Writer to join the next write batch group
without acquiring any locks. Waiting is performed via a per-Writer mutex,
so all of the non-leader writers never need to acquire the db mutex.
It is now possible to join a write batch group after the leader has been
chosen but before the batch has been constructed. This diff doesn't
increase parallelism, but reduces synchronization overheads.

For some CPU-bound workloads (no WAL, RAM-sized working set) this can
substantially reduce contention on the db mutex in a multi-threaded
environment.  With T=8 N=500000 in a CPU-bound scenario (see the test
plan) this is good for a 33% perf win.  Not all scenarios see such a
win, but none show a loss.  This code is slightly faster even for the
single-threaded case (about 2% for the CPU-bound scenario below).

Test Plan:
1. unit tests
2. COMPILE_WITH_TSAN=1 make check
3. stress high-contention scenarios with db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=0 --num=$N -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000

Reviewers: sdong, igor, rven, ljin, yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43887
2015-08-14 10:55:43 -07:00
sdong
603b6da8b8 Add options.compaction_measure_io_stats to print write I/O stats in compactions
Summary:
Add options.compaction_measure_io_stats to print out / pass to listener accumulated time spent on write calls. Example outputs in info logs:

2015/08/12-16:27:59.463944 7fd428bff700 (Original Log Time 2015/08/12-16:27:59.463922) EVENT_LOG_v1 {"time_micros": 1439422079463897, "job": 6, "event": "compaction_finished", "output_level": 1, "num_output_files": 4, "total_output_size": 6900525, "num_input_records": 111483, "num_output_records": 106877, "file_write_nanos": 15663206, "file_range_sync_nanos": 649588, "file_fsync_nanos": 349614797, "file_prepare_write_nanos": 1505812, "lsm_state": [2, 4, 0, 0, 0, 0, 0]}

Add two more counters in iostats_context.

Also add a parameter of db_bench.

Test Plan: Add a unit test. Also manually verify LOG outputs in db_bench

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44115
2015-08-13 16:52:26 -07:00
sdong
4637207120 Add test case to repro the mispositional iterator in a low-chance data race case
Summary: Iterator has a bug: if a child iterator reaches its end, and user issues a Prev(), and just before SeekToLast() of the child iterator is called, some extra rows is added in the end, the position of iterator can be misplaced.

Test Plan: Run the tests with or without valgrind

Reviewers: rven, yhchiang, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: tnovak, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43671
2015-08-12 10:50:52 -07:00
agiardullo
0db807ec28 Transaction error statuses
Summary:
Based on feedback from spetrunia, we should better differentiate error statuses for transaction failures.

https://github.com/MySQLOnRocksDB/mysql-5.6/issues/86#issuecomment-124605954

Test Plan: unit tests

Reviewers: rven, kradhakrishnan, spetrunia, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43323
2015-08-11 17:52:56 -07:00
agiardullo
c2f2cb0214 Pessimistic Transactions
Summary:
Initial implementation of Pessimistic Transactions.  This diff contains the api changes discussed in D38913.  This diff is pretty large, so let me know if people would prefer to meet up to discuss it.

MyRocks folks:  please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues.

Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint().  After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex.  We can then decide which route is preferable.

Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing.

Test Plan: Unit tests, db_bench parallel testing.

Reviewers: igor, rven, sdong, yhchiang, yoshinorim

Reviewed By: sdong

Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40869
2015-08-11 17:52:23 -07:00
Islam AbdelRahman
c2868cbc52 Use manual_compaction for compaction_job_test
Summary:
Under certain conditions (disable compression) the compactions that are created in compaction_job_test will pass the trivial_move conditions
This will cause problems since we assert that we dont run a compaction if it's a trivial move
https://github.com/facebook/rocksdb/blob/master/db/compaction_job.cc#L144-L147

for example when we disable compression, compactions become a valid trivial move and the assert fails
https://ci-builds.fb.com/view/rocksdb/job/rocksdb_no_compression/180/console

Test Plan: compaction_job_test

Reviewers: sdong, yhchiang, noetzli, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43983
2015-08-11 14:47:14 -07:00
Islam AbdelRahman
cee1e8a080 Parallelize LoadTableHandlers
Summary: Add a new option that all LoadTableHandlers to use multiple threads to load files on DB Open and Recover

Test Plan:
make check -j64
COMPILE_WITH_TSAN=1 make check -j64
DISABLE_JEMALLOC=1 make all valgrind_check -j64 (still running)

Reviewers: yhchiang, anthony, rven, kradhakrishnan, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43755
2015-08-11 12:19:56 -07:00
Andres Notzli
4249f159d5 Removing duplicate code in db_bench/db_stress, fixing typos
Summary:
While working on single delete support for db_bench, I realized that
db_bench/db_stress contain a bunch of duplicate code related to
copmression and found some typos. This patch removes duplicate code,
typos and a redundant #ifndef in internal_stats.cc.

Test Plan: make db_stress && make db_bench && ./db_bench --benchmarks=compress,uncompress

Reviewers: yhchiang, sdong, rven, anthony, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43965
2015-08-11 11:46:15 -07:00
Nathan Bronson
1ae27113c7 reduce comparisons by skiplist
Summary:
Key comparison is the single largest CPU user for CPU-bound
workloads. This diff reduces the number of comparisons in two ways.

The first is that it moves predecessor array gathering from
FindGreaterOrEqual to FindLessThan, so that FindGreaterOrEqual can
return immediately if compare_ returns 0.  As part of this change I
moved the sequential insertion optimization into Insert, to remove the
undocumented (and smelly) requirement that prev must be equal to prev_
if it is non-null.

The second optimization is that all of the search functions skip calling
compare_ when moving to a lower level that has the same Next pointer.
With a branching factor of 4 we would expect this to happen 1/4 of
the time.

On a single-threaded CPU-bound workload (-benchmarks=fillrandom -threads=1
-batch_size=1 -memtablerep=skip_list -value_size=0 --num=1600000
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000)
on my dev server this is good for a 7% perf win.

Test Plan: unit tests

Reviewers: rven, ljin, yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43233
2015-08-11 11:25:22 -07:00
Islam AbdelRahman
a9dcc0a638 Fix clang build
Summary:
https://ci-builds.fb.com/view/rocksdb/job/rocksdb_clang_build/893/console
Fixing clang build

Test Plan:
make clean
USE_CLANG=1 make all -j64

Reviewers: sdong, noetzli, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43959
2015-08-10 11:30:36 -07:00
Andres Notzli
68f934355a Better CompactionJob testing
Summary:
Changed compaction_job_test to support better/more thorough
tests and added two tests. Also changed MockFileContents
to order using InternalKeyComparator.

Test Plan: make compaction_job_test && ./compaction_job_test; make all && make check

Reviewers: sdong, rven, igor, yhchiang, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42837
2015-08-07 21:59:51 -07:00
agiardullo
16ea1c7d1c simple ManagedSnapshot wrapper
Summary: Implemented this simple wrapper for something else I was working on.  Seemed like it makes sense to expose it instead of burying it in some random code.

Test Plan: added test

Reviewers: rven, kradhakrishnan, sdong, yhchiang

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43293
2015-08-06 17:59:05 -07:00
sdong
6a4aaadcd7 Avoid type unique_ptr in LogWriterNumber::writer for Windows build break
Summary:
Visual Studio complains about deque<LogWriterNumber> because LogWriterNumber is non-copyable for its unique_ptr member writer. Move away from it, and do explit free.
It is less safe but I can't think of a better way to unblock it.

Test Plan: valgrind check test

Reviewers: anthony, IslamAbdelRahman, kolmike, rven, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43647
2015-08-06 10:52:41 -07:00
Andres Noetzli
d7314ba759 Fixing endless loop if seeking to end of key with seq num 0
Summary:
When seeking to the last occurrence of a key with sequence number 0, db_iter
ends up in an endless loop because it seeks to type kValueTypeForSeek
which is larger than kTypeDeletion/kTypeValue. Added test case that triggers
the behavior.

Test Plan: make clean all check

Reviewers: igor, rven, anthony, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43653
2015-08-06 10:43:28 -07:00
Islam AbdelRahman
29b028b0ed Make DeleteScheduler tests more reliable
Summary: Update DeleteScheduler tests so that they verify the used penalties for waiting instead of measuring the time spent which is not reliable

Test Plan:
make -j64 delete_scheduler_test && ./delete_scheduler_test
COMPILE_WITH_TSAN=1 make -j64 delete_scheduler_test && ./delete_scheduler_test
COMPILE_WITH_ASAN=1 make -j64 delete_scheduler_test && ./delete_scheduler_test

make -j64 db_test && ./db_test --gtest_filter="DBTest.RateLimitedDelete:DBTest.DeleteSchedulerMultipleDBPaths"
COMPILE_WITH_TSAN=1 make -j64 db_test && ./db_test --gtest_filter="DBTest.RateLimitedDelete:DBTest.DeleteSchedulerMultipleDBPaths"
COMPILE_WITH_ASAN=1 make -j64 db_test && ./db_test --gtest_filter="DBTest.RateLimitedDelete:DBTest.DeleteSchedulerMultipleDBPaths"

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43635
2015-08-05 19:16:52 -07:00
Poornima Chozhiyath Raman
7d364d0d94 Fix build failure
Summary: fix the build failure

Test Plan: make all

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43623
2015-08-05 16:38:12 -07:00
Poornima Chozhiyath Raman
960d936e83 Add function 'GetInfoLogList()'
Summary: The list of info log files of a db can be obtained using the new function.

Test Plan: New test in db_test.cc passed.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: IslamAbdelRahman, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41715
2015-08-05 16:16:46 -07:00
sdong
7ccd1c80a7 Add two unit tests for SyncWAL()
Summary:
Add two unit tests for SyncWAL(). One makes sure SyncWAL() doesn't block writes in the other thread. Another one makes sure SyncWAL() doesn't wait ongoing writes to finish before being executed.

Create a new test file db_wal_test and move two WAL related tests from db_test to here.

Test Plan: Run the new tests

Reviewers: IslamAbdelRahman, rven, kradhakrishnan, kolmike, tnovak, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43605
2015-08-05 14:27:02 -07:00
sdong
3ae386eafe Add statistic histogram "rocksdb.sst.read.micros"
Summary: Measure read latency histogram and put in statistics. Compaction inputs are excluded from it when possible (unfortunately usually no possible as we usually take table reader from table cache.

Test Plan:
Run db_bench and it shows the stats, like:

rocksdb.sst.read.micros statistics Percentiles :=> 50 : 1.238522 95 : 2.529740 99 : 3.912180

Reviewers: kradhakrishnan, rven, anthony, IslamAbdelRahman, MarkCallaghan, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43275
2015-08-05 13:02:33 -07:00
Islam AbdelRahman
9aec75fbb9 Enable DBTest.FlushSchedule under TSAN
Summary: This patch will fix the false positive of DBTest.FlushSchedule under TSAN, we dont need to disable this test

Test Plan: COMPILE_WITH_TSAN=1 make -j64 db_test && ./db_test --gtest_filter="DBTest.FlushSchedule"

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43599
2015-08-05 11:47:07 -07:00
sdong
8e01bd1144 Fix misplaced position for reversing iterator direction while current key is a merge
Summary:
While doing forward iterating, if current key is merge, internal iterator position is placed to the next key. If Prev() is called now, needs to do extra Prev() to recover the location.
This is second attempt of fixing after reverting ec70fea4c4. This time shrink the fix to only merge key is the current key and avoid the reseeking logic for max_iterating skipping

Test Plan: enable the two disabled tests and make sure they pass

Reviewers: rven, IslamAbdelRahman, kradhakrishnan, tnovak, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43557
2015-08-05 11:08:50 -07:00
Andres Notzli
c465071029 Removing duplicate code
Summary:
While working on https://reviews.facebook.net/D43179 , I found
duplicate code in the tests. This patch removes it.

Test Plan: make clean all check

Reviewers: igor, sdong, rven, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43263
2015-08-05 07:33:27 -07:00
Mike Kolupaev
e06cf1a098 [wal changes 3/3] method in DB to sync WAL without blocking writers
Summary:
Subj. We really need this feature.

Previous diff D40899 has most of the changes to make this possible, this diff just adds the method.

Test Plan: `make check`, the new test fails without this diff; ran with ASAN, TSAN and valgrind.

Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, tnovak, yhchiang, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, maykov, hermanlee4, yoshinorim, tnovak, dhruba

Differential Revision: https://reviews.facebook.net/D40905
2015-08-05 06:06:39 -07:00
Ari Ekmekji
5dc3e6881a Update Tests To Enable Subcompactions
Summary:
Updated DBTest DBCompactionTest and CompactionJobStatsTest
to run compaction-related tests once with subcompactions enabled and
once disabled using the TEST_P test type in the Google Test suite.

Test Plan: ./db_test  ./db_compaction-test  ./compaction_job_stats_test

Reviewers: sdong, igor, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43443
2015-08-04 22:19:07 -07:00
Islam AbdelRahman
c45a57b41e Support delete rate limiting
Summary:
Introduce DeleteScheduler that allow enforcing a rate limit on file deletion
Instead of deleting files immediately, files are moved to trash directory and deleted in a background thread that apply sleep penalty between deletes if needed.

I have updated PurgeObsoleteFiles and PurgeObsoleteWALFiles to use the delete_scheduler instead of env_->DeleteFile

Test Plan:
added delete_scheduler_test
existing unit tests

Reviewers: kradhakrishnan, anthony, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43221
2015-08-04 20:45:27 -07:00
Yueh-Hsuan Chiang
241bb2aef3 Make DBCompactionTest.SkipStatsUpdateTest more stable.
Summary:
Make DBCompactionTest.SkipStatsUpdateTest more stable by
removing flaky but unnecessary assertion on the size of db
as simply checking the random file open count is suffice.

Test Plan: db_compaction_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43533
2015-08-04 15:47:05 -07:00
Yueh-Hsuan Chiang
14d0bfa429 Add DBOptions::skip_sats_update_on_db_open
Summary:
UpdateAccumulatedStats() is used to optimize compaction decision
esp. when the number of deletion entries are high, but this function
can slowdown DBOpen esp. in disk environment.

This patch adds DBOptions::skip_sats_update_on_db_open, which skips
UpdateAccumulatedStats() in DB::Open() time when it's set to true.

Test Plan: Add DBCompactionTest.SkipStatsUpdateTest

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: tnovak, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42843
2015-08-04 13:48:16 -07:00
Venkatesh Radhakrishnan
20b244fcca Fix CompactFiles by adding all necessary files
Summary:
The compact files API had a bug where some overlapping files
are not added. These are files which overlap with files which were
added to the compaction input files, but not to the original set of
input files. This happens only when there are more than two levels
involved in the compaction. An example will illustrate this better.

Level 2 has 1 input file 1.sst which spans [20,30].

Level 3 has added file  2.sst which spans [10,25]

Level 4 has file 3.sst which spans [35,40] and
        input file 4.sst which spans [46,50].

The existing code would not add 3.sst to the set of input_files because
it only becomes an overlapping file in level 4 and it wasn't one in
level 3.

When installing the results of the compaction, 3.sst would overlap with
output file from the compact files and result in the assertion in
version_set.cc:1130

 // Must not overlap
   assert(level <= 0 || level_files->empty() ||
            internal_comparator_->Compare(
                (*level_files)[level_files->size() - 1]->largest, f->smallest) <
                0);
This change now adds overlapping files from the current level to the set
of input files also so that we don't hit the assertion above.

Test Plan:
d=/tmp/j; rm -rf $d; seq 1000 | parallel --gnu --eta
'd=/tmp/j/d-{}; mkdir -p $d; TEST_TMPDIR=$d ./db_compaction_test
--gtest_filter=*CompactilesOnLevel* --gtest_also_run_disabled_tests >&
'$d'/log-{}'

Reviewers: igor, yhchiang, sdong

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43437
2015-08-03 15:53:22 -07:00
Venkatesh Radhakrishnan
87df6295dd Make SuggestCompactRangeNoTwoLevel0Compactions deterministic
Summary:
Made SuggestCompactRangeNoTwoLevel0Compactions by forcing
a flush after generating a file and waiting for compaction at the end.

Test Plan: Run SuggestCompactRangeNoTwoLevel0Compactions

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43449
2015-08-03 15:52:52 -07:00
Ari Ekmekji
40c64434d4 Parallelize L0-L1 Compaction: Restructure Compaction Job
Summary:
As of now compactions involving files from Level 0 and Level 1 are single
threaded because the files in L0, although sorted, are not range partitioned like
the other levels. This means that during L0-L1 compaction each file from L1
needs to be merged with potentially all the files from L0.

This attempt to parallelize the L0-L1 compaction assigns a thread and a
corresponding iterator to each L1 file that then considers only the key range
found in that L1 file and only the L0 files that have those keys (and only the
specific portion of those L0 files in which those keys are found). In this way
the overlap is minimized and potentially eliminated between different iterators
focusing on the same files.

The first step is to restructure the compaction logic to break L0-L1 compactions
into multiple, smaller, sequential compactions. Eventually each of these smaller
jobs will be run simultaneously. Areas to pay extra attention to are

  # Correct aggregation of compaction job statistics across multiple threads
  # Proper opening/closing of output files (make sure each thread's is unique)
  # Keys that span multiple L1 files
  # Skewed distributions of keys within L0 files

Test Plan: Make and run db_test (newer version has separate compaction tests) and compaction_job_stats_test

Reviewers: igor, noetzli, anthony, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42699
2015-08-03 11:32:14 -07:00
Andres Notzli
193dc977e7 Fixing dead code in table_properties_collector_test
Summary:
There was a bug in table_properties_collector_test that this patch
is fixing: `!backward_mode && !test_int_tbl_prop_collector` in
TestCustomizedTablePropertiesCollector was never true, so the code
in the if-block never got executed. The reason is that the
CustomizedTablePropertiesCollector test was skipping tests with
`!backward_mode_ && !encode_as_internal`. The reason for skipping
the tests is unknown.

Test Plan: make table_properties_collector_test && ./table_properties_collector_test

Reviewers: rven, igor, yhchiang, anthony, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43281
2015-07-30 16:59:03 -07:00
agiardullo
8161bdb5a0 WriteBatch Save Points
Summary:
Support RollbackToSavePoint() in WriteBatch and WriteBatchWithIndex.  Support for partial transaction rollback is needed for MyRocks.

An alternate implementation of Transaction::RollbackToSavePoint() exists in D40869.  However, the other implementation is messier because it is implemented outside of WriteBatch.  This implementation is much cleaner and also exposes a potentially useful feature to WriteBatch.

Test Plan: Added unit tests

Reviewers: IslamAbdelRahman, kradhakrishnan, maykov, yoshinorim, hermanlee4, spetrunia, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42723
2015-07-29 16:54:23 -07:00
Andres Notzli
d06c82e477 Further cleanup of CompactionJob and MergeHelper
Summary:
Simplified logic in CompactionJob and removed unused parameter in
MergeHelper.

Test Plan: make && make check

Reviewers: rven, igor, sdong, yhchiang

Reviewed By: sdong

Subscribers: aekmekji, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42687
2015-07-28 19:21:55 -07:00
Andres Notzli
e95c59cd2f Count number of corrupt keys during compaction
Summary:
For task #7771355, we would like to log the number of corrupt keys
during a compaction. This patch implements and tests the count
as part of CompactionJobStats.

Test Plan: make && make check

Reviewers: rven, igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42921
2015-07-28 16:41:40 -07:00
Poornima Chozhiyath Raman
1bdfcef7bf Fix when output level is 0 of universal compaction with trivial move
Summary: Fix for universal compaction with trivial move, when the ouput level is 0. The tests where failing. Fixed by allowing normal compaction when output level is 0.

Test Plan: modified test cases run successfully.

Reviewers: sdong, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: anthony, kradhakrishnan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42933
2015-07-27 14:25:57 -07:00
sdong
82f148ef97 Fix test DBCompactionTest.PartialCompactionFailure undeterministic failure
Summary: DBCompactionTest.PartialCompactionFailure has a risk that one flush job writes out two mem tables into one file, so that the total files flushed are less than expected. Fix it by writing for flush to finish after every write.

Test Plan: Run the test

Reviewers: IslamAbdelRahman, kradhakrishnan, yhchiang, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42831
2015-07-22 13:46:56 -07:00
Mike Kolupaev
4922af6f8d fixed DBTest.GetPropertiesOfAllTablesTest and DBTest.GetUserDefinedTablaProperties flakiness
Summary: These tests used to fail if a compaction happened between flushing tables and enumerating them to get properties.

Test Plan: this reports occasional failures without this diff and no failures with it: `for i in {1..10000}; do echo $i; done | parallel --gnu -j100 'TEST_TMPDIR=`TMPDIR=/dev/shm/rockstemp mktemp -d -t` ./db_test --gtest_filter=DBTest.GetUserDefinedTablaProperties >&/dev/null || echo {} failed'`

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42861
2015-07-22 12:37:49 -07:00
Mike Kolupaev
fe09a6dae3 [wal changes 2/3] write with sync=true syncs previous unsynced wals to prevent illegal data loss
Summary:
I'll just copy internal task summary here:

"
This sequence will cause data loss in the middle after an sync write:

non-sync write key 1
flush triggered, not yet scheduled
sync write key 2
system crash

After rebooting, users might see key 2 but not key 1, which violates the API of sync write.

This can be reproduced using unit test FaultInjectionTest::DISABLED_WriteOptionSyncTest.

One way to fix it is for a sync write, if there is outstanding unsynced log files, we need to syc them too.
"

This diff should be considered together with the next diff D40905; in isolation this fix probably could be a little simpler.

Test Plan: `make check`; added a test for that (DBTest.SyncingPreviousLogs) before noticing FaultInjectionTest.WriteOptionSyncTest (keeping both since mine asserts a bit more); both tests fail without this diff; for D40905 stacked on top of this diff, ran tests with ASAN, TSAN and valgrind

Reviewers: rven, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40899
2015-07-22 03:28:08 -07:00
Andres Notzli
06aebca592 Report live data size estimate
Summary:
Fixes T6548822. Added a new function for estimating the size of the live data
as proposed in the task. The value can be accessed through the property
rocksdb.estimate-live-data-size.

Test Plan:
There are two unit tests in version_set_test and a simple test in db_test.
make version_set_test && ./version_set_test;
make db_test && ./db_test gtest_filter=GetProperty

Reviewers: rven, igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41493
2015-07-21 21:33:20 -07:00
sdong
02b635fa38 Fix undeterministic failure of DBTest.GetPropertiesOfAllTablesTest
Summary: DBTest.GetPropertiesOfAllTablesTest generates four files and expects four files there, but a L0->L1 comapction can trigger to compact to one single file. Fix it by raising level 0 number of file compaction trigger

Test Plan: Run it many times and see it never fails.

Reviewers: kradhakrishnan, IslamAbdelRahman, yhchiang, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42789
2015-07-21 17:13:23 -07:00