Commit Graph

2057 Commits

Author SHA1 Message Date
Siying Dong
962aa64292 Merge pull request #695 from yuslepukhin/address_windows_build
Address windows build issues caused by introducing Subcompaction
2015-08-20 17:04:48 -07:00
Dmitri Smirnov
5bf8907622 More indent adjustment. 2015-08-20 14:14:02 -07:00
Dmitri Smirnov
e2a9f43d64 Adjust indent 2015-08-20 14:10:51 -07:00
Dmitri Smirnov
1cac89c9b1 Address windows build issues
Intro SubCompactionState move functionality
 =delete copy functionality
 #ifdef SyncPoint in tests for Windows Release builds
2015-08-20 14:08:24 -07:00
Islam AbdelRahman
027ca5b2cd Total SST files size DB Property
Summary: Add a new DB property that calculate the total size of files used by all RocksDB Versions

Test Plan: Unittests for the new property

Reviewers: igor, yhchiang, anthony, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44799
2015-08-20 11:47:19 -07:00
Andres Noetzli
b604d2562f Removing unused variables to fix build
Summary: Removing two unused variables that prevented compilation.

Test Plan: make all

Reviewers: rven, sdong, yhchiang, anthony, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D44991
2015-08-19 16:57:40 -07:00
Venkatesh Radhakrishnan
1b114eed4d Free file iterators for files which are above the iterate upper bound to Improve memory utilization
Summary:
This diff improves the memory utilization for tailing iterators RocksDB,
by freeing file iterators which are over the upper bound.
It is an updating on Siying's original diff for improving the memory usage for
tailing iterators. The changes for the seek and next path are now complete
and a test has been added to exercise these paths while deleting file iterators
which are above the upper bound.

Test Plan: db_tailing_iter_test.TailingIteratorTrimSeekToNext

Reviewers: march, tnovak, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43833
2015-08-19 16:05:51 -07:00
Islam AbdelRahman
3fd70b05b8 Rate limit deletes issued by DestroyDB
Summary: Update DestroyDB so that all SST files in the first path id go through DeleteScheduler instead of being deleted immediately

Test Plan: added a unittest

Reviewers: igor, yhchiang, anthony, kradhakrishnan, rven, sdong

Reviewed By: sdong

Subscribers: jeanxu2012, dhruba

Differential Revision: https://reviews.facebook.net/D44955
2015-08-19 15:02:17 -07:00
Yueh-Hsuan Chiang
df79eafcb3 Introduce GetIntProperty("rocksdb.size-all-mem-tables")
Summary:
Currently, GetIntProperty("rocksdb.cur-size-all-mem-tables") only returns
the memory usage by those memtables which have not yet been flushed.

This patch introduces GetIntProperty("rocksdb.size-all-mem-tables"),
which includes the memory usage by all the memtables, includes those
have been flushed but pinned by iterators.

Test Plan: Added a test in db_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D44229
2015-08-19 13:32:09 -07:00
sdong
888fbdc889 Remove the contstaint that iterator upper bound needs to be within a prefix
Summary: There is a check to fail the iterator if prefix extractor is specified but upper bound is out of the prefix for the seek key. Relax this constraint to allow users to set upper bound to the next prefix of the current one.

Test Plan: make commit-prereq

Reviewers: igor, anthony, kradhakrishnan, yhchiang, rven

Reviewed By: rven

Subscribers: tnovak, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44949
2015-08-19 11:03:51 -07:00
Ari Ekmekji
137c376675 Removing variables used only in assertions to prevent build error
Summary:
A couple variables were declared but only used in assertions
which causes issues when building in fbcode.

Test Plan: make dbg  and   make release

Reviewers: yhchiang, sdong, igor, anthony, MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44937
2015-08-19 08:52:22 -07:00
Ari Ekmekji
b47cc58516 Bounding Number of Subcompactions
Summary:
In D43239 (https://reviews.facebook.net/D43239) the number
of subcompactions is set based on the number of L1 files with
unique starting keys. In certain cases when this number is very large
this causes issues, particularly with the overlap between files since
very small output files can be generated. This diff bounds the number
of subcompactions to the user option DBOption.num_subcompactions.

Test Plan: ./db_test ./db_compaction_test

Reviewers: sdong, igor, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44883
2015-08-18 14:56:31 -07:00
Venkatesh Radhakrishnan
e58e1b18e7 Make tailing iterator show new entries in memtable.
Summary:
Reseek mutable_iter if it is invalid in Next and immutable_iter
is invalid.

Test Plan: DBTestTailingIterator.TailingIteratorSeekToNext

Reviewers: tnovak, march, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D44865
2015-08-18 14:40:06 -07:00
Ari Ekmekji
601b1aaca0 Fixing Failed Assertion in Subcompaction State Diff
Summary:
In D43239 (https://reviews.facebook.net/D43239) there is an
assertion to make sure a subcompaction's output is never empty at the
end of execution. This assertion however breaks the build because some
tests lead to exactly that scenario. So instead I have altered the logic
to handle this case instead of just failing the assertion.

The reason that it is possible for a subcompaction's output to be empty is
that during a sequential execution of subcompactions, if a user aborts the
compaction job then some of the later subcompactions to be executed may
have yet to process any keys and therefore have yet to generate output files.
This becomes very rare once the subcompactions are executed in parallel,
but for now they are still sequential so the case is possible when there is an
early termination, as in some of the tests.

Test Plan: ./db_test  ./db_compaction_test

Reviewers: sdong, igor, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44877
2015-08-18 12:27:12 -07:00
Ari Ekmekji
f0da6977a3 [Parallel L0-L1 Compaction Prep]: Giving Subcompactions Their Own State
Summary:
In prepration for running multiple threads at the same time during
a compaction job, this patch assigns each subcompaction its own state
(instead of sharing the one global CompactionState). Each subcompaction then
uses this state to update its statistics, keep track of its snapshots, etc.
during the course of execution. Then at the end of all the executions the
statistics are aggregated across the subcompactions so that the final result
is the same as if only one larger compaction had run.

Test Plan: ./db_test  ./db_compaction_test  ./compaction_job_test

Reviewers: sdong, anthony, igor, noetzli, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43239
2015-08-18 11:06:23 -07:00
Andres Notzli
f32a572099 Simplify querying of merge results
Summary:
While working on supporting mixing merge operators with
single deletes ( https://reviews.facebook.net/D43179 ),
I realized that returning and dealing with merge results
can be made simpler. Submitting this as a separate diff
because it is not directly related to single deletes.

Before, callers of merge helper had to retrieve the merge
result in one of two ways depending on whether the merge
was successful or not (success = result of merge was single
kTypeValue). For successful merges, the caller could query
the resulting key/value pair and for unsuccessful merges,
the result could be retrieved in the form of two deques of
keys and values. However, with single deletes, a successful merge
does not return a single key/value pair (if merge
operands are merged with a single delete, we have to generate
a value and keep the original single delete around to make
sure that we are not accidentially producing a key overwrite).
In addition, the two existing call sites of the merge
helper were taking the same actions independently from whether
the merge was successful or not, so this patch simplifies that.

Test Plan: make clean all check

Reviewers: rven, sdong, yhchiang, anthony, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43353
2015-08-17 17:34:38 -07:00
sdong
72613657f0 Measure file read latency histogram per level
Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled.

Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected

Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44193
2015-08-14 17:32:42 -07:00
Nathan Bronson
b7198c3afe reduce db mutex contention for write batch groups
Summary:
This diff allows a Writer to join the next write batch group
without acquiring any locks. Waiting is performed via a per-Writer mutex,
so all of the non-leader writers never need to acquire the db mutex.
It is now possible to join a write batch group after the leader has been
chosen but before the batch has been constructed. This diff doesn't
increase parallelism, but reduces synchronization overheads.

For some CPU-bound workloads (no WAL, RAM-sized working set) this can
substantially reduce contention on the db mutex in a multi-threaded
environment.  With T=8 N=500000 in a CPU-bound scenario (see the test
plan) this is good for a 33% perf win.  Not all scenarios see such a
win, but none show a loss.  This code is slightly faster even for the
single-threaded case (about 2% for the CPU-bound scenario below).

Test Plan:
1. unit tests
2. COMPILE_WITH_TSAN=1 make check
3. stress high-contention scenarios with db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=0 --num=$N -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000

Reviewers: sdong, igor, rven, ljin, yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43887
2015-08-14 10:55:43 -07:00
sdong
603b6da8b8 Add options.compaction_measure_io_stats to print write I/O stats in compactions
Summary:
Add options.compaction_measure_io_stats to print out / pass to listener accumulated time spent on write calls. Example outputs in info logs:

2015/08/12-16:27:59.463944 7fd428bff700 (Original Log Time 2015/08/12-16:27:59.463922) EVENT_LOG_v1 {"time_micros": 1439422079463897, "job": 6, "event": "compaction_finished", "output_level": 1, "num_output_files": 4, "total_output_size": 6900525, "num_input_records": 111483, "num_output_records": 106877, "file_write_nanos": 15663206, "file_range_sync_nanos": 649588, "file_fsync_nanos": 349614797, "file_prepare_write_nanos": 1505812, "lsm_state": [2, 4, 0, 0, 0, 0, 0]}

Add two more counters in iostats_context.

Also add a parameter of db_bench.

Test Plan: Add a unit test. Also manually verify LOG outputs in db_bench

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44115
2015-08-13 16:52:26 -07:00
sdong
4637207120 Add test case to repro the mispositional iterator in a low-chance data race case
Summary: Iterator has a bug: if a child iterator reaches its end, and user issues a Prev(), and just before SeekToLast() of the child iterator is called, some extra rows is added in the end, the position of iterator can be misplaced.

Test Plan: Run the tests with or without valgrind

Reviewers: rven, yhchiang, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: tnovak, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43671
2015-08-12 10:50:52 -07:00
agiardullo
0db807ec28 Transaction error statuses
Summary:
Based on feedback from spetrunia, we should better differentiate error statuses for transaction failures.

https://github.com/MySQLOnRocksDB/mysql-5.6/issues/86#issuecomment-124605954

Test Plan: unit tests

Reviewers: rven, kradhakrishnan, spetrunia, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43323
2015-08-11 17:52:56 -07:00
agiardullo
c2f2cb0214 Pessimistic Transactions
Summary:
Initial implementation of Pessimistic Transactions.  This diff contains the api changes discussed in D38913.  This diff is pretty large, so let me know if people would prefer to meet up to discuss it.

MyRocks folks:  please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues.

Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint().  After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex.  We can then decide which route is preferable.

Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing.

Test Plan: Unit tests, db_bench parallel testing.

Reviewers: igor, rven, sdong, yhchiang, yoshinorim

Reviewed By: sdong

Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40869
2015-08-11 17:52:23 -07:00
Islam AbdelRahman
c2868cbc52 Use manual_compaction for compaction_job_test
Summary:
Under certain conditions (disable compression) the compactions that are created in compaction_job_test will pass the trivial_move conditions
This will cause problems since we assert that we dont run a compaction if it's a trivial move
https://github.com/facebook/rocksdb/blob/master/db/compaction_job.cc#L144-L147

for example when we disable compression, compactions become a valid trivial move and the assert fails
https://ci-builds.fb.com/view/rocksdb/job/rocksdb_no_compression/180/console

Test Plan: compaction_job_test

Reviewers: sdong, yhchiang, noetzli, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43983
2015-08-11 14:47:14 -07:00
Islam AbdelRahman
cee1e8a080 Parallelize LoadTableHandlers
Summary: Add a new option that all LoadTableHandlers to use multiple threads to load files on DB Open and Recover

Test Plan:
make check -j64
COMPILE_WITH_TSAN=1 make check -j64
DISABLE_JEMALLOC=1 make all valgrind_check -j64 (still running)

Reviewers: yhchiang, anthony, rven, kradhakrishnan, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43755
2015-08-11 12:19:56 -07:00
Andres Notzli
4249f159d5 Removing duplicate code in db_bench/db_stress, fixing typos
Summary:
While working on single delete support for db_bench, I realized that
db_bench/db_stress contain a bunch of duplicate code related to
copmression and found some typos. This patch removes duplicate code,
typos and a redundant #ifndef in internal_stats.cc.

Test Plan: make db_stress && make db_bench && ./db_bench --benchmarks=compress,uncompress

Reviewers: yhchiang, sdong, rven, anthony, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43965
2015-08-11 11:46:15 -07:00
Nathan Bronson
1ae27113c7 reduce comparisons by skiplist
Summary:
Key comparison is the single largest CPU user for CPU-bound
workloads. This diff reduces the number of comparisons in two ways.

The first is that it moves predecessor array gathering from
FindGreaterOrEqual to FindLessThan, so that FindGreaterOrEqual can
return immediately if compare_ returns 0.  As part of this change I
moved the sequential insertion optimization into Insert, to remove the
undocumented (and smelly) requirement that prev must be equal to prev_
if it is non-null.

The second optimization is that all of the search functions skip calling
compare_ when moving to a lower level that has the same Next pointer.
With a branching factor of 4 we would expect this to happen 1/4 of
the time.

On a single-threaded CPU-bound workload (-benchmarks=fillrandom -threads=1
-batch_size=1 -memtablerep=skip_list -value_size=0 --num=1600000
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000)
on my dev server this is good for a 7% perf win.

Test Plan: unit tests

Reviewers: rven, ljin, yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43233
2015-08-11 11:25:22 -07:00
Islam AbdelRahman
a9dcc0a638 Fix clang build
Summary:
https://ci-builds.fb.com/view/rocksdb/job/rocksdb_clang_build/893/console
Fixing clang build

Test Plan:
make clean
USE_CLANG=1 make all -j64

Reviewers: sdong, noetzli, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43959
2015-08-10 11:30:36 -07:00
Andres Notzli
68f934355a Better CompactionJob testing
Summary:
Changed compaction_job_test to support better/more thorough
tests and added two tests. Also changed MockFileContents
to order using InternalKeyComparator.

Test Plan: make compaction_job_test && ./compaction_job_test; make all && make check

Reviewers: sdong, rven, igor, yhchiang, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42837
2015-08-07 21:59:51 -07:00
agiardullo
16ea1c7d1c simple ManagedSnapshot wrapper
Summary: Implemented this simple wrapper for something else I was working on.  Seemed like it makes sense to expose it instead of burying it in some random code.

Test Plan: added test

Reviewers: rven, kradhakrishnan, sdong, yhchiang

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43293
2015-08-06 17:59:05 -07:00
sdong
6a4aaadcd7 Avoid type unique_ptr in LogWriterNumber::writer for Windows build break
Summary:
Visual Studio complains about deque<LogWriterNumber> because LogWriterNumber is non-copyable for its unique_ptr member writer. Move away from it, and do explit free.
It is less safe but I can't think of a better way to unblock it.

Test Plan: valgrind check test

Reviewers: anthony, IslamAbdelRahman, kolmike, rven, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43647
2015-08-06 10:52:41 -07:00
Andres Noetzli
d7314ba759 Fixing endless loop if seeking to end of key with seq num 0
Summary:
When seeking to the last occurrence of a key with sequence number 0, db_iter
ends up in an endless loop because it seeks to type kValueTypeForSeek
which is larger than kTypeDeletion/kTypeValue. Added test case that triggers
the behavior.

Test Plan: make clean all check

Reviewers: igor, rven, anthony, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43653
2015-08-06 10:43:28 -07:00
Islam AbdelRahman
29b028b0ed Make DeleteScheduler tests more reliable
Summary: Update DeleteScheduler tests so that they verify the used penalties for waiting instead of measuring the time spent which is not reliable

Test Plan:
make -j64 delete_scheduler_test && ./delete_scheduler_test
COMPILE_WITH_TSAN=1 make -j64 delete_scheduler_test && ./delete_scheduler_test
COMPILE_WITH_ASAN=1 make -j64 delete_scheduler_test && ./delete_scheduler_test

make -j64 db_test && ./db_test --gtest_filter="DBTest.RateLimitedDelete:DBTest.DeleteSchedulerMultipleDBPaths"
COMPILE_WITH_TSAN=1 make -j64 db_test && ./db_test --gtest_filter="DBTest.RateLimitedDelete:DBTest.DeleteSchedulerMultipleDBPaths"
COMPILE_WITH_ASAN=1 make -j64 db_test && ./db_test --gtest_filter="DBTest.RateLimitedDelete:DBTest.DeleteSchedulerMultipleDBPaths"

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43635
2015-08-05 19:16:52 -07:00
Poornima Chozhiyath Raman
7d364d0d94 Fix build failure
Summary: fix the build failure

Test Plan: make all

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43623
2015-08-05 16:38:12 -07:00
Poornima Chozhiyath Raman
960d936e83 Add function 'GetInfoLogList()'
Summary: The list of info log files of a db can be obtained using the new function.

Test Plan: New test in db_test.cc passed.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: IslamAbdelRahman, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41715
2015-08-05 16:16:46 -07:00
sdong
7ccd1c80a7 Add two unit tests for SyncWAL()
Summary:
Add two unit tests for SyncWAL(). One makes sure SyncWAL() doesn't block writes in the other thread. Another one makes sure SyncWAL() doesn't wait ongoing writes to finish before being executed.

Create a new test file db_wal_test and move two WAL related tests from db_test to here.

Test Plan: Run the new tests

Reviewers: IslamAbdelRahman, rven, kradhakrishnan, kolmike, tnovak, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43605
2015-08-05 14:27:02 -07:00
sdong
3ae386eafe Add statistic histogram "rocksdb.sst.read.micros"
Summary: Measure read latency histogram and put in statistics. Compaction inputs are excluded from it when possible (unfortunately usually no possible as we usually take table reader from table cache.

Test Plan:
Run db_bench and it shows the stats, like:

rocksdb.sst.read.micros statistics Percentiles :=> 50 : 1.238522 95 : 2.529740 99 : 3.912180

Reviewers: kradhakrishnan, rven, anthony, IslamAbdelRahman, MarkCallaghan, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43275
2015-08-05 13:02:33 -07:00
Islam AbdelRahman
9aec75fbb9 Enable DBTest.FlushSchedule under TSAN
Summary: This patch will fix the false positive of DBTest.FlushSchedule under TSAN, we dont need to disable this test

Test Plan: COMPILE_WITH_TSAN=1 make -j64 db_test && ./db_test --gtest_filter="DBTest.FlushSchedule"

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43599
2015-08-05 11:47:07 -07:00
sdong
8e01bd1144 Fix misplaced position for reversing iterator direction while current key is a merge
Summary:
While doing forward iterating, if current key is merge, internal iterator position is placed to the next key. If Prev() is called now, needs to do extra Prev() to recover the location.
This is second attempt of fixing after reverting ec70fea4c4. This time shrink the fix to only merge key is the current key and avoid the reseeking logic for max_iterating skipping

Test Plan: enable the two disabled tests and make sure they pass

Reviewers: rven, IslamAbdelRahman, kradhakrishnan, tnovak, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43557
2015-08-05 11:08:50 -07:00
Andres Notzli
c465071029 Removing duplicate code
Summary:
While working on https://reviews.facebook.net/D43179 , I found
duplicate code in the tests. This patch removes it.

Test Plan: make clean all check

Reviewers: igor, sdong, rven, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43263
2015-08-05 07:33:27 -07:00
Mike Kolupaev
e06cf1a098 [wal changes 3/3] method in DB to sync WAL without blocking writers
Summary:
Subj. We really need this feature.

Previous diff D40899 has most of the changes to make this possible, this diff just adds the method.

Test Plan: `make check`, the new test fails without this diff; ran with ASAN, TSAN and valgrind.

Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, tnovak, yhchiang, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, maykov, hermanlee4, yoshinorim, tnovak, dhruba

Differential Revision: https://reviews.facebook.net/D40905
2015-08-05 06:06:39 -07:00
Ari Ekmekji
5dc3e6881a Update Tests To Enable Subcompactions
Summary:
Updated DBTest DBCompactionTest and CompactionJobStatsTest
to run compaction-related tests once with subcompactions enabled and
once disabled using the TEST_P test type in the Google Test suite.

Test Plan: ./db_test  ./db_compaction-test  ./compaction_job_stats_test

Reviewers: sdong, igor, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43443
2015-08-04 22:19:07 -07:00
Islam AbdelRahman
c45a57b41e Support delete rate limiting
Summary:
Introduce DeleteScheduler that allow enforcing a rate limit on file deletion
Instead of deleting files immediately, files are moved to trash directory and deleted in a background thread that apply sleep penalty between deletes if needed.

I have updated PurgeObsoleteFiles and PurgeObsoleteWALFiles to use the delete_scheduler instead of env_->DeleteFile

Test Plan:
added delete_scheduler_test
existing unit tests

Reviewers: kradhakrishnan, anthony, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43221
2015-08-04 20:45:27 -07:00
Yueh-Hsuan Chiang
241bb2aef3 Make DBCompactionTest.SkipStatsUpdateTest more stable.
Summary:
Make DBCompactionTest.SkipStatsUpdateTest more stable by
removing flaky but unnecessary assertion on the size of db
as simply checking the random file open count is suffice.

Test Plan: db_compaction_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43533
2015-08-04 15:47:05 -07:00
Yueh-Hsuan Chiang
14d0bfa429 Add DBOptions::skip_sats_update_on_db_open
Summary:
UpdateAccumulatedStats() is used to optimize compaction decision
esp. when the number of deletion entries are high, but this function
can slowdown DBOpen esp. in disk environment.

This patch adds DBOptions::skip_sats_update_on_db_open, which skips
UpdateAccumulatedStats() in DB::Open() time when it's set to true.

Test Plan: Add DBCompactionTest.SkipStatsUpdateTest

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: tnovak, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42843
2015-08-04 13:48:16 -07:00
Venkatesh Radhakrishnan
20b244fcca Fix CompactFiles by adding all necessary files
Summary:
The compact files API had a bug where some overlapping files
are not added. These are files which overlap with files which were
added to the compaction input files, but not to the original set of
input files. This happens only when there are more than two levels
involved in the compaction. An example will illustrate this better.

Level 2 has 1 input file 1.sst which spans [20,30].

Level 3 has added file  2.sst which spans [10,25]

Level 4 has file 3.sst which spans [35,40] and
        input file 4.sst which spans [46,50].

The existing code would not add 3.sst to the set of input_files because
it only becomes an overlapping file in level 4 and it wasn't one in
level 3.

When installing the results of the compaction, 3.sst would overlap with
output file from the compact files and result in the assertion in
version_set.cc:1130

 // Must not overlap
   assert(level <= 0 || level_files->empty() ||
            internal_comparator_->Compare(
                (*level_files)[level_files->size() - 1]->largest, f->smallest) <
                0);
This change now adds overlapping files from the current level to the set
of input files also so that we don't hit the assertion above.

Test Plan:
d=/tmp/j; rm -rf $d; seq 1000 | parallel --gnu --eta
'd=/tmp/j/d-{}; mkdir -p $d; TEST_TMPDIR=$d ./db_compaction_test
--gtest_filter=*CompactilesOnLevel* --gtest_also_run_disabled_tests >&
'$d'/log-{}'

Reviewers: igor, yhchiang, sdong

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43437
2015-08-03 15:53:22 -07:00
Venkatesh Radhakrishnan
87df6295dd Make SuggestCompactRangeNoTwoLevel0Compactions deterministic
Summary:
Made SuggestCompactRangeNoTwoLevel0Compactions by forcing
a flush after generating a file and waiting for compaction at the end.

Test Plan: Run SuggestCompactRangeNoTwoLevel0Compactions

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43449
2015-08-03 15:52:52 -07:00
Ari Ekmekji
40c64434d4 Parallelize L0-L1 Compaction: Restructure Compaction Job
Summary:
As of now compactions involving files from Level 0 and Level 1 are single
threaded because the files in L0, although sorted, are not range partitioned like
the other levels. This means that during L0-L1 compaction each file from L1
needs to be merged with potentially all the files from L0.

This attempt to parallelize the L0-L1 compaction assigns a thread and a
corresponding iterator to each L1 file that then considers only the key range
found in that L1 file and only the L0 files that have those keys (and only the
specific portion of those L0 files in which those keys are found). In this way
the overlap is minimized and potentially eliminated between different iterators
focusing on the same files.

The first step is to restructure the compaction logic to break L0-L1 compactions
into multiple, smaller, sequential compactions. Eventually each of these smaller
jobs will be run simultaneously. Areas to pay extra attention to are

  # Correct aggregation of compaction job statistics across multiple threads
  # Proper opening/closing of output files (make sure each thread's is unique)
  # Keys that span multiple L1 files
  # Skewed distributions of keys within L0 files

Test Plan: Make and run db_test (newer version has separate compaction tests) and compaction_job_stats_test

Reviewers: igor, noetzli, anthony, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42699
2015-08-03 11:32:14 -07:00
Andres Notzli
193dc977e7 Fixing dead code in table_properties_collector_test
Summary:
There was a bug in table_properties_collector_test that this patch
is fixing: `!backward_mode && !test_int_tbl_prop_collector` in
TestCustomizedTablePropertiesCollector was never true, so the code
in the if-block never got executed. The reason is that the
CustomizedTablePropertiesCollector test was skipping tests with
`!backward_mode_ && !encode_as_internal`. The reason for skipping
the tests is unknown.

Test Plan: make table_properties_collector_test && ./table_properties_collector_test

Reviewers: rven, igor, yhchiang, anthony, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43281
2015-07-30 16:59:03 -07:00
agiardullo
8161bdb5a0 WriteBatch Save Points
Summary:
Support RollbackToSavePoint() in WriteBatch and WriteBatchWithIndex.  Support for partial transaction rollback is needed for MyRocks.

An alternate implementation of Transaction::RollbackToSavePoint() exists in D40869.  However, the other implementation is messier because it is implemented outside of WriteBatch.  This implementation is much cleaner and also exposes a potentially useful feature to WriteBatch.

Test Plan: Added unit tests

Reviewers: IslamAbdelRahman, kradhakrishnan, maykov, yoshinorim, hermanlee4, spetrunia, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42723
2015-07-29 16:54:23 -07:00
Andres Notzli
d06c82e477 Further cleanup of CompactionJob and MergeHelper
Summary:
Simplified logic in CompactionJob and removed unused parameter in
MergeHelper.

Test Plan: make && make check

Reviewers: rven, igor, sdong, yhchiang

Reviewed By: sdong

Subscribers: aekmekji, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42687
2015-07-28 19:21:55 -07:00
Andres Notzli
e95c59cd2f Count number of corrupt keys during compaction
Summary:
For task #7771355, we would like to log the number of corrupt keys
during a compaction. This patch implements and tests the count
as part of CompactionJobStats.

Test Plan: make && make check

Reviewers: rven, igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42921
2015-07-28 16:41:40 -07:00
Poornima Chozhiyath Raman
1bdfcef7bf Fix when output level is 0 of universal compaction with trivial move
Summary: Fix for universal compaction with trivial move, when the ouput level is 0. The tests where failing. Fixed by allowing normal compaction when output level is 0.

Test Plan: modified test cases run successfully.

Reviewers: sdong, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: anthony, kradhakrishnan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42933
2015-07-27 14:25:57 -07:00
sdong
82f148ef97 Fix test DBCompactionTest.PartialCompactionFailure undeterministic failure
Summary: DBCompactionTest.PartialCompactionFailure has a risk that one flush job writes out two mem tables into one file, so that the total files flushed are less than expected. Fix it by writing for flush to finish after every write.

Test Plan: Run the test

Reviewers: IslamAbdelRahman, kradhakrishnan, yhchiang, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42831
2015-07-22 13:46:56 -07:00
Mike Kolupaev
4922af6f8d fixed DBTest.GetPropertiesOfAllTablesTest and DBTest.GetUserDefinedTablaProperties flakiness
Summary: These tests used to fail if a compaction happened between flushing tables and enumerating them to get properties.

Test Plan: this reports occasional failures without this diff and no failures with it: `for i in {1..10000}; do echo $i; done | parallel --gnu -j100 'TEST_TMPDIR=`TMPDIR=/dev/shm/rockstemp mktemp -d -t` ./db_test --gtest_filter=DBTest.GetUserDefinedTablaProperties >&/dev/null || echo {} failed'`

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42861
2015-07-22 12:37:49 -07:00
Mike Kolupaev
fe09a6dae3 [wal changes 2/3] write with sync=true syncs previous unsynced wals to prevent illegal data loss
Summary:
I'll just copy internal task summary here:

"
This sequence will cause data loss in the middle after an sync write:

non-sync write key 1
flush triggered, not yet scheduled
sync write key 2
system crash

After rebooting, users might see key 2 but not key 1, which violates the API of sync write.

This can be reproduced using unit test FaultInjectionTest::DISABLED_WriteOptionSyncTest.

One way to fix it is for a sync write, if there is outstanding unsynced log files, we need to syc them too.
"

This diff should be considered together with the next diff D40905; in isolation this fix probably could be a little simpler.

Test Plan: `make check`; added a test for that (DBTest.SyncingPreviousLogs) before noticing FaultInjectionTest.WriteOptionSyncTest (keeping both since mine asserts a bit more); both tests fail without this diff; for D40905 stacked on top of this diff, ran tests with ASAN, TSAN and valgrind

Reviewers: rven, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40899
2015-07-22 03:28:08 -07:00
Andres Notzli
06aebca592 Report live data size estimate
Summary:
Fixes T6548822. Added a new function for estimating the size of the live data
as proposed in the task. The value can be accessed through the property
rocksdb.estimate-live-data-size.

Test Plan:
There are two unit tests in version_set_test and a simple test in db_test.
make version_set_test && ./version_set_test;
make db_test && ./db_test gtest_filter=GetProperty

Reviewers: rven, igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41493
2015-07-21 21:33:20 -07:00
sdong
02b635fa38 Fix undeterministic failure of DBTest.GetPropertiesOfAllTablesTest
Summary: DBTest.GetPropertiesOfAllTablesTest generates four files and expects four files there, but a L0->L1 comapction can trigger to compact to one single file. Fix it by raising level 0 number of file compaction trigger

Test Plan: Run it many times and see it never fails.

Reviewers: kradhakrishnan, IslamAbdelRahman, yhchiang, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42789
2015-07-21 17:13:23 -07:00
Yueh-Hsuan Chiang
7219088cda Move general compaction tests from db_test.cc to db_compaction_test.cc
Summary: Move general compaction tests from db_test.cc to db_compaction_test.cc

Test Plan:
db_test
db_compaction_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42651
2015-07-21 03:05:57 -07:00
agiardullo
064294081b Improved FileExists API
Summary: Add new CheckFileExists method.  Considered changing the FileExists api but didn't want to break anyone's builds.

Test Plan: unit tests

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42003
2015-07-20 17:20:40 -07:00
Yueh-Hsuan Chiang
443c6646bc Move remaining universal compaction tests from db_test.cc to db_universal_compaction_test.cc
Summary: Move remaining universal compaction tests from db_test.cc to db_universal_compaction_test.cc

Test Plan:
db_test
db_universal_compaction_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42645
2015-07-20 16:07:54 -07:00
Yueh-Hsuan Chiang
7462286d33 Move in-place-update related tests from db_test.cc to db_inplace_update_test.cc
Summary: Move in-place-update related tests from db_test.cc to db_inplace_update_test.cc

Test Plan:
db_test
db_inplace_update_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42657
2015-07-20 16:05:28 -07:00
Yueh-Hsuan Chiang
331954ab83 Fixed DBTestUniversalManualCompactionOutputPathId test
Summary:
Fixed DBTestUniversalManualCompactionOutputPathId test
by changing the expected number of files when setting up
the test as flushes no-longer preempt compactions
in patch https://reviews.facebook.net/D41931.

Also, include db_universal_copaction_test in make all check.

Test Plan: db_universal_copaction_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42639
2015-07-20 13:53:41 -07:00
krad
a75f23eb87 Relax assertions in unit DropWrites to be more permissible
Summary: This unit test is blocking our release since it fails under certain
compiler versions. The failure is due to a race in the unit test and not the
core functionality.

Test Plan: Run locally

Reviewers: sdong

CC: leveldb

Task ID: #7760955

Blame Rev:
2015-07-20 12:37:54 -07:00
sdong
ee80432ff8 db_bench add an option of --universal_allow_trivial_move
Summary: Now we allow trivial move in universal compaction. Add a parameter in db_bench

Test Plan: Run db_bench with this option on and off and make sure the option is switched correctly.

Reviewers: yhchiang, igor, kradhakrishnan, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41427
2015-07-20 11:56:12 -07:00
sdong
58b4209e05 Trigger non-trivial compaction in fault_injection_test
Summary: Now the major test cases of fault_injection_test only insert keys in sorted order so compactions will be trivial move. Add a new mode to insert in non-sequential order to trigger non-trivial compactions.

Test Plan: Run the test

Reviewers: kradhakrishnan, anthony, IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42435
2015-07-20 11:51:57 -07:00
Islam AbdelRahman
aa8ac6445b Skip unsupported tests in ROCKSDB_LITE
Summary:
Skipping these tests in ROCKSDB_LITE since they are not supported
json_document_test
wal_manager_test
ttl_test
sst_dump_test
deletefile_test
compact_files_test
prefix_test
checkpoint_test

Test Plan:
json_document_test
wal_manager_test
ttl_test
sst_dump_test
deletefile_test
compact_files_test
prefix_test
checkpoint_test

Reviewers: igor, sdong, yhchiang, kradhakrishnan, anthony

Reviewed By: anthony

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42573
2015-07-20 11:24:54 -07:00
Islam AbdelRahman
c06d1d8393 Make merge_test runnable in ROCKSDB_LITE
Summary: Make merge_test runnable in ROCKSDB_LITE

Test Plan: merge_test

Reviewers: igor, sdong, yhchiang, kradhakrishnan, anthony

Reviewed By: anthony

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42579
2015-07-20 11:17:52 -07:00
Islam AbdelRahman
0d1d9aeebe Block plain_table_db_test in ROCKSDB_LITE
Summary: Block plain_table_db_test in ROCKSDB_LITE since plain table is not supported in ROCKSDB_LITE

Test Plan: plain_table_db_test

Reviewers: igor, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42159
2015-07-20 11:12:02 -07:00
Islam AbdelRahman
f0fe9126f2 Fix compile for write_callback_test in ROCKSDB_LITE
Summary: Add main for write_callback_test when compiled under ROCKSDB_LITE

Test Plan: write_callback_test

Reviewers: igor, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42147
2015-07-20 10:54:15 -07:00
Islam AbdelRahman
cf6a7bebc8 Block cuckoo table tests in ROCKSDB_LITE
Summary: Cuckoo table is not supported in ROCKSDB_LITE, blocking it's tests

Test Plan:
cuckoo_table_builder_test
cuckoo_table_db_test
cuckoo_table_reader_test

Reviewers: sdong, igor, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42141
2015-07-20 10:50:46 -07:00
Islam AbdelRahman
20922c4a5a Make compaction_picker_test runnable in ROCKSDB_LITE
Summary: Remove universal and fifo compaction tests from ROCKSDB_LITE since they are not supported

Test Plan: compaction_picker_test

Reviewers: sdong, igor, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42129
2015-07-20 10:46:09 -07:00
sdong
6e9fbeb27c Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.

Test Plan: Run all existing unit tests.

Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42321
2015-07-17 16:58:18 -07:00
Igor Canadi
35ca59364c Don't let flushes preempt compactions
Summary:
When we first started, max_background_flushes was 0 by default and compaction thread was executing flushes (since there was no flush thread). Then, we switched the default max_background_flushes to 1. However, we still support the case where there is no flush thread and flushes are done in compaction. This is making our code a bit more complicated. By not supporting this use-case we can make our code simpler.

We have a special case that when you set max_background_flushes to 0, we
schedule the flush to execute on the compaction thread.

Test Plan: make check (there might be some unit tests that depend on this behavior)

Reviewers: IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41931
2015-07-17 12:02:52 -07:00
Ari Ekmekji
74c755c552 Added JSON manifest dump option to ldb command
Summary:
Added a new flag --json to the ldb manifest_dump command
that prints out the version edits as JSON objects for easier
reading and parsing of information.

Test Plan:
**Sample usage: **
```
./ldb manifest_dump --json --path=path/to/manifest/file
```

**Sample output:**
```
{"EditNumber": 0, "Comparator": "leveldb.BytewiseComparator", "ColumnFamily": 0}
{"EditNumber": 1, "LogNumber": 0, "ColumnFamily": 0}
{"EditNumber": 2, "LogNumber": 4, "PrevLogNumber": 0, "NextFileNumber": 7, "LastSeq": 35356, "AddedFiles": [{"Level": 0, "FileNumber": 5, "FileSize": 1949284, "SmallestIKey": "'", "LargestIKey": "'"}], "ColumnFamily": 0}
...
{"EditNumber": 13, "PrevLogNumber": 0, "NextFileNumber": 36, "LastSeq": 290994, "DeletedFiles": [{"Level": 0, "FileNumber": 17}, {"Level": 0, "FileNumber": 20}, {"Level": 0, "FileNumber": 22}, {"Level": 0, "FileNumber": 24}, {"Level": 1, "FileNumber": 13}, {"Level": 1, "FileNumber": 14}, {"Level": 1, "FileNumber": 15}, {"Level": 1, "FileNumber": 18}], "AddedFiles": [{"Level": 1, "FileNumber": 25, "FileSize": 2114340, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 26, "FileSize": 2115213, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 27, "FileSize": 2114807, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 30, "FileSize": 2115271, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 31, "FileSize": 2115165, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 32, "FileSize": 2114683, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 35, "FileSize": 1757512, "SmallestIKey": "'", "LargestIKey": "'"}], "ColumnFamily": 0}
...
```

Reviewers: sdong, anthony, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D41727
2015-07-17 10:07:40 -07:00
Igor Canadi
a96fcd09b7 Deprecate CompactionFilterV2
Summary: It has been around for a while and it looks like it never found any uses in the wild. It's also complicating our compaction_job code quite a bit. We're deprecating it in 3.13, but will put it back in 3.14 if we actually find users that need this feature.

Test Plan: make check

Reviewers: noetzli, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42405
2015-07-17 18:59:11 +02:00
Andres Notzli
1d20fa9d0f Fixed and simplified merge_helper
Summary:
MergeUntil was not reporting a success when merging an operand with
a Value/Deletion despite the comments in MergeHelper and CompactionJob
indicating otherwise. This lead to operands being written to the compaction
output unnecessarily:

M1 M2 M3 P M4 M5 --> (P+M1+M2+M3) M2 M3 M4 M5 (before the diff)
M1 M2 M3 P M4 M5 --> (P+M1+M2+M3) M4 M5 (after the diff)

In addition, the code handling Values/Deletion was basically identical.
This patch unifies the code. Finally, this patch also adds testing for
merge_helper.

Test Plan: make && make check

Reviewers: sdong, rven, yhchiang, tnovak, igor

Reviewed By: igor

Subscribers: tnovak, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42351
2015-07-17 09:27:24 -07:00
sdong
ac2b9367fe Fix a typo in variable
Summary: Typo seqeuntial -> sequential in db/fault_injection_test.cc

Test Plan: Build it

Reviewers: ott, igor, anthony, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42417
2015-07-16 12:17:24 -07:00
sdong
6c0c8dee7b Fix data loss after DB recovery by not allowing flush/compaction to be scheduled until DB opened
Summary:
Previous run may leave some SST files with higher file numbers than manifest indicates.
Compaction or flush may start to run while DB::Open() is still going on. SST file garbage collection may happen interleaving with compaction or flush, and overwrite files generated by compaction of flushes after they are generated. This might cause data loss. This possibility of interleaving is recently introduced.
Fix it by not allowing compaction or flush to be scheduled before DB::Open() finishes.

Test Plan: Add a unit test. This verification will have a chance to fail without the fix but doesn't fix without the fix.

Reviewers: kradhakrishnan, anthony, yhchiang, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42399
2015-07-16 10:57:41 -07:00
Andres Notzli
e4af3bfb27 Test for compaction of corrupted keys
Summary:
Fixes T7697334. Adds a simple test to check whether CompactionJob deals
with corrupted keys correctly. Right now, we preserve corrupted keys.
Note: depending on the type of corruption and options like comparators,
CompactionJob fails. This test just checks whether corrupted keys that
do not fail CompactionJob are preserved.

Test Plan:
`make compaction_job_test && ./compaction_job_test` -> Tests pass.
Add `input->Next(); continue;` in CompactionJob::ProcessKeyValueCompaction()
inside then-branch of `!ParseInternalKey(key, &ikey)` -> Tests fail.

Reviewers: sdong, rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42237
2015-07-16 09:18:35 -07:00
agiardullo
81d072623c move convenience.h out of utilities
Summary: Moved convenience.h out of utilities to remove a dependency on utilities in db.

Test Plan: unit tests.  Also compiled a link to the old location to verify the _Pragma works.

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42201
2015-07-15 14:51:51 -07:00
Poornima Chozhiyath Raman
beb19ad0dd Fixing delete files in Trivial move of universal compaction
Summary:
Trvial move in universal compaction was failing when trying to move files from levels other than 0.
This was because the DeleteFile while trivially moving, was only deleting files of level 0 which caused duplication of same file in different levels.
This is fixed by passing the right level as argument in the call of DeleteFile while doing trivial move.

Test Plan: ./db_test ran successfully with the new test cases.

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42135
2015-07-15 12:28:22 -07:00
krad
c613960690 Build fix.
Summary: gcc-4.9-glibc-2.20 complains about uninitialized variable.

db/compaction_picker.cc: In member function 'bool
rocksdb::CompactionPicker::IsInputNonOverlapping(rocksdb::Compaction*)':
db/compaction_picker.cc:1174:17: error:
'prev.rocksdb::{anonymous}::InputFileInfo::f' may be used uninitialized in this
function [-Werror=maybe-uninitialized]
   InputFileInfo prev, curr, next;

Test Plan: pmake on local environment

Reviewers: sdong igor

CC: leveldb@

Task ID: #

Blame Rev:
2015-07-15 11:25:10 -07:00
Aaron Feldman
2c8de0ecae Update --help message in db_bench.
Summary:
Remove --help entry for readhot.

Update read_random_exp_range flag description: The distribution is num *
exp(-r), not num * exp(r).

Test Plan: Run ./db_bench --help

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42303
2015-07-15 10:21:09 -07:00
Andres Notzli
6b2d44b2ff Refactoring of writing key/value pairs
Summary:
Before, writing key/value pairs out to files was done inside
ProcessKeyValueCompaction(). To make ProcessKeyValueCompaction()
more understandable, this patch moves the writing part to a separate
function. This is intended to be a stepping stone for additional
changes.

Test Plan: make && make check

Reviewers: sdong, rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42243
2015-07-15 09:55:45 -07:00
Igor Canadi
9a6a0bd8c9 Style fix in compaction_job.cc
Summary: I didn't know this can work :)

Test Plan: none

Reviewers: sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42081
2015-07-15 09:36:47 +02:00
Andres Notzli
ddad40e930 Fixed nullptr deref and added assert
Summary:
Fixes two minor issues in CompactionJob.
CompactionJob::Run() dereferences log_buffer_ without a check, so
this patch adds an assert in the constructor where log_buffer_
is assigned. compaction_job_stats_ can be null but
ProcessKeyValueCompaction was dereferencing it without a check.

Test Plan: make && make check

Reviewers: sdong, rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42231
2015-07-14 23:12:34 -07:00
Yueh-Hsuan Chiang
801df912af Move UniversalCompaction related db-tests to db_universal_compaction_test.cc
Summary: Move UniversalCompaction related db-tests to db_universal_compaction_test.cc

Test Plan:
db_test
db_universal_compaction_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42225
2015-07-14 18:24:45 -07:00
Yueh-Hsuan Chiang
3ca6b2541e Move TailingIterator tests from db_test.cc to db_test_tailing_iterator.cc
Summary: Move TailingIterator tests from db_test.cc to db_test_tailing_iterator.cc

Test Plan:
db_test
db_test_tailing_iterator

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42021
2015-07-14 16:41:08 -07:00
Yueh-Hsuan Chiang
ce829c77e3 Make TransactionLogIterator related tests from db_test.cc to db_log_iter_test.cc
Summary: Make TransactionLogIterator related tests from db_test.cc to db_log_iter_test.cc

Test Plan:
db_test
db_log_iter_test

Reviewers: sdong, IslamAbdelRahman, igor, anthony

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42045
2015-07-14 16:08:21 -07:00
Yueh-Hsuan Chiang
c3f98bb89b Move CompactionFilter tests in db_test.cc to db_compaction_filter_test.cc
Summary: Move CompactionFilter tests in db_test.cc to db_compaction_filter_test.cc

Test Plan:
db_test
db_compaction_filter_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42207
2015-07-14 16:03:47 -07:00
agiardullo
18d5e1bf88 Remove db_impl_readonly dependency on utilities
Summary: Seems like the cleanest way to resolve this is to move CompactedDBImpl into db/.  CompactedDBImpl should probably live in the same place as DBImplReadonly since Opening the latter could end up instantiating the former.  Both DBImplReadonly and CompactedDBImpl inherit from DBImpl access protected members of DBImpl( and the latter access friendly private methods).

Test Plan: unit tests

Reviewers: sdong, yhchiang, kradhakrishnan, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42027
2015-07-14 11:32:54 -07:00
Islam AbdelRahman
49640bd82f Allow write_batch_test to run with ROCKSDB_LITE
Summary: Remove ColumnFamiliesBatchWithIndexTest from ROCKSDB_LITE since WriteBatchWithIndex is not supported

Test Plan: write_batch_test

Reviewers: yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42033
2015-07-14 10:35:35 -07:00
Igor Canadi
a9c5109515 Deprecate purge_redundant_kvs_while_flush
Summary: This option is guarding the feature implemented 2 and a half years ago: D8991. The feature was enabled by default back then and has been running without issues. There is no reason why any client would turn this feature off. I found no reference in fbcode.

Test Plan: none

Reviewers: sdong, yhchiang, anthony, dhruba

Reviewed By: dhruba

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42063
2015-07-14 13:07:02 +02:00
Igor Canadi
5aea98ddd8 Deprecate WriteOptions::timeout_hint_us
Summary:
In one of our recent meetings, we discussed deprecating features that are not being actively used. One of those features, at least within Facebook, is timeout_hint. The feature is really nicely implemented, but if nobody needs it, we should remove it from our code-base (until we get a valid use-case). Some arguments:
* Less code == better icache hit rate, smaller builds, simpler code
* The motivation for adding timeout_hint_us was to work-around RocksDB's stall issue. However, we're currently addressing the stall issue itself (see @sdong's recent work on stall write_rate), so we should never see sharp lock-ups in the future.
* Nobody is using the feature within Facebook's code-base. Googling for `timeout_hint_us` also doesn't yield any users.

Test Plan: make check

Reviewers: anthony, kradhakrishnan, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: sdong, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41937
2015-07-14 09:35:48 +02:00
Andres Notzli
ae29495e4b Avoid manipulating const char* arrays
Summary:
We were manipulating `const char*` arrays in CompactionJob to
change the sequence number/types of keys. This patch changes
UpdateInternalKey() to use string methods to do the manipulation
and updates all calls accordingly.

Test Plan:
Added test case for UpdateInternalKey() in dbformat_test.
make && make check

Reviewers: sdong, rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41985
2015-07-14 00:21:41 -07:00
Andres Notzli
ab137af4ba Partial cleanup of CompactionJob
Summary:
Logging, dealing with key prefix batches and updating stats
moved from CompactionJob::Run() into separate functions.

Test Plan: make all && make check

Reviewers: sdong, rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41919
2015-07-14 00:09:20 -07:00
Yueh-Hsuan Chiang
b10cf4e2e0 Move DynamicLevel related db-tests to db_dynamic_level_test.cc
Summary: Move DynamicLevel related db-tests to db_dynamic_level_test.cc

Test Plan:
db_dynamic_level_test
db_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42039
2015-07-13 19:00:30 -07:00
Islam AbdelRahman
e290f5d3ce Block reduce_levels_test in ROCKSDB_LITE
Summary: Block reduce_levels_test in ROCKSDB_LITE as LDBCommand is not supported

Test Plan:
make reduce_levels_test -j64
OPT=-DROCKSDB_LITE make reduce_levels_test -j64
make check -j64

Reviewers: sdong, igor, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D41967
2015-07-13 18:45:55 -07:00
Yueh-Hsuan Chiang
49f42ad032 Move global static functions in db_test_util to DBTestBase
Summary:
Move global static functions in db_test_util to DBTestBase.
This is to prevent unused function warning when decoupling
db_test.cc into multiple files.

Test Plan: db_test

Reviewers: igor, sdong, anthony, IslamAbdelRahman, kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42009
2015-07-13 17:41:41 -07:00
Yueh-Hsuan Chiang
625467a08a Move reusable part of db_test.cc to util/db_test_util.h
Summary:
Move reusable part of db_test.cc to util/db_test_util.h.
This makes it more possible to partition db_test.cc into
multiple smaller test files.

Also, fixed many old lint errors in db_test.

Test Plan: db_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong, kradhakrishnan

Reviewed By: sdong, kradhakrishnan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41973
2015-07-13 16:53:38 -07:00
Ari Ekmekji
8bca83e5dd Add tombstone information in CompactionJobStats
Summary:
Added new statistics in CompactionJobStats to keep track of
deletion entries and the expiration of those entries. Updated these
fields in compaction_job.cc as compaction took place and wrote a new
test in compaction_job_stats_test.cc to verify accuracy.

Test Plan:
Wrote new test DeletionStatsTest in
compaction_job_stats_test.cc to verify

Reviewers: sdong, igor, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41355
2015-07-13 15:51:38 -07:00
sdong
f9728640f3 "make format" against last 10 commits
Summary: This helps Windows port to format their changes, as discussed. Might have formatted some other codes too becasue last 10 commits include more.

Test Plan: Build it.

Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41961
2015-07-13 13:50:18 -07:00
sdong
76d3cd3286 Fix public API dependency on internal codes and dependency on MAX_INT32
Summary:
Public API depends on port/port.h which is wrong. Fix it.
Also with gcc 4.8.1 build was broken as MAX_INT32 was not recognized. Fix it by using ::max in linux.

Test Plan: Build it and try to build an external project on top of it.

Reviewers: anthony, yhchiang, kradhakrishnan, igor

Reviewed By: igor

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41745
2015-07-11 10:32:11 -07:00
sdong
5fd11853cb Print Fast CRC32 support information in DB LOG
Summary: Print whether fast CRC32 is supported in DB info LOG

Test Plan: Run db_bench and see it prints out correctly.

Reviewers: yhchiang, anthony, kradhakrishnan, igor

Reviewed By: igor

Subscribers: MarkCallaghan, yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41733
2015-07-10 17:59:36 -07:00
Siying Dong
e41cbd9c2f Merge pull request #646 from yuslepukhin/ms_win_port
Windows Port from Microsoft
2015-07-10 15:53:39 -07:00
sdong
041b6f95a2 perf_context: report time spent on reading index and bloom blocks
Summary: Add a perf context counter to help users figure out time spent on reading indexes and bloom filter blocks.

Test Plan: Will write a unit test

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41433
2015-07-10 14:45:42 -07:00
Dmitri Smirnov
c903ccc4c2 Merge from github/master 2015-07-09 18:01:08 -07:00
krad
7189e90c22 Fix a noisy unit test.
Summary: The t/DBTest.DropWrites test still fails under certain gcc version in release unit test.

I unfortunately cannot repro the failure (since the compilers have mapped library which I am not able to map to correctly). I am suspecting the clock skew.

Test Plan: Run make check

Reviewers:

CC: sdong igore

Task ID: #7312624

Blame Rev:
2015-07-09 13:43:52 -07:00
Aaron Feldman
1f4d565709 Add db_bench flag to set cache_index_and_filter_blocks
Summary:
The new flag --cache_index_and_filter_blocks sets
BlockBasedTableOptions.cache_index_and_filter_blocks

Test Plan: make db_bench. Working on benchmarks with the new flag.

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D41481
2015-07-09 13:36:16 -07:00
Dmitri Smirnov
d8586ab22b All of these are in the new code added past 3.10
1) Crash in env_win.cc that prevented db_test run to completion and some new tests
     2) Fix new corruption tests in DBTest by allowing a shared trunction of files. Note that this is generally needed ONLY for tests.
     3) Close database so WAL is closed prior to inducing corruption similar to what we did within Corruption tests.
2015-07-08 16:56:45 -07:00
Poornima Chozhiyath Raman
4bed00a44b Fix function name format according to google style
Summary: Change the naming style of getter and setters according to Google C++ style in compaction.h file

Test Plan: Compilation success

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41265
2015-07-08 15:21:10 -07:00
krad
e2e3d84b2c Added multi WAL log testing to recovery tests.
Summary: Currently there is no test in the suite to test the case where
there are multiple WAL files and there is a corruption in one of them. We have
tests for single WAL file corruption scenarios. Added tests to mock
the scenarios for all combinations of recovery modes and corruption in
specified file locations.

Test Plan: Run make check

Reviewers: sdong igor

CC: leveldb@

Task ID: #7501229

Blame Rev:
2015-07-08 14:39:00 -07:00
Dmitri Smirnov
ef4b87f1b2 Commit both PR and internal code review changes 2015-07-07 16:58:20 -07:00
agiardullo
4f56632b16 Fix occasional failure in compaction_job_test
Summary: Coverage test has been occasionally failing due to this timing check.

Test Plan: run test

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41367
2015-07-07 16:10:23 -07:00
Poornima Chozhiyath Raman
411c8e3d19 Build fail fix
Summary: Build fail fix. Type cast issues.

Test Plan: compiled

Reviewers: sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D41349
2015-07-07 15:21:17 -07:00
Poornima Chozhiyath Raman
c0b23dd5b0 Enabling trivial move in universal compaction
Summary: This change enables trivial move if all the input files are non onverlapping while doing Universal Compaction.

Test Plan: ./compaction_picker_test and db_test ran successfully with the new testcases.

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40875
2015-07-07 14:18:55 -07:00
Yueh-Hsuan Chiang
d8e3e766f9 Fixed a bug in test ThreadStatusSingleCompaction
Summary:
Fixed a bug in test ThreadStatusSingleCompaction where
SyncPoint traces are not cleared before the test begins
its second iteration.

Test Plan: db_test

Reviewers: sdong, anthony, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41337
2015-07-07 14:00:03 -07:00
Yueh-Hsuan Chiang
4ce5be4255 fixed leaking log::Writers
Summary: Fixes valgrind errors in column_family_test.

Test Plan: `make check`, `make valgrind_check`

Reviewers: igor, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D41181
2015-07-07 12:10:10 -07:00
Yueh-Hsuan Chiang
685582a0b4 Revert two diffs related to DBIter::FindPrevUserKey()
Summary:
This diff reverts the following two previous diffs related to
DBIter::FindPrevUserKey(), which makes db_stress unstable.
We should bake a better fix for this.

* "Fix a comparison in DBIter::FindPrevUserKey()"
  ec70fea4c4.

* "Fixed endless loop in DBIter::FindPrevUserKey()"
  acee2b08a2.

Test Plan: db_stress

Reviewers: anthony, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41301
2015-07-07 11:36:24 -07:00
Andres Notzli
58d7ab3c68 Added tests for ExpandWhileOverlapping()
Summary:
This patch adds three test cases for ExpandWhileOverlapping()
to the compaction_picker_test test suite.
ExpandWhileOverlapping() only has an effect if the comparison
function for the internal keys allows for overlapping user
keys in different SST files on the same level. Thus, this
patch adds a comparator based on sequence numbers to
compaction_picker_test for the new test cases.

Test Plan:
- make compaction_picker_test && ./compaction_picker_test
  -> All tests pass
- Replace body of ExpandWhileOverlapping() with `return true`
  -> Compile and run ./compaction_picker_test as before
  -> New tests fail

Reviewers: sdong, yhchiang, rven, anthony, IslamAbdelRahman, kradhakrishnan, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41277
2015-07-06 22:25:27 -07:00
Igor Canadi
155ce60daf Fix compaction_job_test
Summary:
Two issues:
* the input keys to the compaction don't include sequence number.
* sequence number is set to max(seq_num), but it should be set to max(seq_num)+1, because the condition here is strictly-larger (i.e. we will only zero-out sequence number if the DB's sequence number is strictly greater than the key's sequence number): https://github.com/facebook/rocksdb/blob/master/db/compaction_job.cc#L830

Test Plan: make compaction_job_test && ./compaction_job_test

Reviewers: sdong, lovro

Reviewed By: lovro

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41247
2015-07-06 11:14:08 -07:00
Dmitri Smirnov
d2f0912bd3 Merge the latest changes from github/master 2015-07-02 17:23:41 -07:00
Mike Kolupaev
218487d8dc [wal changes 1/3] fixed unbounded wal growth in some workloads
Summary:
This fixes the following scenario we've hit:
 - we reached max_total_wal_size, created a new wal and scheduled flushing all memtables corresponding to the old one,
 - before the last of these flushes started its column family was dropped; the last background flush call was a no-op; no one removed the old wal from alive_logs_,
 - hours have passed and no flushes happened even though lots of data was written; data is written to different column families, compactions are disabled; old column families are dropped before memtable grows big enough to trigger a flush; the old wal still sits in alive_logs_ preventing max_total_wal_size limit from kicking in,
 - a few more hours pass and we run out disk space because of one huge .log file.

Test Plan: `make check`; backported the new test, checked that it fails without this diff

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40893
2015-07-02 14:27:00 -07:00
Dmitri Smirnov
9dbde7277c Merge remote-tracking branch 'origin' into ms_win_port 2015-07-02 11:34:22 -07:00
Dmitri Smirnov
18285c1e2f Windows Port from Microsoft
Summary: Make RocksDb build and run on Windows to be functionally
 complete and performant. All existing test cases run with no
 regressions. Performance numbers are in the pull-request.

 Test plan: make all of the existing unit tests pass, obtain perf numbers.

 Co-authored-by: Praveen Rao praveensinghrao@outlook.com
 Co-authored-by: Sherlock Huang baihan.huang@gmail.com
 Co-authored-by: Alex Zinoviev alexander.zinoviev@me.com
 Co-authored-by: Dmitri Smirnov dmitrism@microsoft.com
2015-07-01 16:13:56 -07:00
sdong
05e2831966 Allocate LevelFileIteratorState and LevelFileNumIterator from DB iterator's arena
Summary: Try to allocate LevelFileIteratorState and LevelFileNumIterator from DB iterator's arena, instead of calling malloc and free.

Test Plan: valgrind check

Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40929
2015-06-30 17:30:38 -07:00
krad
b0f1927dbb Increasing timeout for drop writes.
Summary: We have a race in the way test works. We avoided the race by adding the
wait to the counter. I thought 1s was eternity, but that is not true in some
scenarios. Increasing the timeout to 10s and adding warnings.

Also, adding nosleep to avoid the case where the wakeup thread is waiting behind
the sleeping thread for scheduling.

Test Plan: Run make check

Reviewers: siying igorcanadi

CC: leveldb@

Task ID: #7312624

Blame Rev:
2015-06-30 11:11:56 -07:00
Tomislav Novak
ec70fea4c4 Fix a comparison in DBIter::FindPrevUserKey()
Summary:
When seek target is a merge key (`kTypeMerge`), `DBIter::FindNextUserEntry()`
advances the underlying iterator _past_ the current key (`saved_key_`); see
`MergeValuesNewToOld()`. However, `FindPrevUserKey()` assumes that `iter_`
points to an entry with the same user key as `saved_key_`. As a result,
`it->Seek(key) && it->Prev()` can cause the iterator to be positioned at the
_next_, instead of the previous, entry (new test, written by @lovro, reproduces
the bug).

This diff changes `FindPrevUserKey()` to also skip keys that are _greater_ than
`saved_key_`.

Test Plan: db_test

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba, lovro

Differential Revision: https://reviews.facebook.net/D40791
2015-06-29 17:04:03 -07:00
Yueh-Hsuan Chiang
501591c423 Make column_family_test runnable in ROCKSDB_LITE
Summary: Make column_family_test runnable in ROCKSDB_LITE.

Test Plan: column_family_test

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40251
2015-06-29 14:39:01 -07:00
krad
6199cba998 Fix race in unit test.
Summary: Avoid falling victim to race condition.

Test Plan: Run the unit test

Reviewers: sdong igor

CC: leveldb@

Task ID: #7312624

Blame Rev:
2015-06-29 11:40:21 -07:00
Igor Canadi
4cbc4e6f88 Call merge operators with empty values
Summary: It's not really nice to call user's API with garbage data in new_value. This diff makes sure that new_value is empty before calling the merge operator.

Test Plan: Added assert to Merge operator in merge_test

Reviewers: sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40773
2015-06-26 11:35:46 -07:00
Venkatesh Radhakrishnan
c9cd404bcd Make flush check for shutdown
Summary:
Fixes task 7156865 where a compaction causes a hang in flush
memtable if CancelAllBackgroundWork was called prior to it.
Stack trace is in : https://phabricator.fb.com/P19848829
We end up waiting for a flush which will never happen because there are no background threads.

Test Plan: PreShutdownFlush

Reviewers: sdong, igor

Reviewed By: sdong, igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40617
2015-06-25 14:43:25 -07:00
Poornima Chozhiyath Raman
4fb09c6871 Updating SeekToLast with upper bound
Summary: #7124486: RocksDB's Iterator.SeekToLast should seek to the last key before iterate_upper_bound if presents

Test Plan: ./db_iter_test run successfully with the new testcase

Reviewers: rven, yhchiang, igor, anthony, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40425
2015-06-25 09:44:30 -07:00
Yueh-Hsuan Chiang
dec2c9f564 Make table_properties_collector_test runnable in ROCKSDB_LITE
Summary: Make table_properties_collector_test runnable in ROCKSDB_LITE

Test Plan: table_properties_collector_test

Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40581
2015-06-24 01:38:53 -07:00
Islam AbdelRahman
674b1181cf Bottommost level compaction option
Summary: Replace force_bottommost_level_compaction in CompactRangeOption with an option that allow the user to (always skip, always compact, compact if compaction filter is present) the bottommost level for level based compaction.

Test Plan: make check

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40527
2015-06-23 13:32:40 -07:00
Giuseppe Ottaviano
782a1590f9 Implement a table-level row cache
Summary:
Implementation of a table-level row cache.
It only caches point queries done through the `DB::Get` interface, queries done through the `Iterator` interface will completely skip the cache.

Supports snapshots and merge operations.

Test Plan: Ran `make valgrind_check commit-prereq`

Reviewers: igor, philipp, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D39849
2015-06-23 10:25:45 -07:00
krad
de85e4cadf Introduce WAL recovery consistency levels
Summary:
The "one size fits all" approach with WAL recovery will only introduce inconvenience for our varied clients as we go forward. The current recovery is a bit heuristic. We introduce the following levels of consistency while replaying the WAL.

1. RecoverAfterRestart (kTolerateCorruptedTailRecords)

This mocks the current recovery mode.

2. RecoverAfterCleanShutdown (kAbsoluteConsistency)

This is ideal for unit test and cases where the store is shutdown cleanly. We tolerate no corruption or incomplete writes.

3. RecoverPointInTime (kPointInTimeRecovery)

This is ideal when using devices with controller cache or file systems which can loose data on restart. We recover upto the point were is no corruption or incomplete write.

4. RecoverAfterDisaster (kSkipAnyCorruptRecord)

This is ideal mode to recover data. We tolerate corruption and incomplete writes, and we hop over those sections that we cannot make sense of salvaging as many records as possible.

Test Plan:
(1) Run added unit test to cover all levels.
(2) Run make check.

Reviewers: leveldb, sdong, igor

Subscribers: yoshinorim, dhruba

Differential Revision: https://reviews.facebook.net/D38487
2015-06-22 15:28:12 -07:00
Islam AbdelRahman
530534fceb Fix trivial move merge
Summary: Fixing bad merge

Test Plan: make -j64 check (this is not enough to verify the fix)

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40521
2015-06-22 15:20:30 -07:00
Venkatesh Radhakrishnan
04251e1e3a Add wal files to Checkpoint for multiple column families.
Summary:
When there are multiple column families, the flush in
GetLiveFiles is not atomic, so that there are entries in the wal files
which are needed to get a consisten RocksDB. We now add the log files to
the checkpoint.

Test Plan:
CheckpointCF - This test forces more data to be written to
the other column families after the flush of the first column family but
before the second.

Reviewers: igor, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40323
2015-06-19 16:08:31 -07:00
Igor Canadi
bf03f59c11 Disable CompressLevelCompaction() if Zlib is not supported
Summary: CompressLevelCompaction() depends on Zlib. We should skip it when zlib is not present.

Test Plan: `make check` without zlib

Reviewers: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40401
2015-06-18 18:46:26 -07:00
Igor Canadi
760e9a94de Fail DB::Open() when the requested compression is not available
Summary:
Currently RocksDB silently ignores this issue and doesn't compress the data. Based on discussion, we agree that this is pretty bad because it can cause confusion for our users.

This patch fails DB::Open() if we don't support the compression that is specified in the options.

Test Plan: make check with LZ4 not present. If Snappy is not present all tests will just fail because Snappy is our default library. We should make Snappy the requirement, since without it our default DB::Open() fails.

Reviewers: sdong, MarkCallaghan, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39687
2015-06-18 14:55:05 -07:00
Islam AbdelRahman
4eabbdb7ec Skip bottommost level compaction if possible
Summary:
This is https://reviews.facebook.net/D39999 but after introducing an option to force compaction the bottom most level

Changes in this patch
- Introduce force_bottommost_level_compaction to CompactRangeOptions that force compacting bottommost level during compaction
- Skip bottommost level compaction if we dont have a compaction filter and force_bottommost_level_compaction options is not set

Although tests pass on my machine but I suspect that there maybe some tests that I am not aware of that  should use force_bottommost_level_compaction to pass in a deterministic way

Test Plan:
make check
adding new tests

Reviewers: igor, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40059
2015-06-18 11:03:31 -07:00
Igor Canadi
4b8bb62f0a Don't dump DBOptions for each column family
Summary: Currently we dump DBOptions for each column family options we dump. This leads to duplicate lines in our LOG file. This diff fixes that.

Test Plan: Check out the LOG

Reviewers: sdong, rven, yhchiang

Reviewed By: yhchiang

Subscribers: IslamAbdelRahman, yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39729
2015-06-18 10:15:54 -07:00
Poornima Chozhiyath Raman
176f0bedce Merge branch 'master' of github.com:facebook/rocksdb
D40233: Replace %llu with format macros in ParsedInternalKey::DebugString())
2015-06-18 10:12:55 -07:00
Yueh-Hsuan Chiang
bb1c74ce18 Fixed a bug of CompactionStats in multi-level universal compaction case
Summary:
Universal compaction can involves in multiple levels.  However,
the current implementation of bytes_readn and bytes_readnp1
(and some other stats with postfix `n` and `np1`) assumes compaction
can only have two levels.

This patch fixes this bug and redefines bytes_readn and bytes_readnp1:
* bytes_readnp1: the number of bytes read in the compaction output level.
* bytes_readn: the total number of bytes read minus bytes_readnp1

Test Plan: Add a test in compaction_job_stats_test

Reviewers: igor, sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40239
2015-06-17 23:40:34 -07:00
Poornima Chozhiyath Raman
a66b8157df Merge branch 'master' of github.com:facebook/rocksdb
D40233: Replace %llu with format macros in ParsedInternalKey::DebugString())
2015-06-17 20:44:52 -07:00
Poornima Chozhiyath Raman
f06be62fd2 Replace %llu with format macros in ParsedInternalKey::DebugString())
Test Plan: successfully compiled the code

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40233
2015-06-17 20:44:26 -07:00
Igor Canadi
2dc3910b5e Add --benchmark_write_rate_limit option to db_bench
Summary:
So far, we benchmarked RocksDB by writing as fast as possible. With this change, we're able to limit our write throughput, which should help us better understand how RocksDB performes under varying write workloads.

Specifically, I'm currently interested in the shape of the graph that has write throughput on one axis and write rate on another. This should help us with designing our stall system, as we have started to do with D36351.

Test Plan:
    $ ./db_bench --benchmarks=fillrandom --benchmark_write_rate_limit=1000000
    fillrandom   :     118.523 micros/op 8437 ops/sec;    0.9 MB/s
    $ ./db_bench --benchmarks=fillrandom --benchmark_write_rate_limit=2000000
    fillrandom   :      59.136 micros/op 16910 ops/sec;    1.9 MB/s

Reviewers: MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39759
2015-06-17 16:44:52 -07:00
Islam AbdelRahman
12e030a992 Use CompactRangeOptions for CompactRange
Summary:
This diff update DB::CompactRange to use RangeCompactionOptions instead of using multiple parameters
Old CompactRange is still available but deprecated

Test Plan:
make all check
make rocksdbjava
USE_CLANG=1 make all
OPT=-DROCKSDB_LITE make release

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40209
2015-06-17 14:36:14 -07:00
Igor Canadi
4716ab4d16 Merge pull request #638 from HolodovAlexander/master
C api: human-readable statistics
2015-06-17 13:16:20 -07:00
Igor Canadi
25d600569d Clean up InstallSuperVersion
Summary:
We go to great lengths to make sure MaybeScheduleFlushOrCompaction() is called outside of write thread. But anyway, it's still called in the mutex, so it's not that much cheaper.

This diff removes the "optimization" and cleans up the code a bit.

Test Plan: make check

Reviewers: rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40113
2015-06-17 12:37:59 -07:00
Yueh-Hsuan Chiang
1a08d0beb5 Block c_test in ROCKSDB_LITE
Summary: Block c_test in ROCKSDB_LITE as it's not supported in ROCKSDB_LITE.

Test Plan: c_test

Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40257
2015-06-17 10:54:51 -07:00
sdong
40f562e747 Allow GetApproximateSize() to include mem table size if it is skip list memtable
Summary:
Add an option in GetApproximateSize() so that the result will include estimated sizes in mem tables.
To implement it, implement an estimated count from the beginning to a key in skip list. The approach is to count to find the entry, how many Next() is issued from each level, and sum them with a weight that is <branching factor> ^ <level>.

Test Plan: Add a test case

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40119
2015-06-16 18:13:23 -07:00
Igor Canadi
d59d90bb1f db_bench periodically writes QPS to CSV file
Summary:
This is part of an effort to better understand and optimize RocksDB stalls under high load. I added a feature to db_bench to periodically write QPS to CSV files. That way we can nicely see how our QPS changes in time (especially when DB is stalled) and can do a better job of evaluating our stall system (i.e. we want the QPS to be as constant as possible, as opposed to having bunch of stalls)

Cool part of CSV files is that we can easily graph them -- there are a bunch of tools available.

Test Plan:
Ran ./db_bench --report_interval_seconds=10 --benchmarks=fillrandom --num=10000000
and observed this in report.csv:

secs_elapsed,interval_qps
10,2725860
20,1980480
30,1863456
40,1454359
50,1460389

Reviewers: sdong, MarkCallaghan, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40047
2015-06-12 14:31:53 -07:00
Yueh-Hsuan Chiang
5fec963877 Fixed false alarm of size comparison in compaction_job_stats_test
Summary: Fixed false alarm of size comparison in compaction_job_stats_test

Test Plan: compaction_job_stats_test

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39921
2015-06-12 10:44:07 -07:00
Islam AbdelRahman
cccd2199a6 Revert skip bottommost compaction
Summary:
Reverting this diff https://reviews.facebook.net/D39999
Will add an option to force bottom most level compaction and then re submit it

Test Plan: make check

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40041
2015-06-12 10:43:33 -07:00
Islam AbdelRahman
20f2b54252 Skip bottom most level compaction if no compaction filter
Summary: If we don't have a compaction filter then we can skip compacting the bottom most level

Test Plan:
make check
added unit tests

Reviewers: yhchiang, sdong, igor

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D39999
2015-06-12 09:56:08 -07:00
sdong
7842920be5 Slow down writes by bytes written
Summary:
We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch.

The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work

hard_rate_limit is deprecated.

options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up.

Test Plan: Add new unit tests in db_test

Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor

Reviewed By: igor

Subscribers: ikabiljo, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D36351
2015-06-11 20:42:18 -07:00
Igor Canadi
a84df655f3 Don't let two L0->L1 compactions run in parallel
Summary: With experimental feature SuggestCompactRange() we don't restrict running two L0->L1 compactions in parallel. This diff fixes this.

Test Plan: added a unit test to reproduce the failure. fixed the unit test

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39981
2015-06-11 15:42:16 -07:00
Islam AbdelRahman
d6ce0f7c61 Add largest sequence to FlushJobInfo
Summary:
Adding largest sequence number to FlushJobInfo
and passing flushed file metadata to NotifyOnFlushCompleted which include alot of other values that we may want to expose in FlushJobInfo

Test Plan: make check

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D39927
2015-06-11 15:22:22 -07:00
Yueh-Hsuan Chiang
3eddd1abe9 Add Env::GetThreadID(), which returns the ID of the current thread.
Summary:
Add Env::GetThreadID(), which returns the ID of the current thread.

In addition, make GetThreadList() and InfoLog use same unique ID for the same thread.

Test Plan:
db_test
listener_test

Reviewers: igor, rven, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D39735
2015-06-11 14:18:02 -07:00
Islam AbdelRahman
73faa3d41d Handling edge cases for ReFitLevel
Summary:
Right now the level we pass to ReFitLevel is the maximum level with files (before compaction), there are multiple cases where this maximum level have changed after compaction
- all files where in L0 (now maximum level is L1)
- using kCompactionStyleUniversal (now maximum level in the last level)
- level_compaction_dynamic_level_bytes ??

We can handle each of these cases individually, but I felt it's safer to calculate max_level_with_files again if we want to do a ReFitLevel

Test Plan:
adding some tests
make -j64 check

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: ott, dhruba

Differential Revision: https://reviews.facebook.net/D39663
2015-06-11 14:15:52 -07:00
Reed Allman
735df66552 C: add WriteBatch.PutLogData support 2015-06-10 00:12:33 -07:00
sdong
e409d3d745 Make "make all" work for CYGWIN
Summary: Some test and benchmark codes don't build for CYGWIN. Fix it.

Test Plan: Build "make all" with TARGET_OS=Cygwin on cygwin and make sure it passes.

Reviewers: rven, yhchiang, anthony, igor, kradhakrishnan

Reviewed By: igor, kradhakrishnan

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D39711
2015-06-09 16:36:07 -07:00
sdong
75d7075a8a Print info message about files need compaction for debuging purpose
Summary:
When there are files marked for compaction after compactions, print extra messages to help debugging. Example:

2015/06/08-23:12:55.212855 7ff5013ff700 [default] [JOB 121] Generated table #75: 54 keys, 4807 bytes (need compaction)

2015/06/08-23:12:55.556194 7ff5013ff700 (Original Log Time 2015/06/08-23:12:55.556160) [default] compacted to: base level 1 max bytes base
10240 files[0 1 9 32 12 0 0 0] max score 0.96 (2 files need compaction), MB/sec: 0.0 rd, 0.1 wr, level 2, files in(1, 3) out(5) MB in(0.0,
0.0) out(0.0), read-write-amplify(11.3) write-amplify(5.7) OK, records in: 40, records dropped: 0

Test Plan:
Run test and see LOG files.

valgrind test DBTest.TablePropertiesNeedCompactTest

Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: yoshinorim, maykov, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D39771
2015-06-09 11:23:29 -07:00
Venkatesh Radhakrishnan
406a5682eb Fix hang when closing a DB after doing loads with WAL disabled.
Summary:
There is a hang during DB close in the following scenario:
a) a load with WAL disabled was done,
b) CancelAllBackgroundWork was called,
c) DB Close was called
This was because in that we will wait for a flush but we cannot do a
background flush because we have called CancelAllBackgroundWork which
marks the DB as shutting downn.

Test Plan: Added DBTest FlushOnDestroy

Reviewers: sdong

Reviewed By: sdong

Subscribers: yoshinorim, hermanlee4, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39747
2015-06-09 10:39:49 -07:00
sdong
d8c8f08c12 GetSnapshot() and ReleaseSnapshot() to move new and free out of DB mutex
Summary: We currently issue malloc and free inside DB mutex in GetSnapshot() and ReleaseSnapshot(). Move them out.

Test Plan:
Go through all tests
make valgrind_check

Reviewers: yhchiang, rven, IslamAbdelRahman, anthony, igor

Reviewed By: igor

Subscribers: maykov, hermanlee4, MarkCallaghan, yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D39753
2015-06-08 21:57:02 -07:00
Islam AbdelRahman
643bbbf081 Use nullptr for default compaction_filter_factory
Summary:
Replacing the default value for compaction_filter_factory and compaction_filter_factory_v2 to be nullptr instead of DefaultCompactionFilterFactory / DefaultCompactionFilterFactoryV2
The reason for this is to be able to determine easily if we have compaction filter factory or not without depending on RTTI

Test Plan: make check

Reviewers: yoshinorim, ott, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D39693
2015-06-08 16:34:26 -07:00
Igor Canadi
f02ce0c651 Fix ASAN errors in c_test
Summary: key_sizes claims that 3rd key is of length 8, but it's really only 3. This diff makes it length 8.

Test Plan: asan c_test works again.

Reviewers: sdong, yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39699
2015-06-08 11:28:40 -07:00
Igor Canadi
133130a4f5 Merge pull request #625 from rdallman/c-slice-parts-support
C: add support for WriteBatch SliceParts params
2015-06-08 13:14:44 -04:00
Igor Canadi
de4d172d0f Merge pull request #622 from rdallman/c-multiget
C: add MultiGet support
2015-06-08 13:13:54 -04:00
sdong
6df589b446 Add TablePropertiesCollector::NeedCompact() to suggest DB to further compact output files
Summary:
It is experimental. Allow users to return from a call back function TablePropertiesCollector::NeedCompact(), based on the data in the file.
It can be used to allow users to suggest DB to clear up delete tombstones faster.

Test Plan: Add a unit test.

Reviewers: igor, yhchiang, kradhakrishnan, rven

Reviewed By: rven

Subscribers: yoshinorim, MarkCallaghan, maykov, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D39585
2015-06-05 20:18:21 -07:00
Yueh-Hsuan Chiang
2e764f06ea [API Change] Improve EventListener::OnFlushCompleted interface
Summary:
EventListener::OnFlushCompleted() now passes a structure instead
of a list of parameters.  This minimizes the API change in the
future.

Test Plan:
listener_test
compact_files_test
example/compact_files_example

Reviewers: kradhakrishnan, sdong, IslamAbdelRahman, rven, igor

Reviewed By: rven, igor

Subscribers: IslamAbdelRahman, rven, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39543
2015-06-05 12:28:51 -07:00
Yueh-Hsuan Chiang
7322c74012 Revert incorrect commit
Summary: Revert incorrect commit

Test Plan: db_test

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39651
2015-06-05 11:23:09 -07:00
Islam AbdelRahman
31e60e2a77 Unlock mutex in ReFitLevel
Summary: I encountered an issue where the database hang, it looks like the mutex is not unlocked on return in ReFitLevel function

Test Plan: make -j64 check

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D39609
2015-06-05 11:06:14 -07:00
Yueh-Hsuan Chiang
7647df8f9e Fixed the tsan failure in util/compaction_job_stats_impl.cc
Summary:
The type of smallest_output_key_prefix and largest_output_key_prefix
have been changed to std::string in https://reviews.facebook.net/D39537.
As a result, we shouldn't do smallest_output_key_prefix[0] = 0 in the
initialization.

Test Plan: compile db_test with tsan enabled and repeat DBTest.CompactionDeletionTrigger test to verify the tsan issue has been gone.

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39645
2015-06-05 11:05:35 -07:00
Igor Canadi
b2785472c8 Fix compile
Summary:
This commit broke the compile: 3ce3bb3da2
As evidenced here: https://evergreen.mongodb.com/task/mongodb_mongo_master_ubuntu1404_rocksdb_compile_ce2b1d11d42de93f7b375f7e6c41fb709f66e969_15_06_04_23_09_36

This should fix it

Test Plan: make check

Reviewers: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39627
2015-06-05 09:41:45 -04:00
Islam AbdelRahman
3ce3bb3da2 Allowing L0 -> L1 trivial move on sorted data
Summary:
This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping

The conditions for trivial move have been updated

Introduced conditions:
  - Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual)
  - Input level files cannot be overlapping

Removed conditions:
  - Trivial move only run when the compaction is not manual
  - Input level should can contain only 1 file

More context on what tests failed because of Trivial move
```
DBTest.CompactionsGenerateMultipleFiles
This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1
```

```
DBTest.NoSpaceCompactRange
This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation
because trivial move does not need any extra space, and did not check for that
```

```
DBTest.DropWrites
Similar to DBTest.NoSpaceCompactRange
```

```
DBTest.DeleteObsoleteFilesPendingOutputs
This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3
```

```
CuckooTableDBTest.CompactionIntoMultipleFiles
Same as DBTest.CompactionsGenerateMultipleFiles
```

This diff is based on a work by @sdong https://reviews.facebook.net/D34149

Test Plan: make -j64 check

Reviewers: rven, sdong, igor

Reviewed By: igor

Subscribers: yhchiang, ott, march, dhruba, sdong

Differential Revision: https://reviews.facebook.net/D34797
2015-06-04 16:51:25 -07:00
Yueh-Hsuan Chiang
bb808eaddb Changed the CompactionJobStats::output_key_prefix type from char[] to string.
Summary:
Keys in RocksDB can be arbitrary byte strings.  However, in the current
CompactionJobStats, smallest_output_key_prefix and largest_output_key_prefix
are of type char[] without having a length, which is insufficient to handle
non-null terminated strings.

This patch change their type to std::string.

Test Plan: compaction_job_stats_test

Reviewers: igor, rven, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39537
2015-06-04 12:31:12 -07:00
Yueh-Hsuan Chiang
0b3172d071 Add EventListener::OnTableFileDeletion()
Summary:
Add EventListener::OnTableFileDeletion(), which will be
called when a table file is deleted.

Test Plan: Extend three existing tests in db_test to verify the deleted files.

Reviewers: rven, anthony, kradhakrishnan, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38931
2015-06-03 19:57:01 -07:00
Reed Allman
211a195d41 C: add MultiGet support 2015-06-03 17:57:42 -07:00
Reed Allman
5dc174e11a C: add support for WriteBatch SliceParts params 2015-06-03 17:08:00 -07:00
Igor Canadi
2d0b9e5f0a Fix compile on darwin
Summary: We need to start doing some CI on Macs.

Test Plan: works now

Reviewers: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39489
2015-06-03 16:52:51 -04:00
sdong
3af668ed17 Fix DBTest.MigrateToDynamicLevelMaxBytesBase slowness with valgrind
Summary:
DBTest.MigrateToDynamicLevelMaxBytesBase with valgrind test is
extremely slow. Work it around by not having both threads running
everything non-stop.

Test Plan: Run the test with valgrind which used to take too long to finish and see it finish in reasonable time.

Reviewers: yhchiang, anthony, rven, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D39477
2015-06-03 12:08:37 -07:00
Igor Canadi
408cc4b8e0 Revert "Merge pull request #621 from rdallman/c-slice-parts-support"
This reverts commit 78382d4ba7, reversing
changes made to ca8b85ac04.
2015-06-03 13:34:07 -04:00
Igor Canadi
78382d4ba7 Merge pull request #621 from rdallman/c-slice-parts-support
C: add support for WriteBatch SliceParts params
2015-06-03 13:16:14 -04:00
Yueh-Hsuan Chiang
0483dab2ab Remove a TODO that has been done
Summary: Remove a TODO that has been done

Test Plan: make

Reviewers: sdong, igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39429
2015-06-02 18:38:57 -07:00
Yueh-Hsuan Chiang
8afafc2783 Fix compile warning in db/db_impl
Summary:
Fix the following compile warning in db/db_impl

  db/db_impl.cc:1603:19: error: implicit conversion loses integer precision: 'const uint64_t' (aka 'const unsigned long') to 'int' [-Werror,-Wshorten-64-to-32]
     info.job_id = job_id;
                 ~ ^~~~~~

Test Plan: db_test

Reviewers: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39423
2015-06-02 17:36:45 -07:00
Yueh-Hsuan Chiang
fe5c6321cb Allow EventListener::OnCompactionCompleted to return CompactionJobStats.
Summary:
Allow EventListener::OnCompactionCompleted to return CompactionJobStats,
which contains useful information about a compaction.

Example CompactionJobStats returned by OnCompactionCompleted():
    smallest_output_key_prefix 05000000
    largest_output_key_prefix 06990000
    elapsed_time 42419
    num_input_records 300
    num_input_files 3
    num_input_files_at_output_level 2
    num_output_records 200
    num_output_files 1
    actual_bytes_input 167200
    actual_bytes_output 110688
    total_input_raw_key_bytes 5400
    total_input_raw_value_bytes 300000
    num_records_replaced 100
    is_manual_compaction 1

Test Plan: Developed a mega test in db_test which covers 20 variables in CompactionJobStats.

Reviewers: rven, igor, anthony, sdong

Reviewed By: sdong

Subscribers: tnovak, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38463
2015-06-02 17:07:16 -07:00
Yueh-Hsuan Chiang
3083ed2129 Fixed heap-use-after-free error in compaction_job_test.cc
Summary: Fixed heap-use-after-free error in compaction_job_test.cc

Test Plan: compaction_job_test

Reviewers: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39411
2015-06-02 16:23:01 -07:00
Yueh-Hsuan Chiang
ab946af08a Fix a compile warning in listener_test.cc
Summary:
Fixed the following compile warning in listener_test.cc:
db/listener_test.cc:214:8: error: 'OnTableFileCreated' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
14:16:46   void OnTableFileCreated(

Test Plan:
make listener_test

Reviewers: sdong, igor

Subscribers: leveldb
2015-06-02 14:20:27 -07:00
Yueh-Hsuan Chiang
fc83821270 Add EventListener::OnTableFileCreated()
Summary:
Add EventListener::OnTableFileCreated(), which will be called
when a table file is created.  This patch is part of the
EventLogger and EventListener integration.

Test Plan: Augment existing test in db/listener_test.cc

Reviewers: anthony, kradhakrishnan, rven, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38865
2015-06-02 14:12:23 -07:00
Yueh-Hsuan Chiang
898e803fc5 Add a stats counter for DB_WRITE back which was mistakenly removed.
Summary: Add a stats counter for DB_WRITE back which was mistakenly removed.

Test Plan: augment GroupCommitTest

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39399
2015-06-02 12:35:12 -07:00
Mike Kolupaev
ec7a944360 more times in perf_context and iostats_context
Summary:
We occasionally get write stalls (>1s Write() calls) on HDD under read load. The following timers explain almost all of the stalls:
 - perf_context.db_mutex_lock_nanos
 - perf_context.db_condition_wait_nanos
 - iostats_context.open_time
 - iostats_context.allocate_time
 - iostats_context.write_time
 - iostats_context.range_sync_time
 - iostats_context.logger_time

In my experiments each of these occasionally takes >1s on write path under some workload. There are rare cases when Write() takes long but none of these takes long.

Test Plan: Added code to our application to write the listed timings to log for slow writes. They usually add up to almost exactly the time Write() call took.

Reviewers: rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: march, dhruba, tnovak

Differential Revision: https://reviews.facebook.net/D39177
2015-06-02 02:07:58 -07:00
sdong
4266d4fd90 Allow users to migrate to options.level_compaction_dynamic_level_bytes=true using CompactRange()
Summary: In DB::CompactRange(), change parameter "reduce_level" to "change_level". Users can compact all data to the last level if needed. By doing it, users can migrate the DB to options.level_compaction_dynamic_level_bytes=true.

Test Plan: Add a unit test for it.

Reviewers: yhchiang, anthony, kradhakrishnan, igor, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D39099
2015-06-01 18:21:14 -07:00
Yueh-Hsuan Chiang
d333820bad Removed DBImpl::notifying_events_
Summary:
DBImpl::notifying_events_ is a internal counter in DBImpl which is
used to prevent DB close when DB is notifying events.  However, as
the current events all rely on either compaction or flush which
already have similar counters to prevent DB close, it is safe to
remove notifying_events_.

Test Plan:
listener_test
examples/compact_files_example

Reviewers: igor, anthony, kradhakrishnan, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39315
2015-06-01 15:32:23 -07:00
Igor Canadi
a187e66ad0 Merge pull request #617 from rdallman/wb-merge-sliceparts
WriteBatch.Merge w/ SliceParts support
2015-05-31 13:34:34 -04:00
Igor Canadi
4c181f08bc Fix compile on darwin
Summary: As title

Test Plan: make check

Reviewers: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39243
2015-05-30 12:25:45 -04:00
agiardullo
bc7a7a400c fix LITE build
Summary: Broken by optimistic transaction diff.  (I only built 'release' not 'static_lib' when testing).

Test Plan: build

Reviewers: yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39219
2015-05-29 15:22:00 -07:00
agiardullo
dc9d70de65 Optimistic Transactions
Summary: Optimistic transactions supporting begin/commit/rollback semantics.  Currently relies on checking the memtable to determine if there are any collisions at commit time.  Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty.  You should probably start with transaction.h to get an overview of what is currently supported.

Test Plan: Added a new test, but still need to look into stress testing.

Reviewers: yhchiang, igor, rven, sdong

Reviewed By: sdong

Subscribers: adamretter, MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D33435
2015-05-29 14:36:35 -07:00
Reed Allman
21cd6b7ad8 C: add support for WriteBatch SliceParts params 2015-05-29 10:23:43 -07:00
Reed Allman
a0635ba3f6 WriteBatch.Merge w/ SliceParts support
also hooked up WriteBatchInternal
2015-05-29 04:30:03 -07:00
agiardullo
c815351038 Support saving history in memtable_list
Summary:
For transactions, we are using the memtables to validate that there are no write conflicts.  But after flushing, we don't have any memtables, and transactions could fail to commit.  So we want to someone keep around some extra history to use for conflict checking.  In addition, we want to provide a way to increase the size of this history if too many transactions fail to commit.

After chatting with people, it seems like everyone prefers just using Memtables to store this history (instead of a separate history structure).  It seems like the best place for this is abstracted inside the memtable_list.  I decide to create a separate list in MemtableListVersion as using the same list complicated the flush/installalflushresults logic too much.

This diff adds a new parameter to control how much memtable history to keep around after flushing.  However, it sounds like people aren't too fond of adding new parameters.  So I am making the default size of flushed+not-flushed memtables be set to max_write_buffers.  This should not change the maximum amount of memory used, but make it more likely we're using closer the the limit.  (We are now postponing deleting flushed memtables until the max_write_buffer limit is reached).  So while we might use more memory on average, we are still obeying the limit set (and you could argue it's better to go ahead and use up memory now instead of waiting for a write stall to happen to test this limit).

However, if people are opposed to this default behavior, we can easily set it to 0 and require this parameter be set in order to use transactions.

Test Plan: Added a xfunc test to play around with setting different values of this parameter in all tests.  Added testing in memtablelist_test and planning on adding more testing here.

Reviewers: sdong, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37443
2015-05-28 16:34:24 -07:00
Yueh-Hsuan Chiang
ec4ff4e99c Rename EventLoggerHelpers EventHelpers
Summary:
Rename EventLoggerHelpers EventHelpers, as it's going to include
all event-related helper functions instead of EventLogger only stuffs.

Test Plan: make

Reviewers: sdong, rven, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39093
2015-05-28 13:37:47 -07:00
Yueh-Hsuan Chiang
672dda9b3b [API Change] Move listeners from ColumnFamilyOptions to DBOptions
Summary: Move listeners from ColumnFamilyOptions to DBOptions

Test Plan:
listener_test
compact_files_test

Reviewers: rven, anthony, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39087
2015-05-28 13:21:39 -07:00
Yueh-Hsuan Chiang
3ab8ffd4dd Compaction now conditionally boosts the size of deletion entries.
Summary:
Compaction now boosts the size of deletion entries of a file only when
the number of deletion entries is greater than the number of non-deletion
entries in the file.  The motivation here is that in a stable workload,
the number of deletion entries should be roughly equal to the number of
non-deletion entries.  If we compensate the size of deletion entries in a
stable workload, the deletion compensation logic might introduce unwanted
effet which changes the shape of LSM tree.

Test Plan: db_test --gtest_filter="*Deletion*"

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38703
2015-05-26 14:05:38 -07:00
Igor Canadi
a81ac24127 Merge pull request #615 from rdallman/master
C: add more block based table stuff, some aux slice transform/merge ops
2015-05-26 14:19:31 -04:00
Yueh-Hsuan Chiang
6d299b70b8 Fixed a bug in EventLoggerHelpers::LogTableFileCreation
Summary:
Fixed a missing "}" at the end of the generated JSON Log
in EventLoggerHelpers::LogTableFileCreation.

Test Plan: db_bench

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38919
2015-05-26 10:55:46 -07:00
Yueh-Hsuan Chiang
a0580205c8 Removed an unused private variable in db_impl.h
Summary: Removed an unused private variable in db_impl.h

Test Plan: make db_test

Reviewers: sdong, anthony, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38925
2015-05-26 10:46:26 -07:00
Reed Allman
9c38ce1d02 C: extra bbto / noop slice transform 2015-05-22 22:56:28 -07:00
Igor Canadi
ea6d3a8ac0 Don't skip last level when calculating compaction stats
Summary: We have a bug where we don't report the last level's files as being compacted. This fixes it.

Test Plan: See the fix in action here: https://phabricator.fb.com/P19845738

Reviewers: MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38727
2015-05-22 15:30:43 -04:00
Yueh-Hsuan Chiang
5c224d1b70 Fixed two bugs on logging file deletion.
Summary:
This patch fixes the following two bugs on logging file deletion.

1.  Previously, file deletion failure was only logged in INFO_LEVEL.
    This patch changes it to ERROR_LEVEL and does some code clean.

2.  EventLogger previously will always generate the same log on
    table file deletion even when file deletion is not successful.
    Now the resulting status of file deletion will also be logged.

Test Plan: make all check

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38817
2015-05-22 12:10:51 -07:00
Yueh-Hsuan Chiang
dc81efe415 Change the log-level of DB summary and options from INFO_LEVEL to WARN_LEVEL
Summary: Change the log-level of DB summary and options from INFO_LEVEL to WARN_LEVEL

Test Plan:
Use db_bench to verify the log level.

Sample output:
    2015/05/22-00:20:39.778064 7fff75b41300 [WARN] RocksDB version: 3.11.0
    2015/05/22-00:20:39.778095 7fff75b41300 [WARN] Git sha rocksdb_build_git_sha:7fee8775a459134c4cb04baae5bd1687e268f2a0
    2015/05/22-00:20:39.778099 7fff75b41300 [WARN] Compile date May 22 2015
    2015/05/22-00:20:39.778101 7fff75b41300 [WARN] DB SUMMARY
    2015/05/22-00:20:39.778145 7fff75b41300 [WARN] SST files in /tmp/rocksdbtest-691931916/dbbench dir, Total Num: 0, files:
    2015/05/22-00:20:39.778148 7fff75b41300 [WARN] Write Ahead Log file in /tmp/rocksdbtest-691931916/dbbench:
    2015/05/22-00:20:39.778150 7fff75b41300 [WARN]          Options.error_if_exists: 0
    2015/05/22-00:20:39.778152 7fff75b41300 [WARN]        Options.create_if_missing: 1
    2015/05/22-00:20:39.778153 7fff75b41300 [WARN]          Options.paranoid_checks: 1

Reviewers: MarkCallaghan, igor, kradhakrishnan

Reviewed By: igor

Subscribers: sdong, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38835
2015-05-22 11:54:59 -07:00
Yueh-Hsuan Chiang
687214f878 Ensure ColumnFamilyOptions.num_levels >= 2 when level compaction is used.
Summary: Ensure ColumnFamilyOptions.num_levels >= 2 when level compaction is used.

Test Plan: Extend SanitizeOptions test in column_family_test

Reviewers: sdong, rven, anthony, krishnanm86, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38829
2015-05-22 11:35:40 -07:00
Yueh-Hsuan Chiang
2abb592688 Avoid logging under mutex in DBImpl::WriteLevel0TableForRecovery().
Summary: Avoid logging under mutex in DBImpl::WriteLevel0TableForRecovery().

Test Plan: make all check

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38823
2015-05-22 11:24:12 -07:00
Yueh-Hsuan Chiang
7fee8775a4 Allow EventLogger to directly log from a JSONWriter.
Summary:
Allow EventLogger to directly log from a JSONWriter.  This allows
the JSONWriter to be shared by EventLogger and potentially EventListener,
which is an important step to integrate EventLogger and EventListener.

This patch also rewrites EventLoggerHelpers::LogTableFileCreation(),
which uses the new API to generate identical log.

Test Plan:
Run db_bench in debug mode and make sure the log is correct and no
assertions fail.

Reviewers: sdong, anthony, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38709
2015-05-21 15:39:30 -07:00
Igor Canadi
7a3577519f Don't artificially inflate L0 score
Summary:
This turns out to be pretty bad because if we prioritize L0->L1 then L1 can grow artificially large, which makes L0->L1 more and more expensive. For example:
256MB @ L0 + 256MB @ L1 --> 512MB @ L1
256MB @ L0 + 512MB @ L1 --> 768MB @ L1
256MB @ L0 + 768MB @ L1 --> 1GB @ L1

....

256MB @ L0 + 10GB @ L1 --> 10.2GB @ L1

At some point we need to start compacting L1->L2 to speed up L0->L1.

Test Plan:
The performance improvement is massive for heavy write workload. This is the benchmark I ran: https://phabricator.fb.com/P19842671. Before this change, the benchmark took 47 minutes to complete. After, the benchmark finished in 2minutes. You can see full results here: https://phabricator.fb.com/P19842674

Also, we ran this diff on MongoDB on RocksDB on one replicaset. Before the change, our initial sync was so slow that it couldn't keep up with primary writes. After the change, the import finished without any issues

Reviewers: dynamike, MarkCallaghan, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38637
2015-05-21 11:40:48 -07:00
Yueh-Hsuan Chiang
e2c1d4b57f [Public API Change] Make DB::GetDbIdentity() be const function.
Summary: Make DB::GetDbIdentity() be const function.

Test Plan: make db_test

Reviewers: igor, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38745
2015-05-21 11:01:48 -07:00
Yueh-Hsuan Chiang
812c461c96 Dump db stats in WARN level
Summary: Dump db stats in WARN level

Test Plan: run db_bench and verify the LOG

Reviewers: igor, MarkCallaghan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38691
2015-05-19 18:42:17 -07:00
Mark Callaghan
944043d683 Add --wal_bytes_per_sync for db_bench and more IO stats
Summary:
See https://gist.github.com/mdcallag/89ebb2b8cbd331854865 for the IO stats.
I added "Cumulative compaction:" and "Interval compaction:" lines. The IO rates
can be confusing. Rates fro per-level stats lines, Wr(MB/s) & Rd(MB/s), are computed
using the duration of the compaction job. If the job reads 10MB, writes 9MB and the job
(IO & merging) takes 1 second then the rates are 10MB/s for read and 9MB/s for writes.
The IO rates in the Cumulative compaction line uses the total uptime. The IO rates in the
Interval compaction line uses the interval uptime. So these Cumalative & Interval
compaction IO rates cannot be compared to the per-level IO rates. But both forms of
the rates are useful for debugging perf.

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D38667
2015-05-19 16:19:30 -07:00
Igor Canadi
04feaeebb9 Fix comparison between signed and usigned integers
Summary: Not sure why this fails on some compilers and doesn't on others.

Test Plan: none

Reviewers: meyering, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38673
2015-05-19 10:59:30 -07:00
Igor Canadi
4a855c0799 Add an option wal_bytes_per_sync to control sync_file_range for WAL files
Summary:
sync_file_range is not always asyncronous and thus can block writes if we do this for WAL in the foreground thread. See more here: http://yoshinorimatsunobu.blogspot.com/2014/03/how-syncfilerange-really-works.html

Some users don't want us to call sync_file_range on WALs. Some other do.
Thus, I'm adding a separate option wal_bytes_per_sync to control calling
sync_file_range on WAL files. bytes_per_sync will apply only to table
files now.

Test Plan: no more sync_file_range for WAL as evidenced by strace

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38253
2015-05-18 17:03:59 -07:00
Igor Canadi
b0fdda4ff0 Allow flushes to run in parallel with manual compaction
Summary: As title. I spent some time thinking about it and I don't think there should be any issue with running manual compaction and flushes in parallel

Test Plan: make check works

Reviewers: rven, yhchiang, sdong

Reviewed By: yhchiang, sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38355
2015-05-18 15:34:33 -07:00
sdong
fb5bdbf987 DBTest.DynamicLevelMaxBytesCompactRange: make sure L0 is not empty before running compact range
Summary: DBTest.DynamicLevelMaxBytesCompactRange needs to make sure L0 is not empty to properly cover the code paths we want to cover. However, current codes have a bug that might leave the condition not held. Improve the test to ensure it.

Test Plan: Run the test in an environment that is used to fail. Also run it many times.

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D38631
2015-05-18 11:49:45 -07:00
sdong
6fa7085121 CompactRange skips levels 1 to base_level -1 for dynamic level base size
Summary: CompactRange() now is much more expensive for dynamic level base size as it goes through all the levels. Skip those not used levels between level 0 an base level.

Test Plan: Run all unit tests

Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D37125
2015-05-18 10:54:11 -07:00
Holodov Alexander
eeb44366ba C api: human-readable statistics 2015-05-16 12:34:28 +04:00
Yueh-Hsuan Chiang
3f0867c0fe Allow GetThreadList to report Flush properties.
Summary:
Allow GetThreadList to report Flush properties, which includes:
* job id
* number of bytes that has been written since flush started.
* total size of input mem-tables

Test Plan:
./db_bench --threads=30 --num=1000000 --benchmarks=fillrandom --thread_status_per_interval=100 --value_size=1000

Sample output from db_bench which tracks same flush job

          ThreadID ThreadType       cfName            Operation   ElapsedTime                                         Stage        State OperationProperties
   140213879898240   High Pri      default                Flush       5789 us                    FlushJob::WriteLevel0Table              BytesMemtables 4112835 | BytesWritten 577104 | JobID 8 |

          ThreadID ThreadType       cfName            Operation   ElapsedTime                                         Stage        State OperationProperties
   140213879898240   High Pri      default                Flush     30.634 ms                    FlushJob::WriteLevel0Table              BytesMemtables 4112835 | BytesWritten 1734865 | JobID 8 |

Reviewers: rven, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38505
2015-05-15 23:22:22 -07:00
Igor Canadi
7413306d94 Take a chance on a random file when choosing compaction
Summary:
When trying to compact entire database with SuggestCompactRange(), we'll first try the left-most files. This is pretty bad, because:
1) the left part of LSM tree will be overly compacted, but right part will not be touched
2) First compaction will pick up the left-most file. Second compaction will try to pick up next left-most, but this will not be possible, because there's a big chance that second's file range on N+1 level is already being compacted.

I observe both of those problems when running Mongo+RocksDB and trying to compact the DB to clean up tombstones. I'm unable to clean them up :(

This diff adds a bit of randomness into choosing a file. First, it chooses a file at random and tries to compact that one. This should solve both problems specified here.

Test Plan: make check

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38379
2015-05-15 14:14:40 -07:00
sdong
5aad881298 DBTest.DynamicLevelMaxBytesBase2: remove an unnecesary check
Summary: DBTest.DynamicLevelMaxBytesBase2 has a check that is not necessary and may fail. Remove it, and add two unrelated check.

Test Plan: Run the test

Reviewers: yhchiang, rven, kradhakrishnan, anthony, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D38457
2015-05-14 09:22:43 -07:00
sdong
ec43a8b9fb Universal Compaction with multiple levels won't allocate up to output size
Summary: Universal compactions with multiple levels should use file preallocation size based on file size if output level is not level 0

Test Plan: Run all tests.

Reviewers: igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D38439
2015-05-13 14:15:46 -07:00
sdong
bc68bd5a13 db_bench to support rate limiter
Summary: Add --rate_limiter_bytes_per_sec to db_bench to allow rater limit to disk

Test Plan:
Run
./db_bench --benchmarks=fillseq --num=30000000 --rate_limiter_bytes_per_sec=3000000 --num_multi_db=8 -disable_wal
And see io_stats to have the rate limited.

Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D38385
2015-05-13 10:03:41 -07:00
Yueh-Hsuan Chiang
df1f87a882 Fixed compile error in db/column_family.cc
Summary:
Fixed the following compile error in db/column_family.cc
    db/column_family.cc:633:33: error: ‘ASSERT_GT’ was not declared in this scope
    16:14:45    ASSERT_GT(listeners.size(), 0U);

Test Plan: make db_test

Reviewers: igor, sdong, rven

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38367
2015-05-12 16:20:03 -07:00
Yueh-Hsuan Chiang
14431e971d Fixed a bug in EventListener::OnCompactionCompleted().
Summary:
Fixed a bug in EventListener::OnCompactionCompleted() that returns
incorrect list of input / output file names.

Test Plan: Extend existing test in listener_test.cc

Reviewers: sdong, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38349
2015-05-12 16:10:23 -07:00
Igor Canadi
dbd95b7532 Add more table properties to EventLogger
Summary:
Example output:

    {"time_micros": 1431463794310521, "job": 353, "event": "table_file_creation", "file_number": 387, "file_size": 86937, "table_info": {"data_size": "81801", "index_size": "9751", "filter_size": "0", "raw_key_size": "23448", "raw_average_key_size": "24.000000", "raw_value_size": "990571", "raw_average_value_size": "1013.890481", "num_data_blocks": "245", "num_entries": "977", "filter_policy_name": "", "kDeletedKeys": "0"}}

Also fixed a bug where BuildTable() in recovery was passing Env::IOHigh argument into paranoid_checks_file parameter.

Test Plan: make check + check out the output in the log

Reviewers: sdong, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38343
2015-05-12 15:53:55 -07:00
Igor Canadi
b5881762bc Reset parent_index and base_index when picking files marked for compaction
Summary: This caused a crash of our MongoDB + RocksDB instance. PickCompactionBySize() sets its own parent_index. We never reset this parent_index when picking PickFilesMarkedForCompactionExperimental(). So we might end up doing SetupOtherInputs() with parent_index that was set by PickCompactionBySize, although we're using compaction calculated using PickFilesMarkedForCompactionExperimental.

Test Plan: Added a unit test that fails with assertion on master.

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38337
2015-05-12 11:16:25 -07:00
agiardullo
711465ccec API to fetch from both a WriteBatchWithIndex and the db
Summary:
Added a couple functions to WriteBatchWithIndex to make it easier to query the value of a key including reading pending writes from a batch.  (This is needed for transactions).

I created write_batch_with_index_internal.h to use to store an internal-only helper function since there wasn't a good place in the existing class hierarchy to store this function (and it didn't seem right to stick this function inside WriteBatchInternal::Rep).

Since I needed to access the WriteBatchEntryComparator, I moved some helper classes from write_batch_with_index.cc into write_batch_with_index_internal.h/.cc.  WriteBatchIndexEntry, ReadableWriteBatch, and WriteBatchEntryComparator are all unchanged (just moved to a different file(s)).

Test Plan: Added new unit tests.

Reviewers: rven, yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38037
2015-05-11 14:51:51 -07:00
Igor Canadi
3996fff8a1 Fix clang build - add override
Summary: In new clang we need to add override to every overriden function

Test Plan: none

Reviewers: rven

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D38259
2015-05-09 11:04:14 -07:00
Igor Canadi
d978139063 SuggestCompactRange() is manual compaction
Summary: When reporting compaction that was started because of SuggestCompactRange() we should treat it as manual compaction.

Test Plan: none

Reviewers: yhchiang, rven

Reviewed By: rven

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38139
2015-05-08 19:37:02 -07:00
Yueh-Hsuan Chiang
77a5a543a5 Allow GetThreadList() to report basic compaction operation properties.
Summary:
Now we're able to show more details about a compaction in
GetThreadList() :)

This patch allows GetThreadList() to report basic compaction
operation properties.  Basic compaction properties include:
    1. job id
    2. compaction input / output level
    3. compaction property flags (is_manual, is_deletion, .. etc)
    4. total input bytes
    5. the number of bytes has been read currently.
    6. the number of bytes has been written currently.

Flush operation properties will be done in a seperate diff.

Test Plan:
/db_bench --threads=30 --num=1000000 --benchmarks=fillrandom --thread_status_per_interval=1

Sample output of tracking same job:

          ThreadID ThreadType       cfName            Operation   ElapsedTime                                         Stage        State OperationProperties
   140664171987072    Low Pri      default           Compaction     31.357 ms     CompactionJob::FinishCompactionOutputFile              BaseInputLevel 1 | BytesRead 2264663 | BytesWritten 1934241 | IsDeletion 0 | IsManual 0 | IsTrivialMove 0 | JobID 277 | OutputLevel 2 | TotalInputBytes 3964158 |

          ThreadID ThreadType       cfName            Operation   ElapsedTime                                         Stage        State OperationProperties
   140664171987072    Low Pri      default           Compaction     59.440 ms     CompactionJob::FinishCompactionOutputFile              BaseInputLevel 1 | BytesRead 2264663 | BytesWritten 1934241 | IsDeletion 0 | IsManual 0 | IsTrivialMove 0 | JobID 277 | OutputLevel 2 | TotalInputBytes 3964158 |

          ThreadID ThreadType       cfName            Operation   ElapsedTime                                         Stage        State OperationProperties
   140664171987072    Low Pri      default           Compaction    226.375 ms                        CompactionJob::Install              BaseInputLevel 1 | BytesRead 3958013 | BytesWritten 3621940 | IsDeletion 0 | IsManual 0 | IsTrivialMove 0 | JobID 277 | OutputLevel 2 | TotalInputBytes 3964158 |

Reviewers: sdong, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37653
2015-05-06 22:51:06 -07:00
Igor Canadi
65fe1cfbb3 Cleanup CompactionJob
Summary:
Couple changes:
1. instead of SnapshotList, just take a vector of snapshots
2. don't take a separate parameter is_snapshots_supported. If there are snapshots in the list, that means they are supported. I actually think we should get rid of this notion of snapshots not being supported.
3. don't pass in mutable_cf_options as a parameter. Lifetime of mutable_cf_options is a bit tricky to maintain, so it's better to not pass it in for the whole compaction job. We only really need it when we install the compaction results.

Test Plan: make check

Reviewers: sdong, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36627
2015-05-05 19:01:12 -07:00
Laurent Demailly
df4130ad85 fix crashes in stats and compaction filter for db_ttl_impl
Summary: fix crashes in stats and compaction filter for db_ttl_impl

Test Plan:
Ran build with lots of debugging
https://reviews.facebook.net/differential/diff/194175/

Reviewers: yhchiang, igor, rven

Reviewed By: igor

Subscribers: rven, dhruba

Differential Revision: https://reviews.facebook.net/D38001
2015-05-05 16:54:47 -07:00
Venkatesh Radhakrishnan
7ea769487f Fix flakiness in column_family_test
Summary:
Fixes #6840824, running "make check" on centos6 hits
a deadlock in column_family_test

Test Plan:
seq 10000 | parallel --gnu --eta 't=/dev/shm/rdb-{}; rm -rf
$t; mkdir $t && export TEST_TMPDIR=$t; ./column_family_test > $t/log-{}'
Made the test deterministic by narrrowing the window for the flush.

Reviewers: igor, meyering

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38079
2015-05-05 15:59:02 -07:00
Liangjun Feng
9aa011fa36 Optimize GetRange Function
Summary: Optimize GetRange Function by checking the level of the files

Test Plan: pass make all check

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D37977
2015-05-05 09:57:47 -07:00
Igor Canadi
36a7408896 Fix UNLIKELY parenthesis
Summary: Ooops :) status.ok() is acutally highly likely :)

Test Plan: none

Reviewers: rven, yhchiang, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38043
2015-05-05 08:57:34 -07:00
Jim Meyering
2ab7065af1 build: avoid unused-variable warning
Summary:
[noticed a new warning when building with the very latest gcc]
* db/memtablerep_bench.cc (FLAGS_env): Remove declaration
of unused varaible, to avoid this warning/error:

db/memtablerep_bench.cc:135:22: error: ‘FLAGS_env’ defined but not\
  used [-Werror=unused-variable]
 static rocksdb::Env* FLAGS_env = rocksdb::Env::Default();
                      ^

Test Plan: compile

Reviewers: ljin, rven, igor.sugak, yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D37983
2015-05-02 13:19:10 -07:00
Venkatesh Radhakrishnan
d2346c2cf0 Fix hang with large write batches and column families.
Summary:
This diff fixes a hang reported by a Github user.
https://www.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Ffacebook%2Frocksdb%2Fissues%2F595%23issuecomment-96983273&h=9AQFYOWlo
Multiple large write batches with column families cause a hang.
The issue was caused by not doing flushes/compaction when the
write controller was stopped.

Test Plan: Create a DBTest from the user's test case

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37929
2015-05-01 15:41:50 -07:00
Mark Callaghan
b6b100fe04 Remove iter_refresh_interval_us
Summary:
The default, use one iter for the whole test, isn't good. This cost me
a few hours of debugging and a few days of tessting. For readonly
that isn't realistic and for read-write that keeps a lot of old sst files around.
I remove the option because nothing uses it and not calling gettimeofday per
loop iteration adds about 3% to QPS at 20 threads.

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D37965
2015-05-01 14:17:45 -07:00
Igor Canadi
dddceefe5e Fix clang build
Summary: fix build

Test Plan: works

Reviewers: kradhakrishnan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37911
2015-04-30 11:11:35 -07:00
krad
d4540654e9 Optimize GetApproximateSizes() to use lesser CPU cycles.
Summary:
CPU profiling reveals GetApproximateSizes as a bottleneck for performance. The current implementation is sub-optimal, it scans every file in every level to compute the result.

We can take advantage of the fact that all levels above 0 are sorted in the increasing order of key ranges and use binary search to locate the starting index. This can reduce the number of comparisons required to compute the result.

Test Plan: We have good test coverage. Run the tests.

Reviewers: sdong, igor, rven, dynamike

Subscribers: dynamike, maykov, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37755
2015-04-30 10:55:03 -07:00
Igor Canadi
7246ad34d0 Don't compact bottommost level in SuggestCompactRange
Summary: Before the fix we also marked the bottommost level for compaction. This is wrong because then RocksDB has N+1 levels instead of N as before the compaction.

Test Plan: SuggestCompactRangeTest in db_test

Reviewers: yhchiang, rven

Reviewed By: rven

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37869
2015-04-29 13:35:48 -07:00