Commit Graph

174 Commits

Author SHA1 Message Date
Tyler Harter
4504c99030 Internal/user key bug fix.
Summary: Fix code so that the filter_block layer only assumes keys are internal when prefix_extractor is set.

Test Plan: ./filter_block_test

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12501
2013-08-23 14:49:57 -07:00
Dhruba Borthakur
1186192ed1 Replace include/leveldb with include/rocksdb.
Summary: Replace include/leveldb with include/rocksdb.

Test Plan:
make clean; make check
make clean; make release

Differential Revision: https://reviews.facebook.net/D12489
2013-08-23 10:51:00 -07:00
Jim Paton
74781a0c49 Add three new MemTableRep's
Summary:
This patch adds three new MemTableRep's: UnsortedRep, PrefixHashRep, and VectorRep.

UnsortedRep stores keys in an std::unordered_map of std::sets. When an iterator is requested, it dumps the keys into an std::set and iterates over that.

VectorRep stores keys in an std::vector. When an iterator is requested, it creates a copy of the vector and sorts it using std::sort. The iterator accesses that new vector.

PrefixHashRep stores keys in an unordered_map mapping prefixes to ordered sets.

I also added one API change. I added a function MemTableRep::MarkImmutable. This function is called when the rep is added to the immutable list. It doesn't do anything yet, but it seems like that could be useful. In particular, for the vectorrep, it means we could elide the extra copy and just sort in place. The only reason I haven't done that yet is because the use of the ArenaAllocator complicates things (I can elaborate on this if needed).

Test Plan:
make -j32 check
./db_stress --memtablerep=vector
./db_stress --memtablerep=unsorted
./db_stress --memtablerep=prefixhash --prefix_size=10

Reviewers: dhruba, haobo, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12117
2013-08-22 23:10:02 -07:00
Tyler Harter
94cf218720 Revert "Prefix scan: db_bench and bug fixes"
This reverts commit c2bd8f4824.
2013-08-22 18:01:11 -07:00
Tyler Harter
c2bd8f4824 Prefix scan: db_bench and bug fixes
Summary: If use_prefix_filters is set and read_range>1, then the random seeks will set a the prefix filter to be the prefix of the key which was randomly selected as the target.  Still need to add statistics (perhaps in a separate diff).

Test Plan: ./db_bench --benchmarks=fillseq,prefixscanrandom --num=10000000 --statistics=1 --use_prefix_blooms=1 --use_prefix_api=1 --bloom_bits=10

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D12273
2013-08-22 16:06:50 -07:00
Simha Venkataramaiah
60bf2b7d4a Add APIs to query SST file metadata and to delete specific SST files
Summary: An api to query the level, key ranges, size etc for each SST file and an api to delete a specific file from the db and all associated state in the bookkeeping datastructures.

Notes: Editing the manifest version does not release the obsolete files right away. However deleting the file directly will mess up the iterator. We may need a more aggressive/timely file deletion api.

I have used std::unique_ptr - will switch to boost:: since this is external. thoughts?

Unit test is fragile right now as it expects the compaction at certain levels.

Test Plan: unittest

Reviewers: dhruba, vamsi, emayanke

CC: zshao, leveldb, haobo

Task ID: #

Blame Rev:
2013-08-22 15:27:19 -07:00
Jim Paton
cb703c9d03 Allow WriteBatch::Handler to abort iteration
Summary:
Sometimes you don't need to iterate through the whole WriteBatch. This diff makes the Handler member functions return a bool that indicates whether to abort or not. If they return true, the iteration stops.

One thing I just thought of is that this will break backwards-compability. Maybe it would be better to add a virtual member function WriteBatch::Handler::ShouldAbort() that returns false by default. Comments requested.

I still have to add a new unit test for the abort code, but let's finalize the API first.

Test Plan: make -j32 check

Reviewers: dhruba, haobo, vamsi, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12339
2013-08-21 18:27:48 -07:00
Mayank Agarwal
404d63ac3e Add TODO for DBToStackableDB function which doesn't work yet
Summary:

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-08-20 21:33:53 -07:00
Deon Nicholas
b87dcae1a3 Made merge_oprator a shared_ptr; and added TTL unit tests
Test Plan:
- make all check;
- make release;
- make stringappend_test; ./stringappend_test

Reviewers: haobo, emayanke

Reviewed By: haobo

CC: leveldb, kailiu

Differential Revision: https://reviews.facebook.net/D12381
2013-08-20 13:35:28 -07:00
Mayank Agarwal
3ab2792f93 Add default info to comment in leveldb/options.h for no_block_cache
Summary: to ley clients know

Test Plan: visual
2013-08-19 14:29:40 -07:00
Mayank Agarwal
28e6fe5e9f Correct documentation for no_block_cache in leveldb/options.h
Summary: false shoudl have been true in comment

Test Plan: visual
2013-08-19 14:27:19 -07:00
Mayank Agarwal
8a3547d38e API for getting archived log files
Summary: Also expanded class LogFile to have startSequene and FileSize and exposed it publicly

Test Plan: make all check

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12087
2013-08-19 13:37:04 -07:00
Deon Nicholas
e1346968d8 Merge operator fixes part 1.
Summary:
-Added null checks and revisions to DBIter::MergeValuesNewToOld()
-Added DBIter test to stringappend_test
-Major fix with Merge and TTL
More plans for fixes later.

Test Plan:
-make clean; make stringappend_test -j 32; ./stringappend_test
-make all check;

Reviewers: haobo, emayanke, vamsi, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12315
2013-08-19 11:42:47 -07:00
Mayank Agarwal
387ac0f1e1 Expose statistic for sequence number and implement setTickerCount
Summary: statistic for sequence number is needed by wormhole. setTickerCount is demanded for this statistic. I can't simply recordTick(max_sequence) when db recovers because the statistic iobject is owned by client and may/may not be reset during reopen. Eg. statistic is reset in mcrocksdb whereas it is not in db_stress. Therefore it is best to go with setTickerCount

Test Plan: ./db_stress ... --statistics=1 and observed expected sequence number

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12327
2013-08-15 23:00:20 -07:00
Jim Paton
0307c5fe3a Implement log blobs
Summary:
This patch adds the ability for the user to add sequences of arbitrary data (blobs) to write batches. These blobs are saved to the log along with everything else in the write batch. You can add multiple blobs per WriteBatch and the ordering of blobs, puts, merges, and deletes are preserved.

Blobs are not saves to SST files. RocksDB ignores blobs in every way except for writing them to the log.

Before committing this patch, I need to add some test code. But I'm submitting it now so people can comment on the API.

Test Plan: make -j32 check

Reviewers: dhruba, haobo, vamsi

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12195
2013-08-14 16:32:46 -07:00
Haobo Xu
d9dd2a1926 [RocksDB] Expose thread local perf counter for low overhead, per call level performance statistics.
Summary:
As title. No locking/atomic is needed due to thread local. There is also no need to modify the existing client interface, in order to expose related counters.

perf_context_test shows a simple example of retrieving the number of user key comparison done for each put and get call. More counters could be added later.

Sample output
./perf_context_test 1000000
==== Test PerfContextTest.KeyComparisonCount
Inserting 1000000 key/value pairs
...
total user key comparison get: 43446523
total user key comparison put: 8017877
max user key comparison get: 88939
avg user key comparison get:43

Basically, the current skiplist does well on average, but could perform poorly in extreme cases.

Test Plan: run perf_context_test <total number of entries to put/get>

Reviewers: dhruba

Differential Revision: https://reviews.facebook.net/D12225
2013-08-14 15:24:06 -07:00
Mayank Agarwal
f1bf169484 Counter for merge failure
Summary:
With Merge returning bool, it can keep failing silently(eg. While faling to fetch timestamp in TTL). We need to detect this through a rocksdb counter which can get bumped whenever Merge returns false. This will also be super-useful for the mcrocksdb-counter service where Merge may fail.
Added a counter NUMBER_MERGE_FAILURES and appropriately updated db/merge_helper.cc

I felt that it would be better to directly add counter-bumping in Merge as a default function of MergeOperator class but user should not be aware of this, so this approach seems better to me.

Test Plan: make all check

Reviewers: dnicholas, haobo, dhruba, vamsi

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12129
2013-08-13 14:25:42 -07:00
Tyler Harter
f5f1842282 Prefix filters for scans (v4)
Summary: Similar to v2 (db and table code understands prefixes), but use ReadOptions as in v3.  Also, make the CreateFilter code faster and cleaner.

Test Plan: make db_test; export LEVELDB_TESTS=PrefixScan; ./db_test

Reviewers: dhruba

Reviewed By: dhruba

CC: haobo, emayanke

Differential Revision: https://reviews.facebook.net/D12027
2013-08-13 14:04:56 -07:00
sumeet
3b81df34bd Separate compaction filter for each compaction
Summary:
If we have same compaction filter for each compaction,
application cannot know about the different compaction processes.
Later on, we can put in more details in compaction filter for the
application to consume and use it according to its needs. For e.g. In
the universal compaction, we have a compaction process involving all the
files while others don't involve all the files. Applications may want to
collect some stats only when during full compaction.

Test Plan: run existing unit tests

Reviewers: haobo, dhruba

Reviewed By: dhruba

CC: xinyaohu, leveldb

Differential Revision: https://reviews.facebook.net/D12057
2013-08-13 10:56:20 -07:00
Dhruba Borthakur
f5fa26b6a9 Merge branch 'performance' of github.com:facebook/rocksdb into performance
Conflicts:
	db/builder.cc
	db/db_impl.cc
	db/version_set.cc
	include/leveldb/statistics.h
2013-08-07 11:58:06 -07:00
Deon Nicholas
c2d7826ced [RocksDB] [MergeOperator] The new Merge Interface! Uses merge sequences.
Summary:
Here are the major changes to the Merge Interface. It has been expanded
to handle cases where the MergeOperator is not associative. It does so by stacking
up merge operations while scanning through the key history (i.e.: during Get() or
Compaction), until a valid Put/Delete/end-of-history is encountered; it then
applies all of the merge operations in the correct sequence starting with the
base/sentinel value.

I have also introduced an "AssociativeMerge" function which allows the user to
take advantage of associative merge operations (such as in the case of counters).
The implementation will always attempt to merge the operations/operands themselves
together when they are encountered, and will resort to the "stacking" method if
and only if the "associative-merge" fails.

This implementation is conjectured to allow MergeOperator to handle the general
case, while still providing the user with the ability to take advantage of certain
efficiencies in their own merge-operator / data-structure.

NOTE: This is a preliminary diff. This must still go through a lot of review,
revision, and testing. Feedback welcome!

Test Plan:
  -This is a preliminary diff. I have only just begun testing/debugging it.
  -I will be testing this with the existing MergeOperator use-cases and unit-tests
(counters, string-append, and redis-lists)
  -I will be "desk-checking" and walking through the code with the help gdb.
  -I will find a way of stress-testing the new interface / implementation using
db_bench, db_test, merge_test, and/or db_stress.
  -I will ensure that my tests cover all cases: Get-Memtable,
Get-Immutable-Memtable, Get-from-Disk, Iterator-Range-Scan, Flush-Memtable-to-L0,
Compaction-L0-L1, Compaction-Ln-L(n+1), Put/Delete found, Put/Delete not-found,
end-of-history, end-of-file, etc.
  -A lot of feedback from the reviewers.

Reviewers: haobo, dhruba, zshao, emayanke

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11499
2013-08-05 20:14:32 -07:00
Jim Paton
8e792e5896 Add soft_rate_limit stats
Summary: This diff adds histogram stats for soft_rate_limit stalls. It also renames the old rate_limit stats to hard_rate_limit.

Test Plan: make -j32 check

Reviewers: dhruba, haobo, MarkCallaghan

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12021
2013-08-05 18:45:23 -07:00
Mayank Agarwal
1d7b4765c3 Expose base db object from ttl wrapper
Summary: rocksdb replicaiton will need this when writing value+TS from master to slave 'as is'

Test Plan: make

Reviewers: dhruba, vamsi, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11919
2013-08-05 18:44:14 -07:00
Jim Paton
1036537c94 Add soft and hard rate limit support
Summary:
This diff adds support for both soft and hard rate limiting. The following changes are included:

1) Options.rate_limit is renamed to Options.hard_rate_limit.
2) Options.rate_limit_delay_milliseconds is renamed to Options.rate_limit_delay_max_milliseconds.
3) Options.soft_rate_limit is added.
4) If the maximum compaction score is > hard_rate_limit and rate_limit_delay_max_milliseconds == 0, then writes are delayed by 1 ms at a time until the max compaction score falls below hard_rate_limit.
5) If the max compaction score is > soft_rate_limit but <= hard_rate_limit, then writes are delayed by 0-1 ms depending on how close we are to hard_rate_limit.
6) Users can disable 4 by setting hard_rate_limit = 0. They can add a limit to the maximum amount of time waited by setting rate_limit_delay_max_milliseconds > 0. Thus, the old behavior can be preserved by setting soft_rate_limit = 0, which is the default.

Test Plan:
make -j32 check
./db_stress

Reviewers: dhruba, haobo, MarkCallaghan

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12003
2013-08-05 15:43:49 -07:00
Mayank Agarwal
cacd812fb2 Support user's compaction filter in TTL logic
Summary: TTL uses compaction filter to purge key-values and required the user to not pass one. This diff makes it accommodating of user's compaciton filter. Added test to ttl_test

Test Plan: make; ./ttl_test

Reviewers: dhruba, haobo, vamsi

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11973
2013-08-05 11:28:01 -07:00
Dhruba Borthakur
711a30cb30 Merge branch 'master' into performance
Conflicts:
	include/leveldb/options.h
	include/leveldb/statistics.h
	util/options.cc
2013-08-02 10:22:08 -07:00
Mayank Agarwal
59d0b02f8b Expand KeyMayExist to return the proper value if it can be found in memory and also check block_cache
Summary: Removed KeyMayExistImpl because KeyMayExist demanded Get like semantics now. Removed no_io from memtable and imm because we need the proper value now and shouldn't just stop when we see Merge in memtable. Added checks to block_cache. Updated documentation and unit-test

Test Plan: make all check;db_stress for 1 hour

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11853
2013-08-01 09:07:46 -07:00
Jim Paton
9700677a2b Slow down writes gradually rather than suddenly
Summary:
Currently, when a certain number of level0 files (level0_slowdown_writes_trigger) are present, RocksDB will slow down each write by 1ms. There is a second limit of level0 files at which RocksDB will stop writes altogether (level0_stop_writes_trigger).

This patch enables the user to supply a third parameter specifying the number of files at which Rocks will start slowing down writes (level0_start_slowdown_writes). When this number is reached, Rocks will slow down writes as a quadratic function of level0_slowdown_writes_trigger - num_level0_files.

For some workloads, this improves latency and throughput. I will post some stats momentarily in https://our.intern.facebook.com/intern/tasks/?t=2613384.

Test Plan:
make -j32 check
./db_stress
./db_bench

Reviewers: dhruba, haobo, MarkCallaghan, xjin

Reviewed By: xjin

CC: leveldb, xjin, zshao

Differential Revision: https://reviews.facebook.net/D11859
2013-07-31 16:20:48 -07:00
Xing Jin
0f0a24e298 Make arena block size configurable
Summary:
Add an option for arena block size, default value 4096 bytes. Arena will allocate blocks with such size.

I am not sure about passing parameter to skiplist in the new virtualized framework, though I talked to Jim a bit. So add Jim as reviewer.

Test Plan:
new unit test, I am running db_test.

For passing paramter from configured option to Arena, I tried tests like:

  TEST(DBTest, Arena_Option) {
  std::string dbname = test::TmpDir() + "/db_arena_option_test";
  DestroyDB(dbname, Options());

  DB* db = nullptr;
  Options opts;
  opts.create_if_missing = true;
  opts.arena_block_size = 1000000; // tested 99, 999999
  Status s = DB::Open(opts, dbname, &db);
  db->Put(WriteOptions(), "a", "123");
  }

and printed some debug info. The results look good. Any suggestion for such a unit-test?

Reviewers: haobo, dhruba, emayanke, jpaton

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D11799
2013-07-31 12:42:23 -07:00
Jim Paton
18afff2e63 Add stall counts to statistics
Summary: Previously, statistics are kept on how much time is spent on stalls of different types. This patch adds support for keeping number of stalls of each type. For example, instead of just reporting how many microseconds are spent waiting for memtables to be compacted, it will also report how many times a write stalled for that to occur.

Test Plan:
make -j32 check
./db_stress

# Not really sure what else should be done...

Reviewers: dhruba, MarkCallaghan, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11841
2013-07-29 10:34:23 -07:00
Jim Paton
52d7ecfc78 Virtualize SkipList Interface
Summary: This diff virtualizes the skiplist interface so that users can provide their own implementation of a backing store for MemTables. Eventually, the backing store will be responsible for its own synchronization, allowing users (and us) to experiment with different lockless implementations.

Test Plan:
make clean
make -j32 check
./db_stress

Reviewers: dhruba, emayanke, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11739
2013-07-23 14:42:27 -07:00
Mayank Agarwal
bf66c10b13 Use KeyMayExist for WriteBatch-Deletes
Summary:
Introduced KeyMayExist checking during writebatch-delete and removed from Outer Delete API because it uses writebatch-delete.
Added code to skip getting Table from disk if not already present in table_cache.
Some renaming of variables.
Introduced KeyMayExistImpl which allows checking since specified sequence number in GetImpl useful to check partially written writebatch.
Changed KeyMayExist to not be pure virtual and provided a default implementation.
Expanded unit-tests in db_test to check appropriately.
Ran db_stress for 1 hour with ./db_stress --max_key=100000 --ops_per_thread=10000000 --delpercent=50 --filter_deletes=1 --statistics=1.

Test Plan: db_stress;make check

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D11745
2013-07-23 13:36:50 -07:00
Haobo Xu
9ee68871dc [RocksDB] Enable manual compaction to move files back to an appropriate level.
Summary: As title. This diff added an option reduce_level to CompactRange. When set to true, it will try to move the files back to the minimum level sufficient to hold the data set. Note that the default is set to true now, just to excerise it in all existing tests. Will set the default to false before check-in, for backward compatibility.

Test Plan: make check;

Reviewers: dhruba, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11553
2013-07-19 16:20:36 -07:00
Dhruba Borthakur
4a745a5666 Merge branch 'master' into performance
Conflicts:
	db/version_set.cc
	include/leveldb/options.h
	util/options.cc
2013-07-17 15:05:57 -07:00
Mayank Agarwal
2a986919d6 Make rocksdb-deletes faster using bloom filter
Summary:
Wrote a new function in db_impl.c-CheckKeyMayExist that calls Get but with a new parameter turned on which makes Get return false only if bloom filters can guarantee that key is not in database. Delete calls this function and if the option- deletes_use_filter is turned on and CheckKeyMayExist returns false, the delete will be dropped saving:
1. Put of delete type
2. Space in the db,and
3. Compaction time

Test Plan:
make all check;
will run db_stress and db_bench and enhance unit-test once the basic design gets approved

Reviewers: dhruba, haobo, vamsi

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11607
2013-07-11 12:11:11 -07:00
Dhruba Borthakur
7cb8d462d5 Rename PickCompactionHybrid to PickCompactionUniversal.
Summary:
Rename PickCompactionHybrid to PickCompactionUniversal.
Changed a few LOG message from "Hybrid:" to "Universal:".

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-07-09 16:08:54 -07:00
Haobo Xu
a8d5f8dde2 [RocksDB] Remove old readahead options
Summary: As title.

Test Plan: make check; db_bench

Reviewers: dhruba, MarkCallaghan

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11643
2013-07-09 11:22:33 -07:00
Dhruba Borthakur
116ec527f2 Renamed 'hybrid_compaction' tp be "Universal Compaction'.
Summary:
All the universal compaction parameters are encapsulated in
a new file universal_compaction.h

Test Plan:
make check
2013-07-03 15:47:53 -07:00
Mayank Agarwal
d56523c49c Update rocksdb version
Summary: rocksdb-2.0 released to third party

Test Plan: visual inspection

Reviewers: dhruba, haobo, sheki

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11559
2013-07-01 14:30:04 -07:00
Dhruba Borthakur
47c4191fe8 Reduce write amplification by merging files in L0 back into L0
Summary:
There is a new option called hybrid_mode which, when switched on,
causes HBase style compactions.  Files from L0 are
compacted back into L0. This meat of this compaction algorithm
is in PickCompactionHybrid().

All files reside in L0. That means all files have overlapping
keys. Each file has a time-bound, i.e. each file contains a
range of keys that were inserted around the same time. The
start-seqno and the end-seqno refers to the timeframe when
these keys were inserted.  Files that have contiguous seqno
are compacted together into a larger file. All files are
ordered from most recent to the oldest.

The current compaction algorithm starts to look for
candidate files starting from the most recent file. It continues to
add more files to the same compaction run as long as the
sum of the files chosen till now is smaller than the next
candidate file size. This logic needs to be debated
and validated.

The above logic should reduce write amplification to a
large extent... will publish numbers shortly.

Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).

Differential Revision: https://reviews.facebook.net/D11289
2013-06-30 20:07:04 -07:00
Haobo Xu
71e0f695c1 [RocksDB] Expose count for WriteBatch
Summary: As title. Exposed a Count function that returns the number of updates in a batch. Could be handy for replication sequence number check.

Test Plan: make check;

Reviewers: emayanke, sheki, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11523
2013-06-26 15:13:21 -07:00
Haobo Xu
4deaa0d48b [RocksDB] Minor change to statistics.h
Summary: as title, use initialize list so that lines fit in 80 chars.

Test Plan: make check;

Reviewers: sheki, dhruba

Differential Revision: https://reviews.facebook.net/D11385
2013-06-19 12:44:42 -07:00
Haobo Xu
96be2c4ee0 [RocksDB] Add mmap_read option for db_stress
Summary: as title, also removed an incorrect assertion

Test Plan: make check; db_stress --mmap_read=1; db_stress --mmap_read=0

Reviewers: dhruba, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11367
2013-06-19 10:28:32 -07:00
Abhishek Kona
5ef6bb8c37 [rocksdb][refactor] statistic printing code to one place
Summary: $title

Test Plan: db_bench --statistics=1

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11373
2013-06-18 20:28:41 -07:00
Haobo Xu
3cc1af2062 [RocksDB] Option for incremental sync
Summary: This diff added an option to control the incremenal sync frequency. db_bench has a new flag bytes_per_sync for easy tuning exercise.

Test Plan: make check; db_bench

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11295
2013-06-18 15:00:32 -07:00
Dhruba Borthakur
6acbe0fc45 Compact multiple memtables before flushing to storage.
Summary:
Merge multiple multiple memtables in memory before writing it
out to a file in L0.

There is a new config parameter min_write_buffer_number_to_merge
that specifies the number of write buffers that should be merged
together to a single file in storage. The system will not flush
wrte buffers to storage unless at least these many buffers have
accumulated in memory.
The default value of this new parameter is 1, which means that
a write buffer will be immediately flushed to disk as soon it is
ready.

Test Plan: make check

Differential Revision: https://reviews.facebook.net/D11241
2013-06-18 14:28:04 -07:00
Abhishek Kona
f561b3a324 [Rocksdb] Rename one stat key from leveldb to rocksdb 2013-06-17 14:33:05 -07:00
Abhishek Kona
39ee47fbf4 [Rocksdb] Record WriteBlock Times into a histogram
Summary: Add a histogram to track WriteBlock times

Test Plan: db_bench and print

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11319
2013-06-17 10:11:10 -07:00
Abhishek Kona
7a5f71d19a [Rocksdb] measure table open io in a histogram
Summary: Table is setup for compaction using Table::SetupForCompaction. So read block calls can be differentiated b/w Gets/Compaction. Use this and measure times.

Test Plan: db_bench --statistics=1

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D11217
2013-06-13 17:25:09 -07:00
Haobo Xu
778e179046 [RocksDB] Sync file to disk incrementally
Summary:
During compaction, we sync the output files after they are fully written out. This causes unnecessary blocking of the compaction thread and burstiness of the write traffic.
This diff simply asks the OS to sync data incrementally as they are written, on the background. The hope is that, at the final sync, most of the data are already on disk and we would block less on the sync call. Thus, each compaction runs faster and we could use fewer number of compaction threads to saturate IO.
In addition, the write traffic will be smoothed out, hopefully reducing the IO P99 latency too.

Some quick tests show 10~20% improvement in per thread compaction throughput. Combined with posix advice on compaction read, just 5 threads are enough to almost saturate the udb flash bandwidth for 800 bytes write only benchmark.
What's more promising is that, with saturated IO, iostat shows average wait time is actually smoother and much smaller.
For the write only test 800bytes test:
Before the change:  await  occillate between 10ms and 3ms
After the change: await ranges 1-3ms

Will test against read-modify-write workload too, see if high read latency P99 could be resolved.

Will introduce a parameter to control the sync interval in a follow up diff after cleaning up EnvOptions.

Test Plan: make check; db_bench; db_stress

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11115
2013-06-12 12:53:59 -07:00