Commit Graph

426 Commits

Author SHA1 Message Date
Dhruba Borthakur
b4ad5e89ae Implement a compressed block cache.
Summary:
Rocksdb can now support a uncompressed block cache, or a compressed
block cache or both. Lookups first look for a block in the
uncompressed cache, if it is not found only then it is looked up
in the compressed cache. If it is found in the compressed cache,
then it is uncompressed and inserted into the uncompressed cache.

It is possible that the same block resides in the compressed cache
as well as the uncompressed cache at the same time. Both caches
have their own individual LRU policy.

Test Plan: Unit test case attached.

Reviewers: kailiu, sdong, haobo, leveldb

Reviewed By: haobo

CC: xjin, haobo

Differential Revision: https://reviews.facebook.net/D12675
2013-11-01 14:31:35 -07:00
Igor Canadi
beeb74be6f Move I/O outside of lock
Summary:
I'm figuring out how Version[Set, Edit, ] classes work and I stumbled on this.

It doesn't seem that the comment is accurate anymore. What I read is when the manifest grows too big, create a new file (and not only when we call LogAndApply for the first time).

Test Plan: make check (currently running)

Reviewers: dhruba, haobo, kailiu, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13839
2013-11-01 12:32:27 -07:00
Haobo Xu
8cbe5bb56b [RocksDB] Add OnCompactionStart to CompactionFilter class
Summary: This is to give application compaction filter a chance to access context information of a specific compaction run. For example, depending on whether a compaction goes through all data files, the application could do things differently.

Test Plan: make check

Reviewers: dhruba, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13683
2013-10-31 13:36:43 -07:00
Naman Gupta
b4fab3be2a Merge branch 'master' of github.com:facebook/rocksdb into inplace 2013-10-31 11:51:03 -07:00
Naman Gupta
fe25070242 In-place updates for equal keys and similar sized values
Summary:
Currently for each put, a fresh memory is allocated, and a new entry is added to the memtable with a new sequence number irrespective of whether the key already exists in the memtable. This diff is an attempt to update the value inplace for existing keys. It currently handles a very simple case:
1. Key already exists in the current memtable. Does not inplace update values in immutable memtable or snapshot
2. Latest value type is a 'put' ie kTypeValue
3. New value size is less than existing value, to avoid reallocating memory

TODO: For a put of an existing key, deallocate memory take by values, for other value types till a kTypeValue is found, ie. remove kTypeMerge.
TODO: Update the transaction log, to allow consistent reload of the memtable.

Test Plan: Added a unit test verifying the inplace update. But some other unit tests broken due to invalid sequence number checks. WIll fix them next.

Reviewers: xinyaohu, sumeet, haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12423

Automatic commit by arc
2013-10-31 11:27:12 -07:00
Siying Dong
f03b2df010 Follow-up Cleaning-up After D13521
Summary:
This patch is to address @haobo's comments on D13521:
1. rename Table to be TableReader and make its factory function to be GetTableReader
2. move the compression type selection logic out of TableBuilder but to compaction logic
3. more accurate comments
4. Move stat name constants into BlockBasedTable implementation.
5. remove some uncleaned codes in simple_table_db_test

Test Plan: pass test suites.

Reviewers: haobo, dhruba, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13785
2013-10-30 10:52:33 -07:00
Siying Dong
068a819ac9 Fix valgrind_check problem of simple_table_db_test.cc
Summary: Two memory issues caused valgrind_check to fail on simple_table_db_test. Fix it

Test Plan: make -j32 valgrind_check

Reviewers: kailiu, haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13773
2013-10-29 14:29:03 -07:00
Siying Dong
d4eec30ed0 Make "Table" pluggable
Summary: This patch makes Table and TableBuilder a abstract class and make all the implementation of the current table into BlockedBasedTable and BlockedBasedTable Builder.

Test Plan: Make db_test.cc to work with block based table. Add a new test simple_table_db_test.cc where a different simple table format is implemented.

Reviewers: dhruba, haobo, kailiu, emayanke, vamsi

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13521
2013-10-28 17:54:09 -07:00
Kai Liu
994575c134 Support user-defined table stats collector
Summary:
1. Added a new option that support user-defined table stats collection.
2. Added a deleted key stats collector in `utilities`

Test Plan:
Added a unit test for newly added code.
Also ran make check to make sure other tests are not broken.

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13491
2013-10-28 15:45:14 -07:00
Kai Liu
7e91b86f4d Fix a valgrind warning
Summary:
A latest valgrind test found a recently added unit test has memory leak,
which is because DB is not closed at the end of the test.

Test Plan: re-run the valgrind locally and make sure there's no memory leakage any more.

Reviewers: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13725
2013-10-28 14:34:27 -07:00
Igor Canadi
100fa8e013 If a Put fails, fail all other puts
Summary:
When a Put fails, it can leave database in a messy state. We don't want to pretend that everything is OK when it may not be. We fail every write following the failed one.

I added checks for corruption to DBImpl::Write(). Is there anywhere else I need to add them?

Test Plan: Corruption unit test.

Reviewers: dhruba, haobo, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13671
2013-10-28 12:36:02 -07:00
Kai Liu
a1d38a41fd fix the error message in debug mode
Summary:

my fix patch introduced a new error in debug mode.

Test Plan:

`make` and `make release`
2013-10-27 23:11:13 -07:00
Kai Liu
39c14891b6 Fix the gcc warning for unused variable
Summary: Fix the unused variable warning for `first` when running `make release`

Test Plan:
make
make check

Reviewers: dhruba, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13695
2013-10-27 22:57:45 -07:00
Mayank Agarwal
56305221c4 Unify DeleteFile and DeleteWalFiles
Summary:
This is to simplify rocksdb public APIs and improve the code quality.
Created an additional parameter to ParseFileName for log sub type and improved the code for deleting a wal file.
Wrote exhaustive unit-tests in delete_file_test
Unification of other redundant APIs can be taken up in a separate diff

Test Plan: Expanded delete_file test

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13647
2013-10-25 08:32:14 -07:00
Kai Liu
c17607a251 Fix the log number bug when updating MANIFEST file
Summary:
Crash may occur during the flushes of more than two mem tables.

As the info log suggested, even when both were successfully flushed,
the recovery process still pick up one of the memtable's log for recovery.

This diff fix the problem by setting the correct "log number" in MANIFEST.

Test Plan: make test; deployed to leaf4 and make sure it doesn't result in crashes of this type.

Reviewers: haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13659
2013-10-24 21:05:33 -07:00
Slobodan Predolac
e44976b199 Conversion of db_bench, db_stress and db_repl_stress to use gflags
Summary: Converted db_stress, db_repl_stress and db_bench to use gflags

Test Plan: I tested by printing out all the flags from old and new versions. Tried defaults, + various combinations with "interesting flags". Also, tested by running db_crashtest.py and db_crashtest2.py.

Reviewers: emayanke, dhruba, haobo, kailiu, sdong

Reviewed By: emayanke

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D13581
2013-10-24 07:43:14 -07:00
Haobo Xu
2fb361ad98 [RocksDB] Add perf_context.wal_write_time to track time spent on writing the recovery log.
Summary: as title

Test Plan: make check; ./perf_context_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13629
2013-10-23 13:38:39 -07:00
Mayank Agarwal
e56ce03691 Hardcoding temp file name for Identity file to 000000.dbtmp just like it's done for CURRENT file
Summary: as per Dhruba's suggestion

Test Plan: make all check; Seen the Id getting generated properly in db_repl_stress

Reviewers: dhruba, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13635
2013-10-23 11:34:22 -07:00
Mayank Agarwal
9b50106f9a Dbid feature
Summary:
Create a new type of file on startup if it doesn't already exist called DBID.
This will store a unique number generated from boost library's uuid header file.
The use-case is to identify the case of a db losing all its data and coming back up either empty or from an image(backup/live replica's recovery)
the key point to note is that DBID is not stored in a backup or db snapshot
It's preferable to use Boost for uuid because:
1) A non-standard way of generating uuid is not good
2) /proc/sys/kernel/random/uuid generates a uuid but only on linux environments and the solution would not be clean
3) c++ doesn't have any direct way to get a uuid
4) Boost is a very good library that was already having linkage in rocksdb from third-party
Note: I had to update the TOOLCHAIN_REV in build files to get latest verison of boost from third-party as the older version had a bug.
I had to put Wno-uninitialized in Makefile because boost-1.51 has an unitialized variable and rocksdb would not comiple otherwise. Latet open-source for boost is 1.54 but is not there in third-party. I have notified the concerned people in fbcode about it.
@kailiu : While releasing to third-party, an additional dependency will need to be created for boost in TARGETS file. I can help identify.

Test Plan:
Expand db_test to test 2 cases
1) Restarting db with Id file present - verify that no change to Id
2)Restarting db with Id file deleted - verify that a different Id is there after reopen
Also run make all check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13587
2013-10-22 12:23:34 -07:00
Mayank Agarwal
ae8e0770b4 Disallow transaction log iterator to skip sequences
Summary:
This is expected to solve the "gaps in transaction log iterator" problem.
* After a lot of observations on the gaps on the sigmafio machines I found that it is due to a race between log reader and writer always.
* So when we drop the wormhole subscription and refresh the iterator, the gaps are not there.
* It is NOT due to some boundary or corner case left unattended in the iterator logic because I checked many instances of the gaps against their log files with ldb. The log files are NOT corrupted also.
* The solution is to not allow the iterator to read incompletely written sequences and detect gaps inside itself and invalidate it which will cause the application to refresh the iterator normally and seek to the required sequence properly.
* Thus, the iterator can at least guarantee that it will not give any gaps.

Test Plan:
* db_test based log iterator tests
* db_repl_stress
* testing on sigmafio setup to see gaps go away

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13593
2013-10-22 11:45:35 -07:00
Siying Dong
65428b0c0a Fix Bug: iterator.Prev() or iterator.SeekToLast() might return the first element instead of the correct one
Summary:
Recent patch https://reviews.facebook.net/D11865 introduced a regression bug:

DBIter::FindPrevUserEntry(), which is called by DBIter::Prev() (and also implicitly if calling iterator.SeekToLast())  might do issue a seek when having skipped too many entries. If the skipped entry just before the seek() is a delete, the saved key is erased so that it seeks to the front, so Prev() would return the first element.

This patch fixes the bug by not doing seek() in DBIter::FindNextUserEntry() if saved key has been erased.

Test Plan: Add a test DBTest.IterPrevMaxSkip which would fail without the patch and would pass with the change.

Reviewers: dhruba, xjin, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13557
2013-10-17 18:33:18 -07:00
Siying Dong
9edda37027 Universal Compaction to Have a Size Percentage Threshold To Decide Whether to Compress
Summary:
This patch adds a option for universal compaction to allow us to only compress output files if the files compacted previously did not yet reach a specified ratio, to save CPU costs in some cases.

Compression is always skipped for flushing. This is because the size information is not easy to evaluate for flushing case. We can improve it later.

Test Plan:
add test
DBTest.UniversalCompactionCompressRatio1 and DBTest.UniversalCompactionCompressRatio12

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13467
2013-10-17 13:33:39 -07:00
Dhruba Borthakur
9cd221094c Add appropriate LICENSE and Copyright message.
Summary:
Add appropriate LICENSE and Copyright message.

Test Plan:
make check

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-10-16 17:48:41 -07:00
Siying Dong
073cbfc8f0 Enable background flush thread by default and fix issues related to it
Summary:
Enable background flush thread in this patch and fix unit tests with:
(1) After background flush, schedule a background compaction if condition satisfied;
(2) Fix a bug that if universal compaction is enabled and number of levels are set to be 0, compaction will not be automatically triggered
(3) Fix unit tests to wait for compaction to finish instead of flush, before checking the compaction results.

Test Plan: pass all unit tests

Reviewers: haobo, xjin, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13461
2013-10-16 13:32:53 -07:00
Mayank Agarwal
da2fd001a6 Fix rocksdb->levledb BytewiseComparator and inverted order of error in db/version_set.cc
Summary:
This is needed to make existing dbs be able to open and also because BytewiseComparator was not changed since leveldb.
The inverted order in the error message caused confusion prebiously

Test Plan: make; open existing db

Reviewers: leveldb, dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D13449
2013-10-14 18:16:54 -07:00
Mayank Agarwal
fe3713961e Features in Transaction log iterator
Summary:
* Logstore requests a valid change of reutrning an empty iterator and not an error in case of no log files.
* Changed the code to return the writebatch containing the sequence number requested from GetupdatesSince even if it lies in the middle. Earlier we used to return the next writebatch,. This also allows me oto guarantee that no files played upon by the iterator are redundant. I mean the starting log file has at least a sequence number >= the sequence number requested form GetupdatesSince.
* Cleaned up redundant logic in Iterator::Next and made a new function SeekToStartSequence for greater readability and maintainibilty.
* Modified a test in db_test accordingly
Please check the logic carefully and suggest improvements. I have a separate patch out for more improvements like restricting reader to read till written sequences.

Test Plan:
* transaction log iterator tests in db_test,
* db_repl_stress.
* rocks_log_iterator_test in fbcode/wormhole/rocksdb/test - 2 tests thriving on hacks till now can get simplified
* testing on the shadow setup for sigma with replication

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13437
2013-10-14 18:16:21 -07:00
Kai Liu
86ef6c3f74 Add statistics to sst file
Summary:
So far we only have key/value pairs as well as bloom filter stored in the
sst file.  It will be great if we are able to store more metadata about
this table itself, for example, the entry size, bloom filter name, etc.

This diff is the first step of this effort. It allows table to keep the
basic statistics mentioned in http://fburl.com/14995441, as well as
allowing writing user-collected stats to stats block.

After this diff, we will figure out the interface of how to allow user to collect their interested statistics.

Test Plan:
1. Added several unit tests.
2. Ran `make check` to ensure it doesn't break other tests.

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13419
2013-10-14 15:56:13 -07:00
Siying Dong
88f2f89068 Change Function names from Compaction->Flush When they really mean Flush
Summary: When I debug the unit test failures when enabling background flush thread, I feel the function names can be made clearer for people to understand. Also, if the names are fixed, in many places, some tests' bugs are obvious (and some of those tests are failing). This patch is to clean it up for future maintenance.

Test Plan: Run test suites.

Reviewers: haobo, dhruba, xjin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13431
2013-10-14 15:12:15 -07:00
sdong
f8509653ba LRUCache to try to clean entries not referenced first.
Summary:
With this patch, when LRUCache.Insert() is called and the cache is full, it will first try to free up entries whose reference counter is 1 (would become 0 after remo\
ving from the cache). We do it in two passes, in the first pass, we only try to release those unreferenced entries. If we cannot free enough space after traversing t\
he first remove_scan_cnt_ entries, we start from the beginning again and remove those entries being used.

Test Plan: add two unit tests to cover the codes

Reviewers: dhruba, haobo, emayanke

Reviewed By: emayanke

CC: leveldb, emayanke, xjin

Differential Revision: https://reviews.facebook.net/D13377
2013-10-11 09:26:21 -07:00
Dhruba Borthakur
c0ce562c32 Bad nfs file checked in a long time back.
Summary:
Bad nfs file checked in a long time back.

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-10-10 22:00:49 -07:00
Mayank Agarwal
a8b4a69de0 Fixing error in ParseFileName causing DestroyDB to fail on archive directory
Summary:
This careless error was causing ASSERT_OK(DestroyDB) to fail in db_test.
Basically .. was being returned as a child of db/archive and ParseFileName returned false on that,
but 'type' was set to LogFile from earlier and not reset. The return of ParseFileName was not being checked to delete the log file or not.

Test Plan: make all check

Reviewers: dhruba, haobo, xjin, kailiu, nkg-

Reviewed By: nkg-

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13413
2013-10-10 18:18:31 -07:00
Naman Gupta
cbf4a06427 Add option for storing transaction logs in a separate dir
Summary: In some cases, you might not want to store the data log (write ahead log) files in the same dir as the sst files. An example use case is leaf, which stores sst files in tmpfs. And would like to save the log files in a separate dir (disk) to save memory.

Test Plan: make all. Ran db_test test. A few test failing. P2785018. If you guys don't see an obvious problem with the code, maybe somebody from the rocksdb team could help me debug the issue here. Running this on leaf worked well. I could see logs stored on disk, and deleted appropriately after compactions. Obviously this is only one set of options. The unit tests cover different options. Seems like I'm missing some edge cases.

Reviewers: dhruba, haobo, leveldb

CC: xinyaohu, sumeet

Differential Revision: https://reviews.facebook.net/D13239
2013-10-08 17:40:27 -07:00
Naman Gupta
116071411b Make db_test more robust
Summary: While working on D13239, I noticed that the same options are not used for opening and destroying at db. So adding that. Also added asserts for successful DestroyDB calls.

Test Plan: Ran unit tests. Atleast 1 unit test is failing. They failures are a result of some past logic change. I'm not really planning to fix those. But I would like to check this in. And hopefully the respective unit test owners can fix the broken tests

Reviewers: leveldb, haobo

CC: xinyaohu, sumeet, dhruba

Differential Revision: https://reviews.facebook.net/D13329
2013-10-08 13:19:31 -07:00
Dhruba Borthakur
1a8c1b0817 Unit test failure in DBTest.NumImmutableMemTable.
Summary:
Previous patch introduced a unit test failure in
DBTest.NumImmutableMemTable because of change in property names.

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-10-06 01:12:02 -07:00
Dhruba Borthakur
4463b11cad Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix.
Summary: Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix.

Test Plan: make check

Reviewers: emayanke, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13311
2013-10-06 00:14:26 -07:00
Haobo Xu
bf89edf78b [RocksDB] Added a property "leveldb.num-immutable-mem-table" so that Flush can be called without blocking, and application still has a way to check when it's done also without blocking.
Summary: as title

Test Plan: DBTest.NumImmutableMemTable

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13305
2013-10-05 11:54:08 -07:00
Dhruba Borthakur
0a9f873f4b Removed scribe, thrift and java modules.
Summary: Removed scribe, thrift and java modules.

Test Plan:
make release
make check

Reviewers: emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13293
2013-10-04 15:36:00 -07:00
Dhruba Borthakur
a143ef9b38 Change namespace from leveldb to rocksdb
Summary:
Change namespace from leveldb to rocksdb. This allows a single
application to link in open-source leveldb code as well as
rocksdb code into the same process.

Test Plan: compile rocksdb

Reviewers: emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13287
2013-10-04 11:59:26 -07:00
Mayank Agarwal
b3ed08129b Add a statistic to count the number of calls to GetUpdatesSince
Summary: This is useful to keep track of refreshes in transaction log iterator

Test Plan: make; db_stress --statistics=1 shows it

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13281
2013-10-04 10:47:20 -07:00
Mayank Agarwal
854d236361 Add backward compatible option in GetLiveFiles to choose whether to not Flush first
Summary:
As explained in comments in GetLiveFiles in db.h, this option will cause flush to be skipped in GetLiveFiles because some use-cases use GetSortedWalFiles after GetLiveFiles to generate more complete snapshots.
Using GetSortedWalFiles after GetLiveFiles allows us to not Flush in GetLiveFiles first because wals have everything.
Note: file deletions will be disabled before calling GLF or GSWF so live logs will not move to archive logs or get delted.
Note: Manifest file is truncated to a proper value in GLF, so it will always reply from the proper wal files on a restart

Test Plan: make

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13257
2013-10-04 10:20:10 -07:00
Haobo Xu
200c05a23f [RocksDB] Still honor DisableFileDeletions when purge_log_after_memtable_flush is on
Summary: as title

Test Plan: make check

Reviewers: emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13263
2013-10-03 16:12:43 -07:00
Haobo Xu
fa798e9e28 [Rocksdb] Submit mem table flush job in a different thread pool
Summary: As title. This is just a quick hack and not ready for commit. fails a lot of unit test. I will test/debug it directly in ViewState shadow .

Test Plan: Try it in shadow test.

Reviewers: dhruba, xjin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12933
2013-10-03 14:37:19 -07:00
Xing Jin
658a3ce2fa Fix SIGSEGV issue in universal compaction
Summary:
We saw SIGSEGV when set options.num_levels=1 in universal compaction
style. Dug into this issue for a while, and finally found the root cause (thank Haobo for discussion).

Test Plan: Add new unit test. It throws SIGSEGV without this change. Also run "make all check".

Reviewers: haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13251
2013-10-02 17:33:31 -07:00
Haobo Xu
71046971f0 [RocksDB] Added perf counters to track skipped internal keys during iteration
Summary: as title. unit test not polished. this is for a quick live test

Test Plan: live

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13221
2013-10-02 10:48:41 -07:00
Xing Jin
8eb552bf4d New unit test for iterator with snapshot
Summary:
I played with the reported bug about iterator with snapshot:
https://code.google.com/p/leveldb/issues/detail?id=200.

I turned the original test program
(https://code.google.com/p/leveldb/issues/attachmentText?id=200&aid=2000000000&name=test.cc&token=7uOUQW-HFlbAFMUm7EqtaAEy7Tw%3A1378320724136)
into a new unit test, but I cannot reproduce the problem. Notice lines
31-34 in above link. I have ran the new test with and without such Put()
operations. Both succeed.

So this diff simply adds the test, without changing any source codes.

Test Plan: run new test.

Reviewers: dhruba, haobo, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12735
2013-09-28 11:39:08 -07:00
Haobo Xu
0c4040681a [RocksDB] Move last_sequence and last_flushed_sequence_ update back into lock protected area
Summary: A previous diff moved these outside of lock protected area. Moved back in now. Also moved tmp_batch_ update outside of lock protected area, as only the single write thread can access it.

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13137
2013-09-26 20:43:11 -07:00
Haobo Xu
08740b15a4 [RocksDB] Fix skiplist sequential insertion optimization
Summary: The original optimization missed updating links other than the lowest level.

Test Plan: make check; perf_context_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, adsharma

Differential Revision: https://reviews.facebook.net/D13119
2013-09-26 15:17:03 -07:00
Haobo Xu
e0aa19a94e [RocbsDB] Add an option to enable set based memtable for perf_context_test
Summary:
as title.
Some result:

-- Sequential insertion of 1M key/value with stock skip list (all in on memtable)
time ./perf_context_test  --total_keys=1000000  --use_set_based_memetable=0
Inserting 1000000 key/value pairs
...
Put uesr key comparison:
Count: 1000000  Average: 8.0179  StdDev: 176.34
Min: 0.0000  Median: 2.5555  Max: 88933.0000
Percentiles: P50: 2.56 P75: 2.83 P99: 58.21 P99.9: 133.62 P99.99: 987.50
Get uesr key comparison:
Count: 1000000  Average: 43.4465  StdDev: 379.03
Min: 2.0000  Median: 36.0195  Max: 88939.0000
Percentiles: P50: 36.02 P75: 43.66 P99: 112.98 P99.9: 824.84 P99.99: 7615.38
real	0m21.345s
user	0m14.723s
sys	0m5.677s

-- Sequential insertion of 1M key/value with set based memtable (all in on memtable)
time ./perf_context_test  --total_keys=1000000  --use_set_based_memetable=1
Inserting 1000000 key/value pairs
...
Put uesr key comparison:
Count: 1000000  Average: 61.5022  StdDev: 6.49
Min: 0.0000  Median: 62.4295  Max: 71.0000
Percentiles: P50: 62.43 P75: 66.61 P99: 71.00 P99.9: 71.00 P99.99: 71.00
Get uesr key comparison:
Count: 1000000  Average: 29.3810  StdDev: 3.20
Min: 1.0000  Median: 29.1801  Max: 34.0000
Percentiles: P50: 29.18 P75: 32.06 P99: 34.00 P99.9: 34.00 P99.99: 34.00
real	0m28.875s
user	0m21.699s
sys	0m5.749s

Worst case comparison for a Put is 88933 (skiplist) vs 71 (set based memetable)

Of course, there's other in-efficiency in set based memtable implementation, which lead to the overall worst performance. However, P99 behavior advantage is very very obvious.

Test Plan: ./perf_context_test and viewstate shadow testing

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13095
2013-09-25 22:49:18 -07:00
Dhruba Borthakur
f1a60e5c3e The vector rep implementation was segfaulting because of incorrect initialization of vector.
Summary:
The constructor for Vector memtable has a parameter called 'count'
that specifies the capacity of the vector to be reserved at allocation
time. It was incorrectly used to initialize the size of the vector.

Test Plan: Enhanced db_test.

Reviewers: haobo, xjin, emayanke

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13083
2013-09-25 11:33:52 -07:00
Dhruba Borthakur
5e9f3a9aa7 Better locking in vectorrep that increases throughput to match speed of storage.
Summary:
There is a use-case where we want to insert data into rocksdb as
fast as possible. Vector rep is used for this purpose.

The background flush thread needs to flush the vectorrep to
storage. It acquires the dblock then sorts the vector, releases
the dblock and then writes the sorted vector to storage. This is
suboptimal because the lock is held during the sort, which
prevents new writes for occuring.

This patch moves the sorting of the vector rep to outside the
db mutex. Performance is now as fastas the underlying storage
system. If you are doing buffered writes to rocksdb files, then
you can observe throughput upwards of 200 MB/sec writes.

This is an early draft and not yet ready to be reviewed.

Test Plan:
make check

Task ID: #

Blame Rev:

Reviewers: haobo

Reviewed By: haobo

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D12987
2013-09-19 21:48:10 -07:00