Commit Graph

1255 Commits

Author SHA1 Message Date
Haobo Xu
48bc0c6ad3 [RocksDB] Fix a race condition in GetSortedWalFiles
Summary: This patch fixed a race condition where a log file is moved to archived dir in the middle of GetSortedWalFiles. Without the fix, the log file would be missed in the result, which leads to transaction log iterator gap. A test utility SyncPoint is added to help reproducing the race condition.

Test Plan: TransactionLogIteratorRace; make check

Reviewers: dhruba, ljin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17121
2014-04-02 22:12:29 -07:00
Igor Canadi
d1d19f5db3 Fix valgrind error in c_test 2014-04-02 17:24:30 -07:00
sdong
158845ba9a Move a info logging out of DB Mutex
Summary: As we know, logging can be slow, or even hang for some file systems. Move one more logging out of DB mutex.

Test Plan: make all check

Reviewers: haobo, igor, ljin

Reviewed By: igor

CC: yhchiang, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D17427
2014-04-02 16:48:32 -07:00
sdong
4af1954fd6 Compaction Filter V1 to use old context struct to keep backward compatible
Summary: The previous change D15087 changed existing compaction filter, which makes the commonly used class not backward compatible. Revert the older interface. Use a new interface for V2 instead.

Test Plan: make all check

Reviewers: haobo, yhchiang, igor

CC: danguo, dhruba, ljin, igor, leveldb

Differential Revision: https://reviews.facebook.net/D17223
2014-04-02 14:57:51 -07:00
sdong
284c365b77 Fix valgrind error caused by FileMetaData as two level iterator's index block handle
Summary: It is a regression valgrind bug caused by using FileMetaData as index block handle. One of the fields of FileMetaData is not initialized after being contructed and copied, but I'm not able to find which one. Also, I realized that it's not a good idea to use FileMetaData as in TwoLevelIterator::InitDataBlock(), a copied FileMetaData can be compared with the one in version set byte by byte, but the refs can be changed. Also comparing such a large structure is slightly more expensive. Use a simpler structure instead

Test Plan:
Run the failing valgrind test (Harness.RandomizedLongDB)
make all check

Reviewers: igor, haobo, ljin

Reviewed By: igor

CC: yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17409
2014-04-02 14:38:28 -07:00
Igor Canadi
8555ce2dec Merge branch 'master' into columnfamilies 2014-04-02 10:48:05 -07:00
sdong
e0a87c4cf1 DBIter to use static allocated char array for saved_key_ (if it is not too long)
Summary: DBIter now uses a std::string for saved_key. Based on some profiling, it could be more expensive than we though. Optimize it with the same technique as LookupKey -- if it is short, we copy it to a static allocated char. Otherwise, dynamically allocate memory for it.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: dhruba, igor, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17289
2014-04-01 16:43:11 -07:00
sdong
d50619a559 PlainTableIterator::Seek() shouldn't check bloom filter in total order mode
Summary:
In total order mode, iterator's seek() shouldn't check total order.

Also some cleaning up about checking null for shared pointers. I don't know the behavior before it.

This bug was reported by @igor.

Test Plan: test plain_table_db_test

Reviewers: ljin, haobo, igor

Reviewed By: igor

CC: yhchiang, dhruba, igor, leveldb

Differential Revision: https://reviews.facebook.net/D17391
2014-04-01 15:05:16 -07:00
Thomas Adam
38dc5ef45f [C-API] added the possiblity to create a HashSkipList or HashLinkedList to support prefix seeks 2014-04-01 12:44:27 +02:00
Igor Canadi
ddbd1ece88 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/internal_stats.cc
	db/internal_stats.h
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
	include/rocksdb/options.h
	util/options.cc
2014-03-31 13:39:24 -07:00
Igor Canadi
577556d5f9 Don't store version number in MANIFEST
Summary: Talked to <insert internal project name> folks and they found it really scary that they won't be able to roll back once they upgrade to 2.8. We should fix this.

Test Plan: make check

Reviewers: haobo, ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17343
2014-03-31 11:33:09 -07:00
Igor Canadi
8a139a054c More valgrind issues!
Summary: Fix some more CompactionFilterV2 valgrind issues. Maybe it would make sense for CompactionFilterV2 to delete its prefix_extractor?

Test Plan: ran CompactionFilterV2* tests with valgrind. issues before patch -> no issues after

Reviewers: haobo, sdong, ljin, dhruba

Reviewed By: dhruba

CC: leveldb, danguo

Differential Revision: https://reviews.facebook.net/D17337
2014-03-29 10:34:47 -07:00
sdong
43a593a6d9 Change default value of some Options
Summary: Since we are optimizing for server workloads, some default values are not optimized any more. We change some of those values that I feel it's less prone to regression bugs.

Test Plan: make all check

Reviewers: dhruba, haobo, ljin, igor, yhchiang

Reviewed By: igor

CC: leveldb, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D16995
2014-03-28 17:09:28 -07:00
sdong
2d3468c293 MemTableIterator not to reference Memtable
Summary: In one of the perf, I shows 10%-25% CPU costs of MemTableIterator.Seek(), when doing dynamic prefix seek, are spent on checking whether we need to do bloom filter check or finding out the prefix extractor. Seems that  more level of pointer checking makes CPU cache miss more likely. This patch makes things slightly simpler by copying pointer of bloom of prefix extractor into the iterator.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: ljin

CC: igor, dhruba, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17247
2014-03-28 16:46:25 -07:00
Lei Jin
0d755fff14 cache friendly blocked bloomfilter
Summary:
By constraining the probes within cache line(s), we can improve the
cache miss rate thus performance. This probably only makes sense for
in-memory workload so defaults the option to off.

Numbers and comparision can be found in wiki:
https://our.intern.facebook.com/intern/wiki/index.php/Ljin/rocksdb_perf/2014_03_17#Bloom_Filter_Study

Test Plan: benchmarked this change substantially. Will run make all check as well

Reviewers: haobo, igor, dhruba, sdong, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17133
2014-03-28 09:21:20 -07:00
Yueh-Hsuan Chiang
10cebec79e Fix the bug in MergeUtil which causes mixing values of different keys.
Summary: Fix the bug in MergeUtil which causes mixing values of different keys.

Test Plan:
stringappend_test
make all check

Reviewers: haobo, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17235
2014-03-27 16:15:25 -07:00
Haobo Xu
a92194e5b2 [RocksDB] Add db property "rocksdb.cur-size-active-mem-table"
Summary: as title

Test Plan: db_test

Reviewers: sdong

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17217
2014-03-27 15:14:04 -07:00
Igor Canadi
1c9f8f0884 Fix valgrind issues
Summary:
NewFixedPrefixTransform is leaked in default options. Broken by b47812fba6

Also included in the diff some code cleanup

Test Plan:
valgrind env_test
also make check

Reviewers: haobo, danguo, yhchiang

Reviewed By: danguo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17211
2014-03-27 08:22:59 -07:00
sdong
d556200264 Some small cleaning up to make some compiling environment happy
Summary: Compiler complains some errors when building using our internal build settings. Fix them.

Test Plan: rebuild

Reviewers: haobo, dhruba, igor, yhchiang, ljin

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17199
2014-03-26 18:11:41 -07:00
Igor Canadi
6a08bc042a Fix no return warning in FileComparator 2014-03-26 14:46:07 -07:00
Igor Canadi
1e9621d4e5 Sort files correctly in Builder::SaveTo
Summary:
Previously, we used to sort all files by BySmallestFirst comparator and then re-sort level0 files in the Finalize() (recently moved to end of SaveTo).

In this diff, I chose the correct comparator at the beginning and sort the files correctly in Builder::SaveTo.

I also added a verification that all files are sorted correctly in CheckConsistency()

NOTE: This diff depends on D17037

Test Plan: make check. Will also run db_stress

Reviewers: dhruba, haobo, sdong, ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17049
2014-03-26 13:30:14 -07:00
Igor Canadi
954679bb0f AssertHeld() should do things
Summary:
AssertHeld() was a no-op before. Now it does things.

Also, this change caught a bad bug in SuperVersion::Init(). The method is calling db->mutex.AssertHeld(), but db variable is not initialized yet! I also fixed that issue.

Test Plan: make check

Reviewers: dhruba, haobo, ljin, sdong, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17193
2014-03-26 11:24:52 -07:00
Igor Canadi
ad9a39c9b4 [RocksDB] Preallocate new MANIFEST files
Summary: We don't preallocate MANIFEST file, even though we have an option for that. This diff preallocates manifest file every time we create it

Test Plan: make check

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17163
2014-03-26 09:37:53 -07:00
sdong
6b2e7a2a01 When Options.max_num_files=-1, non level0 files also by pass table cache
Summary:
This is the part that was not finished when doing the Options.max_num_files=-1 feature. For iterating non level0 SST files (which was done using two level iterator), table cache is not bypassed. With this patch, the leftover feature is done.

Test Plan: make all check; change Options.max_num_files=-1 in one of the tests to cover the codes.

Reviewers: haobo, igor, dhruba, ljin, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17001
2014-03-25 18:40:52 -07:00
Igor Canadi
e86d7dffd7 Merge branch 'master' into columnfamilies 2014-03-25 15:24:02 -07:00
Yueh-Hsuan Chiang
b9ce156e38 Add assert to MergeOperator::PartialMergeMulti to check # of operands.
Summary:
Add assert(operands_list.size() >= 2) in MergeOperator::PartialMergeMulti
to ensure it's only be called when we have at least two merge operands.

Test Plan: run merge_test and stringappend_test.

Reviewers: haobo, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17169
2014-03-25 13:39:17 -07:00
Danny Guo
d9ca83df28 [rocksdb] make init prefix more robust
Summary:
Currently if client uses kNULLString as the prefix, it will confuse
compaction filter v2. This diff added a bool to indicate if the prefix
has been intialized. I also added a unit test to cover this case and
make sure the new code path is hit.

Test Plan: db_test

Reviewers: igor, haobo

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17151
2014-03-25 11:59:40 -07:00
Yueh-Hsuan Chiang
34f9da1cef Fix the failure of stringappend_test caused by PartialMergeMulti.
Summary:
Fix a bug that PartialMergeMulti will try to merge the first operand
with an empty slice.

Test Plan: run stringappend_test and merge_test.

Reviewers: haobo, igor

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17157
2014-03-25 11:50:09 -07:00
Igor Canadi
e8168382c4 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	include/rocksdb/options.h
	util/options.cc
2014-03-25 11:09:40 -07:00
Danny Guo
b47812fba6 [rocksdb] new CompactionFilterV2 API
Summary:
This diff adds a new CompactionFilterV2 API that roll up the
decisions of kv pairs during compactions. These kv pairs must share the
same key prefix. They are buffered inside the db.

    typedef std::vector<Slice> SliceVector;
    virtual std::vector<bool> Filter(int level,
                                 const SliceVector& keys,
                                 const SliceVector& existing_values,
                                 std::vector<std::string>* new_values,
                                 std::vector<bool>* values_changed
                                 ) const = 0;

Application can override the Filter() function to operate
on the buffered kv pairs. More details in the inline documentation.

Test Plan:
make check. Added unit tests to make sure Keep, Delete,
Change all works.

Reviewers: haobo

CCs: leveldb

Differential Revision: https://reviews.facebook.net/D15087
2014-03-24 20:47:53 -07:00
Yueh-Hsuan Chiang
cda4006e87 Enhance partial merge to support multiple arguments
Summary:
* PartialMerge api now takes a list of operands instead of two operands.
* Add min_pertial_merge_operands to Options, indicating the minimum
  number of operands to trigger partial merge.
* This diff is based on Schalk's previous diff (D14601), but it also
  includes necessary changes such as updating the pure C api for
  partial merge.

Test Plan:
* make check all
* develop tests for cases where partial merge takes more than two
  operands.

TODOs (from Schalk):
* Add test with min_partial_merge_operands > 2.
* Perform benchmarks to measure the performance improvements (can probably
  use results of task #2837810.)
* Add description of problem to doc/index.html.
* Change wiki pages to reflect the interface changes.

Reviewers: haobo, igor, vamsi

Reviewed By: haobo

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16815
2014-03-24 17:57:13 -07:00
Igor Canadi
ac328a86b9 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
2014-03-20 14:41:37 -07:00
Igor Canadi
c21ce14fa5 Fix double-free in corruption_test 2014-03-20 14:37:30 -07:00
Igor Canadi
e67241f0b9 Sanity check on Open
Summary:
Everytime a client opens a DB, we do a sanity check that:
* checks the existance of all the necessary files
* verifies that file sizes are correct

Some of the code was stolen from https://reviews.facebook.net/D16935

Test Plan: added a unit test

Reviewers: dhruba, haobo, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17097
2014-03-20 14:18:29 -07:00
Yiting Li
7981a43274 Consistency Check Function
Summary: Added a function/command to check the consistency of live files' meta data

Test Plan:
Manual test (size mismatch, file not exist).
Command test script.

Reviewers: haobo

Reviewed By: haobo

CC: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D16935
2014-03-20 13:42:45 -07:00
Igor Canadi
8ea3cb621e If paranoid_checks -- Mark DB read-only on any IOError
Summary:
Whenever we get an IOError from GetImpl() or NewIterator(), we should immediatelly mark the DB read-only. The same check already exists in Write() and Compaction().

This should help with clients that are somehow missing a file.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, ljin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17061
2014-03-20 13:10:02 -07:00
sdong
f681030c80 Fix DBTest.UniversalCompactionTrigger failure caused by D17067
Summary: D17067 breaks DBTest.UniversalCompactionTrigger because of wrong location of the checking. Fix it.

Test Plan: Run the test and make sure it passes.

Reviewers: igor, haobo

Reviewed By: igor

CC: dhruba, ljin, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17079
2014-03-20 11:10:11 -07:00
sdong
752ec46cd5 Add a unit test to verify compaction filter context
Summary: Add unit tests to make sure CompactionFilterContext::is_manual_compaction_ and CompactionFilterContext::is_full_compaction_ are set correctly.

Test Plan: run the new tests.

Reviewers: haobo, igor, dhruba, yhchiang, ljin

Reviewed By: haobo

CC: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D17067
2014-03-19 18:10:48 -07:00
Igor Canadi
e20fa3f8a4 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/internal_stats.cc
	db/internal_stats.h
	db/version_set.cc
2014-03-19 17:22:20 -07:00
Igor Canadi
fcd5c5e828 ComputeCompactionScore in CompactionPicker
Summary:
As it turns out, we need the call to ComputeCompactionScore (previously: Finalize) in CompactionPicker.

The issue caused a deadlock in db_stress: http://ci-builds.fb.com/job/rocksdb_crashtest/290/console

The last two lines before a deadlock were:
2014/03/18-22:43:41.481029 7facafbee700 (Original Log Time 2014/03/18-22:43:41.480989) Compaction nothing to do
2014/03/18-22:43:41.481041 7faccf7fc700 wait for fewer level0 files...

"Compaction nothing to do" and other thread waiting for fewer level0 files. Hm hm.

I moved the pre-sorting to SaveTo, which should fix both the original and the new issue.

Test Plan: make check for now, will run db_stress in jenkins

Reviewers: dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17037
2014-03-19 16:52:26 -07:00
Igor Canadi
e493f2f54e Don't compact with zero input files
Summary:
We have an issue with internal service trying to run compaction with zero input files:
2014/02/07-02:26:58.386531 7f79117ec700 Compaction start summary: Base version 1420 Base level 3, seek compaction:0, inputs:[Ï›~^Qy^?],[]
2014/02/07-02:26:58.386539 7f79117ec700 Compacted 0@3 + 0@4 files => 0 bytes

There are two issues:
* inputsummary is printing out junk
* it's constantly retrying (since I guess madeProgress is true), so it prints out a lot of data in the LOG file (40GB in one day).

I read through the Level compaction picker and added some failure condition if input[0] is empty. I think PickCompaction() should not return compaction with zero input files with this change. I'm not confident enough to add an assertion though :)

Test Plan: make check

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16005
2014-03-19 16:01:25 -07:00
Igor Canadi
22507aff6c Fix compile issue in Mac OS
Summary:
Compile issues are:
* Unused variable env_
* Unused fallocate_with_keep_size_

Test Plan: compiles

Reviewers: dhruba, haobo, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17043
2014-03-19 15:40:12 -07:00
Lei Jin
6dc940d4c9 avoid shared_ptr assignment in Version::Get()
Summary:
This is a 500ns operation while the whole Get() call takes only a few
micro!

Test Plan: ran db_bench, for a DB with 50M keys, QPS jumps from 5.2M/s to 7.2M/s

Reviewers: haobo, igor, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17007
2014-03-19 10:54:32 -07:00
sdong
71e6a34271 Add a DB property to indicate number of background errors encountered
Summary: Add a property to calculate number of background errors encountered to help users build their monitoring

Test Plan: Add a unit test. make all check

Reviewers: haobo, igor, dhruba

Reviewed By: igor

CC: ljin, nkg-, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16959
2014-03-18 14:28:30 -07:00
Igor Canadi
69aa6ecb26 Finalize fist version in column family 2014-03-18 14:23:47 -07:00
Igor Canadi
e25819a185 Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
2014-03-18 14:00:20 -07:00
Kai Liu
1ec72b37b1 Several easy-to-add properties related to compaction and flushes
Summary: To partly address the request @nkg- raised, add three easy-to-add properties to compactions and flushes.

Test Plan: run unit tests and add a new unit test to cover new properties.

Reviewers: haobo, dhruba

Reviewed By: dhruba

CC: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D13677
2014-03-18 14:00:09 -07:00
Igor Canadi
758fa8c359 Don't Finalize in CompactionPicker
Summary:
Finalize re-sorts (read: mutates) the files_ in Version* and it is called by CompactionPicker during normal runtime. At the same time, this same Version* lives in the SuperVersion* and is accessed without the mutex in GetImpl() code path.

Mutating the files_ in one thread and reading the same files_ in another thread is a bad idea. It caused this issue: http://ci-builds.fb.com/job/rocksdb_crashtest/285/console

Long-term, we need to be more careful with method contracts and clearly document what state can be mutated when. Now that we are much faster because we don't lock in GetImpl(), we keep running into data races that were not a problem before when we were slower. db_stress has been very helpful in detecting those.

Short-term, I removed Finalize() from CompactionPicker.

Note: I believe this is an issue in current 2.7 version running in production.

Test Plan:
make check
Will also run db_stress to see if issue is gone

Reviewers: sdong, ljin, dhruba, haobo

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16983
2014-03-18 13:59:59 -07:00
Igor Canadi
3055a15b29 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
2014-03-18 13:24:27 -07:00
Lei Jin
63cef90078 disable the log_number check in Recover()
Summary:
There is a chance that an old MANIFEST is corrupted in 2.7 but just not noticed.
This check would fail them. Change it to log instead of returning a
Corruption status.

Test Plan: make

Reviewers: haobo, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16923
2014-03-18 12:46:29 -07:00
Igor Canadi
bcea9c1296 Finalize version in dumpmanifest 2014-03-18 09:45:52 -07:00
Igor Canadi
f26cb0f093 Optimize fallocation
Summary:
Based on my recent findings (posted in our internal group), if we use fallocate without KEEP_SIZE flag, we get superior performance of fdatasync() in append-only workloads.

This diff provides an option for user to not use KEEP_SIZE flag, thus optimizing his sync performance by up to 2x-3x.

At one point we also just called posix_fallocate instead of fallocate, which isn't very fast: http://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html (tl;dr it manually writes out zero bytes to allocate storage). This diff also fixes that, by first calling fallocate and then posix_fallocate if fallocate is not supported.

Test Plan: make check

Reviewers: dhruba, sdong, haobo, ljin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16761
2014-03-17 21:52:14 -07:00
Igor Canadi
ae25742af9 Fix race condition in manifest roll
Summary:
When the manifest is getting rolled the following happens:
1) manifest_file_number_ is assigned to a new manifest number (even though the old one is still current)
2) mutex is unlocked
3) SetCurrentFile() creates temporary file manifest_file_number_.dbtmp
4) SetCurrentFile() renames manifest_file_number_.dbtmp to CURRENT
5) mutex is locked

If FindObsoleteFiles happens between (3) and (4) it will:
1) Delete manifest_file_number_.dbtmp (because it's not in pending_outputs_)
2) Delete old manifest (because the manifest_file_number_ already points to a new one)

I introduce the concept of prev_manifest_file_number_ that will avoid the race condition.

However, we should discuss the future of MANIFEST file rolling. We found some race conditions with it last week and who knows how many more are there. Nobody is using it in production because we don't trust the implementation. Should we even support it?

Test Plan: make check

Reviewers: ljin, dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16929
2014-03-17 21:50:15 -07:00
Igor Canadi
d63ae5cb59 Adjust memtable sizes in unit test 2014-03-17 18:37:34 -07:00
Igor Canadi
64904b39a0 Merge branch 'master' into columnfamilies
Conflicts:
	utilities/backupable/backupable_db.cc
2014-03-17 17:57:14 -07:00
Igor Canadi
e0c1211555 Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
	tools/db_stress.cc
2014-03-17 12:21:05 -07:00
Yueh-Hsuan Chiang
a5fafd4f46 Correct the logic of MemTable::ShouldFlushNow().
Summary:
Memtable will now be forced to flush if the one of the following
conditions is met:
1. Already allocated more than write_buffer_size + 60% arena block size.
   (the overflowing condition)
2. Unable to safely allocate one more arena block without hitting the
   overflowing condition AND the unused allocated memory < 25% arena
   block size.

Test Plan: make all check

Reviewers: sdong, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16893
2014-03-17 12:20:11 -07:00
sdong
c61c9830d4 Fix a bug that Prev() can hang.
Summary: Prev() now can hang when there is a key with more than max_skipped number of appearance internally but all of them are newer than the sequence ID to seek. Add unit tests to confirm the bug and fix it.

Test Plan: make all check

Reviewers: igor, haobo

Reviewed By: igor

CC: ljin, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16899
2014-03-17 10:00:41 -07:00
Igor Canadi
30447b7251 Merge pull request #99 from caiosba/master
Make it compile on Debian/GCC 4.7
2014-03-17 08:51:35 -07:00
Lei Jin
0cf6c8f7ce fix: use the correct edit when comparing log_number
Summary:
In the last fix, I forgot to point to the writer when comparing edit,
which is apparently not correct.

Test Plan: still running make whitebox_crash_test

Reviewers: igor, haobo, igor2

Reviewed By: igor2

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16911
2014-03-15 23:30:43 -07:00
Lei Jin
453ec52ca1 journal log_number correctly in MANIFEST
Summary:
Here is what it can cause probelm:
There is one memtable flush and one compaction. Both call LogAndApply(). If both edits are applied in the same batch with flush edit first and the compaction edit followed. LogAndApplyHelper() will assign compaction edit current VersionSet's log number(which should be smaller than the log number from flush edit). It cause log_numbers in MANIFEST to be not monotonic increasing, which violates the assume Recover() makes. What is more is after comitting to MANIFEST file, log_number_ in VersionSet is updated to the log_number from the last edit, which is the compaction one. It ends up not updating the log_number.

Test Plan:
make whitebox_crash_test
got another assertion about iter->valid(), not sure if that is related
to this.

Reviewers: igor, haobo

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16875
2014-03-14 18:36:47 -07:00
Caio SBA
b9c78d2db6 Make it compile on Debian/GCC 4.7 2014-03-14 22:44:35 +00:00
Igor Canadi
a782bb989e Fix log_number in LogAndApply 2014-03-14 13:45:30 -07:00
Igor Canadi
8b169e949a Merge branch 'master' into columnfamilies 2014-03-14 13:42:36 -07:00
Igor Canadi
928ee23567 Change WriteBatch interface 2014-03-14 13:40:06 -07:00
Igor Canadi
2bad3cb0db Missing includes 2014-03-14 13:02:20 -07:00
Igor Canadi
db234133a9 [CF] WriteBatch to take in ColumnFamilyHandle
Summary: Client doesn't need to know anything about ColumnFamily ID. By making WriteBatch take ColumnFamilyHandle as a parameter, we can eliminate method GetID() from ColumnFamilyHandle

Test Plan: column_family_test

Reviewers: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16887
2014-03-14 11:30:14 -07:00
Igor Canadi
3c75cc15a9 Fix HashSkipList and HashLinkedList SIGSEGV
Summary:
Original Summary:
Yesterday, @ljin and I were debugging various db_stress issues. We suspected one of them happens when we concurrently call NewIterator without prefix_seek on HashSkipList. This test demonstrates it.

Update:
Arena is not thread-safe!! When creating a new full iterator, we *have* to create a new arena, otherwise we're doomed.

Test Plan: SIGSEGV and assertion-throwing test now works!

Reviewers: ljin, haobo, sdong

Reviewed By: sdong

CC: leveldb, ljin

Differential Revision: https://reviews.facebook.net/D16857
2014-03-14 10:02:04 -07:00
Igor Canadi
6c72079d77 Fix warning on Mac OS 2014-03-14 09:54:23 -07:00
Igor Canadi
f0e1e3ebf1 CF cleanup part 2 2014-03-13 14:34:01 -07:00
Igor Canadi
f071a20f6e Need more data in memtable to flush due to 11da8b 2014-03-13 13:52:20 -07:00
Igor Canadi
e1f56e12cf Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	tools/db_stress.cc
2014-03-13 13:21:20 -07:00
sdong
5aa81f04fa Fix extra compaction tasks scheduled after D16767 in some cases
Summary:
With D16767, there is a case compaction tasks are scheduled infinitely:
(1) no flush thread is configured and more than 1 compaction threads
(2) a flush is going on by one compaction hread
(3) the state of SST files is in the state that versions_->current()->NeedsCompaction() will generate a false positive (return true actually there is no work to be done)
In that case, a infinite loop will be formed.

This patch would fix it.

Test Plan: make all check

Reviewers: haobo, igor, ljin

Reviewed By: igor

CC: dhruba, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16863
2014-03-13 13:06:08 -07:00
Kai Liu
11da8bc5df A heuristic way to check if a memtable is full
Summary:
This is is based on https://reviews.facebook.net/D15027. It's not finished but I would like to give a prototype to avoid arena over-allocation while making better use of the already allocated memory blocks.

Instead of check approximate memtable size, we will take a deeper look at the arena, which incorporate essential idea that @sdong suggests: flush when arena has allocated its last and the last is "almost full"

Test Plan: N/A

Reviewers: haobo, sdong

Reviewed By: sdong

CC: leveldb, sdong

Differential Revision: https://reviews.facebook.net/D15051
2014-03-12 16:40:14 -07:00
Igor Canadi
25c8a1a20f More bug fixed introduced by code cleanup 2014-03-12 12:28:23 -07:00
Igor Canadi
b5d6ad69fc Bug fixes introduced by code cleanup 2014-03-12 11:10:26 -07:00
Igor Canadi
dff9214165 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	tools/db_stress.cc
2014-03-12 10:17:41 -07:00
Igor Canadi
fb2346fc1f [CF] Code cleanup part 1
Summary:
I'm cleaning up some code preparing for the big diff review tomorrow. This is the first part of the cleanup.

Changes are mostly cosmetic. The goal is to decrease amount of code difference between columnfamilies and master branch.

This diff also fixes race condition when dropping column family.

Test Plan: Ran db_stress with variety of parameters

Reviewers: dhruba, haobo

Differential Revision: https://reviews.facebook.net/D16833
2014-03-12 09:56:53 -07:00
Igor Canadi
45ad75db80 Correct version of D16821 2014-03-12 09:38:59 -07:00
Igor Canadi
2b95dc1542 Revert "Fix bad merge of D16791 and D16767"
This reverts commit 839c8ecfcd.
2014-03-12 09:37:43 -07:00
sdong
839c8ecfcd Fix bad merge of D16791 and D16767
Summary: A bad Auto-Merge caused log buffer is flushed twice. Remove the unintended one.

Test Plan: Should already be tested (the code looks the same as when I ran unit tests).

Reviewers: haobo, igor

Reviewed By: haobo

CC: ljin, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16821
2014-03-11 21:31:57 -07:00
sdong
bd45633b71 Fix data race against logging data structure because of LogBuffer
Summary:
@igor pointed out that there is a potential data race because of the way we use the newly introduced LogBuffer. After "bg_compaction_scheduled_--" or "bg_flush_scheduled_--", they can both become 0. As soon as the lock is released after that, DBImpl's deconstructor can go ahead and deconstruct all the states inside DB, including the info_log object hold in a shared pointer of the options object it keeps. At that point it is not safe anymore to continue using the info logger to write the delayed logs.

With the patch, lock is released temporarily for log buffer to be flushed before "bg_compaction_scheduled_--" or "bg_flush_scheduled_--". In order to make sure we don't miss any pending flush or compaction, a new flag bg_schedule_needed_ is added, which is set to be true if there is a pending flush or compaction but not scheduled because of the max thread limit. If the flag is set to be true, the scheduling function will be called before compaction or flush thread finishes.

Thanks @igor for this finding!

Test Plan: make all check

Reviewers: haobo, igor

Reviewed By: haobo

CC: dhruba, ljin, yhchiang, igor, leveldb

Differential Revision: https://reviews.facebook.net/D16767
2014-03-11 16:09:53 -07:00
Igor Canadi
d833f15738 Fix bug in VersionEdit::DebugString() 2014-03-11 12:14:09 -07:00
Igor Canadi
37472bb279 Add MaxColumnFamily to VersionEdit::DebugString() 2014-03-11 12:08:42 -07:00
Igor Canadi
457c78eb89 [CF] db_stress for column families
Summary:
I had this diff for a while to test column families implementation. Last night, I ran it sucessfully for 10 hours with the command:

   time ./db_stress --threads=30 --ops_per_thread=200000000 --max_key=5000 --column_families=20 --clear_column_family_one_in=3000000 --verify_before_write=1  --reopen=50 --max_background_compactions=10 --max_background_flushes=10 --db=/tmp/db_stress

It is ready to be committed :)

Test Plan: Ran it for 10 hours

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16797
2014-03-11 12:06:12 -07:00
sdong
6c66bc08d9 Temp Fix of LogBuffer flushing
Summary: To temp fix the log buffer flushing. Flush the buffer inside the lock. Clean the trunk before we find an eventual fix.

Test Plan: make all check

Reviewers: haobo, igor

Reviewed By: igor

CC: ljin, leveldb, yhchiang

Differential Revision: https://reviews.facebook.net/D16791
2014-03-11 11:37:40 -07:00
Igor Canadi
cb9802168f Add a comment after SignalAll()
Summary: Having code after SignalAll has already caused 2 bugs. Let's make sure this doesn't happen again.

Test Plan: no test

Reviewers: sdong, dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16785
2014-03-11 11:27:19 -07:00
Igor Canadi
dad8603fc4 [CF] Fix column family dropping
Summary: Column family should be dropped after the change has been commited

Test Plan: db stress

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16779
2014-03-11 09:47:29 -07:00
Igor Canadi
9634ba42ac Merge branch 'master' into columnfamilies
Conflicts:
	db/compaction_picker.cc
	db/db_impl.cc
	db/db_impl.h
	db/tailing_iter.cc
	db/version_set.h
	include/rocksdb/options.h
	util/options.cc
2014-03-10 17:26:09 -07:00
Igor Canadi
d5de22dc09 Call PurgeObsoleteFiles() only when HaveSomethingToDelete()
Summary: as title

Test Plan: fixed the build failure http://ci-builds.fb.com/job/rocksdb_build/987/console

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16743
2014-03-10 15:42:14 -07:00
sdong
fac58c0504 DBTest: remove perf_context's time > 0 check
Summary: DBTest checks  perf_context.seek_internal_seek_time > 0 and perf_context.find_next_user_entry_time > 0, which is not reliable. Remove them.

Test Plan: ./db_test

Reviewers: igor, haobo, ljin

Reviewed By: igor

CC: dhruba, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16737
2014-03-10 14:24:56 -07:00
Haobo Xu
a91aed615a [RocksDB] Minor cleanup of PurgeObsoleteFiles
Summary: as title. also made info log output of file deletion a bit more descriptive.

Test Plan: make check; db_bench and look at LOG output

Reviewers: igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16731
2014-03-10 14:13:38 -07:00
Lei Jin
8d007b4aaf Consolidate SliceTransform object ownership
Summary:
(1) Fix SanitizeOptions() to also check HashLinkList. The current
dynamic case just happens to work because the 2 classes have the same
layout.
(2) Do not delete SliceTransform object in HashSkipListFactory and
HashLinkListFactory destructor. Reason: SanitizeOptions() enforces
prefix_extractor and SliceTransform to be the same object when
Hash**Factory is used. This makes the behavior strange: when
Hash**Factory is used, prefix_extractor will be released by RocksDB. If
other memtable factory is used, prefix_extractor should be released by
user.

Test Plan: db_bench && make asan_check

Reviewers: haobo, igor, sdong

Reviewed By: igor

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16587
2014-03-10 12:56:46 -07:00
Haobo Xu
9e0e6aa7f6 [RocksDB] make sure KSVObsolete does not get accessed as a valid pointer.
Summary: KSVObsolete is no longer nullptr and needs to be checked explicitly. Also did some minor code cleanup and added a stat counter to track superversion cleanups incurred in the foreground.

Test Plan: make check

Reviewers: ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16701
2014-03-10 12:55:25 -07:00
Haobo Xu
66da467983 [RocksDB] LogBuffer Cleanup
Summary: Moved LogBuffer class to an internal header. Removed some unneccesary indirection. Enabled log buffer for BackgroundCallFlush. Forced log buffer flush right after Unlock to improve time ordering of info log.

Test Plan: make check; db_bench compare LOG output

Reviewers: sdong

Reviewed By: sdong

CC: leveldb, igor

Differential Revision: https://reviews.facebook.net/D16707
2014-03-10 11:05:44 -07:00
Igor Canadi
04d2c26e17 Add option verify_checksums_in_compaction
Summary:
If verify_checksums_in_compaction is true, compaction will verify checksums. This is default.
If it's false, compaction doesn't verify checksums. This is useful for in-memory workloads.

Test Plan: corruption_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16695
2014-03-10 10:06:34 -07:00
Igor Canadi
d4f2c610d3 Ignore dropped column families -- don't flush or compact them 2014-03-07 18:43:21 -08:00
Igor Canadi
1e0d47276c Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
2014-03-07 16:59:47 -08:00
Igor Canadi
9f15092ebd [CF] NewIterators
Summary: Adding the last missing function -- NewIterators(). Pretty simple implementation

Test Plan: added a unit test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16689
2014-03-07 16:15:25 -08:00
Lei Jin
e5fa4944fc use CAS when returning SuperVersion to ThreadLocal
Summary:
Add a check at the end of GetImpl to release SuperVersion if it becomes
obsolete. Also do Scrape() inside InstallSuperVersion so it happens more
frequent.

Test Plan:
make all check
running asan_check now

Reviewers: igor, haobo, sdong, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16641
2014-03-07 14:43:22 -08:00
Igor Canadi
eec8695206 Delete local sv when destroying DB from stress test
Summary: Not deleting local SV caused some an crash test issue: http://ci-builds.fb.com/job/rocksdb_asan_crash_test/83/console

Test Plan: ran unit tests

Reviewers: ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16635
2014-03-06 18:15:26 -08:00
Igor Canadi
80a207fc90 Merge branch 'master' into columnfamilies
Conflicts:
	db/compaction_picker.cc
	db/compaction_picker.h
	db/db_impl.cc
	db/version_set.cc
	db/version_set.h
	include/rocksdb/options.h
	util/options.cc
2014-03-05 16:59:22 -08:00
sdong
ecb1ffa2a8 Buffer info logs when picking compactions and write them out after releasing the mutex
Summary: Now while the background thread is picking compactions, it writes out multiple info_logs, especially for universal compaction, which introduces a chance of waiting log writing in mutex, which is bad. To remove this risk, write all those info logs to a buffer and flush it after releasing the mutex.

Test Plan:
make all check
check the log lines while running some tests that trigger compactions.

Reviewers: haobo, igor, dhruba

Reviewed By: dhruba

CC: i.am.jin.lei, dhruba, yhchiang, leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D16515
2014-03-05 15:36:32 -08:00
Igor Canadi
e2dd148a8b Fix compile fail introduced by merge 2014-03-05 12:47:44 -08:00
Igor Canadi
a329dd1b25 Fix TEST_Destroy_DBImpl() to work with column families 2014-03-05 12:27:39 -08:00
Igor Canadi
0738ae6dc9 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
2014-03-05 12:25:05 -08:00
Igor Canadi
9625acbf70 [CF] Dont reuse dropped column family IDs
Summary:
Column family IDs should be unique, even if column family is dropped. To achieve this, we save max column family in manifest.

Note that the diff is still not ready. I'm only using differential to move the patch to my Mac machine.

Test Plan: added a test to column_family_test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16581
2014-03-05 12:13:44 -08:00
Igor Canadi
8ca30bd51b Merge pull request #47 from mlin/kCompactionStopStyleSimilarSize
An initial implementation of kCompactionStopStyleSimilarSize for universal compaction
2014-03-05 10:35:30 -08:00
Lei Jin
04298f8c33 output perf_context in db_bench readrandom
Summary:
Add helper function to print perf context data in db_bench if enabled.
I didn't find any code that actually exports perf context data. Not sure
if I missed anything

Test Plan: ran db_bench

Reviewers: haobo, sdong, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16575
2014-03-05 10:32:54 -08:00
Lei Jin
64138b5d9c fix db_bench to use HashSkipList for real
Summary:
For HashSkipList case, DBImpl has sanity check to see if prefix_extractor in
options is the same as the one in memtable factory. If not, it falls
back to SkipList. As result, I was experimenting with SkipList
performance. No wonder it is much worse than LinkedList

Test Plan: ran benchmark

Reviewers: haobo, sdong, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16569
2014-03-05 10:28:53 -08:00
Lei Jin
51560ba755 config max_background_flushes in db_bench
Summary: as title

Test Plan: make release

Reviewers: haobo, sdong, igor

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16437
2014-03-05 10:27:17 -08:00
Igor Canadi
c0ccf43648 MergingIterator assertion
Summary: I wrote a test that triggers assertion in MergingIterator. I have not touched that code ever, so I'm looking for somebody with good understanding of the MergingIterator code to fix this. The solution is probably a one-liner. Let me know if you're willing to take a look.

Test Plan: This test fails with an assertion `use_heap_ == false`

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16521
2014-03-05 09:13:07 -08:00
sdong
e8ecca9e86 CleanupIteratorState() only to initialize DeletionState when super version cleanup needed
Summary:
Two changes:
1. DeletionState is only constructed when cleaning up is needed
2. Fix the bug of deletion state construction bug. A change was made in a previous patch: https://reviews.facebook.net/rROCKSDB774ed89c2405ee058086b099cbc8b29e243739cc#71a34e2e However, it somehow got lost when merging

Test Plan: make all check

Reviewers: kailiu, haobo, igor

Reviewed By: igor

CC: igor, dhruba, i.am.jin.lei, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16233
2014-03-04 20:58:20 -08:00
Igor Canadi
e21d5b8bbc [CF] Flush all memtables on column family drop
Summary: When column family is dropped, we want to delete all WALs that refer to it. To do that, we need to make them obsolete by flushing all the memtables

Test Plan: column_family_test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16557
2014-03-04 17:21:30 -08:00
Lei Jin
a5b1d2f146 make key evenly distributed between 0 and FLAGS_num
Summary:
The issue is that when FLAGS_num is small, the leading bytes of the key
are padded with 0s. This makes all keys have the same prefix 00000000

Most of the changes are just to make lint happy

Test Plan: ran db_bench

Reviewers: sdong, haobo, igor

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16317
2014-03-04 17:08:05 -08:00
Igor Canadi
fa34697237 Merge branch 'master' into columnfamilies 2014-03-04 09:39:14 -08:00
Igor Canadi
335b207974 [CF] Delete SuperVersion in a special function
Summary: Added a function DeleteSuperVersion that can be called in DBImpl destructor before PurgingObsoleteFiles. That way, PurgeObsoleteFiles will be able to delete all files held by alive super versions.

Test Plan: column_family_test with valgrind

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16545
2014-03-04 09:35:44 -08:00
kailiu
906f3dca72 Add a hash-index component for block
Summary:
this is the key component extracted from diff: https://reviews.facebook.net/D14271
I separate it to a dedicated patch to make the review easier.

Test Plan: added a unit test and passed it.

Reviewers: haobo, sdong, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16245
2014-03-03 21:11:49 -08:00
Igor Canadi
9d0577a6be Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/transaction_log_impl.cc
	db/transaction_log_impl.h
	include/rocksdb/options.h
	util/env.cc
	util/options.cc
2014-03-03 18:29:03 -08:00
Igor Canadi
5142b37000 Fix a group commit bug in LogAndApply
Summary:
EncodeTo(&record) does not overwrite, it appends to it.

This means that group commit log and apply will look something like:
record1
record1record2
record1record2record3

I'm surprised this didn't show up in production, but I think the reason is that MANIFEST group commit almost never happens.

This bug turned up in column family work, where opening a database failed with "adding a same column family twice".

Test Plan: Tested the change in column family branch and observed that the problem is gone (with db_stress)

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16461
2014-03-03 17:10:43 -08:00
Igor Canadi
f9b2f0ad79 [CF] Fix CF bugs in WriteBatch
Summary:
This diff fixes two bugs:
* Increase sequence number even if WriteBatch fails. This is important because WriteBatches in WAL logs have implictly increasing sequence number, even if one update in a write batch fails. This caused some writes to get lost in my CF stress testing
* Tolerate 'invalid column family' errors on recovery. When a column family is dropped, processing WAL logs can have some WriteBatches that still refer to the dropped column family. In recovery environment, we want to ignore those errors. In client's Write() code path, however, we want to return the failure to the client if he's trying to add data to invalid column family.

Test Plan: db_stress's verification works now

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16533
2014-03-03 17:07:46 -08:00
kailiu
bf86af5174 Remove the terrible hack in for flush_block_policy_factory
Summary:
Previous code is too convoluted and I must be drunk for letting
such code to be written without a second thought.

Thanks to the discussion with @sdong, I added the `Options` when
generating the flusher, thus avoiding the tricks.

Just FYI: I resisted to add Options in flush_block_policy.h since I
wanted to avoid cyclic dependencies: FlushBlockPolicy dpends on Options
and Options also depends FlushBlockPolicy... While I appreciate my
effort to prevent it, the old design turns out creating more troubles than
it tried to avoid.

Test Plan: ran ./table_test

Reviewers: sdong

Reviewed By: sdong

CC: sdong, leveldb

Differential Revision: https://reviews.facebook.net/D16503
2014-02-28 16:39:27 -08:00
Igor Canadi
8ea21a778b [CF] Rething LogAndApply for column families
Summary:
I though I might get away with as little changes to LogAndApply() as possible. It turns out this is not the case.

This diff introduces different behavior of LogAndApply() for three cases:
1. column family add
2. column family drop
3. no-column family manipulation

(1) and (2) don't support group commit yet.

There were a lot of problems with old version od LogAndApply, detected by db_stress. The biggest was non-atomicity of manifest writes and metadata changes (i.e. if column family add is in manifest, it also has to be in in-memory data structure).

Test Plan: db_stress

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16491
2014-02-28 14:46:48 -08:00
Igor Canadi
58ca641d53 Make Log::Reader more robust
Summary:
This diff does two things:
(1) Log::Reader does not report a corruption when the last record in a log or manifest file is truncated (meaning that log writer died in the middle of the write). Inherited the code from LevelDB: https://code.google.com/p/leveldb/source/detail?r=269fc6ca9416129248db5ca57050cd5d39d177c8#
(2) Turn off mmap writes for all writes to log and manifest files

(2) is necessary because if we use mmap writes, the last record is not truncated, but is actually filled with zeros, making checksum fail. It is hard to recover from checksum failing.

Test Plan:
Added unit tests from LevelDB
Actually recovered a "corrupted" MANIFEST file.

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16119
2014-02-28 13:19:47 -08:00
Igor Canadi
12966ec1bb Fix LogAndApply() group commit 2014-02-28 12:22:45 -08:00
Yueh-Hsuan Chiang
a77527f2af Add ReadOptions to TransactionLogIterator.
Summary:
Add an optional input parameter ReadOptions to DB::GetUpdateSince(),
which allows the verification of checksums to be disabled by setting
ReadOptions::verify_checksums to false.

Test Plan: Tests are done off-line and will not be included in the regular unit test.

Reviewers: igor

Reviewed By: igor

CC: leveldb, xjin, dhruba

Differential Revision: https://reviews.facebook.net/D16305
2014-02-28 11:50:36 -08:00
Igor Canadi
f6a257b6a1 Set dropped column family before persisting in the manifest 2014-02-28 11:49:32 -08:00
Igor Canadi
670f3ba212 [CF] Small refactor of Recover() and DumpManifest() 2014-02-28 11:25:38 -08:00
Igor Canadi
099ad94306 Set log number for column family 2014-02-28 11:08:24 -08:00
Igor Canadi
510f84b686 [CF] CreateColumnFamily fix
Summary:
This fixes few bugs with CreateColumnFamily
* We first have to LogAndApply and then call VersionSet::CreateColumnFamily. Otherwise, WriteSnapshot might be invoked, writing out column family add inside of LogAndApply, even though it's not really committed
* Fix LogAndApplyHelper() to not apply log number to column_family_data, which is in case of column family add, just a dummy (default) column family
* Create SuperVerion when creating column family

Test Plan: column_family_test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16443
2014-02-28 10:40:52 -08:00
Igor Canadi
206b38f31c SetLogNumber in CreateColumnFamily 2014-02-27 16:53:45 -08:00
Igor Canadi
b41a3bc4da [CF] Change flow of CreateColumnFamily
Summary:
Previously, we first wrote to the manifest and then created internal data structure.
Now, we first create internal data structure. That way, we can write out internal comparator to the manifest

Test Plan: column_family_test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16425
2014-02-27 16:49:49 -08:00
Igor Canadi
492c9f71c6 [CF] Column family support for LDB tool
Summary: Added list_column_family command and also updated dump_manifest

Test Plan: no

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16419
2014-02-27 16:39:23 -08:00
Lei Jin
ad0c3747cb cache SuperVersion in thread local storage to avoid mutex lock
Summary: as title

Test Plan:
asan_check
will post results later

Reviewers: haobo, igor, dhruba, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16257
2014-02-27 11:38:55 -08:00
Igor Canadi
85b1b5e1b9 [CF] WaitForFlush() instead of sleeping
Summary: If we sleep for 300ms the test fails in valgrind because it takes more than 300ms to flush. This way we WaitForFlush() when we're expecting flush, but still sleep and check if the flush happens even though it's not supposed to.

Test Plan: notest

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16401
2014-02-27 10:31:05 -08:00
Igor Canadi
4c42201204 [CF] Test fixes and speedup 2014-02-26 17:34:39 -08:00
Igor Canadi
343c32be7b [CF] DifferentMergeOperators and DifferentCompactionStyles tests
Summary:
Two new column family tests:
* DifferentMergeOperators -- three column families, one without merge operator, one with add operator and one with append operator. verify that operations work as expected.
* DifferentCompactionStyles -- three column families, two with level compactions and one with universal compaction. trigger the compactions and verify they work as expected.

Test Plan: nope

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16377
2014-02-26 16:05:24 -08:00
Igor Canadi
3c81546422 [CF] Make LogDeletionTest less flakey
Summary: Retry GetSortedWalFiles() and also wait 20ms before counting number of log files. WaitForFlush() doesn't necessarily wait for logs to be deleted, since logs are deleted outside of the mutex.

Test Plan: column_family_test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16371
2014-02-26 14:41:18 -08:00
Igor Canadi
6e7cae7711 [CF] More tests
Summary: New unit tests for column families

Test Plan: this is a test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16359
2014-02-26 14:16:23 -08:00
Igor Canadi
9bce2b2a84 [CF] Fix lint errors in CF code
Summary: Big CF diff uncovered some lint errors. This diff fixes some of them. Not much to see here

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16347
2014-02-26 10:10:00 -08:00
Igor Canadi
8b7ab9951c [CF] Handle failure in WriteBatch::Handler
Summary:
* Add ColumnFamilyHandle::GetID() function. Client needs to know column family's ID to be able to construct WriteBatch
* Handle WriteBatch::Handler failure gracefully. Since WriteBatch is not a very smart function (it takes raw CF id), client can add data to WriteBatch for column family that doesn't exist. In that case, we need to gracefully return failure status from DB::Write(). To do that, I added a return Status to WriteBatch functions PutCF, DeleteCF and MergeCF.

Test Plan: Added test to column_family_test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16323
2014-02-26 10:10:00 -08:00
Igor Canadi
8895526308 Merge branch 'master' into columnfamilies 2014-02-25 17:04:48 -08:00
Igor Canadi
5ad7ee03ea [CF] Log deletion in column families
Summary:
* Added unit test that verifies that obsolete files are deleted.
* Advance log number for empty column family when cutting log file.
* MinLogNumber() bug fix! (caught by the new unit test)

Test Plan: unit test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16311
2014-02-25 16:54:41 -08:00
Igor Canadi
dc277f0ab7 [CF] Adaptation of GetLiveFiles for CF
Summary: Even if user flushes the memtables before getting live files, we still can't guarantee that new data didn't come in (to already-flushed memtables). If we want backups to provide consistent view of the database, we still need to get WAL files.

Test Plan: backupable_db_test

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16299
2014-02-25 13:21:14 -08:00
Igor Canadi
5a91746277 log file is uint64_t 2014-02-25 12:57:43 -08:00
Igor Canadi
4209516359 Schedule flush when waiting on flush
Summary:
This will also help with avoiding the deadlock. If a flush failed and we're waiting for a memtable to be flushed, we should schedule a new flush and hope a new one succeedes.

If paranoid_checks = false, Wait() will still hang on ENOSPC, but at least it will automatically continue when the space frees up. Current behavior both hangs and deadlocks.

Also, I renamed some 'compaction' to 'flush'. 'compaction' was leveldb way of saying things.

Test Plan: make check

Reviewers: dhruba, haobo, ljin

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16281
2014-02-25 12:04:14 -08:00
Lei Jin
dea894ef8d expose wal_dir in db_bench
Summary: as title

Test Plan: ran db_bench

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16269
2014-02-25 10:43:46 -08:00
Albert Strasheim
72aacf6b96 A few more C API functions. 2014-02-25 10:32:28 -08:00
Igor Canadi
b69e7d99d5 [CF] Better handling of memtable logs
Summary: DBImpl now keeps a list of alive_log_files_. On every FindObsoleteFiles, it deletes all alive log files that are smaller than versions_->MinLogNumber()

Test Plan:
make check passes
no specific unit tests yet, will add

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16293
2014-02-25 09:55:13 -08:00
Igor Canadi
d39da4b578 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
2014-02-24 17:09:05 -08:00
Igor Canadi
6ed450a58c DeleteFile should schedule Flush or Compaction
Summary:
More info here: https://github.com/facebook/rocksdb/issues/89
If flush fails because of ENOSPC, we have a deadlock problem. This is a quick fix that will continue the normal operation when user deletes the file and frees up the space on the device.

We need to address the issue more broadly with bg_error_ cleanup.

Test Plan: make check

Reviewers: dhruba, haobo, ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16275
2014-02-24 16:00:13 -08:00
Igor Canadi
2bf1151a25 Fix C API 2014-02-24 15:15:34 -08:00
Igor Canadi
18a7cdfba0 Merge pull request #82 from tecbot/api-enhancements
Enhancements to the API
2014-02-24 14:20:13 -08:00
Thomas Adam
ce2b1f7b44 added a test case for custom merge operator 2014-02-23 17:58:38 +01:00
Thomas Adam
68248a2ac5 added a delete method for custom filter policy and merge operator to make it possible to override the cleanup behaviour of the return value 2014-02-23 17:58:11 +01:00
sdong
b2d29675c8 Add a test in prefix_test to verify correctness of results
Summary:
Add a test to verify HashLinkList and HashSkipList (mainly for the former one) returns the correct results when inserting the same bucket in the different orders.

Some other changes:
(1) add the test to test list
(2) fix compile error
(3) add header

Test Plan: ./prefix_test

Reviewers: haobo, kailiu

Reviewed By: haobo

CC: igor, yhchiang, i.am.jin.lei, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D16143
2014-02-19 17:00:34 -08:00
Kai Liu
2b205b35d8 Disable putting filter block to block cache
Summary: This bug caused server crash issues because the filter block is too big and kept purging out of cache.

Test Plan: Wrote a new unit tests to make sure it works.

Reviewers: dhruba, haobo, igor, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16221
2014-02-19 15:38:57 -08:00
Thomas Adam
d74c9b79ea Enhancements to the API 2014-02-19 23:59:54 +01:00
sdong
e90d3f7752 First Transaction Logs Should Not Skip Storage Options Given
Summary: Currently, the first transaction log file ignore bytes_per_sync and other storage-related options. It is not consistent. Fix it.

Test Plan: make all check. See the options set in GDB.

Reviewers: haobo, kailiu

Reviewed By: haobo

CC: igor, ljin, yhchiang, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16215
2014-02-19 10:58:39 -08:00
Igor Canadi
6aef661230 some improvements to CompressedCache test 2014-02-14 17:47:53 -08:00
Igor Canadi
422bb09cb0 Fix table properties
Summary: Adapt table properties to column family world

Test Plan: make check

Reviewers: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16161
2014-02-14 17:13:10 -08:00
Igor Canadi
76c048183c Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	include/rocksdb/db.h
2014-02-14 16:46:03 -08:00
Igor Canadi
be7e273d83 fix u/s comparison #83 2014-02-14 16:18:55 -08:00
Igor Canadi
c67d48c852 [CF] DB test to run on non-default column family
Summary:
This is a huge diff and it was hectic, but the idea is actually quite simple. Every operation (Put, Get, etc.) done on default column family in DBTest is now forwarded to non-default ("pikachu"). The good news is that we had zero test failures! Column families look stable so far.

One interesting test that I adapted for column families is MultiThreadedTest. I replaced every Put() with a WriteBatch writing to all column families concurrently. Every Put in the write batch contains unique_id. Instead of Get() I do a multiget across all column families with the same key. If atomicity holds, I expect to see the same unique_id in all column families.

Test Plan: This is a test!

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16149
2014-02-14 16:08:59 -08:00
kailiu
63690625cd Expose the table properties to application
Summary: Provide a public API for users to access the table properties for each SSTable.

Test Plan: Added a unit tests to test the function correctness under differnet conditions.

Reviewers: haobo, dhruba, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16083
2014-02-13 16:28:21 -08:00
Kai Liu
b2e7ee8b41 Followup code refactor on plain table
Summary:
Fixed most comments in https://reviews.facebook.net/D15429.
Still have some remaining comments left.

Test Plan: make all check

Reviewers: sdong, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15885
2014-02-13 15:27:59 -08:00
Siying Dong
f3ae3d07cc Add more black-box tests for PlainTable and explicitly support total order mode
Summary:
1. Add some more implementation-aware tests for PlainTable
2. move from a hard-coded one index per 16 rows in one prefix to a configurable number. Also, make hash table ratio = 0  means binary search only. Also fixes some divide 0 risks.
3. Explicitly support total order (only use binary search)
4. some code cleaning up.

Test Plan: make all check

Reviewers: haobo, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16023
2014-02-12 17:37:22 -08:00
Igor Canadi
ccdb93e775 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/memtable_list.cc
	db/memtable_list.h
	db/version_set.cc
	db/version_set.h
2014-02-12 14:01:30 -08:00
Igor Canadi
b06840aa7d [CF] Rethinking ColumnFamilyHandle and fix to dropping column families
Summary:
The change to the public behavior:
* When opening a DB or creating new column family client gets a ColumnFamilyHandle.
* As long as column family handle is alive, client can do whatever he wants with it, even drop it
* Dropped column family can still be read from (using the column family handle)
* Added a new call CloseColumnFamily(). Client has to close all column families that he has opened before deleting the DB
* As soon as column family is closed, any calls to DB using that column family handle will fail (also any outstanding calls)

Internally:
* Ref-counting ColumnFamilyData
* New thread-safety for ColumnFamilySet
* Dropped column families are now completely dropped and their memory cleaned-up

Test Plan: added some tests to column_family_test

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16101
2014-02-12 13:47:09 -08:00
Igor Canadi
ca5f1a225a CompactionContext to include is_manual_compaction
Summary: Added a bit more information to compaction context, requested by internal team at FB.

Test Plan: Modified CompactionFilter test to make sure is_manual_compaction is properly set.

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16095
2014-02-12 12:24:18 -08:00
Lei Jin
994c327b86 IOError cleanup
Summary: Clean up IOErrors so that it only indicates errors talking to device.

Test Plan: make all check

Reviewers: igor, haobo, dhruba, emayanke

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15831
2014-02-12 11:42:54 -08:00
Lei Jin
5fbf2ef42d preload table handle on Recover() when max_open_files == -1
Summary: This covers existing table files before DB open happens and avoids contention on table cache

Test Plan: db_test

Reviewers: haobo, sdong, igor, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16089
2014-02-12 10:43:27 -08:00
Lei Jin
28b7f7faa8 enable plain table in db_bench
Summary: as title

Test Plan: ran db_bench to gather stats

Reviewers: haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16059
2014-02-12 10:41:55 -08:00
kailiu
265150cb49 Fix problem 3 for issue #80 2014-02-11 17:52:18 -08:00
Siying Dong
33042669f6 Reduce malloc of iterators in Get() code paths
Summary:
This patch optimized Get() code paths by avoiding malloc of iterators. Iterator creation is moved to mem table rep implementations, where a callback is called when any key is found. This is the same practice as what we do in (SST) table readers.

db_bench result for readrandom following a writeseq, with no compression, single thread and tmpfs, we see throughput improved to 144958 from 139027, about 3%.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

Reviewed By: haobo

CC: leveldb, yhchiang

Differential Revision: https://reviews.facebook.net/D14685
2014-02-11 10:32:51 -08:00
Kai Liu
d4b789fdee Add LIBRARY back to make dbg 2014-02-10 20:15:09 -08:00
Igor Canadi
8e634d3ea4 Merge pull request #74 from alberts/lz4
Support for LZ4 compression.
2014-02-10 15:46:56 -08:00
Albert Strasheim
df2f92214a Support for LZ4 compression. 2014-02-08 14:15:51 -08:00
kailiu
161ab42a8a Make table properties shareable
Summary:
We are going to expose properties of all tables to end users through "some" db interface.
However, current design doesn't naturally fit for this need, which is because:

1. If a table presents in table cache, we cannot simply return the reference to its table properties, because the table may be destroy after compaction (and we don't want to hold the ref of the version).
2. Copy table properties is OK, but it's slow.

Thus in this diff, I change the table reader's interface to return a shared pointer (for const table properties), instead a const refernce.

Test Plan: `make check` passed

Reviewers: haobo, sdong, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15999
2014-02-07 19:26:49 -08:00
Igor Canadi
8d4db63a2d [CF] OpenWithColumnFamilies -> Open
Summary: By discussion with @dhruba, overloading Open makes more sense

Test Plan: compiles!

Reviewers: dhruba

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16017
2014-02-07 14:49:33 -08:00
Igor Canadi
99e61fdd5c [CF] Separate dumping of DBOptions and ColumnFamilyOptions
Summary: When we open a DB, we should dump only DBOptions and then when we create a new column family, we dump ColumnFamilyOptions for each one.

Test Plan: make check, confirm contents of the LOG

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16011
2014-02-07 13:52:51 -08:00
Igor Canadi
2b8c44639a Merge branch 'master' into columnfamilies 2014-02-07 11:42:56 -08:00
Igor Canadi
1560bb913e Readrandom with tailing iterator
Summary:
Added an option for readrandom benchmark to run with tailing iterator instead of Get. Benefit of tailing iterator is that it doesn't require locking DB mutex on access.

I also have some results when running on my machine. The results highly depend on number of cache shards. With our current benchmark setting of 4 table cache shards and 6 block cache shards, I don't see much improvements of using tailing iterator. In that case, we're probably seeing cache mutex contention.

Here are the results for different number of shards

    cache shards       tailing iterator        get
       6                      1.38M           1.16M
      10                      1.58M           1.15M

As soon as we get rid of cache mutex contention, we're seeing big improvements in using tailing iterator vs. ordinary get.

Test Plan: ran regression test

Reviewers: dhruba, haobo, ljin, kailiu, sding

Reviewed By: haobo

CC: tnovak

Differential Revision: https://reviews.facebook.net/D15867
2014-02-07 09:47:47 -08:00
Igor Canadi
0143abdbb0 Merge branch 'master' into columnfamilies
Conflicts:
	HISTORY.md
	db/db_impl.cc
	db/db_impl.h
	db/db_iter.cc
	db/db_test.cc
	db/dbformat.h
	db/memtable.cc
	db/memtable_list.cc
	db/memtable_list.h
	db/table_cache.cc
	db/table_cache.h
	db/version_edit.h
	db/version_set.cc
	db/version_set.h
	db/write_batch.cc
	db/write_batch_test.cc
	include/rocksdb/options.h
	util/options.cc
2014-02-06 15:58:20 -08:00
Igor Canadi
0b4ccf765c Flushes should always go to HIGH priority thread pool
Summary:
This is not column-family related diff. It is in columnfamily branch because the change is significant and we want to push it with next major release (3.0).

It removes the leveldb notion of one thread pool and expands it to two thread pools by default (HIGH and LOW). Flush process is removed from compaction process and all flush threads are executed on HIGH thread pool, since we don't want long-running compactions to influence flush latency.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15987
2014-02-06 14:17:13 -08:00
Igor Canadi
f8d5443efe [CF] Thread-safety guarantees for ColumnFamilySet
Summary: Revised thread-safety guarantees and implemented a way to spinlock the object.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15975
2014-02-06 11:57:36 -08:00
Igor Canadi
8fa8a708ef [CF] Propagate correct options to WriteBatch::InsertInto
Summary:
WriteBatch can have multiple column families in one batch. Every column family has different options. So we have to add a way for write batch to get options for an arbitrary column family.

This required a bit more acrobatics since lots of interfaces had to be changed.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15957
2014-02-06 10:23:31 -08:00
kailiu
84f8185fc0 Merge branch 'master' into performance
Conflicts:
	HISTORY.md
	db/db_impl.cc
	db/memtable.cc
2014-02-05 21:21:00 -08:00
Igor Canadi
b4f441f48a Fixed a bug introduced by previous commit 2014-02-05 14:58:24 -08:00
Igor Canadi
f276e0e59d [CF] Options -> DBOptions
Summary: Replaced most of occurrences of Options with more specific DBOptions. This brings us very close to supporting different configuration options for each column family.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15933
2014-02-05 14:56:09 -08:00
Igor Canadi
6e56ab5702 [CF] Add full_options_ to ColumnFamilyData
Summary:
Lots of code expects Options on construction/function call. My original idea was to split Options argument into ColumnFamilyOptions and DBOptions (the latter only if needed). However, this will require huge code changes very deep in the stack.

The better idea is to have ColumnFamilyData hold both ColumnFamilyOptions and Options. ColumnFamilyData::Options would be constructed from DBOptions (same for each column family) and ColumnFamilyOptions (different for each column family)

Now when we construct a class or call any method that requires Options, we can just push him ColumnFamilyData::Options and be sure that it's using column-family-specific settings.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15927
2014-02-05 12:26:40 -08:00
Igor Canadi
328ac7ee02 Merge branch 'master' into columnfamilies 2014-02-05 12:00:33 -08:00
Igor Canadi
c24d8c4e90 [CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.

To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.

This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15915
2014-02-05 11:55:30 -08:00
Igor Canadi
7b9f134959 [CF] Move InternalStats to ColumnFamilyData
Summary: InternalStats is a messy thing, keeping both DB data and column family data. However, it's better off living in ColumnFamilyData than in DBImpl. For now, at least.

Test Plan: make check

Reviewers: dhruba, kailiu, haobo, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15879
2014-02-04 17:59:50 -08:00
Igor Canadi
73f62255c1 [CF] Split SanitizeOptions into two
Summary:
There are three SanitizeOption-s now : one for DBOptions, one for ColumnFamilyOptions and one for Options (which just calls the other two)

I have also reshuffled some options -- table_cache options and info_log should live in DBOptions, for example.

Test Plan: make check doesn't complain

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15873
2014-02-04 17:26:51 -08:00
kailiu
d43ebd8c65 Put table factory back to public api
Summary:
Previous I am too ambitious to hide every detail about table factory
to internal api. However, we cannot pass the compilatoin for external
users since we use table factory as the shared_ptr, which requires
the definition of table factory's destructor.

Test Plan: make check;

Reviewers: sdong, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15861
2014-02-03 19:51:20 -08:00
Igor Canadi
5e2c4fe766 Get rid of DBImpl::user_comparator()
Summary: user_comparator() is a Column Family property, not DBImpl

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15855
2014-02-03 17:50:47 -08:00
Igor Canadi
0e22badc08 [column families] Iterator and MultiGet
Summary: Support for different column families in Iterator and MultiGet code path.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15849
2014-02-03 17:44:40 -08:00
Igor Canadi
2966d764cd Fix some 32-bit compile errors
Summary: RocksDB doesn't compile on 32-bit architecture apparently. This is attempt to fix some of 32-bit errors. They are reported here: https://gist.github.com/paxos/8789697

Test Plan: RocksDB still compiles on 64-bit :)

Reviewers: kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15825
2014-02-03 13:48:30 -08:00
Igor Canadi
2a9271b403 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/db_impl_readonly.cc
2014-02-03 13:47:54 -08:00
Lei Jin
5b3b6549d6 use super_version in NewIterator() and MultiGet() function
Summary:
Use super_version insider NewIterator to avoid Ref() each component
separately under mutex
The new added bench shows NewIterator QPS increases from 515K to 719K
No meaningful improvement for multiget I guess due to its relatively small
cost comparing to 90 keys fetch in the test.

Test Plan: unit test and db_bench

Reviewers: igor, sdong

Reviewed By: igor

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D15609
2014-02-03 13:13:36 -08:00
Igor Canadi
29bacb2eb6 VersionSet cleanup
Summary:
Removed icmp_ from VersionSet (since it's per-column-family, not per-DB-instance)
Unfriended VersionSet and ColumnFamilyData (yay!)
Removed VersionSet::NumberLevels()
Cleaned up DBImpl

Test Plan: make check

Reviewers: dhruba, haobo, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15819
2014-02-03 13:10:47 -08:00
Siying Dong
d169b67680 [Performance Branch] PlainTable to encode rows with seqID 0, value type using 1 internal byte.
Summary: In PlainTable, use one single byte to represent 8 bytes of internal bytes, if seqID = 0 and it is value type (which should be common for bottom most files). It is to save 7 bytes for uncompressed cases.

Test Plan: make all check

Reviewers: haobo, dhruba, kailiu

Reviewed By: haobo

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D15489
2014-02-03 12:19:30 -08:00
Igor Canadi
5c6ef56152 Fix printf format 2014-02-03 10:25:37 -08:00
kailiu
4f6cb17bdb First phase API clean up
Summary:
Addressed all the issues in https://reviews.facebook.net/D15447.
Now most table-related modules are hidden from user land.

Test Plan: make check

Reviewers: sdong, haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15525
2014-02-03 00:30:43 -08:00
Igor Canadi
27a8856c23 Compacting column families
Summary: This diff enables non-default column families to get compacted both automatically and also by calling CompactRange()

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15813
2014-01-31 19:54:03 -08:00
Igor Canadi
5661ed8b80 Fix reduce_levels_test 2014-01-31 19:44:48 -08:00
Igor Canadi
fe9bd300d6 Merge branch 'master' into columnfamilies 2014-01-31 17:17:22 -08:00
Dhruba Borthakur
30a700657d Fix corruption_test failure caused by auto-enablement of checksum verification.
Summary:
Patch  https://reviews.facebook.net/D15591 enabled checksum
verification by default. This caused the unit test to fail.

Test Plan: ./corruption_test

Reviewers: igor, kailiu

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15795
2014-01-31 17:16:38 -08:00
Igor Canadi
f7489123e2 Move compaction picker and internal key comparator to ColumnFamilyData
Summary: Compaction picker and internal key comparator are different for each column family (not global), so they should live in ColumnFamilyData

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15801
2014-01-31 16:06:55 -08:00
Igor Canadi
5693db2a02 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
2014-01-31 14:45:15 -08:00
Igor Canadi
dbbffbd772 Mark the log_number file number used
Summary:
VersionSet::next_file_number_ is always assumed to be strictly greater than VersionSet::log_number_. In our new recovery code, we artificially set log_number_  to be (log_number + 1), so that once we flush, we don't recover from the same log file again (this is important because of merge operator non-idempotence)

When we set VersionSet::log_number_ to (log_number + 1), we also have to mark that file number used, such that next_file_number_ is increased to a legal level. Otherwise, VersionSet might assert.

This has not be a problem so far because here's what happens:
1. assume next_file_number is 5, we're recovering log_number 10
2. in DBImpl::Recover() we call MarkFileNumberUsed with 10. This will set VersionSet::next_file_number_ to 11.
3. If there are some updates, we will call WriteTable0ForRecovery(), which will use file number 11 as a new table file and advance VersionSet::next_file_number_ to 12.
4. When we LogAndApply() with log_number 11, assertion is true: assert(11 <= 12);

However, this was a lucky occurrence. Even though this diff doesn't cause a bug, I think the issue is important to fix.

Test Plan: In column families I have different recovery logic and this code path asserted. When adding MarkFileNumberUsed(log_number + 1) assert is gone.

Reviewers: dhruba, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15783
2014-01-31 14:43:16 -08:00
Igor Canadi
3615f534d1 Enable flushing memtables from arbitrary column families
Summary: Removed default_cfd_ from all flush code paths. This means we can now flush memtables from arbitrary column families!

Test Plan: Added a new unit test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15789
2014-01-31 14:42:52 -08:00
Siying Dong
56bea9f80d When using Universal Compaction, Zero out seqID in the last file too
Summary: I didn't figure out the reason why the feature of zeroing out earlier sequence ID is disabled in universal compaction. I do see bottommost_level is set correctly. It should simply work if we remove the constraint of universal compaction.

Test Plan: make all check

Reviewers: haobo, dhruba, kailiu, igor

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15423
2014-01-31 09:59:52 -08:00
kailiu
4e0298f23c Clean up arena API
Summary:
Easy thing goes first. This patch moves arena to internal dir; based
on which, the coming patch will deal with memtable_rep.

Test Plan: make check

Reviewers: haobo, sdong, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15615
2014-01-30 22:10:10 -08:00
Dhruba Borthakur
abd70ecc2b The default settings enable checksum verification on every read.
Summary: The default settings enable checksum verification on every read.

Test Plan: make check

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15591
2014-01-30 19:14:03 -08:00
Igor Canadi
9ca638a86d Enable iterating column families with a concurrent writer
Summary:
Sometimes we iterate through column families, and unlock the mutex in the body of the iteration. While mutex is unlocked, some column family might be created or dropped. We need to be able to continue iterating through column families even though our current column family got dropped.

This diff implements circular linked lists that connect all column families. It then uses the link list to enable iterating through linked lists. Even if the column family is dropped, its next_ pointer still can be used to advance to another alive column family.

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15603
2014-01-30 16:57:52 -08:00
Igor Canadi
6973bb1722 MakeRoomForWrite() support for column families
Summary: Making room for write will be the hardest part of the column family implementation. For now, I just iterate through all column families and run MakeRoomForWrite() for every one.

Test Plan: make check does not complain

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15597
2014-01-30 16:12:08 -08:00
Igor Canadi
c37e7de669 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
2014-01-30 11:47:23 -08:00
Igor Canadi
ac92420fc5 Merge branch 'master' into performance
Conflicts:
	db/db_impl.h
2014-01-30 10:09:23 -08:00
Igor Canadi
3c0dcf0e25 InternalStatistics
Summary:
In DBImpl we keep track of some statistics internally and expose them via GetProperty(). This diff encapsulates all the internal statistics into a class InternalStatisics. Most of it is copy/paste.

Apart from cleaning up db_impl.cc, this diff is also necessary for Column families, since every column family should have its own CompactionStats, MakeRoomForWrite-stall stats, etc. It's much easier to keep track of it in every column family if it's nicely encapsulated in its own class.

Test Plan: make check

Reviewers: dhruba, kailiu, haobo, sdong, emayanke

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15273
2014-01-29 20:40:41 -08:00
Lei Jin
d118707f8d set bg_error_ when background flush goes wrong
Summary: as title

Test Plan: unit test

Reviewers: haobo, igor, sdong, kailiu, dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15435
2014-01-29 15:55:58 -08:00
Igor Canadi
514e42c7cc Fix some lint warnings 2014-01-29 15:27:27 -08:00
Igor Canadi
fa99d53e55 Change ColumnFamilyData from struct to class
Summary: ColumnFamilyData grew a lot, there's much more data that it holds now. It makes more sense to encapsulate it better by making it a class.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15579
2014-01-29 15:18:36 -08:00
Igor Canadi
15999e728e Fix column family test (create directory) 2014-01-29 14:06:59 -08:00
Igor Canadi
4662969bf5 PurgeObsoleteFiles in DropColumnFamily
Summary: When we drop the column family, we want to delete all the files from that column family.

Test Plan: make check

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15561
2014-01-29 11:58:33 -08:00
Igor Canadi
20b231d712 Merge branch 'master' into columnfamilies 2014-01-29 11:47:43 -08:00
Igor Canadi
f24a3ee52d Read from and write to different column families
Summary: This one is big. It adds ability to write to and read from different column families (see the unit test). It also supports recovery of different column families from log, which was the hardest part to reason about. We need to make sure to never delete the log file which has unflushed data from any column family. To support that, I added another concept, which is versions_->MinLogNumber()

Test Plan: Added a unit test in column_family_test

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15537
2014-01-29 11:38:16 -08:00
Igor Canadi
e57f0cc1a1 add include <atomic> to version_set.h 2014-01-29 08:17:43 -08:00
Igor Canadi
c1071ed95c Merge branch 'master' into columnfamilies 2014-01-28 16:04:00 -08:00
Igor Canadi
5d2c62822e Only get the manifest file size if there is no error
Summary:
I came across this while working on column families. CorruptionTest::RecoverWriteError threw a SIGSEG because the descriptor_log_->file() was nullptr. I'm not sure why it doesn't happen in master, but better safe than sorry.

@kailiu, can we get this in release, too?

Test Plan: make check

Reviewers: kailiu, dhruba, haobo

Reviewed By: haobo

CC: leveldb, kailiu

Differential Revision: https://reviews.facebook.net/D15513
2014-01-28 16:02:51 -08:00
kailiu
a5e220f5ef Merge branch 'master' into performance
Conflicts:
	Makefile
	db/db_impl.cc
	db/db_test.cc
	db/memtable_list.cc
	db/memtable_list.h
	table/block_based_table_reader.cc
	table/table_test.cc
	util/cache.cc
	util/coding.cc
2014-01-28 10:35:55 -08:00
Igor Canadi
4bf25357ae [column families] Removing VersionSet::current()
Summary: Instead of VersionSet::current(), DBImpl uses default_cfd_->current directly.

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15483
2014-01-27 17:04:46 -08:00
Mark Callaghan
90f29ccbef Update monitoring to include average time per compaction and stall
Summary:
The new columns are msComp and msStall that provide average time per compaction and stall for that level in milliseconds.
Level  Files Size(MB) Score Time(sec)  Read(MB) Write(MB)    Rn(MB)  Rnp1(MB)  Wnew(MB) RW-Amplify Read(MB/s) Write(MB/s)      Rn     Rnp1     Wnp1     NewW    Count   msComp   msStall  Ln-stall Stall-cnt
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0        8       15   1.5         2         0        30         0         0        30        0.0       0.0        15.5        0        0        0        0       16      112       0.2       1.3      7568
  1        8       16   1.6         1        26        26        15        11        16        3.5      17.6        18.1        8        6       13        7        3      362       0.0       0.0         0
  2        1        2   0.0         0         0         2         0         0         2        0.0       0.0        18.4        0        0        0        0        1       50       0.0       0.0         0

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15345
2014-01-27 15:06:04 -08:00
Schalk-Willem Kruger
3d33da75ef Fix UnmarkEOF for partial blocks
Summary:
Blocks in the transaction log are a fixed size, but the last block in the transaction log file is usually a partial block. When a new record is added after the reader hit the end of the file, a new physical record will be appended to the last block. ReadPhysicalRecord can only read full blocks and assumes that the file position indicator is aligned to the start of a block. If the reader is forced to read further by simply clearing the EOF flag, ReadPhysicalRecord will read a full block starting from somewhere in the middle of a real block, causing it to lose alignment and to have a partial physical record at the end of the read buffer. This will result in length mismatches and checksum failures. When the log file is tailed for replication this will cause the log iterator to become invalid, necessitating the creation of a new iterator which will have to read the log file from scratch.

This diff fixes this issue by reading the remaining portion of the last block we read from. This is done when the reader is forced to read further (UnmarkEOF is called).

Test Plan:
- Added unit tests
- Stress test (with replication). Check dbdir/LOG file for corruptions.
- Test on test tier

Reviewers: emayanke, haobo, dhruba

Reviewed By: haobo

CC: vamsi, sheki, dhruba, kailiu, igor

Differential Revision: https://reviews.facebook.net/D15249
2014-01-27 14:49:10 -08:00
Igor Canadi
511b03a5b5 LogAndApply to take ColumnFamilyData
Summary: This removes the default implementation of LogAndApply that applied the changed to the default column family by default. It is mostly simple reformatting.

Test Plan: make check

Reviewers: dhruba, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15465
2014-01-27 13:57:58 -08:00
Igor Canadi
eb055609e4 [column families] Move memtable and immutable memtable list to column family data
Summary: All memtables and immutable memtables are moved from DBImpl to ColumnFamilyData. For now, they are all referenced from default column family in DBImpl. It shouldn't be hard to get them from custom column family.

Test Plan: make check

Reviewers: dhruba, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15459
2014-01-27 13:37:14 -08:00
Igor Canadi
ae16606f98 Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
	db/version_set.h
2014-01-27 11:11:51 -08:00
Igor Canadi
832158e7f7 Fsync directory after we create a new file
Summary:
@dhruba, I'm not sure where we need to sync the directory. I implemented the function in Env() and added the dir sync just after we close the newly created file in the builder.

Should I also add FsyncDir() to new files that get created by a compaction?

Test Plan: Confirmed that FsyncDir is returning Status::OK()

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D14751
2014-01-27 11:02:21 -08:00
Siying Dong
b20486f294 [Performance Branch] HashLinkList to avoid to convert length prefixed string back to internal keys
Summary: Converting from length prefixed buffer back to internal key costs some CPU but it is not necessary. In this patch, internal keys are pass though the functions so that we don't need to convert back to it.

Test Plan: make all check

Reviewers: haobo, kailiu

Reviewed By: kailiu

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D15393
2014-01-27 10:26:14 -08:00
Igor Canadi
cf783c674a Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.h
2014-01-27 10:00:24 -08:00
Igor Canadi
6c2ca1d3e6 Move NeedsCompaction() from VersionSet to Version
Summary: There is no reason to have functions NeedCompaction(), MaxCompactionScore() and MaxCompactionScoreLevel() in VersionSet, since they don't access any data in VersionSet.

Test Plan: make check

Reviewers: kailiu, haobo, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15333
2014-01-27 09:59:00 -08:00
Igor Canadi
e55b3c040c Fixing ref-counting memtables 2014-01-26 18:03:38 -08:00
Mike Lin
af7838de36 address code review comments on 5e3aeb5f8e
- reduce string copying in Compaction::Summary
- simplify file number checking in UniversalCompactionStopStyleSimilarSize unit test
2014-01-25 14:12:24 -08:00
Igor Canadi
983fafa56c Fix memory leak 2014-01-25 13:57:11 -08:00
Igor Canadi
68a91a2e6a missing include 2014-01-24 18:40:05 -08:00
Igor Canadi
5356b2a680 Merge branch 'master' into columnfamilies 2014-01-24 18:34:48 -08:00
Igor Canadi
04afa32134 Fix reduce levels
ReduceNumberOfLevels had segmentation fault in WriteSnapshot() since we
didn't change the number of levels in VersionSet (we consider them
immutable from now on). This fixes the problem.
2014-01-24 18:32:38 -08:00
Siying Dong
8477255da3 Moving Some includes from options.h to forward declaration
Summary: By removing some includes form options.h and reply on forward declaration, we can more easily reason the dependencies.

Test Plan: make all check

Reviewers: kailiu, haobo, igor, dhruba

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15411
2014-01-24 17:16:22 -08:00
Igor Canadi
6a404de4ba Merge branch 'master' into columnfamilies 2014-01-24 15:54:18 -08:00
Igor Canadi
f653fdcf5a Fixing iterator cleanup for Tailing iterator
Immutable tailing iterator doesn't set CleanupState::mem, so we don't
have to unref it.
2014-01-24 15:51:06 -08:00
Igor Canadi
1423e7c9de Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
	db/version_set_reduce_num_levels.cc
	util/ldb_cmd.cc
2014-01-24 15:03:54 -08:00
Igor Canadi
677fee27c6 Make VersionSet::ReduceNumberOfLevels() static
Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.

Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.

Test Plan: reduce_levels_test

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15303
2014-01-24 14:57:04 -08:00
Igor Canadi
c583157d49 MemTableListVersion
Summary:
MemTableListVersion is to MemTableList what Version is to VersionSet. I took almost the same ideas to develop MemTableListVersion. The reason is to have copying std::list done in background, while flushing, rather than in foreground (MultiGet() and NewIterator()) under a mutex! Also, whenever we copied MemTableList, we copied also some MemTableList metadata (flush_requested_, commit_in_progress_, etc.), which was wasteful.

This diff avoids std::list copy under a mutex in both MultiGet() and NewIterator(). I created a small database with some number of immutable memtables, and creating 100.000 iterators in a single-thread (!) decreased from {188739, 215703, 198028} to {154352, 164035, 159817}. A lot of the savings come from code under a mutex, so we should see much higher savings with multiple threads. Creating new iterator is very important to LogDevice team.

I also think this diff will make SuperVersion obsolete for performance reasons. I will try it in the next diff. SuperVersion gave us huge savings on Get() code path, but I think that most of the savings came from copying MemTableList under a mutex. If we had MemTableListVersion, we would never need to copy the entire object (like we still do in NewIterator() and MultiGet())

Test Plan: `make check` works. I will also do `make valgrind_check` before commit

Reviewers: dhruba, haobo, kailiu, sdong, emayanke, tnovak

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15255
2014-01-24 14:52:08 -08:00
Igor Canadi
e832e72b31 Revert "Moving to glibc-fb"
This reverts commit d24961b65e.

For some reason, glibc2.17-fb breaks gflags. Reverting for now
2014-01-24 11:50:38 -08:00
kailiu
66dc033af3 Temporarily disable caching index/filter blocks
Summary:
Mixing index/filter blocks with data blocks resulted in some known
issues.  To make sure in next release our users won't be affected,
we added a new option in BlockBasedTableFactory::TableOption to
conceal this functionality for now.

This patch also introduced a BlockBasedTableReader::OpenOptions,
which avoids the "infinite" growth of parameters in
BlockBasedTableReader::Open().

Test Plan: make check

Reviewers: haobo, sdong, igor, dhruba

Reviewed By: igor

CC: leveldb, tnovak

Differential Revision: https://reviews.facebook.net/D15327
2014-01-24 10:57:15 -08:00
Igor Canadi
d24961b65e Moving to glibc-fb
Summary:
It looks like we might have some trouble when building the new release with 4.8, since fbcode is using glibc2.17-fb by default and we are using glibc2.17. It was reported by Benjamin Renard in our internal group.

This diff moves our fbcode build to use glibc2.17-fb by default. I got some linker errors when compiling, complaining that `google::SetUsageMessage()` was undefined. After deleting all offending lines, the compile was successful and everything works.

Test Plan:
Compiled
Ran ./db_bench ./db_stress ./db_repl_stress

Reviewers: kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15405
2014-01-24 10:24:08 -08:00
Siying Dong
4605e20c58 If User setting of compaction multipliers overflow, use default value 1 instead
Summary: Currently, compaction multipliers can overflow and cause unexpected behaviors. In this patch, we detect those overflows and use multiplier 1 for them.

Test Plan: make all check

Reviewers: dhruba, haobo, igor, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15321
2014-01-24 10:14:23 -08:00
Igor Canadi
28d1a0c6f5 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/db_impl_readonly.h
	db/db_test.cc
	include/rocksdb/db.h
	include/utilities/stackable_db.h
2014-01-24 09:27:29 -08:00
Igor Canadi
09489d395f Fix a bug in DBImpl::CreateColumnFamily 2014-01-24 09:12:00 -08:00
Lei Jin
aba2acb5ec CompactRange() to return status
Summary: as title

Test Plan:
make all check
What else tests shall I cover?

Reviewers: igor, haobo

CC:

Differential Revision: https://reviews.facebook.net/D15339
2014-01-23 16:41:46 -08:00
Kai Liu
054c5dda8c Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/memtable.cc
	db/version_set.cc
	include/rocksdb/statistics.h
	util/statistics_imp.h
2014-01-23 16:32:49 -08:00
Tomislav Novak
81c9cc9b3b Tailing iterator
Summary:
This diff implements a special type of iterator that doesn't create a snapshot
(can be used to read newly inserted data) and is optimized for doing sequential
reads.

TailingIterator uses current superversion number to determine whether to
invalidate its internal iterators. If the version hasn't changed, it can often
avoid doing expensive seeks over immutable structures (sst files and immutable
memtables).

Test Plan:
* new unit tests
* running LD with this patch

Reviewers: igor, dhruba, haobo, sdong, kailiu

Reviewed By: sdong

CC: leveldb, lovro, march

Differential Revision: https://reviews.facebook.net/D15285
2014-01-23 16:26:08 -08:00
Igor Canadi
7c5e583a27 ColumnFamilySet
Summary:
I created a separate class ColumnFamilySet to keep track of column families. Before we did this in VersionSet and I believe this approach is cleaner.

Let me know if you have any comments. I will commit tomorrow.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15357
2014-01-23 14:03:38 -08:00
Igor Canadi
f9a25dda9f Fix wrong merge 2014-01-22 11:30:19 -08:00
Igor Canadi
92a022ad07 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/db_impl_readonly.cc
	db/version_set.cc
2014-01-22 10:59:07 -08:00
Igor Canadi
fb01755aa4 Unfriending classes
Summary:
In this diff I made some effort to reduce usage of friending. To do that, I had to expose Compaction::inputs_ through a method inputs(). Not sure if this is a good idea, there is a trade-off. I think it's less confusing than having lots of friends.

I also thought about other friendship relationships, but they are too much tangled at this point. Once you friend two classes, it's very hard to unfriend them :)

Test Plan: make check

Reviewers: haobo, kailiu, sdong, dhruba

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15267
2014-01-22 10:55:16 -08:00
Igor Canadi
6fe9b57748 Refactor Recover() code
Summary:
This diff does two things:
* Rethinks how we call Recover() with read_only option. Before, we call it with pointer to memtable where we'd like to apply those changes to. This memtable is set in db_impl_readonly.cc and it's actually DBImpl::mem_. Why don't we just apply updates to mem_ right away? It seems more intuitive.
* Changes when we apply updates to manifest. Before, the process is to recover all the logs, flush it to sst files and then do one giant commit that atomically adds all recovered sst files and sets the next log number. This works good enough, but causes some small troubles for my column family approach, since I can't have one VersionEdit apply to more than single column family[1]. The change here is to commit the files recovered from logs right away. Here is the state of the world before the change:
1. Recover log 5, add new sst files to edit
2. Recover log 7, add new sst files to edit
3. Recover log 8, add new sst files to edit
4. Commit all added sst files to manifest and mark log files 5, 7 and 8 as recoverd (via SetLogNumber(9) function)
After the change, we'll do:
1. Recover log 5, commit the new sst files and set log 5 as recovered
2. Recover log 7, commit the new sst files and set log 7 as recovered
3. Recover log 8, commit the new sst files and set log 8 as recovered

The added (small) benefit is that if we fail after (2), the new recovery will only have to recover log 8. In previous case, we'll have to restart the recovery from the beginning. The bigger benefit will be to enable easier integration of multiple column families in Recovery code path.

[1] I'm happy to dicuss this decison, but I believe this is the cleanest way to go. It also makes backward compatibility much easier. We don't have a requirement of adding multiple column families atomically.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15237
2014-01-22 10:45:26 -08:00
Igor Canadi
23f6791c9e Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl_readonly.cc
	db/db_test.cc
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
	db/version_set.h
	db/version_set_reduce_num_levels.cc
2014-01-21 17:01:52 -08:00
Siying Dong
7dea558e6d [Performance Branch] Fix a bug when merging from master
Summary: Commit "1304d8c8cefe66be1a3caa5e93413211ba2486f2" (Merge branch 'master' into performance) removes a line in performance branch by mistake. This patch fixes it.

Test Plan: make all check

Reviewers: haobo, kailiu, igor

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15297
2014-01-21 12:44:43 -08:00
Mark Callaghan
4e8321bfea Boost access before mutex is unlocked
Summary:
This moves the use of versions_ to before the mutex is unlocked
to avoid a possible race.

Task ID: #

Blame Rev:

Test Plan:
make check

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: haobo, dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15279
2014-01-17 21:32:23 -08:00
Kai Liu
ef602f6275 Misc cleanup on performance branch
Summary:

Did some trivial stuffs:

* Add more comments;
* fix compiler's warning messages (uninitialized variables).
* etc

Test Plan:

make check
2014-01-17 14:26:29 -08:00
Igor Canadi
83681bf9ef Statistics code cleanup
Summary: I'm separating code-cleanup part of https://reviews.facebook.net/D14517. This will make D14517 easier to understand and this diff easier to review.

Test Plan: make check

Reviewers: haobo, kailiu, sdong, dhruba, tnovak

Reviewed By: tnovak

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15099
2014-01-17 12:46:06 -08:00
Igor Canadi
8079dd5d24 Merge branch 'master' into performance 2014-01-17 12:20:07 -08:00
Igor Canadi
0f4a75b710 Fix SIGSEGV in compaction picker
Summary:
The SIGSEGV was introduced by https://reviews.facebook.net/D15171

I also fixed ExpandWhileOverlapping() which returned the failure by setting it's own stack variable to nullptr (!). This bug is present in 2.6 release, so I guess ExpandWhileOverlapping never fails :)

Test Plan: `make check`. Also MarkCallaghan confirmed it fixed the SIGSEGV he reported.

Reviewers: MarkCallaghan, kailiu, sdong, dhruba, haobo

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15261
2014-01-17 12:02:03 -08:00
Mark Callaghan
439e36db21 Fix SlowdownAmount
Summary:
This had a few bugs.
1) bottom and top were reversed. top is for the max value but the callers were passing the max
value to bottom. The result is that the max sleep is used when n >= bottom.
2) one of the callers passed values with type double and these values are frequently between
1.0 and 2.0 so rounding will do some bad things
3) sometimes the function returned 0 when there should be a stall

With this change and one other diff (out for review soon) there are slightly fewer stalls on one workload.

With the fix.
Stalls(secs): 160.166 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 58.495 leveln_slowdown
Stalls(count): 910261 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 54526 leveln_slowdown

Without the fix.
Stalls(secs): 172.227 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 56.538 leveln_slowdown
Stalls(count): 160831 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 52845 leveln_slowdown

Task ID: #

Blame Rev:

Test Plan:
run db_bench for --benchmarks=overwrite with IO-bound database

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15243
2014-01-17 10:15:30 -08:00
Mike Lin
5e3aeb5f8e An initial implementation of kCompactionStopStyleSimilarSize for universal compaction 2014-01-16 22:59:34 -08:00
Mike Lin
b1194f4903 Minor compaction logging improvements
1) make summary less likely to be truncated
2) format human-readable file sizes in summary
3) log the motivation for each universal compaction
2014-01-16 22:54:31 -08:00
Naman Gupta
1447bb5919 Allow callback to change size of existing value. Change return type of the callback function to an enum status to handle 3 cases.
Summary:
This diff fixes 2 hacks:
* The callback function can modify the existing value inplace, if the merged value fits within the existing buffer size. But currently the existing buffer size is not being modified. Now the callback recieves a int* allowing the size to be modified. Since size is encoded as a varint in the internal key for memtable. It might happen that the entire value might have be copied to the new location if the new size varint is smaller than the existing size varint.
* The callback function has 3 functionalities
    1. Modify existing buffer inplace, and update size correspondingly. Now to indicate that, Returns 1.
    2. Generate a new buffer indicating merged value. Returns 2.
    3. Fails to do either of above, based on whatever application logic. Returns 0.

Test Plan: Just make all for now. I'm adding another unit test to test each scenario.

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb, sdong, kailiu, xinyaohu, sumeet, danguo

Differential Revision: https://reviews.facebook.net/D15195
2014-01-16 15:12:39 -08:00
Kai Liu
d4f65f1683 Merge branch 'master' into performance
This patch merges master's changes on build_tools/format-diff.sh.
Conflicts:
	db/version_edit.cc
2014-01-16 14:31:18 -08:00
Igor Canadi
6d6fb70960 Remove compaction pointers
Summary: The only thing we do with compaction pointers is set them to some values, we never actually read them. I don't know what we used them for, but it doesn't look like we use them anymore.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15225
2014-01-16 14:06:53 -08:00
Igor Canadi
c699c84af4 CompactionPicker
Summary:
This is a big one. This diff moves all the code related to picking compactions from VersionSet to new class CompactionPicker. Column families' compactions will be completely separate processes, so we need to have multiple CompactionPickers.

To make this easier to review, most of the code change is just copy/paste. There is also a small change not to use VersionSet::current_, but rather to take `Version* version` as a parameter. Most of the other code is exactly the same.

In future diffs, I will also make some improvements to CompactionPickers. I think the most important part will be encapsulating it better. Currently Version, VersionSet, Compaction and CompactionPicker are all friend classes, which makes it harder to change the implementation.

This diff depends on D15171, D15183, D15189 and D15201

Test Plan: `make check`

Reviewers: kailiu, sdong, dhruba, haobo

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15207
2014-01-16 13:03:52 -08:00
kailiu
1304d8c8ce Merge branch 'master' into performance
Conflicts:
	Makefile
	db/db_impl.cc
	db/db_impl.h
	db/db_test.cc
	db/memtable.cc
	db/memtable.h
	db/version_edit.h
	db/version_set.cc
	include/rocksdb/options.h
	util/hash_skiplist_rep.cc
	util/options.cc
2014-01-15 23:12:31 -08:00
kailiu
eae1804f29 Remove the unnecessary use of shared_ptr
Summary:
shared_ptr is slower than unique_ptr (which literally comes with no performance cost compare with raw pointers).
In memtable and memtable rep, we use shared_ptr when we'd actually should use unique_ptr.

According to igor's previous work, we are likely to make quite some performance gain from this diff.

Test Plan: make check

Reviewers: dhruba, igor, sdong, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15213
2014-01-15 18:22:01 -08:00
Igor Canadi
787f11bb3b Move more functions from VersionSet to Version
Summary:
This moves functions:
* VersionSet::Finalize() -> Version::UpdateCompactionStats()
* VersionSet::UpdateFilesBySize() -> Version::UpdateFilesBySize()

The diff depends on D15189, D15183 and D15171

Test Plan: make check

Reviewers: kailiu, sdong, haobo, dhruba

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15201
2014-01-15 16:23:36 -08:00
Igor Canadi
615d1ea2f4 Moving Compaction class to separate header file
Summary:
I'm sure we'll all agree that version_set.cc needs simplifying. This diff moves Compaction class to a separate file.

The diff depends on D15171 and D15183

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15189
2014-01-15 16:22:34 -08:00
Igor Canadi
2f4eda7890 Move functions from VersionSet to Version
Summary:
There were some functions in VersionSet that had no reason to be there instead of Version. Moving them to Version will make column families implementation easier.

The functions moved are:
* NumLevelBytes
* LevelSummary
* LevelFileSummary
* MaxNextLevelOverlappingBytes
* AddLiveFiles (previously AddLiveFilesCurrentVersion())
* NeedSlowdownForNumLevel0Files

The diff continues on (and depends on) D15171

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong, emayanke

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15183
2014-01-15 16:18:04 -08:00
Igor Canadi
65a8a52b54 Decrease reliance on VersionSet::NumberLevels()
Summary:
With column families VersionSet will not have a constant number of levels (each CF can have different options), so we'll need to eliminate call to VersionSet::NumberLevels()

This diff decreases number of callsites, but we're not there yet. It associates number of levels with Version (each version is associated with single CF) instead of VersionSet.

I have also slightly changed how VersionSet keeps track of manifest size.

This diff also modifies constructor of Compaction such that it takes input_version and automatically Ref()s it. Before this was done outside of constructor.

In next diffs I will continue to decrease number of callsites of VersionSet::NumberLevels() and also references to current_

Test Plan: make check

Reviewers: haobo, dhruba, kailiu, sdong

Reviewed By: sdong

Differential Revision: https://reviews.facebook.net/D15171
2014-01-15 16:15:43 -08:00
Siying Dong
9b51af5a17 [RocksDB Performance Branch] DBImpl.NewInternalIterator() to reduce works inside mutex
Summary: To reduce mutex contention caused by DBImpl.NewInternalIterator(), in this function, move all the iteration creation works out of mutex, only leaving object ref and get.

Test Plan:
make all check
will run db_stress for a while too to make sure no problem.

Reviewers: haobo, dhruba, kailiu

Reviewed By: haobo

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14589

Conflicts:
	db/db_impl.cc
2014-01-14 17:41:44 -08:00
Igor Canadi
d9cd7a063f Fix CompactRange to apply filter to every key
Summary:
When doing CompactRange(), we should first flush the memtable and then calculate max_level_with_files. Also, we want to compact all the levels that have files, including level `max_level_with_files`.

This patch fixed the unit test.

Test Plan: Added a failing unit test and a fix, so it's not failing anymore.

Reviewers: dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D14421
2014-01-14 16:19:09 -08:00
Igor Canadi
1ed2404f27 Wrong number of levels is Invalid argument now, not corruption 2014-01-14 15:54:11 -08:00
Igor Canadi
6291020284 Fix test 2014-01-14 15:41:30 -08:00
Igor Canadi
7f3e417f59 Fix memtable construction in tests 2014-01-14 15:36:12 -08:00
Igor Canadi
055e6df45b VersionEdit not to take NumLevels()
Summary:
I will submit a sequence of diffs that are preparing master branch for column families. There are a lot of implicit assumptions in the code that are making column family implementation hard. If I make the change only in column family branch, it will make merging back to master impossible.

Most of the diffs will be simple code refactorings, so I hope we can have fast turnaround time. Feel free to grab me in person to discuss any of them.

This diff removes number of level check from VersionEdit. It is used only when VersionEdit is read, not written, but has to be set when it is written. I believe it is a right thing to make VersionEdit dumb and check consistency on the caller side. This will also make it much easier to implement Column Families, since different column families can have different number of levels.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15159
2014-01-14 15:27:09 -08:00
Igor Canadi
7d9f21cf23 BuildBatchGroup -- memcpy outside of lock
Summary: When building batch group, don't actually build a new batch since it requires heavy-weight mem copy and malloc. Only store references to the batches and build the batch group without lock held.

Test Plan:
`make check`

I am also planning to run performance tests. The workload that will benefit from this change is readwhilewriting. I will post the results once I have them.

Reviewers: dhruba, haobo, kailiu

Reviewed By: haobo

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D15063
2014-01-14 14:49:31 -08:00
Naman Gupta
1d9bac4d7f Use sanitized options while opening db
Summary: We use SanitizeOptions() to set appropriate values for some options, based on other options. So we should use the sanitized options by default. Luckily it hasn't caused a bug yet, but can result in a bug in the fugture.

Test Plan: make check

Reviewers: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14103
2014-01-14 11:46:24 -08:00
Siying Dong
9ea8bf90f1 DB::Put() to estimate write batch data size needed and pre-allocate buffer
Summary:
In one of CPU profiles, we see some CPU costs of string::reserve() inside Batch.Put(). This patch should be able to reduce some of the costs by allocating sufficient buffer before hand.

Since it is a trivial percentage of CPU costs, I didn't find a way to show the improvement in one of the benchmarks. I'll deploy it to same application and do the same CPU profiling to make sure those CPU costs are reduced.

Test Plan: make all check

Reviewers: haobo, kailiu, igor

Reviewed By: haobo

CC: leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D15135
2014-01-14 11:24:43 -08:00
Siying Dong
fbbf0d1456 Pre-calculate whether to slow down for too many level 0 files
Summary: Currently in DBImpl::MakeRoomForWrite(), we do  "versions_->NumLevelFiles(0) >= options_.level0_slowdown_writes_trigger" to check whether the writer thread needs to slow down. However, versions_->NumLevelFiles(0) is slightly more expensive than we expected. By caching the result of the comparison when installing a new version, we can avoid this function call every time.

Test Plan:
make all check
Manually trigger this behavior by applying universal compaction style and make sure inserts are made slow after there are certain number of files.

Reviewers: haobo, kailiu, igor

Reviewed By: kailiu

CC: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D15141
2014-01-14 11:23:02 -08:00
Siying Dong
51dd21926c DB::Put() to estimate write batch data size needed and pre-allocate buffer
Summary:
In one of CPU profiles, we see some CPU costs of string::reserve() inside Batch.Put(). This patch should be able to reduce some of the costs by allocating sufficient buffer before hand.

Since it is a trivial percentage of CPU costs, I didn't find a way to show the improvement in one of the benchmarks. I'll deploy it to same application and do the same CPU profiling to make sure those CPU costs are reduced.

Test Plan: make all check

Reviewers: haobo, kailiu, igor

Reviewed By: haobo

CC: leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D15135
2014-01-14 10:53:16 -08:00
Naman Gupta
8454cfe569 Add read/modify/write functionality to Put() api
Summary: The application can set a callback function, which is applied on the previous value. And calculates the new value. This new value can be set, either inplace, if the previous value existed in memtable, and new value is smaller than previous value. Otherwise the new value is added normally.

Test Plan: fbmake. Added unit tests. All unit tests pass.

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: sdong, kailiu, xinyaohu, sumeet, leveldb

Differential Revision: https://reviews.facebook.net/D14745
2014-01-14 07:55:16 -08:00
Igor Canadi
a107691711 [column families] Implement refcounting ColumnFamilyData
Summary: We don't want to delete ColumnFamilyData object if somebody has references to it.

Test Plan: `make check` for now, but will need to implement bigger column family test case

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15111
2014-01-13 09:22:44 -08:00
Igor Canadi
151f9e144f Merge branch 'master' into columnfamilies 2014-01-13 09:09:51 -08:00
Igor Canadi
d076cef347 [column families] Get rid of VersionSet::current_ and keep current Version for each column family
Summary:
The biggest change here is getting rid of current_ Version and adding a column_family_data->current Version to each column family.

I have also fixed some smaller things in VersionSet that made it easier to implement Column family support.

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15105
2014-01-13 08:59:15 -08:00
Igor Canadi
dd6ecdf342 Use ASSERT_EQ() instead of assert() in merge_test 2014-01-11 09:25:47 -08:00
Schalk-Willem Kruger
a09ee1069d Improve RocksDB "get" performance by computing merge result in memtable
Summary:
Added an option (max_successive_merges) that can be used to specify the
maximum number of successive merge operations on a key in the memtable.
This can be used to improve performance of the "get" operation. If many
successive merge operations are performed on a key, the performance of "get"
operations on the key deteriorates, as the value has to be computed for each
"get" operation by applying all the successive merge operations.

FB Task ID: #3428853

Test Plan:
make all check
db_bench --benchmarks=readrandommergerandom
counter_stress_test

Reviewers: haobo, vamsi, dhruba, sdong

Reviewed By: haobo

CC: zshao

Differential Revision: https://reviews.facebook.net/D14991
2014-01-10 17:33:56 -08:00
Siying Dong
aa0ef6602d [Performance Branch] If options.max_open_files set to be -1, cache table readers in FileMetadata for Get() and NewIterator()
Summary:
In some use cases, table readers for all live files should always be cached. In that case, there will be an opportunity to avoid the table cache look-up while Get() and NewIterator().

We define options.max_open_files = -1 to be the mode that table readers for live files will always be kept. In that mode, table readers are cached in FileMetaData (with a reference count hold in table cache). So that when executing table_cache.Get() and table_cache.newInterator(), LRU cache checking can be by-passed, to reduce latency.

Test Plan: add a test case in db_test

Reviewers: haobo, kailiu

Reviewed By: haobo

CC: dhruba, igor, leveldb

Differential Revision: https://reviews.facebook.net/D15039
2014-01-10 15:57:49 -08:00
Siying Dong
5b5ab0c1a8 [Performance Branch] Fix memory leak in HashLinkListRep.GetIterator()
Summary: Full list constructed for full iterator can be leaked. This was a bug introduced when I copy the full iterator codes from hash skip list to hash link list. This patch fixes it.

Test Plan: Run valgrind test against db_test and make sure the memory leak is fixed

Reviewers: kailiu, haobo

Reviewed By: kailiu

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D15093
2014-01-10 12:12:28 -08:00
Siying Dong
237a3da677 StopWatch not to get time if it is created for statistics and it is disabled
Summary: Currently, even if statistics is not enabled, StopWatch only for the stats still gets the time of the day, which is wasteful. This patch adds a new option to StopWatch to disable this get in this case.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14703

Conflicts:
	db/db_impl.cc
2014-01-09 17:39:48 -08:00
Siying Dong
424a524ac9 [Performance Branch] A Hashed Linked List Based Mem Table
Summary:
Implement a mem table, in which keys are hashed based on prefixes. In each bucket, entries are organized in a sorted linked list. It has the same thread safety guarantee as skip list.

The motivation is to optimize memory usage for the case that prefix hashing is primary way of seeking to the entry. Compared to hash skip list implementation, this implementation is more memory efficient, but inside each bucket, search is always linear. The target scenario is that there are only very limited number of records in each hash bucket.

Test Plan: Add a test case in db_test

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14979
2014-01-09 16:19:11 -08:00
Siying Dong
5575316350 StopWatch not to get time if it is created for statistics and it is disabled
Summary: Currently, even if statistics is not enabled, StopWatch only for the stats still gets the time of the day, which is wasteful. This patch adds a new option to StopWatch to disable this get in this case.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14703
2014-01-08 16:05:36 -08:00
Igor Canadi
19e3ee64ac Add column family information to WAL
Summary:
I have added three new value types:
* kTypeColumnFamilyDeletion
* kTypeColumnFamilyValue
* kTypeColumnFamilyMerge
which include column family Varint32 before the data (value, deletion and merge). These values are used only in WAL (not in memtables yet).

This endeavour required changing some WriteBatch internals.

Test Plan: Added a unittest

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15045
2014-01-08 12:53:33 -08:00
Mark Callaghan
50994bf699 Don't always compress L0 files written by memtable flush
Summary:
Code was always compressing L0 files written by a memtable flush
when compression was enabled. Now this is done when
min_level_to_compress=0 for leveled compaction and when
universal_compaction_size_percent=-1 for universal compaction.

Task ID: #3416472

Blame Rev:

Test Plan:
ran db_bench with compression options

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba, igor, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14757
2014-01-07 21:50:26 -08:00
Igor Canadi
a45b7d83ba Merge pull request #59 from mlin/more-c-bindings
C API: add rocksdb_env_set_high_priority_background_threads
2014-01-07 16:33:03 -08:00
Igor Canadi
72918efffe [column families] Implement DB::OpenWithColumnFamilies()
Summary:
In addition to implementing OpenWithColumnFamilies, this diff also includes some minor changes:
* Changed all column family names from Slice() to std::string. The performance of column family name handling is not critical, and it's more convenient and cleaner to have names as std::strings
* Implemented ColumnFamilyOptions(const Options&) and DBOptions(const Options&)
* Added ColumnFamilyOptions to VersionSet::ColumnFamilyData. ColumnFamilyOptions are specified on OpenWithColumnFamilies() and CreateColumnFamily()

I will keep the diff in the Phabricator for a day or two and will push to the branch then. Feel free to comment even after the diff has been pushed.

Test Plan: Added a simple unit test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15033
2014-01-07 11:05:50 -08:00
Igor Canadi
d3a2ba9c64 Merge branch 'master' into columnfamilies 2014-01-07 11:05:03 -08:00
Igor Canadi
17a222670b Merge branch 'master' into performance 2014-01-07 11:04:21 -08:00
Tomislav Novak
9f690ec62c Fix a deadlock in CompactRange()
Summary:
The way DBImpl::TEST_CompactRange() throttles down the number of bg compactions
can cause it to deadlock when CompactRange() is called concurrently from
multiple threads. Imagine a following scenario with only two threads
(max_background_compactions is 10 and bg_compaction_scheduled_ is initially 0):

   1. Thread #1 increments bg_compaction_scheduled_ (to LargeNumber), sets
      bg_compaction_scheduled_ to 9 (newvalue), schedules the compaction
      (bg_compaction_scheduled_ is now 10) and waits for it to complete.
   2. Thread #2 calls TEST_CompactRange(), increments bg_compaction_scheduled_
      (now LargeNumber + 10) and waits on a cv for bg_compaction_scheduled_ to
      drop to LargeNumber.
   3. BG thread completes the first manual compaction, decrements
      bg_compaction_scheduled_ and wakes up all threads waiting on bg_cv_.
      Thread #1 runs, increments bg_compaction_scheduled_ by LargeNumber again
      (now 2*LargeNumber + 9). Since that's more than LargeNumber + newvalue,
      thread #2 also goes to sleep (waiting on bg_cv_), without resetting
      bg_compaction_scheduled_.

This diff attempts to address the problem by introducing a new counter
bg_manual_only_ (when positive, MaybeScheduleFlushOrCompaction() will only
schedule manual compactions).

Test Plan:
I could pretty much consistently reproduce the deadlock with a program that
calls CompactRange(nullptr, nullptr) immediately after Write() from multiple
threads. This no longer happens with this patch.

Tests (make check) pass.

Reviewers: dhruba, igor, sdong, haobo

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14799
2014-01-07 10:37:34 -08:00
Igor Canadi
fff5c7e817 Merge branch 'master' into columnfamilies 2014-01-06 13:31:41 -08:00
Kai Liu
5e7d5629c7 Fix the valgrind issues 2014-01-03 11:48:31 -08:00
Igor Canadi
ef6ad1708d [column families] Support to create and drop column families
Summary:
This diff provides basic implementations of CreateColumnFamily(), DropColumnFamily() and ListColumnFamilies(). It builds on top of https://reviews.facebook.net/D14733

It also includes a bug fix for DBImplReadOnly, where Get implementation would be redirected to DBImpl instead of DBImplReadOnly.

Test Plan: Added unit test

Reviewers: dhruba, haobo, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15021
2014-01-03 01:12:16 -08:00
Kai Liu
774ed89c24 Replace vector with autovector
Summary: this diff only replace the cases when we need to frequently create vector with small amount of entries. This diff doesn't aim to improve performance of a specific area, but more like a small scale test for the autovector and see how it works in real life.

Test Plan:
make check

I also ran the performance tests, however there is no performance gain/loss. All performance numbers are pretty much the same before/after the change.

Reviewers: dhruba, haobo, sdong, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14985
2014-01-02 16:43:35 -08:00
kailiu
e72aa37cc5 Merge branch 'master' into performance
Conflicts:
	db/table_cache.cc
2014-01-02 16:34:59 -08:00
kailiu
476416c27c Some minor refactoring on the code
Summary: I made some cleanup while reading the source code in `db`. Most changes are about style, naming or C++ 11 new features.

Test Plan: ran `make check`

Reviewers: haobo, dhruba, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15009
2014-01-02 16:32:31 -08:00
kailiu
9281a826f1 Hotfix the bug in table cache's GetSliceForFileNumber
Forgot to fix this problem in master branch. Already fixed it in performance branch.
2014-01-02 10:30:42 -08:00
Igor Canadi
7535443083 [RocksDB] Support for column families in manifest
Summary:
<This diff is for Column Family branch>

Added fields in manifest file to support adding and deleting column families.

Pretty simple change, each version edit record can be:
1. add column family
2. drop column family
3. add and delete N files from a single column family (compactions and flushes will generate such records)

Test Plan: make check works, the code is backward compatible

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14733
2014-01-02 04:18:28 -08:00
Igor Canadi
6de1b5b83e Merge branch 'master' into columnfamilies 2014-01-02 04:18:07 -08:00
Igor Canadi
b60c14f6ee Support multi-threaded DisableFileDeletions() and EnableFileDeletions()
Summary:
We don't want two threads to clash if they concurrently call DisableFileDeletions() and EnableFileDeletions(). I'm adding a counter that will enable file deletions only after all DisableFileDeletions() calls have been negated with EnableFileDeletions().

However, we also don't want to break the old behavior, so I added a parameter force to EnableFileDeletions(). If force is true, we will still enable file deletions after every call to EnableFileDeletions(), which is what is happening now.

Test Plan: make check

Reviewers: dhruba, haobo, sanketh

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14781
2014-01-02 03:33:42 -08:00
Mike Lin
4b1d049236 C API: add rocksdb_env_set_high_priority_background_threads 2013-12-31 15:14:18 -08:00
kailiu
f1cec73a76 Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/memtable.cc
	db/version_set.cc
	include/rocksdb/statistics.h
2013-12-27 12:23:17 -08:00
Siying Dong
a094f3b3b5 TableCache.FindTable() to avoid the mem copy of file number
Summary: I'm not sure what's the purpose of encoding file number to a new buffer for looking up the table cache. It seems to be unnecessary to me. With this patch, we point the lookup key to the address of the int64 of the file number.

Test Plan: make all check

Reviewers: dhruba, haobo, igor, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14811
2013-12-26 16:57:07 -08:00
Siying Dong
18df47b79a Avoid malloc in NotFound key status if no message is given.
Summary:
In some places we have NotFound status created with empty message, but it doesn't avoid a malloc. With this patch, the malloc is avoided for that case.

The motivation of it is that I found in db_bench readrandom test when all keys are not existing, about 4% of the total running time is spent on malloc of Status, plus a similar amount of CPU spent on free of them, which is not necessary.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14691
2013-12-26 16:23:10 -08:00
kailiu
079a21ba99 Fix the unused variable warning message in mac os 2013-12-26 15:12:30 -08:00
Haobo Xu
bf4a48ccb3 [RocksDB] [Performance Branch] Revert previous patch.
Summary: The previous patch is wrong. rep_.resize(kHeader) just resets the header portion to zero, and should not cause a re-allocation if g++ does it right. I will go ahead and revert it.

Test Plan: make check

Reviewers: dhruba, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14793
2013-12-20 18:20:06 -08:00
Haobo Xu
e94eea4527 [RocksDB] [Performance Branch] Minor fix, Remove string resize from WriteBatch::Clear
Summary: tmp_batch_ will get re-allocated for every merged write batch because of the existing resize in WriteBatch::Clear. Note that in DBImpl::BuildBatchGroup, we have a hard coded upper limit of batch size 1<<20 = 1MB already.

Test Plan: make check

Reviewers: dhruba, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14787
2013-12-20 16:29:05 -08:00
Siying Dong
abaf26266d [RocksDB] [Performance Branch] Some Changes to PlainTable format
Summary:
Some changes to PlainTable format:
(1) support variable key length
(2) use user defined slice transformer to extract prefixes
(3) Run some test cases against PlainTable in db_test and table_test

Test Plan: test db_test

Reviewers: haobo, kailiu

CC: dhruba, igor, leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D14457
2013-12-20 12:08:35 -08:00
Igor Canadi
1fdb3f7dc6 [RocksDB] Optimize locking for Get
Summary:
Instead of locking and saving a DB state, we can cache a DB state and update it only when it changes. This change reduces lock contention and speeds up read operations on the DB.

Performance improvements are substantial, although there is some cost in no-read workloads. I ran the regression tests on my devserver and here are the numbers:

  overwrite                    56345  ->   63001
  fillseq                      193730 ->  185296
  readrandom                   771301 -> 1219803 (58% improvement!)
  readrandom_smallblockcache   677609 ->  862850
  readrandom_memtable_sst      710440 -> 1109223
  readrandom_fillunique_random 221589 ->  247869
  memtablefillrandom           105286 ->   92643
  memtablereadrandom           763033 -> 1288862

Test Plan:
make asan_check
I am also running db_stress

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14679
2013-12-20 09:57:58 -08:00
Mark Callaghan
ca92068b12 Add 'readtocache' test
Summary:
For some tests I want to cache the database prior to running other tests on the same invocation
of db_bench. The readtocache test ignores --threads and --reads so those can be used by other tests
and it will still do a full read of --num rows with one thread. It might be invoked like:
  db_bench --benchmarks=readtocache,readrandom --reads 100 --num 10000 --threads 8

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14739
2013-12-18 16:54:53 -08:00
Igor Canadi
269709a885 Merge branch 'master' into columnfamilies 2013-12-18 15:56:37 -08:00
Igor Canadi
3b50b6213d Merge pull request #37 from mlin/more-c-bindings
C bindings: add a bunch of the newer options
2013-12-18 13:12:04 -08:00
Igor Canadi
9385a5247e [RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>

Sharing some of the work I've done so far. This diff compiles and passes the tests.

The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.

Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]

Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.

Please provide feedback.

Test Plan: make check works, the code is backward compatible

Reviewers: dhruba, haobo, sdong, kailiu, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14445
2013-12-18 13:08:22 -08:00
Siying Dong
14995a8ff3 Move level0 sorting logic from Version::SaveTo() to Version::Finalize()
Summary: I realized that "D14409 Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo()" is not done in an optimized place. SaveTo() is usually inside mutex. Move it to Finalize(), which is called out of mutex.

Test Plan: make all check

Reviewers: dhruba, haobo, kailiu

Reviewed By: dhruba

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14607
2013-12-17 18:06:58 -08:00
Siying Dong
a8b8b11dc4 Get() Does Not Reserve space for to_delete memtables
Summary: It seems to be a decision tradeoff in current codes: we make a malloc for every Get() to reduce one malloc for a flush inside mutex. It takes about 5% of CPU time in readrandom tests. We might consider the tradeoff to be the other way around.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14697
2013-12-17 17:16:16 -08:00
Mike Lin
2a2506b629 C bindings: add a bunch of the newer options 2013-12-15 13:47:06 -08:00
Kai Liu
2e9efcd6d8 Add the property block for the plain table
Summary:
This is the last diff that adds the property block to plain table.
The format resembles that of the block-based table: https://github.com/facebook/rocksdb/wiki/Rocksdb-table-format

  [data block]
  [meta block 1: stats block]
  [meta block 2: future extended block]
  ...
  [meta block K: future extended block]  (we may add more meta blocks in the future)
  [metaindex block]
  [index block: we only have the placeholder here, we can add persistent index block in the future]
  [Footer: contains magic number, handle to metaindex block and index block]
  <end_of_file>

Test Plan: extended existing property block test.

Reviewers: haobo, sdong, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14523
2013-12-13 17:18:14 -08:00
kailiu
0cd1521af5 Completely remove argv_ since no one use it
There are still warning in some other environment, just move that useless variable `argv_`
2013-12-12 16:36:38 -08:00
kailiu
0e24f97b9f Revert last commit and add "unused" attribute to suppress warning 2013-12-12 15:40:44 -08:00
kailiu
bc9b488e92 fix a warning in db_test when running make release 2013-12-12 15:35:02 -08:00
Mark Callaghan
e9e6b00d29 Add monitoring for universal compaction and add counters for compaction IO
Summary:
Adds these counters
{ WAL_FILE_SYNCED, "rocksdb.wal.synced" }
  number of writes that request a WAL sync
{ WAL_FILE_BYTES, "rocksdb.wal.bytes" },
  number of bytes written to the WAL
{ WRITE_DONE_BY_SELF, "rocksdb.write.self" },
  number of writes processed by the calling thread
{ WRITE_DONE_BY_OTHER, "rocksdb.write.other" },
  number of writes not processed by the calling thread. Instead these were
  processed by the current holder of the write lock
{ WRITE_WITH_WAL, "rocksdb.write.wal" },
  number of writes that request WAL logging
{ COMPACT_READ_BYTES, "rocksdb.compact.read.bytes" },
  number of bytes read during compaction
{ COMPACT_WRITE_BYTES, "rocksdb.compact.write.bytes" },
  number of bytes written during compaction

Per-interval stats output was updated with WAL stats and correct stats for universal compaction
including a correct value for write-amplification. It now looks like:
                               Compactions
Level  Files Size(MB) Score Time(sec)  Read(MB) Write(MB)    Rn(MB)  Rnp1(MB)  Wnew(MB) RW-Amplify Read(MB/s) Write(MB/s)      Rn     Rnp1     Wnp1     NewW    Count  Ln-stall Stall-cnt
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0        7      464  46.4       281      3411      3875      3411         0      3875        2.1      12.1        13.8      621        0      240      240      628       0.0         0
Uptime(secs): 310.8 total, 2.0 interval
Writes cumulative: 9999999 total, 9999999 batches, 1.0 per batch, 1.22 ingest GB
WAL cumulative: 9999999 WAL writes, 9999999 WAL syncs, 1.00 writes per sync, 1.22 GB written
Compaction IO cumulative (GB): 1.22 new, 3.33 read, 3.78 write, 7.12 read+write
Compaction IO cumulative (MB/sec): 4.0 new, 11.0 read, 12.5 write, 23.4 read+write
Amplification cumulative: 4.1 write, 6.8 compaction
Writes interval: 100000 total, 100000 batches, 1.0 per batch, 12.5 ingest MB
WAL interval: 100000 WAL writes, 100000 WAL syncs, 1.00 writes per sync, 0.01 MB written
Compaction IO interval (MB): 12.49 new, 14.98 read, 21.50 write, 36.48 read+write
Compaction IO interval (MB/sec): 6.4 new, 7.6 read, 11.0 write, 18.6 read+write
Amplification interval: 101.7 write, 102.9 compaction
Stalls(secs): 142.924 level0_slowdown, 0.000 level0_numfiles, 0.805 memtable_compaction, 0.000 leveln_slowdown
Stalls(count): 132461 level0_slowdown, 0 level0_numfiles, 3 memtable_compaction, 0 leveln_slowdown

Task ID: #3329644, #3301695

Blame Rev:

Test Plan:
Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14583
2013-12-12 13:27:43 -08:00
Siying Dong
e8ab1934d9 [RocksDB Performance Branch] DBImpl.NewInternalIterator() to reduce works inside mutex
Summary: To reduce mutex contention caused by DBImpl.NewInternalIterator(), in this function, move all the iteration creation works out of mutex, only leaving object ref and get.

Test Plan:
make all check
will run db_stress for a while too to make sure no problem.

Reviewers: haobo, dhruba, kailiu

Reviewed By: haobo

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14589
2013-12-12 11:30:00 -08:00
Siying Dong
aaf9c6203c [RocksDB][Performance Branch]Iterator Cleanup method only tries to find obsolete files if it has the last reference to a version
Summary: When deconstructing an iterator, no need to check obsolete file if it doesn't hold last reference of any version.

Test Plan: make all check

Reviewers: haobo, igor, dhruba, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14595
2013-12-11 13:59:43 -08:00
Siying Dong
a8029fdc75 Introduce MergeContext to Lazily Initialize merge operand list
Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14415

Conflicts:
	db/db_impl.cc
	db/memtable.cc
2013-12-11 11:37:28 -08:00
Siying Dong
bc5dd19b14 [RocksDB Performance Branch] Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo()
Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get().

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: dhruba

CC: nkg-, igor, leveldb

Differential Revision: https://reviews.facebook.net/D14409
2013-12-11 10:50:09 -08:00
Siying Dong
41349d9ef1 [RocksDB Performance Branch] Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo()
Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get().

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: dhruba

CC: nkg-, igor, leveldb

Differential Revision: https://reviews.facebook.net/D14409
2013-12-11 10:49:49 -08:00
Siying Dong
0304e3d2ff When flushing mem tables, create iterators out of mutex
Summary:
creating new iterators of mem tables can be expensive. Move them out of mutex.
DBImpl::WriteLevel0Table()'s mems seems to be a local vector and is only used by flushing. memtables to flush are also immutable, so it should be safe to do so.

Test Plan: make all check

Reviewers: haobo, dhruba, kailiu

Reviewed By: dhruba

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14577

Conflicts:
	db/db_impl.cc
2013-12-11 10:02:17 -08:00
Siying Dong
95a411d853 When flushing mem tables, create iterators out of mutex
Summary:
creating new iterators of mem tables can be expensive. Move them out of mutex.
DBImpl::WriteLevel0Table()'s mems seems to be a local vector and is only used by flushing. memtables to flush are also immutable, so it should be safe to do so.

Test Plan: make all check

Reviewers: haobo, dhruba, kailiu

Reviewed By: dhruba

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14577
2013-12-11 09:57:19 -08:00
Haobo Xu
3c02c363b3 [RocksDB] [Performance Branch] Added dynamic bloom, to be used for memable non-existing key filtering
Summary: as title

Test Plan: dynamic_bloom_test

Reviewers: dhruba, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14385
2013-12-11 00:15:14 -08:00
kailiu
a82f42b765 rename db/memtablelist.{h,cc} 2013-12-10 19:03:13 -08:00
Igor Canadi
204bb9cffd Get rid of LogFlush() in InternalIterator 2013-12-10 10:59:00 -08:00
Igor Canadi
19f5463d3f Don't LogFlush() in foreground threads
Summary: So fflush() takes a lock which is heavyweight. I added flush_pending_, but more importantly, I removed LogFlush() from foreground threads.

Test Plan: ./db_test

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14535
2013-12-10 10:57:46 -08:00
Igor Canadi
a204dabb9d Merge pull request #31 from sepeth/c-api
Rename leveldb to rocksdb in C api
2013-12-10 09:18:47 -08:00
Doğan Çeçen
6c4e110c8c Rename leveldb to rocksdb in C api 2013-12-10 10:48:35 +02:00
Igor Canadi
fb9fce4fc3 [RocksDB] BackupableDB
Summary:
In this diff I present you BackupableDB v1. You can easily use it to backup your DB and it will do incremental snapshots for you.
Let's first describe how you would use BackupableDB. It's inheriting StackableDB interface so you can easily construct it with your DB object -- it will add a method RollTheSnapshot() to the DB object. When you call RollTheSnapshot(), current snapshot of the DB will be stored in the backup dir. To restore, you can just call RestoreDBFromBackup() on a BackupableDB (which is a static method) and it will restore all files from the backup dir. In the next version, it will even support automatic backuping every X minutes.

There are multiple things you can configure:
1. backup_env and db_env can be different, which is awesome because then you can easily backup to HDFS or wherever you feel like.
2. sync - if true, it *guarantees* backup consistency on machine reboot
3. number of snapshots to keep - this will keep last N snapshots around if you want, for some reason, be able to restore from an earlier snapshot. All the backuping is done in incremental fashion - if we already have 00010.sst, we will not copy it again. *IMPORTANT* -- This is based on assumption that 00010.sst never changes - two files named 00010.sst from the same DB will always be exactly the same. Is this true? I always copy manifest, current and log files.
4. You can decide if you want to flush the memtables before you backup, or you're fine with backing up the log files -- either way, you get a complete and consistent view of the database at a time of backup.
5. More things you can find in BackupableDBOptions

Here is the directory structure I use:

   backup_dir/CURRENT_SNAPSHOT - just 4 bytes holding the latest snapshot
               0, 1, 2, ... - files containing serialized version of each snapshot - containing a list of files
               files/*.sst - sst files shared between snapshots - if one snapshot references 00010.sst and another one needs to backup it from the DB, it will just reference the same file
               files/ 0/, 1/, 2/, ... - snapshot directories containing private snapshot files - current, manifest and log files

All the files are ref counted and deleted immediatelly when they get out of scope.

Some other stuff in this diff:
1. Added GetEnv() method to the DB. Discussed with @haobo and we agreed that it seems right thing to do.
2. Fixed StackableDB interface. The way it was set up before, I was not able to implement BackupableDB.

Test Plan:
I have a unittest, but please don't look at this yet. I just hacked it up to help me with debugging. I will write a lot of good tests and update the diff.

Also, `make asan_check`

Reviewers: dhruba, haobo, emayanke

Reviewed By: dhruba

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D14295
2013-12-09 14:06:52 -08:00
kailiu
551e9428ce Merge branch 'master' into performance 2013-12-06 14:15:42 -08:00
Siying Dong
ef2211a9ca [RocksDB Performance Branch] Introduce MergeContext to Lazily Initialize merge operand list
Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14415
2013-12-06 10:28:59 -08:00
kailiu
b1d2de4a40 Fix #26 by putting the implementation of CreateDBStatistics() to a cc file 2013-12-05 22:29:03 -08:00
kailiu
90729f8b23 Extract metaindex block from block-based table
Summary: This change will allow other table to reuse the code for meta blocks.

Test Plan: all existing unit tests passed

Reviewers: dhruba, haobo, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14475
2013-12-05 16:34:16 -08:00
Mayank Agarwal
92e8316118 Make GetDbIdentity pure virtual and also implement it for StackableDB, DBWithTTL
Summary: As title

Test Plan: make clean and make

Reviewers: igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14469
2013-12-05 12:02:31 -08:00
Mayank Agarwal
18802689b8 Make an API to get database identity from the IDENTITY file
Summary: This would enable rocksdb users to get the db identity without depending on implementation details(storing that in IDENTITY file)

Test Plan: db/db_test (has identity checks)

Reviewers: dhruba, haobo, igor, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14463
2013-12-04 22:39:17 -08:00
Mark Callaghan
97aa401e2f Add compression options to db_bench
Summary:
This adds 2 options for compression to db_bench:
* universal_compression_size_percent
* compression_level - to set zlib compression level
It also logs compression_size_percent at startup in LOG

Task ID: #

Blame Rev:

Test Plan:
make check, run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14439
2013-12-03 14:28:48 -08:00
Sajal Jain
28a1b9b95f [rocksdb] statistics counters for memtable hits and misses
Summary:
added counters
rocksdb.memtable.hit - for memtable hit
rocksdb.memtable.miss - for memtable miss

Test Plan: db_bench tests

Reviewers: igor, dhruba, haobo

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D14433
2013-12-03 12:59:53 -08:00
Igor Canadi
eb12e47e0e Killing Transform Rep
Summary:
Let's get rid of TransformRep and it's children. We have confirmed that HashSkipListRep works better with multifeed, so there is no benefit to keeping this around.

This diff is mostly just deleting references to obsoleted functions. I also have a diff for fbcode that we'll need to push when we switch to new release.

I had to expose HashSkipListRepFactory in the client header files because db_impl.cc needs access to GetTransform() function for SanitizeOptions.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14397
2013-12-03 12:42:15 -08:00
Igor Canadi
043fc14c3e Get rid of some shared_ptrs
Summary:
I went through all remaining shared_ptrs and removed the ones that I found not-necessary. Only GenerateCachePrefix() is called fairly often, so don't expect much perf wins.

The ones that are left are accessed infrequently and I think we're fine with keeping them.

Test Plan: make asan_check

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14427
2013-12-03 11:17:58 -08:00
Dhruba Borthakur
96bc3ec297 Memtables should be deleted appropriately in the unit test.
Summary:
Memtables should be deleted appropriately in the unit test.

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-12-01 21:23:44 -08:00
Dhruba Borthakur
98968ba937 Free obsolete memtables outside the dbmutex had a memory leak.
Summary:
The commit at 27bbef1180 had a memory leak
that was detected by valgrind. The memtable that has a refcount decrement
in MemTableList::InstallMemtableFlushResults was not freed.

Test Plan: valgrind ./db_test --leak-check=full

Reviewers: igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14391
2013-11-28 10:25:22 -08:00
Igor Canadi
35ddf18367 Don't do compression tests if we don't have compression libs
Summary: These tests fail if compression libraries are not installed.

Test Plan: Manually disabled snappy, observed tests not ran.

Reviewers: dhruba, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14379
2013-11-27 13:32:56 -08:00
Kai Liu
1966b63137 Merge branch 'master' into perf 2013-11-27 11:47:40 -08:00
Haobo Xu
4e6463ea44 [RocksDB][Performance Branch] Make height and branching factor configurable for skiplist implementation
Summary: As title. Especially, HashSkipListRepFactory will be able to specify a relatively small height, to reduce the memory overhead of one skiplist per bucket.

Test Plan: make check and test it on leaf4

Reviewers: dhruba, sdong, kailiu

CC: reconnect.grayhat, leveldb

Differential Revision: https://reviews.facebook.net/D14307
2013-11-26 21:59:36 -08:00
Dhruba Borthakur
8478f380a0 During benchmarking, I see excessive use of vector.reserve().
Summary:
This code path can potentially accumulate multiple important_files for level 0.
But for other levels, it should have only one file in the
important_files, so it is ok not to reserve excessive space, is it not?

Test Plan: make check

Reviewers: haobo

Reviewed By: haobo

CC: reconnect.grayhat, leveldb

Differential Revision: https://reviews.facebook.net/D14349
2013-11-26 07:47:08 -08:00
Dhruba Borthakur
27bbef1180 Free obsolete memtables outside the dbmutex.
Summary:
Large memory allocations and frees are costly and best done outside the
db-mutex. The memtables are already allocated outside the db-mutex but
they were being freed while holding the db-mutex.
This patch frees obsolete memtables outside the db-mutex.

Test Plan:
make check
db_stress

Unit tests pass, I am in the process of running stress tests.

Reviewers: haobo, igor, emayanke

Reviewed By: haobo

CC: reconnect.grayhat, leveldb

Differential Revision: https://reviews.facebook.net/D14319
2013-11-25 21:04:48 -08:00
Igor Canadi
3ce3658411 DB::GetOptions()
Summary: We need access to options for BackupableDB

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, reconnect.grayhat

Differential Revision: https://reviews.facebook.net/D14331
2013-11-25 15:51:50 -08:00
Igor Canadi
11c26bd4a4 [RocksDB] Interface changes required for BackupableDB
Summary: This is part of https://reviews.facebook.net/D14295 -- smaller diff that is easier to review

Test Plan: make asan_check

Reviewers: dhruba, haobo, emayanke

Reviewed By: emayanke

CC: leveldb, kailiu, reconnect.grayhat

Differential Revision: https://reviews.facebook.net/D14301
2013-11-25 12:39:23 -08:00
Dhruba Borthakur
299f5c76bb Create new log file outside the dbmutex.
Summary:
All filesystem Io should be done outside the dbmutex. There was one place
when we have to roll the transaction log that we were creating the new log file
while holding the dbmutex.

I rearranged this code so that the act of creating the new transaction log
file is done without holding the dbmutex.  I also allocate the new memtable
outside the dbmutex, this is important because creating the memtable
could be heavyweight.

Test Plan: make check and dbstress

Reviewers: haobo, igor

Reviewed By: haobo

CC: leveldb, reconnect.grayhat

Differential Revision: https://reviews.facebook.net/D14283
2013-11-25 11:23:42 -08:00
Haobo Xu
5b825d6964 [RocksDB] Use raw pointer instead of shared pointer when passing Statistics object internally
Summary: liveness of the statistics object is already ensured by the shared pointer in DB options. There's no reason to pass again shared pointer among internal functions. Raw pointer is sufficient and efficient.

Test Plan: make check

Reviewers: dhruba, MarkCallaghan, igor

Reviewed By: dhruba

CC: leveldb, reconnect.grayhat

Differential Revision: https://reviews.facebook.net/D14289
2013-11-25 10:38:15 -08:00
Siying Dong
718488abc5 Add BloomFilter to PlainTableIterator::Seek()
Summary:
This patch adds a simple bloom filter in PlainTableIterator::Seek()

Test Plan: N/A

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-11-21 22:26:39 -08:00
kailiu
0c93df912e Improve the readability of the TableProperties::ToString() 2013-11-21 17:54:23 -08:00
Siying Dong
3e35aa6412 Revert "Allow users to profile a query and see bottleneck of the query"
This reverts commit 3d8ac31d71.
2013-11-21 17:40:39 -08:00
Siying Dong
b135d01e7b Allow users to profile a query and see bottleneck of the query
Summary:
Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later.

Test Plan: Enable this profiling in seveal existing tests.

Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14001

Conflicts:
	table/merger.cc
2013-11-21 17:39:19 -08:00
Siying Dong
3d8ac31d71 Allow users to profile a query and see bottleneck of the query
Summary:
Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later.

Test Plan: Enable this profiling in seveal existing tests.

Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14001
2013-11-21 16:29:57 -08:00
Siying Dong
58e1956d50 [Only for Performance Branch] A Hacky patch to lazily generate memtable key for prefix-hashed memtables.
Summary:
For prefix mem tables, encoding mem table key may be unnecessary if the prefix doesn't have any key. This patch is a little bit hacky but I want to try out the performance gain of removing this lazy initialization.

In longer term, we might want to revisit the way we abstract mem tables implementations.

Test Plan: make all check

Reviewers: haobo, igor, kailiu

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14265
2013-11-20 20:49:23 -08:00
Siying Dong
b59d4d5a50 A Simple Plain Table
Summary:
A Simple plain table format. No block structure. When creating the table reader, scanning the full table to create indexes.

Test Plan:Add unit test

Reviewers:haobo,dhruba,kailiu

CC:

Task ID: #

Blame Rev:
2013-11-20 18:44:22 -08:00
Siying Dong
071fb0d77b Inline a couple of functions and put one save lazily clearing
Summary:
Machine several functions inline.
Also, in DBIter.Seek() make value cleaning up lazily done.
These are for the use case that Seek() are called lots of times but few return values.

Test Plan: make all check

Differential Revision: https://reviews.facebook.net/D14217
2013-11-20 17:32:57 -08:00
Haobo Xu
37b459f0aa [RocksDB] Test diff on performance branch
Summary: trivia comment change

Test Plan: Go through the step ofs developing under the performance branch

Reviewers: dhruba, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14259
2013-11-20 14:34:52 -08:00
Haobo Xu
a617227a36 [RocksDB] fix prefix_test
Summary: user comparator needs to work if either input is prefix only.

Test Plan: ./prefix_test --write_buffer_size=100000 --total_prefixes=10000 --items_per_prefix=10

Reviewers: dhruba, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14241
2013-11-20 09:16:23 -08:00
kailiu
6eb5649800 Move flush_block_policy from Options to TableFactory
Summary:
Previously we introduce a `flush_block_policy_factory` in Options, however, that options is strongly releated to Table based tables.
It will make more sense to move it to block based table's own factory class.

Test Plan: make check to pass existing tests

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14211
2013-11-19 22:00:48 -08:00
kailiu
1415f8820d Improve the "table stats"
Summary:
The primary motivation of the changes is to make it easier to figure out the inside of the tables.

* rename "table stats" to "table properties" since now we have more than "integers" to store in the property block.
* Add filter block size to the basic table properties.
* Whenever a table is built, we'll log the table properties (the sample output is in Test Plan).
* Make an api to expose deleted keys.

Test Plan:
Passed all existing test. and the sample output of table stats:

    ==================================================================
        Basic Properties
    ------------------------------------------------------------------
                  # data blocks: 1
                      # entries: 1

                   raw key size: 9
           raw average key size: 9
                 raw value size: 9
         raw average value size: 0

                data block size: 25
               index block size: 27
              filter block size: 18
         (estimated) table size: 70

                  filter policy: rocksdb.BuiltinBloomFilter
    ==================================================================
        User collected properties: InternalKeyPropertiesCollector
    ------------------------------------------------------------------
                    kDeletedKeys: 1
    ==================================================================

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14187
2013-11-19 16:29:42 -08:00
Igor Canadi
fc61428288 Include <unistd.h> in db_test
Summary: This is the only compile issue in Ubuntu. It might be better to include <unistd.h> only in env_posix and add Truncate function to Env, but since we use truncate only in db_test, I don't think it makes much sense.

Test Plan: Rocksdb now compiles on Ubuntu!

Reviewers: dhruba, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14127
2013-11-17 21:58:16 -08:00
Igor Canadi
de9ce7d439 Upgrading compiler to gcc4.8.1
Summary:
Finally did it - the trick was in using --dynamic-linker option. This is first step to running ASAN.

All of our code seems to compile just fine on 4.8.1. However, I still left fbcode.471.sh in the 'build_tools/' just in case.

Test Plan: make clean; make

Reviewers: dhruba, haobo, kailiu, emayanke, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14109
2013-11-17 13:52:55 -08:00
Kai Liu
75df72f2a5 Change the logic in KeyMayExist()
Summary:
Previously in KeyMayExist(), if DB::Get() returns non-Status::OK(), we assumes key may not exist.
However, as if index block is not in block cache, Status::Incomplete() will return. Worse still, if
options::filter_delete is enabled, we may falsely ignore the "delete" operation:

  https://github.com/facebook/rocksdb/blob/master/db/write_batch.cc#L217-L220

This diff fixes this bug and will let crash-test pass.

Test Plan:
Ran:

  ./db_stress --test_batches_snapshots=1 --ops_per_thread=1000000 --threads=32 --write_buffer_size=4194304 --destroy_db_initially=1 --reopen=0 --readpercent=5 --prefixpercent=45 --writepercent=35 --delpercent=5 --iterpercent=10 --db=/home/kailiu/local/newer --max_key=100000000 --disable_seek_compaction=0 --mmap_read=0 --block_size=16384 --cache_size=1048576 --open_files=500000 --verify_checksum=1 --sync=0 --disable_wal=0 --disable_data_sync=0 --target_file_size_base=2097152
--target_file_size_multiplier=2 --max_write_buffer_number=3 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --filter_deletes=1

Previously we'll see crash happens very soon.

Reviewers: igor, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14115
2013-11-17 01:00:34 -08:00
kailiu
97d8e573a6 make util/env_posix.cc work under mac
Summary: This diff invoves some more complicated issues in the posix environment.

Test Plan: works under mac os. will need to verify dev box.

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14061
2013-11-16 23:44:39 -08:00
Pascal Borreli
443e04e62d Fixed typos 2013-11-16 11:21:34 +00:00
Igor Canadi
21905dd4a8 Start DeleteFileTest with clean plate
Summary:
Remove all the files from the test dir before the test. The test failed when there were some old files still in the directory, since it checks the file counts.
This is what caused jenkins' test failures. It was running fine on my machine so it was hard to repro.

Test Plan:
1. create an extra 000001.log file in the test directory
2. run a ./deletefile_test - test failes
3. patch ./deletefile_test with this
4. test succeeds

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14097
2013-11-15 16:30:23 -08:00
Igor Canadi
29c931f70b Avoid populating live set if we don't need to
Summary: Also changed some comments

Test Plan: ./deletefile_test

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14091
2013-11-14 22:42:02 -08:00
Igor Canadi
a0ce3fd00a PurgeObsoleteFiles() unittest
Summary:
Created a unittest that verifies that automatic deletion performed by PurgeObsoleteFiles() works correctly.

Also, few small fixes on the logic part -- call version_set_->GetObsoleteFiles() in FindObsoleteFiles() instead of on some arbitrary positions.

Test Plan: Created a unit test

Reviewers: dhruba, haobo, nkg-

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14079
2013-11-14 18:03:57 -08:00
Vamsi Ponnekanti
94dde686bb [Merge operand meant for key K is being applied on wrong key]
Summary:
We iterate until we find a different key than original key.
ikey is pointing to next key when we break out of loop.
After the loop we apply all merge operands meant for original key
on the next key!

Test Plan:
Need to give a build to Marcin to test out.

Revert Plan: OK

Task ID: #3181932

Reviewers: haobo, emayanke, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14073
2013-11-14 17:13:24 -08:00
Igor Canadi
fda8142f29 Delete log files in the correct dir
Summary: Log files are stored in wal_dir, not dbname_

Test Plan: deletfile_test

Reviewers: nkg-

Reviewed By: nkg-

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14067
2013-11-13 14:54:54 -08:00
Kai Liu
88ba331c1a Add the index/filter block cache
Summary: This diff leverage the existing block cache and extend it to cache index/filter block.

Test Plan:
Added new tests in db_test and table_test

The correctness is checked by:

1. make check
2. make valgrind_check

Performance is test by:

1. 10 times of build_tools/regression_build_test.sh on two versions of rocksdb before/after the code change. Test results suggests no significant difference between them. For the two key operatons `overwrite` and `readrandom`, the average iops are both 20k and ~260k, with very small variance).
2. db_stress.

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, haobo, xjin

Differential Revision: https://reviews.facebook.net/D13167
2013-11-12 22:46:51 -08:00
Kai Liu
35460ccb53 Fix the string format issue
Summary:

mac and our dev server has totally differnt definition of uint64_t, therefore fixing the warning in mac has actually made code in linux uncompileable.

Test Plan:

make clean && make -j32
2013-11-12 21:05:39 -08:00
Igor Canadi
d88d8ecf80 Fix deleting files
Summary: One more fix! In some cases, our filenames start with "/". Apparently, env_ can't handle filenames with double //

Test Plan:
deletefile_test does not include this line in the LOG anymore:
2013/11/12-18:11:43.150149 7fe4a6fff700 RenameFile logfile #3 FAILED -- IO error: /tmp/rocksdbtest-3574/deletefile_test//000003.log: No such file or directory

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14055
2013-11-12 20:32:07 -08:00
kailiu
21587760b9 Fixing the warning messages captured under mac os # Consider using git commit -m 'One line title' && arc diff. # You will save time by running lint and unit in the background.
Summary: The work to make sure mac os compiles rocksdb is not completed yet. But at least we can start cleaning some warnings captured only by g++ from mac os..

Test Plan: ran make in mac os

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14049
2013-11-12 20:05:28 -08:00
Igor Canadi
9df2b217e9 Move fast and break things
Summary:
Broke the compile when I removed purge_log_after_memtable_flush.

sorrybus

Test Plan: make db_bench works now

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14037
2013-11-12 12:42:42 -08:00
Igor Canadi
9bc4a26f56 Small changes in Deleting obsolete files
Summary:
@haobo's suggestions from https://reviews.facebook.net/D13827

Renaming some variables, deprecating purge_log_after_flush, changing for loop into auto for loop.

I have not implemented deleting objects outside of mutex yet because it would require a big code change - we would delete object in db_impl, which currently does not know anything about object because it's defined in version_edit.h (FileMetaData). We should do it at some point, though.

Test Plan: Ran deletefile_test

Reviewers: haobo

Reviewed By: haobo

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D14025
2013-11-12 11:53:26 -08:00
Igor Canadi
dad425562f Move the comment
Summary: Moving the comment per @haobo suggestion.

Test Plan: No

Reviewers: haobo

Reviewed By: haobo

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D14019
2013-11-12 10:07:55 -08:00
Igor Canadi
4abd219cfc Combine two FindObsoleteFiles()
Summary: We don't need to call FindObsoleteFiles() twice

Test Plan: deletefile_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14007
2013-11-11 21:41:32 -08:00
Igor Canadi
94e139f94d Fixing failed delete file test
Summary: FindObsoleteFiles() has to be called before PurgeObsoleteFiles() because FindObsoleteFiles() sets manifest_file_number, log_number and prev_log_number to valid values.

Test Plan: deletefile_test now works

Reviewers: dhruba, emayanke, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13995
2013-11-11 21:03:41 -08:00
Dhruba Borthakur
318a4919d2 Fix valgrind check by initialising DeletionState.
Summary:
The valgrind error was introduced by commit
1510339e52. Initialize DeletionState
in constructor.

Test Plan: valgrind --leak-check=yes ./deletefile_test

Reviewers: igor, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13983
2013-11-11 16:01:13 -08:00
lovro
8a46ecd357 WriteBatch::Put() overload that gathers key and value from arrays of slices
Summary: In our project, when writing to the database, we want to form the value as the concatenation of a small header and a larger payload.  It's a shame to have to copy the payload just so we can give RocksDB API a linear view of the value.  Since RocksDB makes a copy internally, it's easy to support gather writes.

Test Plan: write_batch_test, new test case

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13947
2013-11-08 16:34:32 -08:00
Igor Canadi
1510339e52 Speed up FindObsoleteFiles
Summary:
Here's one solution we discussed on speeding up FindObsoleteFiles. Keep a set of all files in DBImpl and update the set every time we create a file. I probably missed few other spots where we create a file.

It might speed things up a bit, but makes code uglier. I don't really like it.

Much better approach would be to abstract all file handling to a separate class. Think of it as layer between DBImpl and Env. Having a separate class deal with file namings and deletion would benefit both code cleanliness (especially with huge DBImpl) and speed things up. It will take a huge effort to do this, though.

Let's discuss offline today.

Test Plan: Ran ./db_stress, verified that files are getting deleted

Reviewers: dhruba, haobo, kailiu, emayanke

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D13827
2013-11-08 15:23:46 -08:00
Igor Canadi
dd218bbc88 Forgot to change interface everywhere
Summary: Changed the name and interface for creating HashSkipListRep. Forgot to change it in db_test.

Test Plan: make db_test

Reviewers: haobo

Reviewed By: haobo

Differential Revision: https://reviews.facebook.net/D13965
2013-11-08 12:23:12 -08:00
Igor Canadi
8b3379dc0a Implementing DynamicIterator for TransformRepNoLock
Summary: What @haobo done with TransformRep, now in TransformRepNoLock. Similar implementation, except that I made DynamicIterator a subclass of Iterator which makes me have less iterator initializations.

Test Plan: ./prefix_test. Seeing huge savings vs. TransformRep again!

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: haobo

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D13953
2013-11-08 00:31:09 -08:00
Kai Liu
fd075d6edd Provide mechanism to configure when to flush the block
Summary: Allow block based table to configure the way flushing the blocks. This feature will allow us to add support for prefix-aligned block.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, igor

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13875
2013-11-07 21:27:21 -08:00
Kai Liu
bba6595b1f Fix the valgrind error
Summary: I this bug from valgrind report and found a place that may potentially leak memory.

Test Plan: re-ran the valgrind and no error any more

Reviewers: emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13959
2013-11-07 15:46:48 -08:00
Igor Canadi
444cf88a56 Flush the log outside of lock
Summary:
Added a new call LogFlush() that flushes the log contents to the OS buffers. We never call it with lock held.

We call it once for every Read/Write and often in compaction/flush process so the frequency should not be a problem.

Test Plan: db_test

Reviewers: dhruba, haobo, kailiu, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13935
2013-11-07 11:31:56 -08:00
Haobo Xu
fd2044883a [RocksDB] Generalize prefix-aware iterator to be used for more than one Seek
Summary: Added a prefix_seek flag in ReadOptions to indicate that Seek is prefix aware(might not return data with different prefix), and also not bound to a specific prefix. Multiple Seeks and range scans can be invoked on the same iterator. If a specific prefix is specified, this flag will be ignored. Just a quick prototype that works for PrefixHashRep, the new lockless memtable could be easily extended with this support too.

Test Plan: test it on Leaf

Reviewers: dhruba, kailiu, sdong, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13929
2013-11-06 20:45:49 -08:00
shamdor
c2be2cba04 WAL log retention policy based on archive size.
Summary:
Archive cleaning will still happen every WAL_ttl seconds
but archived logs will be deleted only if archive size
is greater then a WAL_size_limit value.
Empty archived logs will be deleted evety WAL_ttl.

Test Plan:
1. Unit tests pass.
2. Benchmark.

Reviewers: emayanke, dhruba, haobo, sdong, kailiu, igor

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13869
2013-11-06 18:46:28 -08:00
Igor Canadi
be96f2498e TransformRep - use array instead of unordered_map
Summary:
I'm sending this diff together with https://reviews.facebook.net/D13881 because it didn't allow me to send only the array one.

Here I also replaced unordered_map with just an array of shared_ptrs. This elminated all the locks.

I will run the new benchmark and post the results here.

Test Plan: db_test

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13893
2013-11-06 11:55:43 -08:00
Haobo Xu
fe4a449472 [RocksDB] prefixhash memtable test
Summary: as title, half baked test for prefixhash memtable. Also contains deadlock test option

Test Plan: run it

Reviewers: igor, dhruba

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13887
2013-11-05 23:20:10 -08:00
Dhruba Borthakur
39190588e1 Fix failure in rocksdb unit test CompressedCache
Summary:
The problem was that there was only a single key-value in a block
and its compressibility was less than 88%. Rocksdb refuses to
compress a block unless its compresses to lesser than 88% of its
original size. If a block is not compressed, it does nto get inserted
into the compressed block cache.

Create the test data so that multiple records fit into the same
data block. This increases the compressibility of these data block.

Test Plan: ./db_test

Reviewers: kailiu, haobo

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13905
2013-11-05 16:11:34 -08:00
Dhruba Borthakur
7845fd9db9 Fixed valgrind error in DBTest.CompressedCache
Summary:
Fixed valgrind error in DBTest.CompressedCache.
This fixes the valgrind error (thanks to Haobo). I am still trying to reproduce the test-failure case deterministically.

Test Plan: db_test

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13899
2013-11-05 11:12:39 -08:00
Mayank Agarwal
f837f5b1c9 Making the transaction log iterator more robust
Summary:
strict essentially means that we MUST find the startsequence. Thus we should return if starteSequence is not found in the first file in case strict is set. This will take care of ending the iterator in case of permanent gaps due to corruptions in the log files
Also created NextImpl function that will have internal variable to distinguish whether Next is being called from StartSequence or by application.
Set NotFoudn::gaps status to give an indication of gaps happeneing.
Polished the inline documentation at various places

Test Plan:
* db_repl_stress test
* db_test relating to transaction log iterator
* fbcode/wormhole/rocksdb/rocks_log_iterator
* sigma production machine sigmafio032.prn1

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13689
2013-11-04 20:49:03 -08:00
Dhruba Borthakur
b4ad5e89ae Implement a compressed block cache.
Summary:
Rocksdb can now support a uncompressed block cache, or a compressed
block cache or both. Lookups first look for a block in the
uncompressed cache, if it is not found only then it is looked up
in the compressed cache. If it is found in the compressed cache,
then it is uncompressed and inserted into the uncompressed cache.

It is possible that the same block resides in the compressed cache
as well as the uncompressed cache at the same time. Both caches
have their own individual LRU policy.

Test Plan: Unit test case attached.

Reviewers: kailiu, sdong, haobo, leveldb

Reviewed By: haobo

CC: xjin, haobo

Differential Revision: https://reviews.facebook.net/D12675
2013-11-01 14:31:35 -07:00
Igor Canadi
beeb74be6f Move I/O outside of lock
Summary:
I'm figuring out how Version[Set, Edit, ] classes work and I stumbled on this.

It doesn't seem that the comment is accurate anymore. What I read is when the manifest grows too big, create a new file (and not only when we call LogAndApply for the first time).

Test Plan: make check (currently running)

Reviewers: dhruba, haobo, kailiu, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13839
2013-11-01 12:32:27 -07:00
Haobo Xu
8cbe5bb56b [RocksDB] Add OnCompactionStart to CompactionFilter class
Summary: This is to give application compaction filter a chance to access context information of a specific compaction run. For example, depending on whether a compaction goes through all data files, the application could do things differently.

Test Plan: make check

Reviewers: dhruba, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13683
2013-10-31 13:36:43 -07:00
Naman Gupta
b4fab3be2a Merge branch 'master' of github.com:facebook/rocksdb into inplace 2013-10-31 11:51:03 -07:00
Naman Gupta
fe25070242 In-place updates for equal keys and similar sized values
Summary:
Currently for each put, a fresh memory is allocated, and a new entry is added to the memtable with a new sequence number irrespective of whether the key already exists in the memtable. This diff is an attempt to update the value inplace for existing keys. It currently handles a very simple case:
1. Key already exists in the current memtable. Does not inplace update values in immutable memtable or snapshot
2. Latest value type is a 'put' ie kTypeValue
3. New value size is less than existing value, to avoid reallocating memory

TODO: For a put of an existing key, deallocate memory take by values, for other value types till a kTypeValue is found, ie. remove kTypeMerge.
TODO: Update the transaction log, to allow consistent reload of the memtable.

Test Plan: Added a unit test verifying the inplace update. But some other unit tests broken due to invalid sequence number checks. WIll fix them next.

Reviewers: xinyaohu, sumeet, haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12423

Automatic commit by arc
2013-10-31 11:27:12 -07:00
Siying Dong
f03b2df010 Follow-up Cleaning-up After D13521
Summary:
This patch is to address @haobo's comments on D13521:
1. rename Table to be TableReader and make its factory function to be GetTableReader
2. move the compression type selection logic out of TableBuilder but to compaction logic
3. more accurate comments
4. Move stat name constants into BlockBasedTable implementation.
5. remove some uncleaned codes in simple_table_db_test

Test Plan: pass test suites.

Reviewers: haobo, dhruba, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13785
2013-10-30 10:52:33 -07:00
Siying Dong
068a819ac9 Fix valgrind_check problem of simple_table_db_test.cc
Summary: Two memory issues caused valgrind_check to fail on simple_table_db_test. Fix it

Test Plan: make -j32 valgrind_check

Reviewers: kailiu, haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13773
2013-10-29 14:29:03 -07:00
Siying Dong
d4eec30ed0 Make "Table" pluggable
Summary: This patch makes Table and TableBuilder a abstract class and make all the implementation of the current table into BlockedBasedTable and BlockedBasedTable Builder.

Test Plan: Make db_test.cc to work with block based table. Add a new test simple_table_db_test.cc where a different simple table format is implemented.

Reviewers: dhruba, haobo, kailiu, emayanke, vamsi

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13521
2013-10-28 17:54:09 -07:00
Kai Liu
994575c134 Support user-defined table stats collector
Summary:
1. Added a new option that support user-defined table stats collection.
2. Added a deleted key stats collector in `utilities`

Test Plan:
Added a unit test for newly added code.
Also ran make check to make sure other tests are not broken.

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13491
2013-10-28 15:45:14 -07:00
Kai Liu
7e91b86f4d Fix a valgrind warning
Summary:
A latest valgrind test found a recently added unit test has memory leak,
which is because DB is not closed at the end of the test.

Test Plan: re-run the valgrind locally and make sure there's no memory leakage any more.

Reviewers: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13725
2013-10-28 14:34:27 -07:00
Igor Canadi
100fa8e013 If a Put fails, fail all other puts
Summary:
When a Put fails, it can leave database in a messy state. We don't want to pretend that everything is OK when it may not be. We fail every write following the failed one.

I added checks for corruption to DBImpl::Write(). Is there anywhere else I need to add them?

Test Plan: Corruption unit test.

Reviewers: dhruba, haobo, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13671
2013-10-28 12:36:02 -07:00
Kai Liu
a1d38a41fd fix the error message in debug mode
Summary:

my fix patch introduced a new error in debug mode.

Test Plan:

`make` and `make release`
2013-10-27 23:11:13 -07:00
Kai Liu
39c14891b6 Fix the gcc warning for unused variable
Summary: Fix the unused variable warning for `first` when running `make release`

Test Plan:
make
make check

Reviewers: dhruba, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13695
2013-10-27 22:57:45 -07:00
Mayank Agarwal
56305221c4 Unify DeleteFile and DeleteWalFiles
Summary:
This is to simplify rocksdb public APIs and improve the code quality.
Created an additional parameter to ParseFileName for log sub type and improved the code for deleting a wal file.
Wrote exhaustive unit-tests in delete_file_test
Unification of other redundant APIs can be taken up in a separate diff

Test Plan: Expanded delete_file test

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13647
2013-10-25 08:32:14 -07:00
Kai Liu
c17607a251 Fix the log number bug when updating MANIFEST file
Summary:
Crash may occur during the flushes of more than two mem tables.

As the info log suggested, even when both were successfully flushed,
the recovery process still pick up one of the memtable's log for recovery.

This diff fix the problem by setting the correct "log number" in MANIFEST.

Test Plan: make test; deployed to leaf4 and make sure it doesn't result in crashes of this type.

Reviewers: haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13659
2013-10-24 21:05:33 -07:00
Slobodan Predolac
e44976b199 Conversion of db_bench, db_stress and db_repl_stress to use gflags
Summary: Converted db_stress, db_repl_stress and db_bench to use gflags

Test Plan: I tested by printing out all the flags from old and new versions. Tried defaults, + various combinations with "interesting flags". Also, tested by running db_crashtest.py and db_crashtest2.py.

Reviewers: emayanke, dhruba, haobo, kailiu, sdong

Reviewed By: emayanke

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D13581
2013-10-24 07:43:14 -07:00
Haobo Xu
2fb361ad98 [RocksDB] Add perf_context.wal_write_time to track time spent on writing the recovery log.
Summary: as title

Test Plan: make check; ./perf_context_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13629
2013-10-23 13:38:39 -07:00
Mayank Agarwal
e56ce03691 Hardcoding temp file name for Identity file to 000000.dbtmp just like it's done for CURRENT file
Summary: as per Dhruba's suggestion

Test Plan: make all check; Seen the Id getting generated properly in db_repl_stress

Reviewers: dhruba, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13635
2013-10-23 11:34:22 -07:00
Mayank Agarwal
9b50106f9a Dbid feature
Summary:
Create a new type of file on startup if it doesn't already exist called DBID.
This will store a unique number generated from boost library's uuid header file.
The use-case is to identify the case of a db losing all its data and coming back up either empty or from an image(backup/live replica's recovery)
the key point to note is that DBID is not stored in a backup or db snapshot
It's preferable to use Boost for uuid because:
1) A non-standard way of generating uuid is not good
2) /proc/sys/kernel/random/uuid generates a uuid but only on linux environments and the solution would not be clean
3) c++ doesn't have any direct way to get a uuid
4) Boost is a very good library that was already having linkage in rocksdb from third-party
Note: I had to update the TOOLCHAIN_REV in build files to get latest verison of boost from third-party as the older version had a bug.
I had to put Wno-uninitialized in Makefile because boost-1.51 has an unitialized variable and rocksdb would not comiple otherwise. Latet open-source for boost is 1.54 but is not there in third-party. I have notified the concerned people in fbcode about it.
@kailiu : While releasing to third-party, an additional dependency will need to be created for boost in TARGETS file. I can help identify.

Test Plan:
Expand db_test to test 2 cases
1) Restarting db with Id file present - verify that no change to Id
2)Restarting db with Id file deleted - verify that a different Id is there after reopen
Also run make all check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13587
2013-10-22 12:23:34 -07:00
Mayank Agarwal
ae8e0770b4 Disallow transaction log iterator to skip sequences
Summary:
This is expected to solve the "gaps in transaction log iterator" problem.
* After a lot of observations on the gaps on the sigmafio machines I found that it is due to a race between log reader and writer always.
* So when we drop the wormhole subscription and refresh the iterator, the gaps are not there.
* It is NOT due to some boundary or corner case left unattended in the iterator logic because I checked many instances of the gaps against their log files with ldb. The log files are NOT corrupted also.
* The solution is to not allow the iterator to read incompletely written sequences and detect gaps inside itself and invalidate it which will cause the application to refresh the iterator normally and seek to the required sequence properly.
* Thus, the iterator can at least guarantee that it will not give any gaps.

Test Plan:
* db_test based log iterator tests
* db_repl_stress
* testing on sigmafio setup to see gaps go away

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13593
2013-10-22 11:45:35 -07:00
Siying Dong
65428b0c0a Fix Bug: iterator.Prev() or iterator.SeekToLast() might return the first element instead of the correct one
Summary:
Recent patch https://reviews.facebook.net/D11865 introduced a regression bug:

DBIter::FindPrevUserEntry(), which is called by DBIter::Prev() (and also implicitly if calling iterator.SeekToLast())  might do issue a seek when having skipped too many entries. If the skipped entry just before the seek() is a delete, the saved key is erased so that it seeks to the front, so Prev() would return the first element.

This patch fixes the bug by not doing seek() in DBIter::FindNextUserEntry() if saved key has been erased.

Test Plan: Add a test DBTest.IterPrevMaxSkip which would fail without the patch and would pass with the change.

Reviewers: dhruba, xjin, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13557
2013-10-17 18:33:18 -07:00