Commit Graph

732 Commits

Author SHA1 Message Date
sdong
ea0198fe9a Create log::Writer out of DB Mutex
Summary: Our measurement shows that sometimes new log::Write's constructor can take hundreds of milliseconds. It's unclear why but just simply move it out of DB mutex.

Test Plan: make all check

Reviewers: haobo, ljin, igor

Reviewed By: haobo

CC: nkg-, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17487
2014-04-04 15:46:28 -07:00
Lei Jin
c90d446ee7 make hash_link_list Node's key space consecutively followed at the end
Summary: per sdong's request, this will help processor prefetch on n->key case.

Test Plan: make all check

Reviewers: sdong, haobo, igor

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17415
2014-04-04 15:37:28 -07:00
sdong
99c756f0fe Flush Buffered Info Logs Before Doing Compaction (one line change)
Summary: Flushing log buffer earlier to avoid confusion of time holding the locks.

Test Plan: Should be safe as long as several related db test passes

Reviewers: haobo, igor, ljin

Reviewed By: igor

CC: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D17493
2014-04-04 10:58:30 -07:00
Igor Canadi
040657aec9 Fix MacOS errors 2014-04-03 16:04:34 -07:00
Igor Canadi
f76e4027ca initialize candidate count 2014-04-03 11:45:44 -07:00
sdong
b9767d0e09 Move several more logging inside DB mutex to log buffer
Summary: Move several some common logging still in DB mutex to log buffer.

Test Plan: make all check

Reviewers: haobo, igor, ljin, nkg-

Reviewed By: nkg-

CC: nkg-, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17439
2014-04-03 10:47:18 -07:00
Thomas Adam
98422cba77 [C-API] implemented more options 2014-04-03 10:47:37 +02:00
Thomas Adam
3a30b5b0be [C-API] added "rocksdb_options_set_plain_table_factory" to make it possible to use plain table factory 2014-04-03 10:47:37 +02:00
Haobo Xu
48bc0c6ad3 [RocksDB] Fix a race condition in GetSortedWalFiles
Summary: This patch fixed a race condition where a log file is moved to archived dir in the middle of GetSortedWalFiles. Without the fix, the log file would be missed in the result, which leads to transaction log iterator gap. A test utility SyncPoint is added to help reproducing the race condition.

Test Plan: TransactionLogIteratorRace; make check

Reviewers: dhruba, ljin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17121
2014-04-02 22:12:29 -07:00
Igor Canadi
d1d19f5db3 Fix valgrind error in c_test 2014-04-02 17:24:30 -07:00
sdong
158845ba9a Move a info logging out of DB Mutex
Summary: As we know, logging can be slow, or even hang for some file systems. Move one more logging out of DB mutex.

Test Plan: make all check

Reviewers: haobo, igor, ljin

Reviewed By: igor

CC: yhchiang, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D17427
2014-04-02 16:48:32 -07:00
sdong
4af1954fd6 Compaction Filter V1 to use old context struct to keep backward compatible
Summary: The previous change D15087 changed existing compaction filter, which makes the commonly used class not backward compatible. Revert the older interface. Use a new interface for V2 instead.

Test Plan: make all check

Reviewers: haobo, yhchiang, igor

CC: danguo, dhruba, ljin, igor, leveldb

Differential Revision: https://reviews.facebook.net/D17223
2014-04-02 14:57:51 -07:00
sdong
284c365b77 Fix valgrind error caused by FileMetaData as two level iterator's index block handle
Summary: It is a regression valgrind bug caused by using FileMetaData as index block handle. One of the fields of FileMetaData is not initialized after being contructed and copied, but I'm not able to find which one. Also, I realized that it's not a good idea to use FileMetaData as in TwoLevelIterator::InitDataBlock(), a copied FileMetaData can be compared with the one in version set byte by byte, but the refs can be changed. Also comparing such a large structure is slightly more expensive. Use a simpler structure instead

Test Plan:
Run the failing valgrind test (Harness.RandomizedLongDB)
make all check

Reviewers: igor, haobo, ljin

Reviewed By: igor

CC: yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17409
2014-04-02 14:38:28 -07:00
sdong
e0a87c4cf1 DBIter to use static allocated char array for saved_key_ (if it is not too long)
Summary: DBIter now uses a std::string for saved_key. Based on some profiling, it could be more expensive than we though. Optimize it with the same technique as LookupKey -- if it is short, we copy it to a static allocated char. Otherwise, dynamically allocate memory for it.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: dhruba, igor, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17289
2014-04-01 16:43:11 -07:00
sdong
d50619a559 PlainTableIterator::Seek() shouldn't check bloom filter in total order mode
Summary:
In total order mode, iterator's seek() shouldn't check total order.

Also some cleaning up about checking null for shared pointers. I don't know the behavior before it.

This bug was reported by @igor.

Test Plan: test plain_table_db_test

Reviewers: ljin, haobo, igor

Reviewed By: igor

CC: yhchiang, dhruba, igor, leveldb

Differential Revision: https://reviews.facebook.net/D17391
2014-04-01 15:05:16 -07:00
Thomas Adam
38dc5ef45f [C-API] added the possiblity to create a HashSkipList or HashLinkedList to support prefix seeks 2014-04-01 12:44:27 +02:00
Igor Canadi
577556d5f9 Don't store version number in MANIFEST
Summary: Talked to <insert internal project name> folks and they found it really scary that they won't be able to roll back once they upgrade to 2.8. We should fix this.

Test Plan: make check

Reviewers: haobo, ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17343
2014-03-31 11:33:09 -07:00
Igor Canadi
8a139a054c More valgrind issues!
Summary: Fix some more CompactionFilterV2 valgrind issues. Maybe it would make sense for CompactionFilterV2 to delete its prefix_extractor?

Test Plan: ran CompactionFilterV2* tests with valgrind. issues before patch -> no issues after

Reviewers: haobo, sdong, ljin, dhruba

Reviewed By: dhruba

CC: leveldb, danguo

Differential Revision: https://reviews.facebook.net/D17337
2014-03-29 10:34:47 -07:00
sdong
43a593a6d9 Change default value of some Options
Summary: Since we are optimizing for server workloads, some default values are not optimized any more. We change some of those values that I feel it's less prone to regression bugs.

Test Plan: make all check

Reviewers: dhruba, haobo, ljin, igor, yhchiang

Reviewed By: igor

CC: leveldb, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D16995
2014-03-28 17:09:28 -07:00
sdong
2d3468c293 MemTableIterator not to reference Memtable
Summary: In one of the perf, I shows 10%-25% CPU costs of MemTableIterator.Seek(), when doing dynamic prefix seek, are spent on checking whether we need to do bloom filter check or finding out the prefix extractor. Seems that  more level of pointer checking makes CPU cache miss more likely. This patch makes things slightly simpler by copying pointer of bloom of prefix extractor into the iterator.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: ljin

CC: igor, dhruba, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17247
2014-03-28 16:46:25 -07:00
Lei Jin
0d755fff14 cache friendly blocked bloomfilter
Summary:
By constraining the probes within cache line(s), we can improve the
cache miss rate thus performance. This probably only makes sense for
in-memory workload so defaults the option to off.

Numbers and comparision can be found in wiki:
https://our.intern.facebook.com/intern/wiki/index.php/Ljin/rocksdb_perf/2014_03_17#Bloom_Filter_Study

Test Plan: benchmarked this change substantially. Will run make all check as well

Reviewers: haobo, igor, dhruba, sdong, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17133
2014-03-28 09:21:20 -07:00
Yueh-Hsuan Chiang
10cebec79e Fix the bug in MergeUtil which causes mixing values of different keys.
Summary: Fix the bug in MergeUtil which causes mixing values of different keys.

Test Plan:
stringappend_test
make all check

Reviewers: haobo, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17235
2014-03-27 16:15:25 -07:00
Haobo Xu
a92194e5b2 [RocksDB] Add db property "rocksdb.cur-size-active-mem-table"
Summary: as title

Test Plan: db_test

Reviewers: sdong

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17217
2014-03-27 15:14:04 -07:00
Igor Canadi
1c9f8f0884 Fix valgrind issues
Summary:
NewFixedPrefixTransform is leaked in default options. Broken by b47812fba6

Also included in the diff some code cleanup

Test Plan:
valgrind env_test
also make check

Reviewers: haobo, danguo, yhchiang

Reviewed By: danguo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17211
2014-03-27 08:22:59 -07:00
sdong
d556200264 Some small cleaning up to make some compiling environment happy
Summary: Compiler complains some errors when building using our internal build settings. Fix them.

Test Plan: rebuild

Reviewers: haobo, dhruba, igor, yhchiang, ljin

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17199
2014-03-26 18:11:41 -07:00
Igor Canadi
6a08bc042a Fix no return warning in FileComparator 2014-03-26 14:46:07 -07:00
Igor Canadi
1e9621d4e5 Sort files correctly in Builder::SaveTo
Summary:
Previously, we used to sort all files by BySmallestFirst comparator and then re-sort level0 files in the Finalize() (recently moved to end of SaveTo).

In this diff, I chose the correct comparator at the beginning and sort the files correctly in Builder::SaveTo.

I also added a verification that all files are sorted correctly in CheckConsistency()

NOTE: This diff depends on D17037

Test Plan: make check. Will also run db_stress

Reviewers: dhruba, haobo, sdong, ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17049
2014-03-26 13:30:14 -07:00
Igor Canadi
954679bb0f AssertHeld() should do things
Summary:
AssertHeld() was a no-op before. Now it does things.

Also, this change caught a bad bug in SuperVersion::Init(). The method is calling db->mutex.AssertHeld(), but db variable is not initialized yet! I also fixed that issue.

Test Plan: make check

Reviewers: dhruba, haobo, ljin, sdong, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17193
2014-03-26 11:24:52 -07:00
Igor Canadi
ad9a39c9b4 [RocksDB] Preallocate new MANIFEST files
Summary: We don't preallocate MANIFEST file, even though we have an option for that. This diff preallocates manifest file every time we create it

Test Plan: make check

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17163
2014-03-26 09:37:53 -07:00
sdong
6b2e7a2a01 When Options.max_num_files=-1, non level0 files also by pass table cache
Summary:
This is the part that was not finished when doing the Options.max_num_files=-1 feature. For iterating non level0 SST files (which was done using two level iterator), table cache is not bypassed. With this patch, the leftover feature is done.

Test Plan: make all check; change Options.max_num_files=-1 in one of the tests to cover the codes.

Reviewers: haobo, igor, dhruba, ljin, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17001
2014-03-25 18:40:52 -07:00
Yueh-Hsuan Chiang
b9ce156e38 Add assert to MergeOperator::PartialMergeMulti to check # of operands.
Summary:
Add assert(operands_list.size() >= 2) in MergeOperator::PartialMergeMulti
to ensure it's only be called when we have at least two merge operands.

Test Plan: run merge_test and stringappend_test.

Reviewers: haobo, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17169
2014-03-25 13:39:17 -07:00
Danny Guo
d9ca83df28 [rocksdb] make init prefix more robust
Summary:
Currently if client uses kNULLString as the prefix, it will confuse
compaction filter v2. This diff added a bool to indicate if the prefix
has been intialized. I also added a unit test to cover this case and
make sure the new code path is hit.

Test Plan: db_test

Reviewers: igor, haobo

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17151
2014-03-25 11:59:40 -07:00
Yueh-Hsuan Chiang
34f9da1cef Fix the failure of stringappend_test caused by PartialMergeMulti.
Summary:
Fix a bug that PartialMergeMulti will try to merge the first operand
with an empty slice.

Test Plan: run stringappend_test and merge_test.

Reviewers: haobo, igor

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17157
2014-03-25 11:50:09 -07:00
Danny Guo
b47812fba6 [rocksdb] new CompactionFilterV2 API
Summary:
This diff adds a new CompactionFilterV2 API that roll up the
decisions of kv pairs during compactions. These kv pairs must share the
same key prefix. They are buffered inside the db.

    typedef std::vector<Slice> SliceVector;
    virtual std::vector<bool> Filter(int level,
                                 const SliceVector& keys,
                                 const SliceVector& existing_values,
                                 std::vector<std::string>* new_values,
                                 std::vector<bool>* values_changed
                                 ) const = 0;

Application can override the Filter() function to operate
on the buffered kv pairs. More details in the inline documentation.

Test Plan:
make check. Added unit tests to make sure Keep, Delete,
Change all works.

Reviewers: haobo

CCs: leveldb

Differential Revision: https://reviews.facebook.net/D15087
2014-03-24 20:47:53 -07:00
Yueh-Hsuan Chiang
cda4006e87 Enhance partial merge to support multiple arguments
Summary:
* PartialMerge api now takes a list of operands instead of two operands.
* Add min_pertial_merge_operands to Options, indicating the minimum
  number of operands to trigger partial merge.
* This diff is based on Schalk's previous diff (D14601), but it also
  includes necessary changes such as updating the pure C api for
  partial merge.

Test Plan:
* make check all
* develop tests for cases where partial merge takes more than two
  operands.

TODOs (from Schalk):
* Add test with min_partial_merge_operands > 2.
* Perform benchmarks to measure the performance improvements (can probably
  use results of task #2837810.)
* Add description of problem to doc/index.html.
* Change wiki pages to reflect the interface changes.

Reviewers: haobo, igor, vamsi

Reviewed By: haobo

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16815
2014-03-24 17:57:13 -07:00
Igor Canadi
c21ce14fa5 Fix double-free in corruption_test 2014-03-20 14:37:30 -07:00
Igor Canadi
e67241f0b9 Sanity check on Open
Summary:
Everytime a client opens a DB, we do a sanity check that:
* checks the existance of all the necessary files
* verifies that file sizes are correct

Some of the code was stolen from https://reviews.facebook.net/D16935

Test Plan: added a unit test

Reviewers: dhruba, haobo, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17097
2014-03-20 14:18:29 -07:00
Yiting Li
7981a43274 Consistency Check Function
Summary: Added a function/command to check the consistency of live files' meta data

Test Plan:
Manual test (size mismatch, file not exist).
Command test script.

Reviewers: haobo

Reviewed By: haobo

CC: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D16935
2014-03-20 13:42:45 -07:00
Igor Canadi
8ea3cb621e If paranoid_checks -- Mark DB read-only on any IOError
Summary:
Whenever we get an IOError from GetImpl() or NewIterator(), we should immediatelly mark the DB read-only. The same check already exists in Write() and Compaction().

This should help with clients that are somehow missing a file.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, ljin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17061
2014-03-20 13:10:02 -07:00
sdong
f681030c80 Fix DBTest.UniversalCompactionTrigger failure caused by D17067
Summary: D17067 breaks DBTest.UniversalCompactionTrigger because of wrong location of the checking. Fix it.

Test Plan: Run the test and make sure it passes.

Reviewers: igor, haobo

Reviewed By: igor

CC: dhruba, ljin, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17079
2014-03-20 11:10:11 -07:00
sdong
752ec46cd5 Add a unit test to verify compaction filter context
Summary: Add unit tests to make sure CompactionFilterContext::is_manual_compaction_ and CompactionFilterContext::is_full_compaction_ are set correctly.

Test Plan: run the new tests.

Reviewers: haobo, igor, dhruba, yhchiang, ljin

Reviewed By: haobo

CC: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D17067
2014-03-19 18:10:48 -07:00
Igor Canadi
fcd5c5e828 ComputeCompactionScore in CompactionPicker
Summary:
As it turns out, we need the call to ComputeCompactionScore (previously: Finalize) in CompactionPicker.

The issue caused a deadlock in db_stress: http://ci-builds.fb.com/job/rocksdb_crashtest/290/console

The last two lines before a deadlock were:
2014/03/18-22:43:41.481029 7facafbee700 (Original Log Time 2014/03/18-22:43:41.480989) Compaction nothing to do
2014/03/18-22:43:41.481041 7faccf7fc700 wait for fewer level0 files...

"Compaction nothing to do" and other thread waiting for fewer level0 files. Hm hm.

I moved the pre-sorting to SaveTo, which should fix both the original and the new issue.

Test Plan: make check for now, will run db_stress in jenkins

Reviewers: dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17037
2014-03-19 16:52:26 -07:00
Igor Canadi
e493f2f54e Don't compact with zero input files
Summary:
We have an issue with internal service trying to run compaction with zero input files:
2014/02/07-02:26:58.386531 7f79117ec700 Compaction start summary: Base version 1420 Base level 3, seek compaction:0, inputs:[ϛ~^Qy^?],[]
2014/02/07-02:26:58.386539 7f79117ec700 Compacted 0@3 + 0@4 files => 0 bytes

There are two issues:
* inputsummary is printing out junk
* it's constantly retrying (since I guess madeProgress is true), so it prints out a lot of data in the LOG file (40GB in one day).

I read through the Level compaction picker and added some failure condition if input[0] is empty. I think PickCompaction() should not return compaction with zero input files with this change. I'm not confident enough to add an assertion though :)

Test Plan: make check

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16005
2014-03-19 16:01:25 -07:00
Igor Canadi
22507aff6c Fix compile issue in Mac OS
Summary:
Compile issues are:
* Unused variable env_
* Unused fallocate_with_keep_size_

Test Plan: compiles

Reviewers: dhruba, haobo, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17043
2014-03-19 15:40:12 -07:00
Lei Jin
6dc940d4c9 avoid shared_ptr assignment in Version::Get()
Summary:
This is a 500ns operation while the whole Get() call takes only a few
micro!

Test Plan: ran db_bench, for a DB with 50M keys, QPS jumps from 5.2M/s to 7.2M/s

Reviewers: haobo, igor, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17007
2014-03-19 10:54:32 -07:00
sdong
71e6a34271 Add a DB property to indicate number of background errors encountered
Summary: Add a property to calculate number of background errors encountered to help users build their monitoring

Test Plan: Add a unit test. make all check

Reviewers: haobo, igor, dhruba

Reviewed By: igor

CC: ljin, nkg-, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16959
2014-03-18 14:28:30 -07:00
Kai Liu
1ec72b37b1 Several easy-to-add properties related to compaction and flushes
Summary: To partly address the request @nkg- raised, add three easy-to-add properties to compactions and flushes.

Test Plan: run unit tests and add a new unit test to cover new properties.

Reviewers: haobo, dhruba

Reviewed By: dhruba

CC: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D13677
2014-03-18 14:00:09 -07:00
Igor Canadi
758fa8c359 Don't Finalize in CompactionPicker
Summary:
Finalize re-sorts (read: mutates) the files_ in Version* and it is called by CompactionPicker during normal runtime. At the same time, this same Version* lives in the SuperVersion* and is accessed without the mutex in GetImpl() code path.

Mutating the files_ in one thread and reading the same files_ in another thread is a bad idea. It caused this issue: http://ci-builds.fb.com/job/rocksdb_crashtest/285/console

Long-term, we need to be more careful with method contracts and clearly document what state can be mutated when. Now that we are much faster because we don't lock in GetImpl(), we keep running into data races that were not a problem before when we were slower. db_stress has been very helpful in detecting those.

Short-term, I removed Finalize() from CompactionPicker.

Note: I believe this is an issue in current 2.7 version running in production.

Test Plan:
make check
Will also run db_stress to see if issue is gone

Reviewers: sdong, ljin, dhruba, haobo

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16983
2014-03-18 13:59:59 -07:00
Lei Jin
63cef90078 disable the log_number check in Recover()
Summary:
There is a chance that an old MANIFEST is corrupted in 2.7 but just not noticed.
This check would fail them. Change it to log instead of returning a
Corruption status.

Test Plan: make

Reviewers: haobo, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16923
2014-03-18 12:46:29 -07:00
Igor Canadi
f26cb0f093 Optimize fallocation
Summary:
Based on my recent findings (posted in our internal group), if we use fallocate without KEEP_SIZE flag, we get superior performance of fdatasync() in append-only workloads.

This diff provides an option for user to not use KEEP_SIZE flag, thus optimizing his sync performance by up to 2x-3x.

At one point we also just called posix_fallocate instead of fallocate, which isn't very fast: http://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html (tl;dr it manually writes out zero bytes to allocate storage). This diff also fixes that, by first calling fallocate and then posix_fallocate if fallocate is not supported.

Test Plan: make check

Reviewers: dhruba, sdong, haobo, ljin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16761
2014-03-17 21:52:14 -07:00