Commit Graph

744 Commits

Author SHA1 Message Date
sdong
808928fc99 Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB
Summary: Add an option to allocate a piece of memory from huge page TLB. Add options to trigger it in dynamic bloom, plain table indexes andhash linked list hash table.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: nkg-, dhruba, leveldb, igor, yhchiang

Differential Revision: https://reviews.facebook.net/D18357

Conflicts:
	db/plain_table_db_test.cc
	util/options.cc
2014-04-30 14:57:41 -07:00
sdong
eb96dc003a Fix a bug in IterKey
Summary: IterKey set buffer_size_ to a wrong initial value, causing it to always allocate values from heap instead of stack if the key size is smaller. Fix it.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: igor, dhruba, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D18279
2014-04-23 19:45:58 -07:00
sdong
8db376d494 Fix a sign and unsign comparating in plain_table_db_test
Summary:

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2014-04-23 16:09:45 -07:00
sdong
73895c9478 Expose number of entries in mem tables to users
Summary: In this patch, two new DB properties are defined: rocksdb.num-immutable-mem-table and rocksdb.num-entries-imm-mem-tables, from where number of entries in mem tables can be exposed to users

Test Plan:
Cover the codes in db_test
make all check

Reviewers: haobo, ljin, igor

Reviewed By: igor

CC: nkg-, igor, yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18207

Conflicts:
	db/db_test.cc
2014-04-22 22:38:44 -07:00
sdong
7b37f0d5af PlainTableReader to expose index size to users
Summary:
This is a temp solution to expose index sizes to users from PlainTableReader before we persistent them to files.
In this patch, the memory consumption of indexes used by PlainTableReader will be reported as two user defined properties, so that users can monitor them.

Test Plan:
Add a unit test.
make all check`

Reviewers: haobo, ljin

Reviewed By: haobo

CC: nkg-, yhchiang, igor, ljin, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18195
2014-04-22 19:31:12 -07:00
Igor Canadi
4b7b1949d4 REmove occurrences of kBlockBasedTableWithWholeKeyHashIndex 2014-04-21 12:00:31 -07:00
Igor Canadi
9e04ce7645 Don't execute WholeKeyPrefix test 2014-04-21 11:23:47 -07:00
sdong
034b494774 Fix bugs introduced by D17961
Summary:
D17961 has two bugs:
(1) two level iterator fails to populate FileMetaData.table_reader, causing performance regression.
(2) table cache handle the !status.ok() case in the wrong place, causing seg fault which shouldn't happen.

Test Plan: make all check

Reviewers: ljin, igor, haobo

Reviewed By: ljin

CC: yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D17991

Conflicts:
	db/version_set.cc
2014-04-18 14:40:58 -07:00
sdong
d705755e52 Minimize accessing multiple objects in Version::Get()
Summary:
One of our profilings shows that Version::Get() sometimes is slow when getting pointer of user comparators or other global objects. In this patch:
(1) we keep pointers of immutable objects in Version to avoid accesses them though option objects or cfd objects
(2) table_reader is directly cached in FileMetaData so that table cache don't have to go through handle first to fetch it
(3) If level 0 has less than 3 files, skip the filtering logic based on SST tables' key range. Smallest and largest key are stored in separated memory locations, which has potential cache misses

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: igor, yhchiang, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D17739

Conflicts:
	db/db_impl.cc
	db/db_impl_readonly.cc
	db/table_cache.cc
	db/version_edit.h
	db/version_set.cc
	db/version_set.h
2014-04-18 14:27:06 -07:00
Kai Liu
22f396798e Enable hash index for block-based table
Summary: Based on previous patches, this diff eventually provides the end-to-end mechanism for users to specify the hash-index.

Test Plan: Wrote several new unit tests.

Reviewers: sdong, haobo, dhruba

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16539

Use shorten index key for hash-index
Summary:
I was wrong about the "index builder", right now since we create index
by scanning both whole table and index, there is not need to preserve
the whole key as the index key.

I switch back to original way index which is both space efficient and
able to supprot in-fly construction of hash index.

IN this patch, I made minimal change since I'm not sure if we still need
the "pluggable index builder", under current circumstance it is of no use
and kind of over-engineered. But I'm not sure if we can still exploit its
usefulness in the future; otherwise I think I can just burn them with great
vengeance.

Test Plan: unit tests

Reviewers: sdong, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17745

Conflicts:
	table/block_based_table_reader.cc
	table/block_based_table_reader.h
2014-04-15 20:13:34 -07:00
sdong
258eac1772 Polish IterKey and use it in DBImpl::ProcessKeyValueCompaction()
Summary:
1. Polish IterKey a little bit.
2. Turn to use it in local parameter of current_user_key in DBImpl::ProcessKeyValueCompaction(). Our profile showing that DBImpl::ProcessKeyValueCompaction() has about 14% costs in std::string (the base including reading and writing data but excluding compaction filtering), which is higher than it should be. There are two std::string used in DBImpl::ProcessKeyValueCompaction(), compaction_filter_value and current_user_key and it's hard to distinguish the two.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: igor, yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D17613

Conflicts:
	db/db_impl.cc
2014-04-09 20:54:16 -07:00
sdong
b405cb886b PlainTableIterator not to store copied key in std::string
Summary:
Move PlainTableIterator's copied key from std::string local buffer to avoid paying the extra costs in std::string related to sharing. Reuse the same buffer class in DbIter. Move the class to dbformat.h.

This patch improves iterator performance significantly. Running this benchmark:

./table_reader_bench --num_keys2=17 --iterator --plain_table --time_unit=nanosecond

The average latency is improved to about 750 nanoseconds from 1100 nanoseconds.

Test Plan:
Add a unit test.
make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: igor, yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D17547
2014-04-07 22:18:09 -07:00
sdong
ea0198fe9a Create log::Writer out of DB Mutex
Summary: Our measurement shows that sometimes new log::Write's constructor can take hundreds of milliseconds. It's unclear why but just simply move it out of DB mutex.

Test Plan: make all check

Reviewers: haobo, ljin, igor

Reviewed By: haobo

CC: nkg-, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17487
2014-04-04 15:46:28 -07:00
Lei Jin
c90d446ee7 make hash_link_list Node's key space consecutively followed at the end
Summary: per sdong's request, this will help processor prefetch on n->key case.

Test Plan: make all check

Reviewers: sdong, haobo, igor

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17415
2014-04-04 15:37:28 -07:00
sdong
99c756f0fe Flush Buffered Info Logs Before Doing Compaction (one line change)
Summary: Flushing log buffer earlier to avoid confusion of time holding the locks.

Test Plan: Should be safe as long as several related db test passes

Reviewers: haobo, igor, ljin

Reviewed By: igor

CC: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D17493
2014-04-04 10:58:30 -07:00
Igor Canadi
040657aec9 Fix MacOS errors 2014-04-03 16:04:34 -07:00
Igor Canadi
f76e4027ca initialize candidate count 2014-04-03 11:45:44 -07:00
sdong
b9767d0e09 Move several more logging inside DB mutex to log buffer
Summary: Move several some common logging still in DB mutex to log buffer.

Test Plan: make all check

Reviewers: haobo, igor, ljin, nkg-

Reviewed By: nkg-

CC: nkg-, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17439
2014-04-03 10:47:18 -07:00
Thomas Adam
98422cba77 [C-API] implemented more options 2014-04-03 10:47:37 +02:00
Thomas Adam
3a30b5b0be [C-API] added "rocksdb_options_set_plain_table_factory" to make it possible to use plain table factory 2014-04-03 10:47:37 +02:00
Haobo Xu
48bc0c6ad3 [RocksDB] Fix a race condition in GetSortedWalFiles
Summary: This patch fixed a race condition where a log file is moved to archived dir in the middle of GetSortedWalFiles. Without the fix, the log file would be missed in the result, which leads to transaction log iterator gap. A test utility SyncPoint is added to help reproducing the race condition.

Test Plan: TransactionLogIteratorRace; make check

Reviewers: dhruba, ljin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17121
2014-04-02 22:12:29 -07:00
Igor Canadi
d1d19f5db3 Fix valgrind error in c_test 2014-04-02 17:24:30 -07:00
sdong
158845ba9a Move a info logging out of DB Mutex
Summary: As we know, logging can be slow, or even hang for some file systems. Move one more logging out of DB mutex.

Test Plan: make all check

Reviewers: haobo, igor, ljin

Reviewed By: igor

CC: yhchiang, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D17427
2014-04-02 16:48:32 -07:00
sdong
4af1954fd6 Compaction Filter V1 to use old context struct to keep backward compatible
Summary: The previous change D15087 changed existing compaction filter, which makes the commonly used class not backward compatible. Revert the older interface. Use a new interface for V2 instead.

Test Plan: make all check

Reviewers: haobo, yhchiang, igor

CC: danguo, dhruba, ljin, igor, leveldb

Differential Revision: https://reviews.facebook.net/D17223
2014-04-02 14:57:51 -07:00
sdong
284c365b77 Fix valgrind error caused by FileMetaData as two level iterator's index block handle
Summary: It is a regression valgrind bug caused by using FileMetaData as index block handle. One of the fields of FileMetaData is not initialized after being contructed and copied, but I'm not able to find which one. Also, I realized that it's not a good idea to use FileMetaData as in TwoLevelIterator::InitDataBlock(), a copied FileMetaData can be compared with the one in version set byte by byte, but the refs can be changed. Also comparing such a large structure is slightly more expensive. Use a simpler structure instead

Test Plan:
Run the failing valgrind test (Harness.RandomizedLongDB)
make all check

Reviewers: igor, haobo, ljin

Reviewed By: igor

CC: yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17409
2014-04-02 14:38:28 -07:00
sdong
e0a87c4cf1 DBIter to use static allocated char array for saved_key_ (if it is not too long)
Summary: DBIter now uses a std::string for saved_key. Based on some profiling, it could be more expensive than we though. Optimize it with the same technique as LookupKey -- if it is short, we copy it to a static allocated char. Otherwise, dynamically allocate memory for it.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: dhruba, igor, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17289
2014-04-01 16:43:11 -07:00
sdong
d50619a559 PlainTableIterator::Seek() shouldn't check bloom filter in total order mode
Summary:
In total order mode, iterator's seek() shouldn't check total order.

Also some cleaning up about checking null for shared pointers. I don't know the behavior before it.

This bug was reported by @igor.

Test Plan: test plain_table_db_test

Reviewers: ljin, haobo, igor

Reviewed By: igor

CC: yhchiang, dhruba, igor, leveldb

Differential Revision: https://reviews.facebook.net/D17391
2014-04-01 15:05:16 -07:00
Thomas Adam
38dc5ef45f [C-API] added the possiblity to create a HashSkipList or HashLinkedList to support prefix seeks 2014-04-01 12:44:27 +02:00
Igor Canadi
577556d5f9 Don't store version number in MANIFEST
Summary: Talked to <insert internal project name> folks and they found it really scary that they won't be able to roll back once they upgrade to 2.8. We should fix this.

Test Plan: make check

Reviewers: haobo, ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17343
2014-03-31 11:33:09 -07:00
Igor Canadi
8a139a054c More valgrind issues!
Summary: Fix some more CompactionFilterV2 valgrind issues. Maybe it would make sense for CompactionFilterV2 to delete its prefix_extractor?

Test Plan: ran CompactionFilterV2* tests with valgrind. issues before patch -> no issues after

Reviewers: haobo, sdong, ljin, dhruba

Reviewed By: dhruba

CC: leveldb, danguo

Differential Revision: https://reviews.facebook.net/D17337
2014-03-29 10:34:47 -07:00
sdong
43a593a6d9 Change default value of some Options
Summary: Since we are optimizing for server workloads, some default values are not optimized any more. We change some of those values that I feel it's less prone to regression bugs.

Test Plan: make all check

Reviewers: dhruba, haobo, ljin, igor, yhchiang

Reviewed By: igor

CC: leveldb, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D16995
2014-03-28 17:09:28 -07:00
sdong
2d3468c293 MemTableIterator not to reference Memtable
Summary: In one of the perf, I shows 10%-25% CPU costs of MemTableIterator.Seek(), when doing dynamic prefix seek, are spent on checking whether we need to do bloom filter check or finding out the prefix extractor. Seems that  more level of pointer checking makes CPU cache miss more likely. This patch makes things slightly simpler by copying pointer of bloom of prefix extractor into the iterator.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: ljin

CC: igor, dhruba, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17247
2014-03-28 16:46:25 -07:00
Lei Jin
0d755fff14 cache friendly blocked bloomfilter
Summary:
By constraining the probes within cache line(s), we can improve the
cache miss rate thus performance. This probably only makes sense for
in-memory workload so defaults the option to off.

Numbers and comparision can be found in wiki:
https://our.intern.facebook.com/intern/wiki/index.php/Ljin/rocksdb_perf/2014_03_17#Bloom_Filter_Study

Test Plan: benchmarked this change substantially. Will run make all check as well

Reviewers: haobo, igor, dhruba, sdong, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17133
2014-03-28 09:21:20 -07:00
Yueh-Hsuan Chiang
10cebec79e Fix the bug in MergeUtil which causes mixing values of different keys.
Summary: Fix the bug in MergeUtil which causes mixing values of different keys.

Test Plan:
stringappend_test
make all check

Reviewers: haobo, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17235
2014-03-27 16:15:25 -07:00
Haobo Xu
a92194e5b2 [RocksDB] Add db property "rocksdb.cur-size-active-mem-table"
Summary: as title

Test Plan: db_test

Reviewers: sdong

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17217
2014-03-27 15:14:04 -07:00
Igor Canadi
1c9f8f0884 Fix valgrind issues
Summary:
NewFixedPrefixTransform is leaked in default options. Broken by b47812fba6

Also included in the diff some code cleanup

Test Plan:
valgrind env_test
also make check

Reviewers: haobo, danguo, yhchiang

Reviewed By: danguo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17211
2014-03-27 08:22:59 -07:00
sdong
d556200264 Some small cleaning up to make some compiling environment happy
Summary: Compiler complains some errors when building using our internal build settings. Fix them.

Test Plan: rebuild

Reviewers: haobo, dhruba, igor, yhchiang, ljin

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17199
2014-03-26 18:11:41 -07:00
Igor Canadi
6a08bc042a Fix no return warning in FileComparator 2014-03-26 14:46:07 -07:00
Igor Canadi
1e9621d4e5 Sort files correctly in Builder::SaveTo
Summary:
Previously, we used to sort all files by BySmallestFirst comparator and then re-sort level0 files in the Finalize() (recently moved to end of SaveTo).

In this diff, I chose the correct comparator at the beginning and sort the files correctly in Builder::SaveTo.

I also added a verification that all files are sorted correctly in CheckConsistency()

NOTE: This diff depends on D17037

Test Plan: make check. Will also run db_stress

Reviewers: dhruba, haobo, sdong, ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17049
2014-03-26 13:30:14 -07:00
Igor Canadi
954679bb0f AssertHeld() should do things
Summary:
AssertHeld() was a no-op before. Now it does things.

Also, this change caught a bad bug in SuperVersion::Init(). The method is calling db->mutex.AssertHeld(), but db variable is not initialized yet! I also fixed that issue.

Test Plan: make check

Reviewers: dhruba, haobo, ljin, sdong, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17193
2014-03-26 11:24:52 -07:00
Igor Canadi
ad9a39c9b4 [RocksDB] Preallocate new MANIFEST files
Summary: We don't preallocate MANIFEST file, even though we have an option for that. This diff preallocates manifest file every time we create it

Test Plan: make check

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17163
2014-03-26 09:37:53 -07:00
sdong
6b2e7a2a01 When Options.max_num_files=-1, non level0 files also by pass table cache
Summary:
This is the part that was not finished when doing the Options.max_num_files=-1 feature. For iterating non level0 SST files (which was done using two level iterator), table cache is not bypassed. With this patch, the leftover feature is done.

Test Plan: make all check; change Options.max_num_files=-1 in one of the tests to cover the codes.

Reviewers: haobo, igor, dhruba, ljin, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17001
2014-03-25 18:40:52 -07:00
Yueh-Hsuan Chiang
b9ce156e38 Add assert to MergeOperator::PartialMergeMulti to check # of operands.
Summary:
Add assert(operands_list.size() >= 2) in MergeOperator::PartialMergeMulti
to ensure it's only be called when we have at least two merge operands.

Test Plan: run merge_test and stringappend_test.

Reviewers: haobo, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17169
2014-03-25 13:39:17 -07:00
Danny Guo
d9ca83df28 [rocksdb] make init prefix more robust
Summary:
Currently if client uses kNULLString as the prefix, it will confuse
compaction filter v2. This diff added a bool to indicate if the prefix
has been intialized. I also added a unit test to cover this case and
make sure the new code path is hit.

Test Plan: db_test

Reviewers: igor, haobo

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17151
2014-03-25 11:59:40 -07:00
Yueh-Hsuan Chiang
34f9da1cef Fix the failure of stringappend_test caused by PartialMergeMulti.
Summary:
Fix a bug that PartialMergeMulti will try to merge the first operand
with an empty slice.

Test Plan: run stringappend_test and merge_test.

Reviewers: haobo, igor

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17157
2014-03-25 11:50:09 -07:00
Danny Guo
b47812fba6 [rocksdb] new CompactionFilterV2 API
Summary:
This diff adds a new CompactionFilterV2 API that roll up the
decisions of kv pairs during compactions. These kv pairs must share the
same key prefix. They are buffered inside the db.

    typedef std::vector<Slice> SliceVector;
    virtual std::vector<bool> Filter(int level,
                                 const SliceVector& keys,
                                 const SliceVector& existing_values,
                                 std::vector<std::string>* new_values,
                                 std::vector<bool>* values_changed
                                 ) const = 0;

Application can override the Filter() function to operate
on the buffered kv pairs. More details in the inline documentation.

Test Plan:
make check. Added unit tests to make sure Keep, Delete,
Change all works.

Reviewers: haobo

CCs: leveldb

Differential Revision: https://reviews.facebook.net/D15087
2014-03-24 20:47:53 -07:00
Yueh-Hsuan Chiang
cda4006e87 Enhance partial merge to support multiple arguments
Summary:
* PartialMerge api now takes a list of operands instead of two operands.
* Add min_pertial_merge_operands to Options, indicating the minimum
  number of operands to trigger partial merge.
* This diff is based on Schalk's previous diff (D14601), but it also
  includes necessary changes such as updating the pure C api for
  partial merge.

Test Plan:
* make check all
* develop tests for cases where partial merge takes more than two
  operands.

TODOs (from Schalk):
* Add test with min_partial_merge_operands > 2.
* Perform benchmarks to measure the performance improvements (can probably
  use results of task #2837810.)
* Add description of problem to doc/index.html.
* Change wiki pages to reflect the interface changes.

Reviewers: haobo, igor, vamsi

Reviewed By: haobo

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16815
2014-03-24 17:57:13 -07:00
Igor Canadi
c21ce14fa5 Fix double-free in corruption_test 2014-03-20 14:37:30 -07:00
Igor Canadi
e67241f0b9 Sanity check on Open
Summary:
Everytime a client opens a DB, we do a sanity check that:
* checks the existance of all the necessary files
* verifies that file sizes are correct

Some of the code was stolen from https://reviews.facebook.net/D16935

Test Plan: added a unit test

Reviewers: dhruba, haobo, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17097
2014-03-20 14:18:29 -07:00
Yiting Li
7981a43274 Consistency Check Function
Summary: Added a function/command to check the consistency of live files' meta data

Test Plan:
Manual test (size mismatch, file not exist).
Command test script.

Reviewers: haobo

Reviewed By: haobo

CC: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D16935
2014-03-20 13:42:45 -07:00