Commit Graph

990 Commits

Author SHA1 Message Date
Igor Canadi
df70047669 Flush stale column families
Summary:
Added a new option `max_total_wal_size`. Once the total WAL size goes over that, we make an attempt to flush all column families that still have data in the earliest WAL file.

By default, I calculate `max_total_wal_size` dynamically, that should be good-enough for non-advanced customers.

Test Plan: Added a test

Reviewers: dhruba, haobo, sdong, ljin, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18345
2014-04-30 14:33:40 -04:00
sdong
7dafa3a1d7 Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB
Summary: Add an option to allocate a piece of memory from huge page TLB. Add options to trigger it in dynamic bloom, plain table indexes andhash linked list hash table.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: nkg-, dhruba, leveldb, igor, yhchiang

Differential Revision: https://reviews.facebook.net/D18357
2014-04-30 11:02:26 -07:00
Igor Canadi
66f88c43a5 Some fixes as preparation for release 2014-04-30 09:03:24 -07:00
Igor Canadi
d6d67c0efe More s/us fixes 2014-04-30 07:04:36 -07:00
Yueh-Hsuan Chiang
9d9d2965cb Add a new mem-table representation based on cuckoo hash.
Summary:
= Major Changes =
* Add a new mem-table representation, HashCuckooRep, which is based cuckoo hash.
  Cuckoo hash uses multiple hash functions.  This allows each key to have multiple
  possible locations in the mem-table.

  - Put: When insert a key, it will try to find whether one of its possible
    locations is vacant and store the key.  If none of its possible
    locations are available, then it will kick out a victim key and
    store at that location.  The kicked-out victim key will then be
    stored at a vacant space of its possible locations or kick-out
    another victim.  In this diff, the kick-out path (known as
    cuckoo-path) is found using BFS, which guarantees to be the shortest.

 - Get: Simply tries all possible locations of a key --- this guarantees
   worst-case constant time complexity.

 - Time complexity: O(1) for Get, and average O(1) for Put if the
   fullness of the mem-table is below 80%.

 - Default using two hash functions, the number of hash functions used
   by the cuckoo-hash may dynamically increase if it fails to find a
   short-enough kick-out path.

 - Currently, HashCuckooRep does not support iteration and snapshots,
   as our current main purpose of this is to optimize point access.

= Minor Changes =
* Add IsSnapshotSupported() to DB to indicate whether the current DB
  supports snapshots.  If it returns false, then DB::GetSnapshot() will
  always return nullptr.

Test Plan:
Run existing tests.  Will develop a test specifically for cuckoo hash in
the next diff.

Reviewers: sdong, haobo

Reviewed By: sdong

CC: leveldb, dhruba, igor

Differential Revision: https://reviews.facebook.net/D16155
2014-04-29 17:13:46 -07:00
Igor Canadi
f1c9aa6ebe More unsigned/signed compare fixes 2014-04-29 13:01:06 -07:00
Igor Canadi
38693d99c4 Fix more signed/unsigned comparsions 2014-04-29 12:40:18 -07:00
Igor Canadi
dd9eb7a7d5 Cache result of ReadFirstRecord()
Summary:
ReadFirstRecord() reads the actual log file from disk on every call. This diff introduces a cache layer on top of ReadFirstRecord(), which should significantly speed up repeated calls to GetUpdatesSince().

I also cleaned up some stuff, but the whole TransactionLogIterator could use some refactoring, especially if we see increased usage.

Test Plan: make check

Reviewers: haobo, sdong, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18387
2014-04-29 13:27:58 -04:00
Igor Canadi
91ef2eae23 Use new DBWithTTL API in tests 2014-04-28 23:46:24 -04:00
Igor Canadi
72ff275e3c Fix TransactionLogIterator EOF caching
Summary:
When TransactionLogIterator comes to EOF, it calls UnmarkEOF and continues reading. However, if glibc cached the EOF status of the file, it will get EOF again, even though the new data might have been written to it.

This has been causing errors in Mac OS.

Test Plan: test passes, was failing before

Reviewers: dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18381
2014-04-28 23:30:27 -04:00
Donovan Hide
4f9fae9bb7 Add rocksdb_open_for_read_only to C API 2014-04-27 20:57:10 +01:00
Igor Canadi
c489499a2b Fix OSX compile 2014-04-26 17:15:43 -04:00
Lei Jin
ccaca59bee avoid calling FindFile twice in TwoLevelIterator for PlainTable
Summary:
this is to reclaim the regression introduced in
https://reviews.facebook.net/D17853

Test Plan: make all check

Reviewers: igor, haobo, sdong, dhruba, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17985
2014-04-25 12:23:07 -07:00
Lei Jin
d642c60bdc Check PrefixMayMatch on Seek()
Summary:
As a follow-up diff for https://reviews.facebook.net/D17805, add
optimization to check PrefixMayMatch on Seek()

Test Plan: make all check

Reviewers: igor, haobo, sdong, yhchiang, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17853
2014-04-25 12:22:23 -07:00
Lei Jin
3995e801ab kill ReadOptions.prefix and .prefix_seek
Summary:
also add an override option total_order_iteration if you want to use full
iterator with prefix_extractor

Test Plan: make all check

Reviewers: igor, haobo, sdong, yhchiang

Reviewed By: haobo

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D17805
2014-04-25 12:21:34 -07:00
Igor Canadi
8ce5492623 Delete superversion and log outside of mutex
Summary: As summary. Add two autovectors that get filled up in MakeRoomForWrite and they get deleted outside of mutex

Test Plan: make check

Reviewers: dhruba, haobo, ljin, sdong

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18249
2014-04-25 14:58:02 -04:00
Igor Canadi
ad3cd39ccd Column family logging
Summary:
Now that we have column families involved, we need to add extra context to every log message. They now start with "[column family name] log message"

Also added some logging that I think would be useful, like level summary after every flush (I often needed that when going through the logs).

Test Plan: make check + ran db_bench to confirm I'm happy with log output

Reviewers: dhruba, haobo, ljin, yhchiang, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18303
2014-04-25 09:51:16 -04:00
Igor Canadi
4cd9f58c04 Fix corruption test 2014-04-24 14:56:41 -04:00
Igor Canadi
478990c81b Make CompactionInputErrorParanoid less flakey
Summary:
I'm getting lots of e-mails with CompactionInputErrorParanoid failing. Most recent example early morning today was: http://ci-builds.fb.com/job/rocksdb_valgrind/562/consoleFull

I'm putting a stop to these e-mails. I investigated why the test is flakey and it turns out it's because of non-determinsim of compaction scheduling. If there is a compaction after the last flush, CorruptFile will corrupt the compacted file instead of file at level 0 (as it assumes). That makes `Check(9, 9)` fail big time.

I also saw some errors with table file getting outputed to >= 1 levels instead of 0. Also fixed that.

Test Plan: Ran corruption_test 100 times without a failure. Previously it usually failed at 10th occurrence.

Reviewers: dhruba, haobo, ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18285
2014-04-24 11:13:28 -07:00
sdong
4de5b84ee0 Fix a bug in IterKey
Summary: IterKey set buffer_size_ to a wrong initial value, causing it to always allocate values from heap instead of stack if the key size is smaller. Fix it.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: igor, dhruba, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D18279
2014-04-23 19:45:22 -07:00
Igor Canadi
f9f8965e96 Print out stack trace in mac, too
Summary: While debugging Mac-only issue with ThreadLocalPtr, this was very useful. Let's print out stack trace in MAC OS, too.

Test Plan: Verified that somewhat useful stack trace was generated on mac. Will run PrintStack() on linux, too.

Reviewers: ljin, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18189
2014-04-23 09:11:35 -04:00
sdong
a570740727 Expose number of entries in mem tables to users
Summary: In this patch, two new DB properties are defined: rocksdb.num-immutable-mem-table and rocksdb.num-entries-imm-mem-tables, from where number of entries in mem tables can be exposed to users

Test Plan:
Cover the codes in db_test
make all check

Reviewers: haobo, ljin, igor

Reviewed By: igor

CC: nkg-, igor, yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18207
2014-04-22 22:13:21 -07:00
Lei Jin
5f1daf7ae3 get rid of shared_ptr in memtable.cc
Summary: Get rid of the devil. Probably won't impact anything on the perf side.

Test Plan: make all check

Reviewers: igor, haobo, sdong, yhchiang

Reviewed By: haobo

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D18153
2014-04-22 21:14:25 -07:00
sdong
86a0133d05 PlainTableReader to expose index size to users
Summary:
This is a temp solution to expose index sizes to users from PlainTableReader before we persistent them to files.
In this patch, the memory consumption of indexes used by PlainTableReader will be reported as two user defined properties, so that users can monitor them.

Test Plan:
Add a unit test.
make all check`

Reviewers: haobo, ljin

Reviewed By: haobo

CC: nkg-, yhchiang, igor, ljin, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18195
2014-04-22 19:29:05 -07:00
Igor Canadi
1068d2fa60 Revert "Better port::Mutex::AssertHeld() and AssertNotHeld()"
This reverts commit ddafceb6c2.
2014-04-22 18:38:10 -07:00
Igor Canadi
ddafceb6c2 Better port::Mutex::AssertHeld() and AssertNotHeld()
Summary:
Using ThreadLocalPtr as a flag to determine if a mutex is locked or not enables us to implement AssertNotHeld(). It also makes AssertHeld() actually correct.

I had to remove port::Mutex as a dependency for util/thread_local.h, but that's fine since we can just use std::mutex :)

Test Plan: make check

Reviewers: ljin, dhruba, haobo, sdong, yhchiang

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18171
2014-04-22 17:26:21 -07:00
Igor Canadi
3992aec8fa Support for column families in TTL DB
Summary:
This will enable people using TTL DB to do so with multiple column families. They can also specify different TTLs for each one.

TODO: Implement CreateColumnFamily() in TTL world.

Test Plan: Added a very simple sanity test.

Reviewers: dhruba, haobo, ljin, sdong, yhchiang

Reviewed By: haobo

CC: leveldb, alberts

Differential Revision: https://reviews.facebook.net/D17859
2014-04-22 11:27:33 -07:00
Igor Canadi
8dc34364d2 Rename "benchmark" back to "bench".
Also, make `benchharness.cc` not compiled into rocksdb library.
2014-04-21 13:12:15 -07:00
Pratyush Seth
ff1b5df4c6 Added benchmark functionality on the lines of folly/Benchmark.h
Summary: Added benchmark functionality on the lines of folly/Benchmark.h

Test Plan: Added unit tests

Reviewers: igor, haobo, sdong, ljin, yhchiang, dhruba

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17973
2014-04-21 12:29:55 -07:00
Igor Canadi
f813279da5 Remove TransactionLogIteratorRace when -DNDEBUG 2014-04-21 11:08:30 -07:00
Lei Jin
0f2d768191 hints for narrowing down FindFile range and avoiding checking unrelevant L0 files
Summary:
The file tree structure in Version is prebuilt and the range of each file is known.
On the Get() code path, we do binary search in FindFile() by comparing
target key with each file's largest key and also check the range for each L0 file.
With some pre-calculated knowledge, each key comparision that has been done can serve
as a hint to narrow down further searches:
(1) If a key falls within a L0 file's range, we can safely skip the next
file if its range does not overlap with the current one.
(2) If a key falls within a file's range in level L0 - Ln-1, we should only
need to binary search in the next level for files that overlap with the current one.

(1) will be able to skip some files depending one the key distribution.
(2) can greatly reduce the range of binary search, especially for bottom
levels, given that one file most likely only overlaps with N files from
the level below (where N is max_bytes_for_level_multiplier). So on level
L, we will only look at ~N files instead of N^L files.

Some inital results: measured with 500M key DB, when write is light (10k/s = 1.2M/s), this
improves QPS ~7% on top of blocked bloom. When write is heavier (80k/s =
9.6M/s), it gives us ~13% improvement.

Test Plan: make all check

Reviewers: haobo, igor, dhruba, sdong, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17205
2014-04-21 09:10:12 -07:00
sdong
651792251a Fix bugs introduced by D17961
Summary:
D17961 has two bugs:
(1) two level iterator fails to populate FileMetaData.table_reader, causing performance regression.
(2) table cache handle the !status.ok() case in the wrong place, causing seg fault which shouldn't happen.

Test Plan: make all check

Reviewers: ljin, igor, haobo

Reviewed By: ljin

CC: yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D17991
2014-04-17 17:25:28 -07:00
sdong
fa430bfd04 Minimize accessing multiple objects in Version::Get()
Summary:
One of our profilings shows that Version::Get() sometimes is slow when getting pointer of user comparators or other global objects. In this patch:
(1) we keep pointers of immutable objects in Version to avoid accesses them though option objects or cfd objects
(2) table_reader is directly cached in FileMetaData so that table cache don't have to go through handle first to fetch it
(3) If level 0 has less than 3 files, skip the filtering logic based on SST tables' key range. Smallest and largest key are stored in separated memory locations, which has potential cache misses

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: igor, yhchiang, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D17739
2014-04-17 14:14:00 -07:00
Igor Canadi
161d9e586b Don't overflow size_t in mac 2014-04-16 15:15:22 -07:00
Igor Canadi
5c12f27791 Remove tautological assert 2014-04-16 09:09:28 -07:00
Igor Canadi
faf7691358 Close DB at the end of DontRollEmptyLogs test 2014-04-15 17:20:56 -07:00
Igor Canadi
1803ed2ccb Fix Mac OS compile 2014-04-15 16:31:49 -07:00
Igor Canadi
7d838856cf Fix compile issues when doing make release 2014-04-15 16:00:10 -07:00
sdong
0f40fe4bc7 When creating a new DB, fail it when wal_dir contains existing log files
Summary: Current behavior of creating new DB is, if there is existing log files, we will go ahead and replay them on top of empty DB. This is a behavior that no user would expect. With this patch, we will fail the creation if a user creates a DB with existing log files.

Test Plan: make all check

Reviewers: haobo, igor, ljin

Reviewed By: haobo

CC: nkg-, yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D17817
2014-04-15 14:01:57 -07:00
Igor Canadi
c166615850 Fix compile issues introduced by RocksDBLite 2014-04-15 13:51:07 -07:00
Igor Canadi
588bca2020 RocksDBLite
Summary:
Introducing RocksDBLite! Removes all the non-essential features and reduces the binary size. This effort should help our adoption on mobile.

Binary size when compiling for IOS (`TARGET_OS=IOS m static_lib`) is down to 9MB from 15MB (without stripping)

Test Plan: compiles :)

Reviewers: dhruba, haobo, ljin, sdong, yhchiang

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17835
2014-04-15 13:39:26 -07:00
Igor Canadi
dbe0f327ca Set log_empty to false even when options.sync is off [fix tests] 2014-04-15 10:28:34 -07:00
Igor Canadi
e6acb874cd Don't roll empty logs
Summary:
With multiple column families, especially when manual Flush is executed, we might roll the log file, although the current log file is empty (no data has been written to the log).

After the diff, we won't create new log file if current is empty.

Next, I will write an algorithm that will flush column families that reference old log files (i.e., that weren't flushed in a while)

Test Plan: Added an unit test. Confirmed that unit test failes in master

Reviewers: dhruba, haobo, ljin, sdong

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17631
2014-04-15 09:57:25 -07:00
sdong
c87ed0942c Fix db_bench's multireadrandom
Summary: multireadrandom is broken. Fix it

Test Plan: run it and see segfault has gone.

Reviewers: ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17781
2014-04-14 15:43:34 -07:00
Yueh-Hsuan Chiang
118f88d25d Fix compile error in tailing_iter.h
Summary:
Fix the following compile error

  ./db/tailing_iter.h:17:1: error: class 'SuperVersion' was previously declared as a struct [-Werror,-Wmismatched-tags]
  class SuperVersion;
  ^
  ./db/column_family.h:77:8: note: previous use is here
  struct SuperVersion {
         ^
         ./db/tailing_iter.h:17:1: note: did you mean struct here?
         class SuperVersion;
         ^~~~~
         struct
         1 error generated.

Test Plan: make

Reviewers: ljin, igor, haobo, sdong

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17799
2014-04-14 14:05:15 -07:00
Yueh-Hsuan Chiang
327102efa5 Fix merge_test failure due to incorrect assert behavior in the release mode. 2014-04-14 12:06:49 -07:00
Lei Jin
82b37a18bd thread local for tailing iterator
Summary:
replace the super version acquisision in tailing itrator with thread
local

Test Plan: will post results

Reviewers: igor, haobo, sdong, yhchiang, dhruba

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17757
2014-04-14 10:48:01 -07:00
Lei Jin
539dd207df using thread local SuperVersion for NewIterator
Summary:
Similar to GetImp(), use SuperVersion from thread local instead of acquriing mutex.
I don't expect this change will make a dent on NewIterator() performance
because the bottleneck seems to be on the rest part of the API

Test Plan:
make asan_check
will post perf numbers

Reviewers: haobo, igor, sdong, dhruba, yhchiang

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17643
2014-04-14 09:34:59 -07:00
sdong
d5e087b6df db_bench: add a mode to operate multiple DBs
Summary: This patch introduces a new parameter num_multi_db in db_bench. When this parameter is larger than 1, multiple DBs will be created. In all benchmarks, any operation applies to a random DB among them. This is to benchmark the performance of similar applications.

Test Plan: run db_bench on both of num_multi_db=0 and more.

Reviewers: haobo, ljin, igor

Reviewed By: igor

CC: igor, yhchiang, dhruba, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D17769
2014-04-11 16:59:08 -07:00
Lei Jin
eba3fc644a make corruption_test:CompactionInputErrorParanoid deterministic
Summary:
it writes ~10M data, default L0 compaction trigger is 4, plus 2 writer
buffer, so that can accommodate ~6M data before compaction happens for
sure. I guess encoding is doing a good job to shrink the data so that
sometime, compaction does not get triggered. I get test failure quite
often.

Test Plan: ran it multiple times and all got pass

Reviewers: igor, sdong

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17775
2014-04-11 12:48:38 -07:00
Igor Canadi
de41357a18 Don't dump rocksdb version on IOS 2014-04-11 10:19:58 -07:00
Lei Jin
0af36d6aa6 SeekRandomWhileWriting
Summary: as title

Test Plan: ran it

Reviewers: igor, haobo, yhchiang

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17751
2014-04-11 09:47:20 -07:00
Kai Liu
1405232b6d Temporarily disable a test case in db_test
Summary:

Root cause is still under investigation. Just Disable the troubling use case for now.
2014-04-10 17:17:39 -07:00
Igor Canadi
ddef6841b3 Renamed InfoLogLevel::DEBUG to InfoLogLevel::DEBUG_LEVEL
Summary: XCode for some reason injects `#define DEBUG 1` into our code, which makes compile fail because we use `DEBUG` keyword for other stuff. This diff fixes the issue by renaming `DEBUG` to `DEBUG_LEVEL`.

Test Plan: compiles

Reviewers: dhruba, haobo, sdong, yhchiang, ljin

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17709
2014-04-10 15:27:42 -07:00
Kai Liu
75b59d5146 Enable hash index for block-based table
Summary: Based on previous patches, this diff eventually provides the end-to-end mechanism for users to specify the hash-index.

Test Plan: Wrote several new unit tests.

Reviewers: sdong, haobo, dhruba

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16539
2014-04-10 14:19:43 -07:00
Lei Jin
7a92537fc4 db_bench: add IteratorCreationWhileWriting mode and allow prefix_seek
Summary: as title

Test Plan: ran it

Reviewers: igor, haobo, yhchiang

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17655
2014-04-10 10:15:59 -07:00
Igor Canadi
4daea66343 Turn on -Wmissing-prototypes
Summary: Compiling for iOS has by default turned on -Wmissing-prototypes, which causes rocksdb to fail compiling. This diff turns on -Wmissing-prototypes in our compile options and cleans up all functions with missing prototypes.

Test Plan: compiles

Reviewers: dhruba, haobo, ljin, sdong

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17649
2014-04-09 21:17:14 -07:00
sdong
df2a8b6a1a Polish IterKey and use it in DBImpl::ProcessKeyValueCompaction()
Summary:
1. Polish IterKey a little bit.
2. Turn to use it in local parameter of current_user_key in DBImpl::ProcessKeyValueCompaction(). Our profile showing that DBImpl::ProcessKeyValueCompaction() has about 14% costs in std::string (the base including reading and writing data but excluding compaction filtering), which is higher than it should be. There are two std::string used in DBImpl::ProcessKeyValueCompaction(), compaction_filter_value and current_user_key and it's hard to distinguish the two.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: igor, yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D17613
2014-04-09 20:50:58 -07:00
Igor Canadi
dc55903293 Improved CompressedCache
Summary:
This is testing behavior that was reported in https://github.com/facebook/rocksdb/issues/111

No issue was found, but it still good to commit this and make CompressedCache more robust.

Test Plan: this is a plan

Reviewers: ljin, dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17625
2014-04-09 11:43:14 -07:00
Lei Jin
4824014e3b speed up db_bench filluniquerandom mode
Summary:
filluniquerandom is painfully slow due to the naive bitmap check to find
out if a key has been seen before. Majority of time is spent on searching
the last few keys. Split a giant BitSet to smaller ones so that we can
quickly check if a BitSet is full and thus can skip quickly.

It used to take over one hour to filluniquerandom for 100M keys, now it
takes about 10 mins.

Test Plan:
unit test
also verified correctness in db_bench and make sure all keys are
generated

Reviewers: igor, haobo, yhchiang

Reviewed By: igor

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D17607
2014-04-09 11:25:21 -07:00
Igor Canadi
2014915d32 Fix ASAN issue 2014-04-09 10:38:05 -07:00
Igor Canadi
b947fdc89d Column family support for DB::OpenForReadOnly()
Summary: When opening DB in read-only mode, client can choose to only specify a subset of column families ("default" column family can't be omitted, though)

Test Plan: added a unit test in column_family_test

Reviewers: haobo, sdong, ljin, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17565
2014-04-09 09:56:17 -07:00
Igor Canadi
731e55c01c Fix GetProperty() test
Summary:
GetProperty test is flakey.
Before this diff: P8635927
After: P8635945

We need to make sure the thread is done before we destruct sleeping tasks. Otherwise, bad things happen.

Test Plan: See summary

Reviewers: ljin, sdong, haobo, dhruba

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17595
2014-04-08 14:57:00 -07:00
Igor Canadi
34455deb06 Fix Mac OS compile issues 2014-04-08 14:05:53 -07:00
Igor Canadi
5b345b76cb Remove env_ from MergingIterator
Summary: env_ is not used. Compiling for iOS complains.

Test Plan: compiles now

Reviewers: ljin, haobo, sdong, dhruba

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17589
2014-04-08 13:40:42 -07:00
Lei Jin
0c1126d4cf db_bench cleanup
Summary:
clean up the db_bench a little bit. also avoid allocating memory for key
in the loop

Test Plan:
I verified a run with filluniquerandom & readrandom. Iterator seek will be used lot
to measure performance. Will fix whatever comes up

Reviewers: haobo, igor, yhchiang

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17559
2014-04-08 11:21:09 -07:00
Igor Canadi
beeee9dccc Small speedup of CompactionFilterV2
Summary: ToString() is expensive. Profiling shows that most compaction threads are stuck in jemalloc, allocating a new string. This will help out a litte.

Test Plan: make check

Reviewers: haobo, danguo

Reviewed By: danguo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17583
2014-04-08 11:06:39 -07:00
Lei Jin
92c1eb0291 macros for perf_context
Summary: This will allow us to disable them completely for iOS or for better performance

Test Plan: will run make all check

Reviewers: igor, haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17511
2014-04-08 10:58:07 -07:00
sdong
5e2db3b434 PlainTableIterator not to store copied key in std::string
Summary:
Move PlainTableIterator's copied key from std::string local buffer to avoid paying the extra costs in std::string related to sharing. Reuse the same buffer class in DbIter. Move the class to dbformat.h.

This patch improves iterator performance significantly. Running this benchmark:

./table_reader_bench --num_keys2=17 --iterator --plain_table --time_unit=nanosecond

The average latency is improved to about 750 nanoseconds from 1100 nanoseconds.

Test Plan:
Add a unit test.
make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: igor, yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D17547
2014-04-07 19:06:09 -07:00
Igor Canadi
664559fe2d Small final fixes before merge 2014-04-07 15:38:53 -07:00
Igor Canadi
d1e2bce42d CallFlushDuringCompaction 2014-04-07 15:03:15 -07:00
Igor Canadi
b42ceb9598 Simplify cleanup of dead (refcount == 0) column families 2014-04-07 14:31:02 -07:00
Igor Canadi
e48348d196 Make flush part of compaction process
This will enable user to use only 1 background thread.
2014-04-07 13:53:08 -07:00
Igor Canadi
2a0917b28e Merge branch 'master' into columnfamilies 2014-04-07 13:04:25 -07:00
Igor Canadi
751e4b1a35 Fix wal_dir sanitizing 2014-04-07 11:36:03 -07:00
Igor Canadi
3d2fe844ab Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/memtable_list.cc
	db/version_set.cc
2014-04-07 11:31:11 -07:00
Igor Canadi
7efdd9ef4d Options::wal_dir shouldn't end in '/'
Summary: If a client specifies wal_dir with trailing '/', we will fail in deleting obsolete log files. See task #4083746

Test Plan: make check

Reviewers: haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17535
2014-04-07 10:25:38 -07:00
sdong
ea0198fe9a Create log::Writer out of DB Mutex
Summary: Our measurement shows that sometimes new log::Write's constructor can take hundreds of milliseconds. It's unclear why but just simply move it out of DB mutex.

Test Plan: make all check

Reviewers: haobo, ljin, igor

Reviewed By: haobo

CC: nkg-, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17487
2014-04-04 15:46:28 -07:00
Lei Jin
c90d446ee7 make hash_link_list Node's key space consecutively followed at the end
Summary: per sdong's request, this will help processor prefetch on n->key case.

Test Plan: make all check

Reviewers: sdong, haobo, igor

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17415
2014-04-04 15:37:28 -07:00
sdong
99c756f0fe Flush Buffered Info Logs Before Doing Compaction (one line change)
Summary: Flushing log buffer earlier to avoid confusion of time holding the locks.

Test Plan: Should be safe as long as several related db test passes

Reviewers: haobo, igor, ljin

Reviewed By: igor

CC: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D17493
2014-04-04 10:58:30 -07:00
Igor Canadi
040657aec9 Fix MacOS errors 2014-04-03 16:04:34 -07:00
Igor Canadi
f76e4027ca initialize candidate count 2014-04-03 11:45:44 -07:00
sdong
b9767d0e09 Move several more logging inside DB mutex to log buffer
Summary: Move several some common logging still in DB mutex to log buffer.

Test Plan: make all check

Reviewers: haobo, igor, ljin, nkg-

Reviewed By: nkg-

CC: nkg-, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17439
2014-04-03 10:47:18 -07:00
Thomas Adam
98422cba77 [C-API] implemented more options 2014-04-03 10:47:37 +02:00
Thomas Adam
3a30b5b0be [C-API] added "rocksdb_options_set_plain_table_factory" to make it possible to use plain table factory 2014-04-03 10:47:37 +02:00
Haobo Xu
48bc0c6ad3 [RocksDB] Fix a race condition in GetSortedWalFiles
Summary: This patch fixed a race condition where a log file is moved to archived dir in the middle of GetSortedWalFiles. Without the fix, the log file would be missed in the result, which leads to transaction log iterator gap. A test utility SyncPoint is added to help reproducing the race condition.

Test Plan: TransactionLogIteratorRace; make check

Reviewers: dhruba, ljin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17121
2014-04-02 22:12:29 -07:00
Igor Canadi
d1d19f5db3 Fix valgrind error in c_test 2014-04-02 17:24:30 -07:00
sdong
158845ba9a Move a info logging out of DB Mutex
Summary: As we know, logging can be slow, or even hang for some file systems. Move one more logging out of DB mutex.

Test Plan: make all check

Reviewers: haobo, igor, ljin

Reviewed By: igor

CC: yhchiang, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D17427
2014-04-02 16:48:32 -07:00
sdong
4af1954fd6 Compaction Filter V1 to use old context struct to keep backward compatible
Summary: The previous change D15087 changed existing compaction filter, which makes the commonly used class not backward compatible. Revert the older interface. Use a new interface for V2 instead.

Test Plan: make all check

Reviewers: haobo, yhchiang, igor

CC: danguo, dhruba, ljin, igor, leveldb

Differential Revision: https://reviews.facebook.net/D17223
2014-04-02 14:57:51 -07:00
sdong
284c365b77 Fix valgrind error caused by FileMetaData as two level iterator's index block handle
Summary: It is a regression valgrind bug caused by using FileMetaData as index block handle. One of the fields of FileMetaData is not initialized after being contructed and copied, but I'm not able to find which one. Also, I realized that it's not a good idea to use FileMetaData as in TwoLevelIterator::InitDataBlock(), a copied FileMetaData can be compared with the one in version set byte by byte, but the refs can be changed. Also comparing such a large structure is slightly more expensive. Use a simpler structure instead

Test Plan:
Run the failing valgrind test (Harness.RandomizedLongDB)
make all check

Reviewers: igor, haobo, ljin

Reviewed By: igor

CC: yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17409
2014-04-02 14:38:28 -07:00
Igor Canadi
8555ce2dec Merge branch 'master' into columnfamilies 2014-04-02 10:48:05 -07:00
sdong
e0a87c4cf1 DBIter to use static allocated char array for saved_key_ (if it is not too long)
Summary: DBIter now uses a std::string for saved_key. Based on some profiling, it could be more expensive than we though. Optimize it with the same technique as LookupKey -- if it is short, we copy it to a static allocated char. Otherwise, dynamically allocate memory for it.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: dhruba, igor, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17289
2014-04-01 16:43:11 -07:00
sdong
d50619a559 PlainTableIterator::Seek() shouldn't check bloom filter in total order mode
Summary:
In total order mode, iterator's seek() shouldn't check total order.

Also some cleaning up about checking null for shared pointers. I don't know the behavior before it.

This bug was reported by @igor.

Test Plan: test plain_table_db_test

Reviewers: ljin, haobo, igor

Reviewed By: igor

CC: yhchiang, dhruba, igor, leveldb

Differential Revision: https://reviews.facebook.net/D17391
2014-04-01 15:05:16 -07:00
Thomas Adam
38dc5ef45f [C-API] added the possiblity to create a HashSkipList or HashLinkedList to support prefix seeks 2014-04-01 12:44:27 +02:00
Igor Canadi
ddbd1ece88 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/internal_stats.cc
	db/internal_stats.h
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
	include/rocksdb/options.h
	util/options.cc
2014-03-31 13:39:24 -07:00
Igor Canadi
577556d5f9 Don't store version number in MANIFEST
Summary: Talked to <insert internal project name> folks and they found it really scary that they won't be able to roll back once they upgrade to 2.8. We should fix this.

Test Plan: make check

Reviewers: haobo, ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17343
2014-03-31 11:33:09 -07:00
Igor Canadi
8a139a054c More valgrind issues!
Summary: Fix some more CompactionFilterV2 valgrind issues. Maybe it would make sense for CompactionFilterV2 to delete its prefix_extractor?

Test Plan: ran CompactionFilterV2* tests with valgrind. issues before patch -> no issues after

Reviewers: haobo, sdong, ljin, dhruba

Reviewed By: dhruba

CC: leveldb, danguo

Differential Revision: https://reviews.facebook.net/D17337
2014-03-29 10:34:47 -07:00
sdong
43a593a6d9 Change default value of some Options
Summary: Since we are optimizing for server workloads, some default values are not optimized any more. We change some of those values that I feel it's less prone to regression bugs.

Test Plan: make all check

Reviewers: dhruba, haobo, ljin, igor, yhchiang

Reviewed By: igor

CC: leveldb, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D16995
2014-03-28 17:09:28 -07:00
sdong
2d3468c293 MemTableIterator not to reference Memtable
Summary: In one of the perf, I shows 10%-25% CPU costs of MemTableIterator.Seek(), when doing dynamic prefix seek, are spent on checking whether we need to do bloom filter check or finding out the prefix extractor. Seems that  more level of pointer checking makes CPU cache miss more likely. This patch makes things slightly simpler by copying pointer of bloom of prefix extractor into the iterator.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: ljin

CC: igor, dhruba, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17247
2014-03-28 16:46:25 -07:00
Lei Jin
0d755fff14 cache friendly blocked bloomfilter
Summary:
By constraining the probes within cache line(s), we can improve the
cache miss rate thus performance. This probably only makes sense for
in-memory workload so defaults the option to off.

Numbers and comparision can be found in wiki:
https://our.intern.facebook.com/intern/wiki/index.php/Ljin/rocksdb_perf/2014_03_17#Bloom_Filter_Study

Test Plan: benchmarked this change substantially. Will run make all check as well

Reviewers: haobo, igor, dhruba, sdong, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17133
2014-03-28 09:21:20 -07:00