Commit Graph

764 Commits

Author SHA1 Message Date
Stanislau Hlebik
9d70cce047 Adding option to save PlainTable index and bloom filter in SST file.
Summary:
Adding option to save PlainTable index and bloom filter in SST file.
If there is no bloom block and/or index block, PlainTableReader builds
new ones. Otherwise PlainTableReader just use these blocks.

Test Plan: make all check

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19527
2014-07-18 16:58:13 -07:00
sdong
0abaed2e08 Support multiple DB directories in universal compaction style
Summary:
This patch adds a target size parameter in options.db_paths and universal compaction will base it to determine which DB path to place a new file.
Level-style stays the same.

Test Plan: Add new unit tests

Reviewers: ljin, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, dhruba, igor, leveldb

Differential Revision: https://reviews.facebook.net/D19869
2014-07-15 12:06:28 -07:00
Igor Canadi
20c056306b Remove stats logger
Summary: Browsing through the code, looks like StatsLogger is not used at all!

Test Plan: compiles

Reviewers: ljin, sdong, yhchiang, dhruba

Reviewed By: dhruba

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D19827
2014-07-15 09:16:32 -04:00
Igor Canadi
5ff6633588 Fix mac compile 2014-07-10 13:19:47 -07:00
sdong
36de0e5359 Add a function to return current perf level
Summary: Add a function to return the perf level. It is to allow a wrapper of DB to increase the perf level and restore the original perf level after finishing the function call.

Test Plan: Add a verification in db_test

Reviewers: yhchiang, igor, ljin

Reviewed By: ljin

Subscribers: xjin, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D19551
2014-07-10 11:35:48 -07:00
Lei Jin
534357ca3a integrate rate limiter into rocksdb
Summary:
Add option and plugin rate limiter for PosixWritableFile. The rate
limiter only applies to flush and compaction. WAL and MANIFEST are
excluded from this enforcement.

Test Plan: db_test

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19425
2014-07-08 12:31:49 -07:00
Lei Jin
5ef1ba7ff5 generic rate limiter
Summary:
A generic rate limiter that can be shared by threads and rocksdb
instances. Will use this to smooth out write traffic generated by
compaction and flush. This will help us get better p99 behavior on flash
storage.

Test Plan:
unit test output
==== Test RateLimiterTest.Rate
request size [1 - 1023], limit 10 KB/sec, actual rate: 10.374969 KB/sec, elapsed 2002265
request size [1 - 2047], limit 20 KB/sec, actual rate: 20.771242 KB/sec, elapsed 2002139
request size [1 - 4095], limit 40 KB/sec, actual rate: 41.285299 KB/sec, elapsed 2202424
request size [1 - 8191], limit 80 KB/sec, actual rate: 81.371605 KB/sec, elapsed 2402558
request size [1 - 16383], limit 160 KB/sec, actual rate: 162.541268 KB/sec, elapsed 3303500

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19359
2014-07-08 11:41:57 -07:00
Yueh-Hsuan Chiang
90a6aca48e Finer report I/O stats about Flush and Compaction.
Summary:
This diff allows the I/O stats about Flush and Compaction to be reported
in a more accurate way.  Instead of measuring the size of a file, it
measure I/O cost in per read / write basis.

Test Plan: make all check

Reviewers: sdong, igor, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19383
2014-07-03 16:28:03 -07:00
Yueh-Hsuan Chiang
d4d338de33 Add timeout_hint_us to WriteOptions and introduce Status::TimeOut.
Summary:
This diff adds timeout_hint_us to WriteOptions.  If it's non-zero, then
1) writes associated with this options MAY be aborted when it has been
  waiting for longer than the specified time.  If an abortion happens,
  associated writes will return Status::TimeOut.
2) the stall time of the associated write caused by flush or compaction
  will be limited by timeout_hint_us.

The default value of timeout_hint_us is 0 (i.e., OFF.)

The statistics of timeout writes will be recorded in WRITE_TIMEDOUT.

Test Plan:
export ROCKSDB_TESTS=WriteTimeoutAndDelayTest
make db_test
./db_test

Reviewers: igor, ljin, haobo, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18837
2014-07-03 15:47:02 -07:00
sdong
2459f7ec4e Support Multiple DB paths (without having an interface to expose to users)
Summary:
In this patch, we allow RocksDB to support multiple DB paths internally.
No user interface is supported yet so this patch is silent to users.

Test Plan: make all check

Reviewers: igor, haobo, ljin, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18921
2014-07-02 21:14:44 -07:00
Igor Canadi
d3f63f03ad Fix 32-bit errors
Summary: https://www.facebook.com/groups/rocksdb.dev/permalink/590438347721350/

Test Plan: compiles

Reviewers: sdong, ljin, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19197
2014-07-02 11:40:16 +02:00
sdong
9c332aa11a HashLinkList memtable switches a bucket to a skip list to reduce performance outliers
Summary:
In this patch, we enhance HashLinkList memtable to reduce performance outliers when a bucket contains too many entries. We switch to skip list for this case to enable binary search.

Add threshold_use_skiplist parameter to determine when a bucket needs to switch to skip list.

The new data structure is documented in comments in the codes.

Test Plan:
make all check
set threshold_use_skiplist in several tests

Reviewers: yhchiang, haobo, ljin

Reviewed By: yhchiang, ljin

Subscribers: nkg-, xjin, dhruba, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D19299
2014-07-01 17:14:15 -07:00
Feng Zhu
5656367416 use arena to allocate memtable's bloomfilter and hashskiplist's buckets_
Summary:
    Bloomfilter and hashskiplist's buckets_ allocated by memtable's arena
    DynamicBloom: pass arena via constructor, allocate space in SetTotalBits
    HashSkipListRep: allocate space of buckets_ using arena.
       do not delete it in deconstructor because arena would take care of it.
    Several test files are changed.

Test Plan:
    make all check

Reviewers: ljin, haobo, yhchiang, sdong

Reviewed By: sdong

Subscribers: igor, dhruba

Differential Revision: https://reviews.facebook.net/D19335
2014-06-30 15:54:31 -07:00
Yueh-Hsuan Chiang
81c5d98900 Fixed a comparison between signed and unsigned integers in options.cc
Summary:
Fixed the following warning:

util/options.cc: In constructor ‘rocksdb::ColumnFamilyOptions::ColumnFamilyOptions(const rocksdb::Options&)’:
util/options.cc:157:58: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare]
   if (max_bytes_for_level_multiplier_additional.size() < num_levels) {
                                                             ^

Test Plan: make all check

Reviewers: haobo, sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19293
2014-06-25 15:31:30 -06:00
sdong
19de6a7aad Remove MemTableRep::GetIterator(const Slice& slice)
Summary: It seems to me that when ever function MemTableRep::GetIterator(const Slice& slice) is used, we can use MemTableRep::GetDynamicPrefixIterator() instead. Just delete it to simplify the codes.

Test Plan: make all check

Reviewers: yhchiang, ljin

Reviewed By: ljin

Subscribers: xjin, dhruba, haobo, leveldb

Differential Revision: https://reviews.facebook.net/D19281
2014-06-25 14:09:29 -07:00
Yueh-Hsuan Chiang
55531fd089 Fixed heap-buffer-overflow issue when Options.num_levels > 7.
Summary:
Currently, when num_levels has been changed to > 7, internally
it will not resize max_bytes_for_level_multiplier_additional.
As a result, max_bytes_for_level_multiplier_additional.size() will
be smaller than num_levels, which causes heap-buffer-overflow.

Test Plan: make all check

Reviewers: haobo, sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19275
2014-06-25 14:11:12 -06:00
Yueh-Hsuan Chiang
82c31792d0 Revert the default setting of InitFromCmdLineArgs().
Summary:
Revert the default setting of InitFromCmdLineArgs() as all the callers
currently provide full set of arguments.

Test Plan:
make reduce_levels_test
./reduce_levels_test

Reviewers: haobo, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19257
2014-06-23 18:51:00 -06:00
Yueh-Hsuan Chiang
e5e6f55bd1 Fix compile error caused in LDB tool
Summary:
Fixed the following compile error.
    tools/reduce_levels_test.cc:89:31: error: no matching function for call to 'InitFromCmdLineArgs'
      LDBCommand* level_reducer = LDBCommand::InitFromCmdLineArgs(args);
                                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    ./util/ldb_cmd.h:56:22: note: candidate function not viable: requires 3 arguments, but 1 was provided
      static LDBCommand* InitFromCmdLineArgs(
                         ^
    ./util/ldb_cmd.h:62:22: note: candidate function not viable: requires 4 arguments, but 1 was provided
      static LDBCommand* InitFromCmdLineArgs(
                         ^
    1 error generated.

Test Plan:
make reduce_levels_test
./reduce_levels_test

Reviewers: haobo, ljin, sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19251
2014-06-23 17:19:53 -06:00
Haobo Xu
dfb31d152d [RocksDB] allow LDB tool to have customized key formatter
Summary: Currently ldb tool dump keys either in ascii format or hex format - neither is ideal if the key has a binary structure and is not readable in ascii. This diff also allows LDB tool to be customized in ways beyond DB options.

Test Plan: verify that key formatter works with some simple db with binary key.

Reviewers: sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19209
2014-06-23 15:35:40 -07:00
Igor Canadi
00b26c3a83 JSONDocument
Summary:
After evaluating options for JSON storage, I decided to implement our own. The reason is that we'll be able to optimize it better and we get to reduce unnecessary dependencies (which is what we'd get with folly).

I also plan to write a serializer/deserializer for JSONDocument with our own binary format similar to BSON. That way we'll store binary JSON format in RocksDB instead of the plain-text JSON. This means less storage and faster deserialization.

There are still some inefficiencies left here. I plan to optimize them after we develop a functioning DocumentDB. That way we can move and iterate faster.

Test Plan: added a unit test

Reviewers: dhruba, haobo, sdong, ljin, yhchiang

Reviewed By: haobo

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D18831
2014-06-20 11:14:14 +02:00
Igor Canadi
d4a8423334 Remove seek compaction
Summary:
As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases.

This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction.

There is one test case that relied on didIO information. I'll try to find another way to implement it.

Test Plan: make check

Reviewers: sdong, haobo, yhchiang, ljin, dhruba

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19161
2014-06-20 10:23:02 +02:00
sdong
8c265c08f1 HashLinkList to log distribution of number of entries aross buckets
Summary: Add two parameters of hash linked list to log distribution of number of entries across all buckets, and a sample row when there are too many entries in one single bucket.

Test Plan: Turn it on in plain_table_db_test and see the logs.

Reviewers: haobo, ljin

Reviewed By: ljin

Subscribers: leveldb, nkg-, dhruba, yhchiang

Differential Revision: https://reviews.facebook.net/D19095
2014-06-17 17:55:36 -07:00
Igor Canadi
4f18bfe376 Merge pull request #176 from bgrainger/mutexrw-unlock
Add separate Read/WriteUnlock methods in MutexRW.
2014-06-17 20:38:06 +02:00
Bradley Grainger
2d02ec6533 Add separate Read/WriteUnlock methods in MutexRW.
Some platforms, particularly Windows, do not have a single method that can
release both a held reader lock and a held writer lock; instead, a
separate method (ReleaseSRWLockShared or ReleaseSRWLockExclusive) must be
called in each case.

This may also be necessary to back MutexRW with a shared_mutex in C++14;
the current language proposal includes both an unlock() and a
shared_unlock() method.
2014-06-16 15:41:46 -07:00
Yueh-Hsuan Chiang
4d913cfbc3 Fix a bug causing LOG is not created when max_log_file_size is set.
Summary:
Fix a bug causing LOG is not created when max_log_file_size is set.
This bug is reported in issue #174.

Test Plan:
Add TEST(AutoRollLoggerTest, LogFileExistence).
make auto_roll_logger_test
./auto_roll_logger_test

Reviewers: haobo, sdong, ljin, igor, igor2

Reviewed By: igor2

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D19053
2014-06-16 10:27:42 -07:00
Lei Jin
c83b085770 prefetch bloom filter data block for L0 files
Summary: as title

Test Plan:
db_bench
the initial result is very promising. I will post results of complete
runs

Reviewers: dhruba, haobo, sdong, igor

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D18867
2014-06-12 10:06:18 -07:00
sdong
80f409ea37 Clean PlainTableReader's variables for better data locality
Summary:
Clean PlainTableReader's data structures:
(1) inline bloom_ (in order to do this, change DynamicBloom to allow lazy initialization)
(2) remove some variables only used when initialization from the class
(3) put variables not used in normal read code paths to the end of the class and reference prefix_extractor directly
(4) make Options a reference.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: ljin

Subscribers: igor, yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18891
2014-06-09 13:53:39 -07:00
Igor Canadi
a0191c9dfe Create Missing Column Families
Summary: Provide an convenience option to create column families if they are missing from the DB. Task #4460490

Test Plan: added unit test. also, stress test for some time

Reviewers: sdong, haobo, dhruba, ljin, yhchiang

Reviewed By: yhchiang

Subscribers: yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D18951
2014-06-06 18:04:56 -07:00
Ankit Gupta
08314321fa Merge branch 'master' of https://github.com/facebook/rocksdb 2014-06-05 13:17:53 +01:00
Igor Canadi
552c49f0f4 Remove upper bound for rate limiting unit test 2014-06-03 13:58:44 -07:00
sdong
df9069d23f In DB::NewIterator(), try to allocate the whole iterator tree in an arena
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.

Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.

Test Plan: make all check

Reviewers: ljin, haobo

Reviewed By: haobo

Subscribers: leveldb, dhruba, yhchiang, igor

Differential Revision: https://reviews.facebook.net/D18513
2014-06-02 17:44:57 -07:00
sdong
462796697c dynamic_bloom: replace some divide (remainder) operations with shifts in locality mode, and other improvements
Summary:
This patch changes meaning of options.bloom_locality: 0 means disable cache line optimization and any positive number means use CACHE_LINE_SIZE as block size (the previous behavior is the block size will be CACHE_LINE_SIZE*options.bloom_locality). By doing it, the divide operations inside a block can be replaced by a shift.

Performance is improved:
https://reviews.facebook.net/P471

Also, improve the basic algorithm in two ways:
(1) make sure num of blocks is an odd number
(2) rotate bytes after every probe in locality mode. Since the divider is 2^n, unless doing it, we are never able to use all the bits.
Improvements of false positive: https://reviews.facebook.net/P459

Test Plan: make all check

Reviewers: ljin, haobo

Reviewed By: haobo

Subscribers: dhruba, yhchiang, igor, leveldb

Differential Revision: https://reviews.facebook.net/D18843
2014-06-02 17:36:38 -07:00
Ankit Gupta
fa1b62cabd Merge remote-tracking branch 'upstream/master' 2014-05-29 18:33:11 -07:00
sdong
9899b12780 ThreadID printed when Thread terminating in the same format as posix_logger
Summary: 220132b65e correctly fixed the issue of thread ID printing when terminating a thread. Nothing wrong with it. This diff prints the ID in the same way as in PosixLogger::logv() so that users can be more easily to correlates them.

Test Plan: run env_test and make sure it prints correctly.

Reviewers: igor, haobo, ljin, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18819
2014-05-29 11:11:08 -07:00
Ankit Gupta
84cbafa5ce Merge branch 'master' of https://github.com/facebook/rocksdb 2014-05-27 13:04:16 -07:00
Igor Canadi
6de6a06631 FIFO compaction style
Summary:
Introducing new compaction style -- FIFO.

FIFO compaction style has write amplification of 1 (+1 for WAL) and it deletes the oldest files when the total DB size exceeds pre-configured values.

FIFO compaction style is suited for storing high-frequency event logs.

Test Plan: Added a unit test

Reviewers: dhruba, haobo, sdong

Reviewed By: dhruba

Subscribers: alberts, leveldb

Differential Revision: https://reviews.facebook.net/D18765
2014-05-21 11:43:35 -07:00
Chilledheart
81b498bc15 Print pthread_t in a more safe way 2014-05-22 01:24:42 +08:00
Ankit Gupta
260842a7f8 Merge branch 'master' of https://github.com/facebook/rocksdb 2014-05-21 07:59:50 -07:00
Mike Orr
591f71285c cleanup exception text 2014-05-21 07:54:22 -04:00
Mike Orr
d788bb8f71 - hdfs cleanup; fix to NewDirectory to comply with definition in env.h
- fix compile error with env_test; static casts added
2014-05-21 07:50:37 -04:00
Mike Orr
c2fda55cfe hdfs cleanup and compile test against CDH 4.4. 2014-05-20 17:22:12 -04:00
sdong
bd1105aa5a Print out thread ID while thread terminates for decreased pool size.
Summary: Per request from @nkg-, temporarily print thread ID when a thread terminates. It is a temp solution as we try to minimized stderr messages.

Test Plan: env_test

Reviewers: haobo, igor, dhruba

Reviewed By: igor

CC: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D18753
2014-05-19 15:18:02 -07:00
sdong
3df07d1703 ThreadPool to allow decrease number of threads and increase of number of threads is to be instantly scheduled
Summary:
Add a feature to decrease the number of threads in thread pool.
Also instantly schedule more threads if number of threads is increased.

Here is the way it is implemented: each background thread needs its thread ID. After decreasing number of threads, all threads are woken up. The thread with the largest thread ID will terminate. If there are more threads to terminate, the thread will wake up all threads again.

Another change is made so that when number of threads is increased, more threads are created and all previous excessive threads are woken up to do the work.

Test Plan: Add a unit test.

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: yhchiang, igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D18675
2014-05-19 11:52:12 -07:00
Ankit Gupta
5a1b41cee1 Merge branch 'master' of https://github.com/facebook/rocksdb 2014-05-18 22:41:00 -07:00
Chilledheart
fa16ef394c Fix errors while building with clang 2014-05-15 12:34:53 +08:00
Igor Canadi
52783c793c declare kInline size in arena.cc 2014-05-14 12:40:49 -07:00
Igor Canadi
a490d2d3a0 Revert "Define kInlineSize in .cc file instead of header"
This reverts commit 35a8873aa6.
2014-05-14 12:36:21 -07:00
Igor Canadi
35a8873aa6 Define kInlineSize in .cc file instead of header 2014-05-14 12:29:46 -07:00
Igor Canadi
eea73226e9 Improve EnvHdfs
Summary: Copy improvements from fbcode's version of EnvHdfs to our open-source version. Some very important bug fixes in there.

Test Plan: compiles

Reviewers: dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18711
2014-05-14 12:14:18 -07:00
Igor Canadi
f4574449e9 Clean up compaction logging
Summary: Cleaned up compaction logging a little bit. Now file sizes are easier to read. Also, removed the trailing space.

Test Plan:
verified that i'm happy with logging output:

        files_size[#33(seq=101,sz=98KB,0) #31(seq=81,sz=159KB,0) #26(seq=0,sz=637KB,0)]

Reviewers: sdong, haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18549
2014-05-14 12:13:50 -07:00
sdong
3e4a9ec241 Arena to inline 2KB of data in it.
Summary:
In order to use arena to a use case that the total allocation size might be small (LogBuffer is already such a case), inline 1KB of data in it, so that it can be mostly in stack or inline in another class.

If always inlining 2KB is a concern, I could make it a template to determine what to inline. However, dependents need to changes. Doesn't go with it for now

Test Plan: make all check.

Reviewers: haobo, igor, yhchiang, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18609
2014-05-14 11:49:01 -07:00
Ankit Gupta
c92425e0ba Merge branch 'master' of https://github.com/facebook/rocksdb 2014-05-13 22:23:04 -07:00
Ankit Gupta
cf0b773a29 Add more options to backupable DB 2014-05-13 22:22:21 -07:00
sdong
8c2c4602ee FixedPrefixTransform to include prefix length in its name
Summary: As title

Test Plan: make all check.

Reviewers: haobo, igor, yhchiang

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18705
2014-05-13 16:08:21 -07:00
Igor Canadi
26f5dd9a5a TablePropertiesCollectorFactory
Summary:
This diff addresses task #4296714 and rethinks how users provide us with TablePropertiesCollectors as part of Options.

Here's description of task #4296714:
       I'm debugging #4295529 and noticed that our count of user properties kDeletedKeys is wrong. We're sharing one single InternalKeyPropertiesCollector with all Table Builders. In LOG Files, we're outputting number of kDeletedKeys as connected with a single table, while it's actually the total count of deleted keys since creation of the DB.

       For example, this table has 3155 entries and 1391828 deleted keys.

The problem with current approach that we call methods on a single TablePropertiesCollector for all the tables we create. Even worse, we could do it from multiple threads at the same time and TablePropertiesCollector has no way of knowing which table we're calling it for.

Good part: Looks like nobody inside Facebook is using Options::table_properties_collectors. This means we should be able to painfully change the API.

In this change, I introduce TablePropertiesCollectorFactory. For every table we create, we call `CreateTablePropertiesCollector`, which creates a TablePropertiesCollector for a single table. We then use it sequentially from a single thread, which means it doesn't have to be thread-safe.

Test Plan:
Added a test in table_properties_collector_test that fails on master (build two tables, assert that kDeletedKeys count is correct for the second one).
Also, all other tests

Reviewers: sdong, dhruba, haobo, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18579
2014-05-13 12:30:55 -07:00
Igor Canadi
d08073aafc Merge pull request #141 from dallasmarlow/master
only use MAP_HUGETLB when the kernel supports it
2014-05-11 15:44:32 -07:00
Dallas Marlow
557fbc9b3b arena spacing 2014-05-11 10:22:28 -04:00
Igor Canadi
038a477b53 Make it easier to start using RocksDB
Summary:
This diff is addressing multiple things with a single goal -- to make RocksDB easier to use:
* Add some functions to Options that make RocksDB easier to tune.
* Add example code for both simple RocksDB and RocksDB with Column Families.
* Rewrite our README.md

Regarding Options, I took a stab at something we talked about for a long time:
* https://www.facebook.com/groups/rocksdb.dev/permalink/563169950448190/

I added functions:
* IncreaseParallelism() -- easy, increases the thread pool and max_background_compactions
* OptimizeLevelStyleCompaction(memtable_memory_budget) -- the easiest way to optimize rocksdb for less stalls with level style compaction. This is very likely not ideal configuration. Feel free to suggest improvements. I used some of Mark's suggestions from here: https://github.com/facebook/rocksdb/issues/54
* OptimizeUniversalStyleCompaction(memtable_memory_budget) -- optimize for universal compaction.

Test Plan: compiled rocksdb. ran examples.

Reviewers: dhruba, MarkCallaghan, haobo, sdong, yhchiang

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18621
2014-05-10 10:49:33 -07:00
Dallas Marlow
030db3d17e testing 2014-05-09 18:58:39 -04:00
Tyler Neely
deb89401fd have proprocessor choose correct mmap args 2014-05-09 14:49:25 -04:00
Igor Canadi
fec4269966 Fix more gflag namespace issues 2014-05-09 08:41:02 -07:00
Igor Canadi
af7453a673 autovector::resize
Summary: Resize the autovector!

Test Plan: test

Reviewers: sdong

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18543
2014-05-08 13:50:49 -07:00
sdong
1c6a027d23 HashLinkedList::Iterator: remove an ununsed class variable
Summary: This variable is not used. Remove it.

Test Plan: build.

Reviewers: haobo, igor, yhchiang

Reviewed By: haobo

CC: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18525
2014-05-08 13:28:52 -07:00
dallas marlow
f41cde3105 remove anon mmap allocation flag MAP_HUGETLB 2014-05-08 11:45:44 -04:00
sdong
3a171dcb51 Pass logger to memtable rep and TLB page allocation error logged to info logs
Summary:
TLB page allocation errors are now logged to info logs, instead of stderr.
In order to do that, mem table rep's factory functions take a info logger now.

Test Plan: make all check

Reviewers: haobo, igor, yhchiang

Reviewed By: yhchiang

CC: leveldb, yhchiang, dhruba

Differential Revision: https://reviews.facebook.net/D18471
2014-05-05 16:43:37 -07:00
Igor Canadi
d2569fea47 log_and_apply_bench on a new benchmark framework
Summary:
db_test includes Benchmark for LogAndApply. This diff removes it from db_test and puts it into a separate log_and_apply bench. I just wanted to play around with our new benchmark framework and figure out how it works.

I would also like to show you how great it is! I believe right set of microbenchmarks can speed up our productivity a lot and help catch early regressions.

Test Plan: no

Reviewers: dhruba, haobo, sdong, ljin, yhchiang

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18261
2014-05-05 11:11:48 -07:00
Igor Canadi
6785a52b1b Temporary remove perror() calls before we can log from inside of arena 2014-05-05 07:13:48 -07:00
sdong
4a7c747064 Revert "Revert "Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB""
And make the default 0 for hash linked list memtable

This reverts commit d69dc64be7.
2014-05-04 13:56:29 -07:00
Igor Canadi
d69dc64be7 Revert "Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB"
This reverts commit 7dafa3a1d7.
2014-05-04 08:37:09 -07:00
Igor Canadi
4ecfbcf865 ApplyToAllCacheEntries
Summary: Added a method that executes a callback on every cache entry.

Test Plan: added a unit test

Reviewers: haobo

Reviewed By: haobo

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D18441
2014-05-02 16:24:04 -04:00
Igor Canadi
0afc8bc29a xxHash
Summary:
Originally: https://github.com/facebook/rocksdb/pull/87/files

I'm taking over to apply some finishing touches

Test Plan: will add tests

Reviewers: dhruba, haobo, sdong, yhchiang, ljin

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18315
2014-05-01 14:09:32 -04:00
Igor Canadi
df70047669 Flush stale column families
Summary:
Added a new option `max_total_wal_size`. Once the total WAL size goes over that, we make an attempt to flush all column families that still have data in the earliest WAL file.

By default, I calculate `max_total_wal_size` dynamically, that should be good-enough for non-advanced customers.

Test Plan: Added a test

Reviewers: dhruba, haobo, sdong, ljin, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18345
2014-04-30 14:33:40 -04:00
sdong
7dafa3a1d7 Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB
Summary: Add an option to allocate a piece of memory from huge page TLB. Add options to trigger it in dynamic bloom, plain table indexes andhash linked list hash table.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: nkg-, dhruba, leveldb, igor, yhchiang

Differential Revision: https://reviews.facebook.net/D18357
2014-04-30 11:02:26 -07:00
Igor Canadi
66f88c43a5 Some fixes as preparation for release 2014-04-30 09:03:24 -07:00
Yueh-Hsuan Chiang
9d9d2965cb Add a new mem-table representation based on cuckoo hash.
Summary:
= Major Changes =
* Add a new mem-table representation, HashCuckooRep, which is based cuckoo hash.
  Cuckoo hash uses multiple hash functions.  This allows each key to have multiple
  possible locations in the mem-table.

  - Put: When insert a key, it will try to find whether one of its possible
    locations is vacant and store the key.  If none of its possible
    locations are available, then it will kick out a victim key and
    store at that location.  The kicked-out victim key will then be
    stored at a vacant space of its possible locations or kick-out
    another victim.  In this diff, the kick-out path (known as
    cuckoo-path) is found using BFS, which guarantees to be the shortest.

 - Get: Simply tries all possible locations of a key --- this guarantees
   worst-case constant time complexity.

 - Time complexity: O(1) for Get, and average O(1) for Put if the
   fullness of the mem-table is below 80%.

 - Default using two hash functions, the number of hash functions used
   by the cuckoo-hash may dynamically increase if it fails to find a
   short-enough kick-out path.

 - Currently, HashCuckooRep does not support iteration and snapshots,
   as our current main purpose of this is to optimize point access.

= Minor Changes =
* Add IsSnapshotSupported() to DB to indicate whether the current DB
  supports snapshots.  If it returns false, then DB::GetSnapshot() will
  always return nullptr.

Test Plan:
Run existing tests.  Will develop a test specifically for cuckoo hash in
the next diff.

Reviewers: sdong, haobo

Reviewed By: sdong

CC: leveldb, dhruba, igor

Differential Revision: https://reviews.facebook.net/D16155
2014-04-29 17:13:46 -07:00
Igor Canadi
38693d99c4 Fix more signed/unsigned comparsions 2014-04-29 12:40:18 -07:00
Igor Canadi
d0b006ab13 Fix unsigned and signed integer compare 2014-04-29 12:29:22 -07:00
Igor Canadi
f868dcbbed Support for adding TTL-ed column family
Summary:
This enables user to add a TTL column family to normal DB.

Next step should be to expand StackableDB and create StackableColumnFamily, such that users can for example add geo-spatial column families to normal DB.

Test Plan: added a test

Reviewers: dhruba, haobo, ljin

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18201
2014-04-28 20:34:20 -07:00
Igor Canadi
72ff275e3c Fix TransactionLogIterator EOF caching
Summary:
When TransactionLogIterator comes to EOF, it calls UnmarkEOF and continues reading. However, if glibc cached the EOF status of the file, it will get EOF again, even though the new data might have been written to it.

This has been causing errors in Mac OS.

Test Plan: test passes, was failing before

Reviewers: dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18381
2014-04-28 23:30:27 -04:00
Lei Jin
3995e801ab kill ReadOptions.prefix and .prefix_seek
Summary:
also add an override option total_order_iteration if you want to use full
iterator with prefix_extractor

Test Plan: make all check

Reviewers: igor, haobo, sdong, yhchiang

Reviewed By: haobo

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D17805
2014-04-25 12:21:34 -07:00
Igor Canadi
46b3076c91 Better stack trace in MAC
Summary:
Now this gives us the real deal stack trace:

    Assertion failed: (false), function GetProperty, file db/db_impl.cc, line 4072.
    Received signal 6 (Abort trap: 6)
    #0   0x7fff57ce39b9
    #1   abort (in libsystem_c.dylib) + 125
    #2   basename (in libsystem_c.dylib) + 0
    #3   rocksdb::DBImpl::GetProperty(rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*) (in db_test) (db_impl.cc:4072)
    #4   rocksdb::_Test_Empty::_Run() (in db_test) (testharness.h:68)
    #5   rocksdb::_Test_Empty::_RunIt() (in db_test) (db_test.cc:1005)
    #6   rocksdb::test::RunAllTests() (in db_test) (testharness.cc:60)
    #7   main (in db_test) (db_test.cc:6697)
    #8   start (in libdyld.dylib) + 1

Test Plan: added artificial assert, saw great stack trace

Reviewers: haobo, dhruba, ljin

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18309
2014-04-25 09:50:51 -04:00
Igor Canadi
fc3127e8de Install stack trace handlers in unit tests
Summary: Sometimes, our tests fail because of normal `assert` call. It would be helpful to see stack trace in that case, too.

Test Plan: Added `assert(false)` and verified it prints out stack trace

Reviewers: haobo, dhruba, sdong, ljin, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18291
2014-04-24 13:52:20 -04:00
Igor Canadi
f9f8965e96 Print out stack trace in mac, too
Summary: While debugging Mac-only issue with ThreadLocalPtr, this was very useful. Let's print out stack trace in MAC OS, too.

Test Plan: Verified that somewhat useful stack trace was generated on mac. Will run PrintStack() on linux, too.

Reviewers: ljin, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18189
2014-04-23 09:11:35 -04:00
Lei Jin
547bb6a626 simplify ThreadLocalPtr a little bit
Summary:
make singleton a static member instead of dynamic object. This should
also avoid the race on unique_ptr

Test Plan: make all check

Reviewers: igor, haobo, sdong

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18177
2014-04-22 21:13:34 -07:00
Igor Canadi
1068d2fa60 Revert "Better port::Mutex::AssertHeld() and AssertNotHeld()"
This reverts commit ddafceb6c2.
2014-04-22 18:38:10 -07:00
Igor Canadi
ddafceb6c2 Better port::Mutex::AssertHeld() and AssertNotHeld()
Summary:
Using ThreadLocalPtr as a flag to determine if a mutex is locked or not enables us to implement AssertNotHeld(). It also makes AssertHeld() actually correct.

I had to remove port::Mutex as a dependency for util/thread_local.h, but that's fine since we can just use std::mutex :)

Test Plan: make check

Reviewers: ljin, dhruba, haobo, sdong, yhchiang

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18171
2014-04-22 17:26:21 -07:00
Igor Canadi
c2da9e5997 Flush before Fsync()/Sync()
Summary: Calling Fsync()/Sync() on a file should give the guarantee that whatever you written to the file is now persisted. This is currently not the case, since we might have some data left in application cache as we do Fsync()/Sync(). For example, BuildTable() calls Fsync() without the flush, assuming all sst data is now persisted, but it's actually not. This may result in big inconsistencies.

Test Plan: no test

Reviewers: sdong, dhruba, haobo, ljin, yhchiang

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18159
2014-04-21 17:45:04 -07:00
Igor Canadi
ba16c1f410 Move benchmark timing to Env::NowNanos() 2014-04-21 17:43:51 -07:00
Igor Canadi
8dc34364d2 Rename "benchmark" back to "bench".
Also, make `benchharness.cc` not compiled into rocksdb library.
2014-04-21 13:12:15 -07:00
Igor Canadi
05c168658e Relax env_test::AllocateTest 2014-04-21 12:56:32 -07:00
Pratyush Seth
ff1b5df4c6 Added benchmark functionality on the lines of folly/Benchmark.h
Summary: Added benchmark functionality on the lines of folly/Benchmark.h

Test Plan: Added unit tests

Reviewers: igor, haobo, sdong, ljin, yhchiang, dhruba

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17973
2014-04-21 12:29:55 -07:00
Igor Canadi
c7076a7a05 Fix Allocate test
Summary: For some reason, on a subset of our continuous build machines, preallocation is allocating 8 block more than it should be. Let's relax the test a little bit -- now we require the test to allocate *at least* the number of blocks as we told them to.

Test Plan: no

Reviewers: ljin, haobo, sdong

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18141
2014-04-21 12:12:02 -07:00
Igor Canadi
86ae8203e6 Fix ifdef NDEBUG 2014-04-17 14:29:28 -07:00
Igor Canadi
62551b1c4e Don't compile sync_point if NDEBUG
Summary:
We don't really need sync_point.o if we're compiling with NDEBUG.

This diff depends on D17823

Test Plan: compiles

Reviewers: haobo, ljin, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17829
2014-04-17 10:49:58 -07:00
Igor Canadi
588bca2020 RocksDBLite
Summary:
Introducing RocksDBLite! Removes all the non-essential features and reduces the binary size. This effort should help our adoption on mobile.

Binary size when compiling for IOS (`TARGET_OS=IOS m static_lib`) is down to 9MB from 15MB (without stripping)

Test Plan: compiles :)

Reviewers: dhruba, haobo, ljin, sdong, yhchiang

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17835
2014-04-15 13:39:26 -07:00
Igor Canadi
23c8f89b57 Revert "Don't compile ldb tool into static library"
This reverts commit e296577ef6.
2014-04-15 11:29:02 -07:00
Igor Canadi
e296577ef6 Don't compile ldb tool into static library
Summary:
This is first step of my effort to reduce size of librocksdb.a for use in mobile.

ldb object files are huge and are ment to be used as a command line tool. I moved them to `tools/` directory and include them only when compiling `ldb`

This diff reduced librocksdb.a from 42MB to 39MB on my mac (not stripped).

Test Plan: ran ldb

Reviewers: dhruba, haobo, sdong, ljin, yhchiang

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17823
2014-04-15 10:52:39 -07:00
Igor Canadi
9433e359d1 Dont build version if compiling for IOS 2014-04-11 10:54:47 -07:00
Igor Canadi
6c0fc5d3ba Fix some InfoLogLevels 2014-04-10 16:49:49 -07:00
Igor Canadi
ddef6841b3 Renamed InfoLogLevel::DEBUG to InfoLogLevel::DEBUG_LEVEL
Summary: XCode for some reason injects `#define DEBUG 1` into our code, which makes compile fail because we use `DEBUG` keyword for other stuff. This diff fixes the issue by renaming `DEBUG` to `DEBUG_LEVEL`.

Test Plan: compiles

Reviewers: dhruba, haobo, sdong, yhchiang, ljin

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17709
2014-04-10 15:27:42 -07:00
Igor Canadi
4daea66343 Turn on -Wmissing-prototypes
Summary: Compiling for iOS has by default turned on -Wmissing-prototypes, which causes rocksdb to fail compiling. This diff turns on -Wmissing-prototypes in our compile options and cleans up all functions with missing prototypes.

Test Plan: compiles

Reviewers: dhruba, haobo, ljin, sdong

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17649
2014-04-09 21:17:14 -07:00
Lei Jin
4824014e3b speed up db_bench filluniquerandom mode
Summary:
filluniquerandom is painfully slow due to the naive bitmap check to find
out if a key has been seen before. Majority of time is spent on searching
the last few keys. Split a giant BitSet to smaller ones so that we can
quickly check if a BitSet is full and thus can skip quickly.

It used to take over one hour to filluniquerandom for 100M keys, now it
takes about 10 mins.

Test Plan:
unit test
also verified correctness in db_bench and make sure all keys are
generated

Reviewers: igor, haobo, yhchiang

Reviewed By: igor

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D17607
2014-04-09 11:25:21 -07:00
Igor Canadi
b947fdc89d Column family support for DB::OpenForReadOnly()
Summary: When opening DB in read-only mode, client can choose to only specify a subset of column families ("default" column family can't be omitted, though)

Test Plan: added a unit test in column_family_test

Reviewers: haobo, sdong, ljin, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17565
2014-04-09 09:56:17 -07:00
Lei Jin
92c1eb0291 macros for perf_context
Summary: This will allow us to disable them completely for iOS or for better performance

Test Plan: will run make all check

Reviewers: igor, haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17511
2014-04-08 10:58:07 -07:00
Igor Canadi
a4d73ddee0 Initialize Options::bloom_locality
Summary: I think this issue was caused by bad merge. We have to initialize bloom_locality, otherwise valgrind complains: "Use of uninitialised value of size 8"

Test Plan: Run valgrind ./prefix_test

Reviewers: ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17553
2014-04-07 16:33:03 -07:00
Igor Canadi
3d2fe844ab Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/memtable_list.cc
	db/version_set.cc
2014-04-07 11:31:11 -07:00
Lei Jin
c90d446ee7 make hash_link_list Node's key space consecutively followed at the end
Summary: per sdong's request, this will help processor prefetch on n->key case.

Test Plan: make all check

Reviewers: sdong, haobo, igor

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17415
2014-04-04 15:37:28 -07:00
Igor Canadi
318eace49d Dynamically choose SSE 4.2
Summary: Otherwise, if we compile on machine with SSE4.2 support and run it on machine without the support, we will fail.

Test Plan: compiles, verified that isSse42() gets called.

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17505
2014-04-04 14:03:19 -07:00
Igor Canadi
51023c3911 Make RocksDB compile for iOS
Summary:
I had to make number of changes to the code and Makefile:
* Add `make lib`, that will create static library without debug info. We need this to avoid growing binary too much. Currently it's 14MB.
* Remove cpuinfo() function and use __SSE4_2__ macro. We actually used the macro as part of Fast_CRC32() function.
As a result, I also accidentally fixed this issue: https://www.facebook.com/groups/rocksdb.dev/permalink/549700778461774/?stream_ref=2
* Remove __thread locals in OS_MACOSX

Test Plan: `make lib PLATFORM=IOS`

Reviewers: ljin, haobo, dhruba, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17475
2014-04-04 13:11:44 -07:00
sdong
ef7dc38919 Fix some other signed & unsigned comparisons
Summary: Fix some signed and unsigned comparisons to make some other build script happy.

Test Plan: Build and run those changed tests

Reviewers: ljin, igor, haobo

Reviewed By: igor

CC: yhchiang, dhruba, kailiu, leveldb

Differential Revision: https://reviews.facebook.net/D17463
2014-04-03 20:51:27 -07:00
Igor Canadi
040657aec9 Fix MacOS errors 2014-04-03 16:04:34 -07:00
Haobo Xu
2fa5d41ba3 [RocksDB] make SetPerfLevel affect only the current thread
Summary: as title, make it easy to turn on/off profiling at per thread level.

Test Plan: make check

Reviewers: sdong, ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17469
2014-04-03 15:09:17 -07:00
Haobo Xu
48bc0c6ad3 [RocksDB] Fix a race condition in GetSortedWalFiles
Summary: This patch fixed a race condition where a log file is moved to archived dir in the middle of GetSortedWalFiles. Without the fix, the log file would be missed in the result, which leads to transaction log iterator gap. A test utility SyncPoint is added to help reproducing the race condition.

Test Plan: TransactionLogIteratorRace; make check

Reviewers: dhruba, ljin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17121
2014-04-02 22:12:29 -07:00
Albert Strasheim
56ca75e89e crc32: build a whole special Extend function for SSE 4.2.
Disassembling the Extend function shows something that looks
much more healthy now. The SSE 4.2 instructions are right
there in the body of the function.

Intel(R) Core(TM) i7-3540M CPU @ 3.00GHz

Before:

crc32c: 1.305 micros/op 766260 ops/sec; 2993.2 MB/s (4K per op)

After:

crc32c: 0.442 micros/op 2263843 ops/sec; 8843.1 MB/s (4K per op)
2014-04-02 15:15:57 -07:00
Igor Canadi
8555ce2dec Merge branch 'master' into columnfamilies 2014-04-02 10:48:05 -07:00
Lei Jin
807b2c2e02 reduce thread count in ThreadLocalTest.ConcurrentReadWriteTest
Summary: to make it less CPU intensive

Test Plan: ran it

Reviewers: igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17403
2014-04-01 16:26:07 -07:00
Yueh-Hsuan Chiang
fa84eb1f7b Fixed a compile error which tries to check whether a size_t < 0 in env_posix.cc
Summary:
Fixed a compile error which tries to check whether a size_t < 0 in env_posix.cc

util/env_posix.cc:180:16: error: comparison of unsigned expression < 0 is always false [-Werror,-Wtautological-compare]
    } while (r < 0 && errno == EINTR);
             ~ ^ ~
1 error generated.

Test Plan: make check all

Reviewers: igor, haobo

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17379
2014-04-01 11:09:06 -07:00
Igor Canadi
8e81caf01a Fix Autoroll logger
Summary:
If auto roll logger can't create a new LOG file on roll (if, for example, somebody deletes rocksdb directory while rocksdb is running, khm), we'll try to call Logv on invalid address and get a SIGSEGV. This diff will fix the issue

Here's the paste of the stack trace: https://phabricator.fb.com/P8276386 (fb-only)

Test Plan: make check is fine, although not really testing error condition

Reviewers: haobo, ljin, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17367
2014-03-31 17:18:06 -07:00
Igor Canadi
726c8084cd Retry FS system calls on EINTR
Summary: EINTR means 'please retry'. We don't do that currenty. We should.

Test Plan: make check, although it doesn't really test the new code. we'll just have to believe in the code!

Reviewers: haobo, ljin

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17349
2014-03-31 14:45:26 -07:00
Igor Canadi
ddbd1ece88 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/internal_stats.cc
	db/internal_stats.h
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
	include/rocksdb/options.h
	util/options.cc
2014-03-31 13:39:24 -07:00
Lei Jin
550cca7192 dynamicbloom fix: don't offset address when it is already aligned
Summary: this causes overflow and asan failure

Test Plan: make asan_check

Reviewers: igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17301
2014-03-28 17:30:20 -07:00
sdong
43a593a6d9 Change default value of some Options
Summary: Since we are optimizing for server workloads, some default values are not optimized any more. We change some of those values that I feel it's less prone to regression bugs.

Test Plan: make all check

Reviewers: dhruba, haobo, ljin, igor, yhchiang

Reviewed By: igor

CC: leveldb, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D16995
2014-03-28 17:09:28 -07:00
Lei Jin
c8bb79978e fix the buffer overflow in dynamic_bloom_test
Summary: int -> uint64_t

Test Plan:
it think it is pretty obvious
will run asan_check before committing

Reviewers: igor, haobo

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17241
2014-03-28 16:21:42 -07:00
Igor Canadi
64ae6e9eb9 Don't preallocate log files 2014-03-28 15:04:43 -07:00
Lei Jin
0d755fff14 cache friendly blocked bloomfilter
Summary:
By constraining the probes within cache line(s), we can improve the
cache miss rate thus performance. This probably only makes sense for
in-memory workload so defaults the option to off.

Numbers and comparision can be found in wiki:
https://our.intern.facebook.com/intern/wiki/index.php/Ljin/rocksdb_perf/2014_03_17#Bloom_Filter_Study

Test Plan: benchmarked this change substantially. Will run make all check as well

Reviewers: haobo, igor, dhruba, sdong, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17133
2014-03-28 09:21:20 -07:00
Igor Canadi
b14c1f995b allow mmap writes 2014-03-27 12:00:38 -07:00
Igor Canadi
5826f9528f Make rate limiting unit test more robust 2014-03-27 11:53:05 -07:00
Igor Canadi
1c9f8f0884 Fix valgrind issues
Summary:
NewFixedPrefixTransform is leaked in default options. Broken by b47812fba6

Also included in the diff some code cleanup

Test Plan:
valgrind env_test
also make check

Reviewers: haobo, danguo, yhchiang

Reviewed By: danguo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17211
2014-03-27 08:22:59 -07:00
sdong
d556200264 Some small cleaning up to make some compiling environment happy
Summary: Compiler complains some errors when building using our internal build settings. Fix them.

Test Plan: rebuild

Reviewers: haobo, dhruba, igor, yhchiang, ljin

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17199
2014-03-26 18:11:41 -07:00
Igor Canadi
e86d7dffd7 Merge branch 'master' into columnfamilies 2014-03-25 15:24:02 -07:00
Igor Canadi
5c44a8db61 fallocate_with_keep_size is false for LogWrites 2014-03-25 12:53:23 -07:00
Igor Canadi
e8168382c4 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	include/rocksdb/options.h
	util/options.cc
2014-03-25 11:09:40 -07:00
Danny Guo
b47812fba6 [rocksdb] new CompactionFilterV2 API
Summary:
This diff adds a new CompactionFilterV2 API that roll up the
decisions of kv pairs during compactions. These kv pairs must share the
same key prefix. They are buffered inside the db.

    typedef std::vector<Slice> SliceVector;
    virtual std::vector<bool> Filter(int level,
                                 const SliceVector& keys,
                                 const SliceVector& existing_values,
                                 std::vector<std::string>* new_values,
                                 std::vector<bool>* values_changed
                                 ) const = 0;

Application can override the Filter() function to operate
on the buffered kv pairs. More details in the inline documentation.

Test Plan:
make check. Added unit tests to make sure Keep, Delete,
Change all works.

Reviewers: haobo

CCs: leveldb

Differential Revision: https://reviews.facebook.net/D15087
2014-03-24 20:47:53 -07:00
Yueh-Hsuan Chiang
cda4006e87 Enhance partial merge to support multiple arguments
Summary:
* PartialMerge api now takes a list of operands instead of two operands.
* Add min_pertial_merge_operands to Options, indicating the minimum
  number of operands to trigger partial merge.
* This diff is based on Schalk's previous diff (D14601), but it also
  includes necessary changes such as updating the pure C api for
  partial merge.

Test Plan:
* make check all
* develop tests for cases where partial merge takes more than two
  operands.

TODOs (from Schalk):
* Add test with min_partial_merge_operands > 2.
* Perform benchmarks to measure the performance improvements (can probably
  use results of task #2837810.)
* Add description of problem to doc/index.html.
* Change wiki pages to reflect the interface changes.

Reviewers: haobo, igor, vamsi

Reviewed By: haobo

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16815
2014-03-24 17:57:13 -07:00
Igor Canadi
275832aea8 Merge branch 'master' into columnfamilies 2014-03-21 16:51:26 -07:00
sdong
83ab62e2bb Fix data corruption by LogBuffer
Summary: LogBuffer::AddLogToBuffer() uses vsnprintf() in the wrong way, which might cause buffer overflow when log line is too line. Fix it.

Test Plan: Add a unit test to cover most LogBuffer's most logic.

Reviewers: igor, haobo, dhruba

Reviewed By: igor

CC: ljin, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17103
2014-03-21 15:32:48 -07:00
Igor Canadi
ac328a86b9 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
2014-03-20 14:41:37 -07:00
Igor Canadi
e67241f0b9 Sanity check on Open
Summary:
Everytime a client opens a DB, we do a sanity check that:
* checks the existance of all the necessary files
* verifies that file sizes are correct

Some of the code was stolen from https://reviews.facebook.net/D16935

Test Plan: added a unit test

Reviewers: dhruba, haobo, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17097
2014-03-20 14:18:29 -07:00
Yiting Li
7981a43274 Consistency Check Function
Summary: Added a function/command to check the consistency of live files' meta data

Test Plan:
Manual test (size mismatch, file not exist).
Command test script.

Reviewers: haobo

Reviewed By: haobo

CC: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D16935
2014-03-20 13:42:45 -07:00
Igor Canadi
e20fa3f8a4 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/internal_stats.cc
	db/internal_stats.h
	db/version_set.cc
2014-03-19 17:22:20 -07:00
Igor Canadi
22507aff6c Fix compile issue in Mac OS
Summary:
Compile issues are:
* Unused variable env_
* Unused fallocate_with_keep_size_

Test Plan: compiles

Reviewers: dhruba, haobo, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17043
2014-03-19 15:40:12 -07:00
Igor Canadi
3055a15b29 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
2014-03-18 13:24:27 -07:00
Igor Canadi
f26cb0f093 Optimize fallocation
Summary:
Based on my recent findings (posted in our internal group), if we use fallocate without KEEP_SIZE flag, we get superior performance of fdatasync() in append-only workloads.

This diff provides an option for user to not use KEEP_SIZE flag, thus optimizing his sync performance by up to 2x-3x.

At one point we also just called posix_fallocate instead of fallocate, which isn't very fast: http://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html (tl;dr it manually writes out zero bytes to allocate storage). This diff also fixes that, by first calling fallocate and then posix_fallocate if fallocate is not supported.

Test Plan: make check

Reviewers: dhruba, sdong, haobo, ljin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16761
2014-03-17 21:52:14 -07:00
Igor Canadi
e0c1211555 Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
	tools/db_stress.cc
2014-03-17 12:21:05 -07:00
Caio SBA
f234dfd8fb Breaking line 2014-03-14 23:56:58 +00:00
Caio SBA
b9c78d2db6 Make it compile on Debian/GCC 4.7 2014-03-14 22:44:35 +00:00
Igor Canadi
8b169e949a Merge branch 'master' into columnfamilies 2014-03-14 13:42:36 -07:00
Igor Canadi
56dce9bf8e unterminated conditional directive 2014-03-14 10:22:37 -07:00
Igor Canadi
f74659ac9f Fix another Mac OS warning 2014-03-14 10:13:39 -07:00
Igor Canadi
3c75cc15a9 Fix HashSkipList and HashLinkedList SIGSEGV
Summary:
Original Summary:
Yesterday, @ljin and I were debugging various db_stress issues. We suspected one of them happens when we concurrently call NewIterator without prefix_seek on HashSkipList. This test demonstrates it.

Update:
Arena is not thread-safe!! When creating a new full iterator, we *have* to create a new arena, otherwise we're doomed.

Test Plan: SIGSEGV and assertion-throwing test now works!

Reviewers: ljin, haobo, sdong

Reviewed By: sdong

CC: leveldb, ljin

Differential Revision: https://reviews.facebook.net/D16857
2014-03-14 10:02:04 -07:00
Igor Canadi
e1f56e12cf Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	tools/db_stress.cc
2014-03-13 13:21:20 -07:00
Kai Liu
11da8bc5df A heuristic way to check if a memtable is full
Summary:
This is is based on https://reviews.facebook.net/D15027. It's not finished but I would like to give a prototype to avoid arena over-allocation while making better use of the already allocated memory blocks.

Instead of check approximate memtable size, we will take a deeper look at the arena, which incorporate essential idea that @sdong suggests: flush when arena has allocated its last and the last is "almost full"

Test Plan: N/A

Reviewers: haobo, sdong

Reviewed By: sdong

CC: leveldb, sdong

Differential Revision: https://reviews.facebook.net/D15051
2014-03-12 16:40:14 -07:00
Igor Canadi
dff9214165 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	tools/db_stress.cc
2014-03-12 10:17:41 -07:00
sdong
bd45633b71 Fix data race against logging data structure because of LogBuffer
Summary:
@igor pointed out that there is a potential data race because of the way we use the newly introduced LogBuffer. After "bg_compaction_scheduled_--" or "bg_flush_scheduled_--", they can both become 0. As soon as the lock is released after that, DBImpl's deconstructor can go ahead and deconstruct all the states inside DB, including the info_log object hold in a shared pointer of the options object it keeps. At that point it is not safe anymore to continue using the info logger to write the delayed logs.

With the patch, lock is released temporarily for log buffer to be flushed before "bg_compaction_scheduled_--" or "bg_flush_scheduled_--". In order to make sure we don't miss any pending flush or compaction, a new flag bg_schedule_needed_ is added, which is set to be true if there is a pending flush or compaction but not scheduled because of the max thread limit. If the flag is set to be true, the scheduling function will be called before compaction or flush thread finishes.

Thanks @igor for this finding!

Test Plan: make all check

Reviewers: haobo, igor

Reviewed By: haobo

CC: dhruba, ljin, yhchiang, igor, leveldb

Differential Revision: https://reviews.facebook.net/D16767
2014-03-11 16:09:53 -07:00
sdong
01dcef114b Env to add a function to allow users to query waiting queue length
Summary: Add a function to Env so that users can query the waiting queue length of each thread pool

Test Plan: add a test in env_test

Reviewers: haobo

Reviewed By: haobo

CC: dhruba, igor, yhchiang, ljin, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D16755
2014-03-11 10:19:02 -07:00
Igor Canadi
9634ba42ac Merge branch 'master' into columnfamilies
Conflicts:
	db/compaction_picker.cc
	db/db_impl.cc
	db/db_impl.h
	db/tailing_iter.cc
	db/version_set.h
	include/rocksdb/options.h
	util/options.cc
2014-03-10 17:26:09 -07:00
Lei Jin
8d007b4aaf Consolidate SliceTransform object ownership
Summary:
(1) Fix SanitizeOptions() to also check HashLinkList. The current
dynamic case just happens to work because the 2 classes have the same
layout.
(2) Do not delete SliceTransform object in HashSkipListFactory and
HashLinkListFactory destructor. Reason: SanitizeOptions() enforces
prefix_extractor and SliceTransform to be the same object when
Hash**Factory is used. This makes the behavior strange: when
Hash**Factory is used, prefix_extractor will be released by RocksDB. If
other memtable factory is used, prefix_extractor should be released by
user.

Test Plan: db_bench && make asan_check

Reviewers: haobo, igor, sdong

Reviewed By: igor

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16587
2014-03-10 12:56:46 -07:00
Haobo Xu
66da467983 [RocksDB] LogBuffer Cleanup
Summary: Moved LogBuffer class to an internal header. Removed some unneccesary indirection. Enabled log buffer for BackgroundCallFlush. Forced log buffer flush right after Unlock to improve time ordering of info log.

Test Plan: make check; db_bench compare LOG output

Reviewers: sdong

Reviewed By: sdong

CC: leveldb, igor

Differential Revision: https://reviews.facebook.net/D16707
2014-03-10 11:05:44 -07:00
Igor Canadi
04d2c26e17 Add option verify_checksums_in_compaction
Summary:
If verify_checksums_in_compaction is true, compaction will verify checksums. This is default.
If it's false, compaction doesn't verify checksums. This is useful for in-memory workloads.

Test Plan: corruption_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16695
2014-03-10 10:06:34 -07:00
Igor Canadi
1e0d47276c Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
2014-03-07 16:59:47 -08:00
Lei Jin
e5fa4944fc use CAS when returning SuperVersion to ThreadLocal
Summary:
Add a check at the end of GetImpl to release SuperVersion if it becomes
obsolete. Also do Scrape() inside InstallSuperVersion so it happens more
frequent.

Test Plan:
make all check
running asan_check now

Reviewers: igor, haobo, sdong, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16641
2014-03-07 14:43:22 -08:00
Yumikiyo Osanai
056a0286d2 Modify the compile error about ftruncate()
Summary:
Change to store the return value from ftruncate().
The reason is that ftruncate() has "warn_unused_result" attribute in some environment.

Signed-off-by: Yumikiyo Osanai <yumios.art@gmail.com>
2014-03-08 02:14:34 +09:00
sdong
e1f52b6a22 Fix Valgrind error introduced by D16515
Summary: valgrind reports issues. This patch seems to fix it.

Test Plan: run the tests that fails in valgrind

Reviewers: igor, haobo, kailiu

Reviewed By: kailiu

CC: dhruba, ljin, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16653
2014-03-06 16:07:37 -08:00
Igor Canadi
26ac5603f4 Truncate unused space on PosixWritableFile::Close()
Summary:
Blocks allocated with fallocate will take extra space on disk even if they are unused and the file is close.

Now we remove the extra blocks at the end of the file by calling `ftruncate`.

Test Plan: added a test to env_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16647
2014-03-06 15:59:27 -08:00
Igor Canadi
80a207fc90 Merge branch 'master' into columnfamilies
Conflicts:
	db/compaction_picker.cc
	db/compaction_picker.h
	db/db_impl.cc
	db/version_set.cc
	db/version_set.h
	include/rocksdb/options.h
	util/options.cc
2014-03-05 16:59:22 -08:00
Kai Liu
abeee9f2cb Make sure GetUniqueID releated tests run on "regular" storage
Summary:
With the use of tmpfs or ramfs, unit tests related to GetUniqueID()
failed because of the failure from ioctl, which doesn't work with these
fancy file systems at all.

I fixed this issue and make sure all related tests run on the "regular"
storage (disk or flash).

Test Plan: TEST_TMPDIR=/dev/shm make check -j32

Reviewers: igor, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16593
2014-03-05 16:21:36 -08:00
sdong
ecb1ffa2a8 Buffer info logs when picking compactions and write them out after releasing the mutex
Summary: Now while the background thread is picking compactions, it writes out multiple info_logs, especially for universal compaction, which introduces a chance of waiting log writing in mutex, which is bad. To remove this risk, write all those info logs to a buffer and flush it after releasing the mutex.

Test Plan:
make all check
check the log lines while running some tests that trigger compactions.

Reviewers: haobo, igor, dhruba

Reviewed By: dhruba

CC: i.am.jin.lei, dhruba, yhchiang, leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D16515
2014-03-05 15:36:32 -08:00
sdong
4405f3a000 Allow user to specify log level for info_log
Summary:
Currently, there is no easy way for user to change log level of info log. Add a parameter in options to specify that.
Also make the default level to INFO level. Removing the [INFO] tag if it is INFO level as I don't want to cause performance regression. (add [LOG] means another mem-copy and string formatting).

Test Plan:
make all check
manual check the levels work as expected.

Reviewers: dhruba, yhchiang

Reviewed By: yhchiang

CC: dhruba, igor, i.am.jin.lei, ljin, haobo, leveldb

Differential Revision: https://reviews.facebook.net/D16563
2014-03-05 14:54:31 -08:00
Igor Canadi
0738ae6dc9 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
2014-03-05 12:25:05 -08:00
Lei Jin
04298f8c33 output perf_context in db_bench readrandom
Summary:
Add helper function to print perf context data in db_bench if enabled.
I didn't find any code that actually exports perf context data. Not sure
if I missed anything

Test Plan: ran db_bench

Reviewers: haobo, sdong, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16575
2014-03-05 10:32:54 -08:00
Igor Canadi
fa34697237 Merge branch 'master' into columnfamilies 2014-03-04 09:39:14 -08:00
kailiu
906f3dca72 Add a hash-index component for block
Summary:
this is the key component extracted from diff: https://reviews.facebook.net/D14271
I separate it to a dedicated patch to make the review easier.

Test Plan: added a unit test and passed it.

Reviewers: haobo, sdong, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16245
2014-03-03 21:11:49 -08:00
Igor Canadi
9d0577a6be Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/transaction_log_impl.cc
	db/transaction_log_impl.h
	include/rocksdb/options.h
	util/env.cc
	util/options.cc
2014-03-03 18:29:03 -08:00
Igor Canadi
58ca641d53 Make Log::Reader more robust
Summary:
This diff does two things:
(1) Log::Reader does not report a corruption when the last record in a log or manifest file is truncated (meaning that log writer died in the middle of the write). Inherited the code from LevelDB: https://code.google.com/p/leveldb/source/detail?r=269fc6ca9416129248db5ca57050cd5d39d177c8#
(2) Turn off mmap writes for all writes to log and manifest files

(2) is necessary because if we use mmap writes, the last record is not truncated, but is actually filled with zeros, making checksum fail. It is hard to recover from checksum failing.

Test Plan:
Added unit tests from LevelDB
Actually recovered a "corrupted" MANIFEST file.

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16119
2014-02-28 13:19:47 -08:00
Kai Liu
6ba1084f24 Fix some compilation bugs in different platforms
Summary:

detect some problems when testing my 3rd party release tool.
2014-02-27 22:20:17 -08:00
Kai Liu
99e4b40a55 Fix the [-Werror=sign-compare] issues
Summary:

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2014-02-27 22:18:33 -08:00
Igor Canadi
492c9f71c6 [CF] Column family support for LDB tool
Summary: Added list_column_family command and also updated dump_manifest

Test Plan: no

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16419
2014-02-27 16:39:23 -08:00
Yueh-Hsuan Chiang
9a7b74954f Refine the checks in InfoLogLevel test.
Summary:
InfoLogLevel test now checks the number of lines of the output log file
instead of the number of bytes in the log file.

This diff fixes the issue that the previous InfoLogLevel test in
auto_roll_logger_test passed in make check but fails when valgrind
is used.

Test Plan: run with make check and valgrind.

Reviewers: kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16407
2014-02-27 14:00:10 -08:00
Lei Jin
ad0c3747cb cache SuperVersion in thread local storage to avoid mutex lock
Summary: as title

Test Plan:
asan_check
will post results later

Reviewers: haobo, igor, dhruba, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16257
2014-02-27 11:38:55 -08:00
kailiu
e41c060a06 Make sure logger is safely released in InfoLogLevel
Summary: fix the memory leak that was captured by jenkin build.

Test Plan: ran the valgrind test locally

Reviewers: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16389
2014-02-26 19:07:57 -08:00
Yueh-Hsuan Chiang
ccaedd16d4 Enable log info with different levels.
Summary:
* Now each Log related function has a variant that takes an additional
  argument indicating its log level, which is one of the following:
 - DEBUG, INFO, WARN, ERROR, FATAL.

* To ensure backward-compatibility, old version Log functions are kept
  unchanged.

* Logger now has a member variable indicating its log level.  Any incoming
  Log request which log level is lower than Logger's log level will not
  be output.

* The output of the newer version Log will be prefixed by its log level.

Test Plan:
Add a LogType test in auto_roll_logger_test.cc

 = Sample log output =
    2014/02/11-00:03:07.683895 7feded179840 [DEBUG] this is the message to be written to the log file!!
    2014/02/11-00:03:07.683898 7feded179840 [INFO] this is the message to be written to the log file!!
    2014/02/11-00:03:07.683900 7feded179840 [WARN] this is the message to be written to the log file!!
    2014/02/11-00:03:07.683903 7feded179840 [ERROR] this is the message to be written to the log file!!
    2014/02/11-00:03:07.683906 7feded179840 [FATAL] this is the message to be written to the log file!!

Reviewers: dhruba, xjin, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16071
2014-02-26 14:41:28 -08:00
Igor Canadi
944ff673d6 Merge branch 'master' into columnfamilies 2014-02-26 10:09:52 -08:00
Lei Jin
b2795b799e thread local pointer storage
Summary:
This is not a generic thread local implementation in the sense that it
only takes pointer. But it does support multiple instances per thread
and lets user plugin function to perform cleanup when thread exits or an
instance gets destroyed.

Test Plan: unit test for now

Reviewers: haobo, igor, sdong, dhruba

Reviewed By: igor

CC: leveldb, kailiu

Differential Revision: https://reviews.facebook.net/D16131
2014-02-25 17:47:37 -08:00
Igor Canadi
d39da4b578 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
2014-02-24 17:09:05 -08:00
sdong
01c27be5fb A simple benchmark to measure WAL append latency
Summary: A simple benchmark that simulates WAL append. It can be used to test different platform/file system's performance on WAL.

Test Plan: run it.

Reviewers: haobo, kailiu

Reviewed By: haobo

CC: igor, dhruba, i.am.jin.lei, yhchiang, leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D16239
2014-02-24 14:39:32 -08:00
Igor Canadi
ccdb93e775 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/memtable_list.cc
	db/memtable_list.h
	db/version_set.cc
	db/version_set.h
2014-02-12 14:01:30 -08:00
Igor Canadi
b06840aa7d [CF] Rethinking ColumnFamilyHandle and fix to dropping column families
Summary:
The change to the public behavior:
* When opening a DB or creating new column family client gets a ColumnFamilyHandle.
* As long as column family handle is alive, client can do whatever he wants with it, even drop it
* Dropped column family can still be read from (using the column family handle)
* Added a new call CloseColumnFamily(). Client has to close all column families that he has opened before deleting the DB
* As soon as column family is closed, any calls to DB using that column family handle will fail (also any outstanding calls)

Internally:
* Ref-counting ColumnFamilyData
* New thread-safety for ColumnFamilySet
* Dropped column families are now completely dropped and their memory cleaned-up

Test Plan: added some tests to column_family_test

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16101
2014-02-12 13:47:09 -08:00
Lei Jin
994c327b86 IOError cleanup
Summary: Clean up IOErrors so that it only indicates errors talking to device.

Test Plan: make all check

Reviewers: igor, haobo, dhruba, emayanke

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15831
2014-02-12 11:42:54 -08:00
Siying Dong
33042669f6 Reduce malloc of iterators in Get() code paths
Summary:
This patch optimized Get() code paths by avoiding malloc of iterators. Iterator creation is moved to mem table rep implementations, where a callback is called when any key is found. This is the same practice as what we do in (SST) table readers.

db_bench result for readrandom following a writeseq, with no compression, single thread and tmpfs, we see throughput improved to 144958 from 139027, about 3%.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

Reviewed By: haobo

CC: leveldb, yhchiang

Differential Revision: https://reviews.facebook.net/D14685
2014-02-11 10:32:51 -08:00
Albert Strasheim
df2f92214a Support for LZ4 compression. 2014-02-08 14:15:51 -08:00
Igor Canadi
99e61fdd5c [CF] Separate dumping of DBOptions and ColumnFamilyOptions
Summary: When we open a DB, we should dump only DBOptions and then when we create a new column family, we dump ColumnFamilyOptions for each one.

Test Plan: make check, confirm contents of the LOG

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16011
2014-02-07 13:52:51 -08:00
Igor Canadi
2b8c44639a Merge branch 'master' into columnfamilies 2014-02-07 11:42:56 -08:00
Igor Canadi
d53b188228 Fix some errors detected by coverity scan
Summary: Nothing major, just an extra return line and posibility of leaking fb in NewRandomRWFile

Test Plan: make check

Reviewers: kailiu, dhruba

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15993
2014-02-06 21:59:44 -08:00
Igor Canadi
0143abdbb0 Merge branch 'master' into columnfamilies
Conflicts:
	HISTORY.md
	db/db_impl.cc
	db/db_impl.h
	db/db_iter.cc
	db/db_test.cc
	db/dbformat.h
	db/memtable.cc
	db/memtable_list.cc
	db/memtable_list.h
	db/table_cache.cc
	db/table_cache.h
	db/version_edit.h
	db/version_set.cc
	db/version_set.h
	db/write_batch.cc
	db/write_batch_test.cc
	include/rocksdb/options.h
	util/options.cc
2014-02-06 15:58:20 -08:00
Igor Canadi
0b4ccf765c Flushes should always go to HIGH priority thread pool
Summary:
This is not column-family related diff. It is in columnfamily branch because the change is significant and we want to push it with next major release (3.0).

It removes the leveldb notion of one thread pool and expands it to two thread pools by default (HIGH and LOW). Flush process is removed from compaction process and all flush threads are executed on HIGH thread pool, since we don't want long-running compactions to influence flush latency.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15987
2014-02-06 14:17:13 -08:00
kailiu
84f8185fc0 Merge branch 'master' into performance
Conflicts:
	HISTORY.md
	db/db_impl.cc
	db/memtable.cc
2014-02-05 21:21:00 -08:00
Igor Canadi
f276e0e59d [CF] Options -> DBOptions
Summary: Replaced most of occurrences of Options with more specific DBOptions. This brings us very close to supporting different configuration options for each column family.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15933
2014-02-05 14:56:09 -08:00
Igor Canadi
328ac7ee02 Merge branch 'master' into columnfamilies 2014-02-05 12:00:33 -08:00
Igor Canadi
c24d8c4e90 [CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.

To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.

This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15915
2014-02-05 11:55:30 -08:00
Albert Strasheim
d411dc5884 crc32c: choose function in static initialization
Before: 4.430 micros/op 225732 ops/sec; 881.8 MB/s (4K per op)
After: 4.125 micros/op 242425 ops/sec; 947.0 MB/s (4K per op)
2014-02-04 19:13:57 -08:00
Igor Canadi
73f62255c1 [CF] Split SanitizeOptions into two
Summary:
There are three SanitizeOption-s now : one for DBOptions, one for ColumnFamilyOptions and one for Options (which just calls the other two)

I have also reshuffled some options -- table_cache options and info_log should live in DBOptions, for example.

Test Plan: make check doesn't complain

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15873
2014-02-04 17:26:51 -08:00
Igor Canadi
2966d764cd Fix some 32-bit compile errors
Summary: RocksDB doesn't compile on 32-bit architecture apparently. This is attempt to fix some of 32-bit errors. They are reported here: https://gist.github.com/paxos/8789697

Test Plan: RocksDB still compiles on 64-bit :)

Reviewers: kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15825
2014-02-03 13:48:30 -08:00
Igor Canadi
2a9271b403 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/db_impl_readonly.cc
2014-02-03 13:47:54 -08:00
Igor Canadi
29bacb2eb6 VersionSet cleanup
Summary:
Removed icmp_ from VersionSet (since it's per-column-family, not per-DB-instance)
Unfriended VersionSet and ColumnFamilyData (yay!)
Removed VersionSet::NumberLevels()
Cleaned up DBImpl

Test Plan: make check

Reviewers: dhruba, haobo, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15819
2014-02-03 13:10:47 -08:00
Siying Dong
d169b67680 [Performance Branch] PlainTable to encode rows with seqID 0, value type using 1 internal byte.
Summary: In PlainTable, use one single byte to represent 8 bytes of internal bytes, if seqID = 0 and it is value type (which should be common for bottom most files). It is to save 7 bytes for uncompressed cases.

Test Plan: make all check

Reviewers: haobo, dhruba, kailiu

Reviewed By: haobo

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D15489
2014-02-03 12:19:30 -08:00
Igor Canadi
5c6ef56152 Fix printf format 2014-02-03 10:25:37 -08:00
Kai Liu
87bda51d77 Merge pull request #58 from mlin/no-stdout
Eliminate stdout message when launching a posix thread.
2014-02-03 00:38:11 -08:00
kailiu
4f6cb17bdb First phase API clean up
Summary:
Addressed all the issues in https://reviews.facebook.net/D15447.
Now most table-related modules are hidden from user land.

Test Plan: make check

Reviewers: sdong, haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15525
2014-02-03 00:30:43 -08:00
Igor Canadi
5661ed8b80 Fix reduce_levels_test 2014-01-31 19:44:48 -08:00
Igor Canadi
f7489123e2 Move compaction picker and internal key comparator to ColumnFamilyData
Summary: Compaction picker and internal key comparator are different for each column family (not global), so they should live in ColumnFamilyData

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15801
2014-01-31 16:06:55 -08:00
kailiu
4e0298f23c Clean up arena API
Summary:
Easy thing goes first. This patch moves arena to internal dir; based
on which, the coming patch will deal with memtable_rep.

Test Plan: make check

Reviewers: haobo, sdong, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15615
2014-01-30 22:10:10 -08:00
Igor Canadi
c37e7de669 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
2014-01-30 11:47:23 -08:00
Igor Canadi
ac92420fc5 Merge branch 'master' into performance
Conflicts:
	db/db_impl.h
2014-01-30 10:09:23 -08:00
Igor Canadi
fa99d53e55 Change ColumnFamilyData from struct to class
Summary: ColumnFamilyData grew a lot, there's much more data that it holds now. It makes more sense to encapsulate it better by making it a class.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15579
2014-01-29 15:18:36 -08:00
Lei Jin
fb84c49a36 convert Tickers back to array with padding and alignment
Summary:
Pad each Ticker structure to be 64 bytes and make them align on 64 bytes
boundary to avoid cache line false sharing issue.
Please refer to task 3615553 for more details

Test Plan:
db_bench

LevelDB:    version 2.0s
Date:       Wed Jan 29 12:23:17 2014
CPU:        32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
CPUCache:   20480 KB
rocksdb.build.overwrite.qps 49638
rocksdb.build.overwrite.p50_micros 58.73
rocksdb.build.overwrite.p75_micros 210.56
rocksdb.build.overwrite.p99_micros 733.28
rocksdb.build.fillseq.qps 366729
rocksdb.build.fillseq.p50_micros 1.00
rocksdb.build.fillseq.p75_micros 1.00
rocksdb.build.fillseq.p99_micros 2.65
rocksdb.build.readrandom.qps 1152995
rocksdb.build.readrandom.p50_micros 11.27
rocksdb.build.readrandom.p75_micros 15.69
rocksdb.build.readrandom.p99_micros 33.59
rocksdb.build.readrandom_smallblockcache.qps 956047
rocksdb.build.readrandom_smallblockcache.p50_micros 15.23
rocksdb.build.readrandom_smallblockcache.p75_micros 17.31
rocksdb.build.readrandom_smallblockcache.p99_micros 31.49
rocksdb.build.readrandom_memtable_sst.qps 1105183
rocksdb.build.readrandom_memtable_sst.p50_micros 12.04
rocksdb.build.readrandom_memtable_sst.p75_micros 15.78
rocksdb.build.readrandom_memtable_sst.p99_micros 32.49
rocksdb.build.readrandom_fillunique_random.qps 487856
rocksdb.build.readrandom_fillunique_random.p50_micros 29.65
rocksdb.build.readrandom_fillunique_random.p75_micros 40.93
rocksdb.build.readrandom_fillunique_random.p99_micros 78.68
rocksdb.build.memtablefillrandom.qps 91304
rocksdb.build.memtablefillrandom.p50_micros 171.05
rocksdb.build.memtablefillrandom.p75_micros 196.12
rocksdb.build.memtablefillrandom.p99_micros 291.73
rocksdb.build.memtablereadrandom.qps 1340411
rocksdb.build.memtablereadrandom.p50_micros 9.48
rocksdb.build.memtablereadrandom.p75_micros 13.95
rocksdb.build.memtablereadrandom.p99_micros 30.36
rocksdb.build.readwhilewriting.qps 491004
rocksdb.build.readwhilewriting.p50_micros 29.58
rocksdb.build.readwhilewriting.p75_micros 40.34
rocksdb.build.readwhilewriting.p99_micros 76.78

Reviewers: igor, haobo

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15573
2014-01-29 15:08:41 -08:00
kailiu
a5e220f5ef Merge branch 'master' into performance
Conflicts:
	Makefile
	db/db_impl.cc
	db/db_test.cc
	db/memtable_list.cc
	db/memtable_list.h
	table/block_based_table_reader.cc
	table/table_test.cc
	util/cache.cc
	util/coding.cc
2014-01-28 10:35:55 -08:00
Igor Canadi
4bf25357ae [column families] Removing VersionSet::current()
Summary: Instead of VersionSet::current(), DBImpl uses default_cfd_->current directly.

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15483
2014-01-27 17:04:46 -08:00
Igor Canadi
eb055609e4 [column families] Move memtable and immutable memtable list to column family data
Summary: All memtables and immutable memtables are moved from DBImpl to ColumnFamilyData. For now, they are all referenced from default column family in DBImpl. It shouldn't be hard to get them from custom column family.

Test Plan: make check

Reviewers: dhruba, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15459
2014-01-27 13:37:14 -08:00
Igor Canadi
ae16606f98 Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
	db/version_set.h
2014-01-27 11:11:51 -08:00
Igor Canadi
832158e7f7 Fsync directory after we create a new file
Summary:
@dhruba, I'm not sure where we need to sync the directory. I implemented the function in Env() and added the dir sync just after we close the newly created file in the builder.

Should I also add FsyncDir() to new files that get created by a compaction?

Test Plan: Confirmed that FsyncDir is returning Status::OK()

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D14751
2014-01-27 11:02:21 -08:00
Siying Dong
b20486f294 [Performance Branch] HashLinkList to avoid to convert length prefixed string back to internal keys
Summary: Converting from length prefixed buffer back to internal key costs some CPU but it is not necessary. In this patch, internal keys are pass though the functions so that we don't need to convert back to it.

Test Plan: make all check

Reviewers: haobo, kailiu

Reviewed By: kailiu

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D15393
2014-01-27 10:26:14 -08:00
Igor Canadi
5356b2a680 Merge branch 'master' into columnfamilies 2014-01-24 18:34:48 -08:00
Siying Dong
8477255da3 Moving Some includes from options.h to forward declaration
Summary: By removing some includes form options.h and reply on forward declaration, we can more easily reason the dependencies.

Test Plan: make all check

Reviewers: kailiu, haobo, igor, dhruba

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15411
2014-01-24 17:16:22 -08:00
Igor Canadi
1423e7c9de Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
	db/version_set_reduce_num_levels.cc
	util/ldb_cmd.cc
2014-01-24 15:03:54 -08:00
Igor Canadi
b13bdfa500 Add a call DisownData() to Cache, which should speed up shutdown
Summary: On a shutdown, freeing memory takes a long time. If we're shutting down, we don't really care about memory leaks. I added a call to Cache that will avoid freeing all objects in cache.

Test Plan:
I created a script to test the speedup and demonstrate how to use the call: https://phabricator.fb.com/P3864368

Clean shutdown took 7.2 seconds, while fast and dirty one took 6.3 seconds. Unfortunately, the speedup is not that big, but should be bigger with bigger block_cache. I have set up the capacity to 80GB, but the script filled up only ~7GB.

Reviewers: dhruba, haobo, MarkCallaghan, xjin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15069
2014-01-24 14:57:52 -08:00
Igor Canadi
677fee27c6 Make VersionSet::ReduceNumberOfLevels() static
Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.

Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.

Test Plan: reduce_levels_test

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15303
2014-01-24 14:57:04 -08:00
Igor Canadi
28d1a0c6f5 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/db_impl_readonly.h
	db/db_test.cc
	include/rocksdb/db.h
	include/utilities/stackable_db.h
2014-01-24 09:27:29 -08:00
kailiu
eda924a03a Remove an unused GetLengthPrefixedSlice
Summary: We have 3 versions of GetLengthPrefixedSlice() and one of them is no longer in use.

Test Plan: make

Reviewers: sdong, igor, haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15399
2014-01-23 23:06:52 -08:00
Kai Liu
054c5dda8c Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/memtable.cc
	db/version_set.cc
	include/rocksdb/statistics.h
	util/statistics_imp.h
2014-01-23 16:32:49 -08:00
Igor Canadi
4e91f27c3a Fix performance regression in statistics
Summary:
For some reason, D15099 caused a big performance regression: https://fburl.com/16059000

After digging a bit, I figured out that the reason was that std::atomic_uint_fast64_t was allocated in an array. When I switched from an array to vector, the QPS returned to the previous level. I'm not sure why this is happening, but this diff seems to fix the performance regression.

Test Plan: I ran the regression script, observed the performance going back to normal

Reviewers: tnovak, kailiu, haobo

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15375
2014-01-23 16:11:55 -08:00
Kai Liu
bb19b530ca Aggressively inlining the short functions in coding.cc
Summary:
This diff takes an even more aggressive way to inline the functions. A decent rule that I followed is "not inline a function if it is more than 10 lines long."

Normally optimizing code by inline is ugly and hard to control, but since one of our usecase has significant amount of CPU used in functions from coding.cc, I'd like to try this diff out.

Test Plan:
1. the size for some .o file increased a little bit, but most less than 1%. So I think the negative impact of inline is negligible.
2. As the regression test shows (ran for 10 times and I calculated the average number)

    Metrics                                         Befor    After
    ========================================================================
    rocksdb.build.fillseq.qps                       426595   444515    (+4.6%)
    rocksdb.build.memtablefillrandom.qps            121739   123110
    rocksdb.build.memtablereadrandom.qps            1285103  1280520
    rocksdb.build.overwrite.qps                     125816   135570    (+9%)
    rocksdb.build.readrandom_fillunique_random.qps  285995   296863
    rocksdb.build.readrandom_memtable_sst.qps       1027132  1027279
    rocksdb.build.readrandom.qps                    1041427  1054665
    rocksdb.build.readrandom_smallblockcache.qps    1028631  1038433
    rocksdb.build.readwhilewriting.qps              918352   914629

Reviewers: haobo, sdong, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15291
2014-01-23 16:03:34 -08:00
Igor Canadi
7c5e583a27 ColumnFamilySet
Summary:
I created a separate class ColumnFamilySet to keep track of column families. Before we did this in VersionSet and I believe this approach is cleaner.

Let me know if you have any comments. I will commit tomorrow.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15357
2014-01-23 14:03:38 -08:00
Igor Canadi
92a022ad07 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/db_impl_readonly.cc
	db/version_set.cc
2014-01-22 10:59:07 -08:00
kailiu
4036d58dc9 Fix a Statistics-related unit test faulure
Summary:
In my MacOS, the member variables are populated with random numbers after initialization.
This diff fixes it by fill these arrays with 0.

Test Plan: make && ./table_test

Reviewers: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15315
2014-01-21 18:02:55 -08:00
Igor Canadi
23f6791c9e Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl_readonly.cc
	db/db_test.cc
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
	db/version_set.h
	db/version_set_reduce_num_levels.cc
2014-01-21 17:01:52 -08:00
Kai Liu
ef602f6275 Misc cleanup on performance branch
Summary:

Did some trivial stuffs:

* Add more comments;
* fix compiler's warning messages (uninitialized variables).
* etc

Test Plan:

make check
2014-01-17 14:26:29 -08:00
Igor Canadi
83681bf9ef Statistics code cleanup
Summary: I'm separating code-cleanup part of https://reviews.facebook.net/D14517. This will make D14517 easier to understand and this diff easier to review.

Test Plan: make check

Reviewers: haobo, kailiu, sdong, dhruba, tnovak

Reviewed By: tnovak

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15099
2014-01-17 12:46:06 -08:00
kailiu
1304d8c8ce Merge branch 'master' into performance
Conflicts:
	Makefile
	db/db_impl.cc
	db/db_impl.h
	db/db_test.cc
	db/memtable.cc
	db/memtable.h
	db/version_edit.h
	db/version_set.cc
	include/rocksdb/options.h
	util/hash_skiplist_rep.cc
	util/options.cc
2014-01-15 23:12:31 -08:00
kailiu
eae1804f29 Remove the unnecessary use of shared_ptr
Summary:
shared_ptr is slower than unique_ptr (which literally comes with no performance cost compare with raw pointers).
In memtable and memtable rep, we use shared_ptr when we'd actually should use unique_ptr.

According to igor's previous work, we are likely to make quite some performance gain from this diff.

Test Plan: make check

Reviewers: dhruba, igor, sdong, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15213
2014-01-15 18:22:01 -08:00
Igor Canadi
65a8a52b54 Decrease reliance on VersionSet::NumberLevels()
Summary:
With column families VersionSet will not have a constant number of levels (each CF can have different options), so we'll need to eliminate call to VersionSet::NumberLevels()

This diff decreases number of callsites, but we're not there yet. It associates number of levels with Version (each version is associated with single CF) instead of VersionSet.

I have also slightly changed how VersionSet keeps track of manifest size.

This diff also modifies constructor of Compaction such that it takes input_version and automatically Ref()s it. Before this was done outside of constructor.

In next diffs I will continue to decrease number of callsites of VersionSet::NumberLevels() and also references to current_

Test Plan: make check

Reviewers: haobo, dhruba, kailiu, sdong

Reviewed By: sdong

Differential Revision: https://reviews.facebook.net/D15171
2014-01-15 16:15:43 -08:00
Kai Liu
cd535c2280 Optimize MayContainHash()
Summary:
In latest leaf's, MayContainHash() consistently consumes 5%~7% CPU usage.

I checked the code and did an experiment with/without inlining this method.

In release mode, with `1024 * 1024 * 256` bits and `1024 * 512` entries, both call 2^30 MayContainHash() with distinctive parameters.

As the result showed, this patch reduced the running time from 9.127 sec to 7.891 sec.

Test Plan: make check

Reviewers: sdong, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15177
2014-01-14 22:03:57 -08:00
Igor Canadi
d9cd7a063f Fix CompactRange to apply filter to every key
Summary:
When doing CompactRange(), we should first flush the memtable and then calculate max_level_with_files. Also, we want to compact all the levels that have files, including level `max_level_with_files`.

This patch fixed the unit test.

Test Plan: Added a failing unit test and a fix, so it's not failing anymore.

Reviewers: dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D14421
2014-01-14 16:19:09 -08:00
Naman Gupta
8454cfe569 Add read/modify/write functionality to Put() api
Summary: The application can set a callback function, which is applied on the previous value. And calculates the new value. This new value can be set, either inplace, if the previous value existed in memtable, and new value is smaller than previous value. Otherwise the new value is added normally.

Test Plan: fbmake. Added unit tests. All unit tests pass.

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: sdong, kailiu, xinyaohu, sumeet, leveldb

Differential Revision: https://reviews.facebook.net/D14745
2014-01-14 07:55:16 -08:00
Igor Canadi
00065d0d5d Fix merge test
Copy max_successive_merges from options to column family options
2014-01-13 13:27:47 -08:00
Igor Canadi
151f9e144f Merge branch 'master' into columnfamilies 2014-01-13 09:09:51 -08:00
Schalk-Willem Kruger
a09ee1069d Improve RocksDB "get" performance by computing merge result in memtable
Summary:
Added an option (max_successive_merges) that can be used to specify the
maximum number of successive merge operations on a key in the memtable.
This can be used to improve performance of the "get" operation. If many
successive merge operations are performed on a key, the performance of "get"
operations on the key deteriorates, as the value has to be computed for each
"get" operation by applying all the successive merge operations.

FB Task ID: #3428853

Test Plan:
make all check
db_bench --benchmarks=readrandommergerandom
counter_stress_test

Reviewers: haobo, vamsi, dhruba, sdong

Reviewed By: haobo

CC: zshao

Differential Revision: https://reviews.facebook.net/D14991
2014-01-10 17:33:56 -08:00
Siying Dong
5b5ab0c1a8 [Performance Branch] Fix memory leak in HashLinkListRep.GetIterator()
Summary: Full list constructed for full iterator can be leaked. This was a bug introduced when I copy the full iterator codes from hash skip list to hash link list. This patch fixes it.

Test Plan: Run valgrind test against db_test and make sure the memory leak is fixed

Reviewers: kailiu, haobo

Reviewed By: kailiu

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D15093
2014-01-10 12:12:28 -08:00
Yancey
afdd2d1a46 fix compile warning 2014-01-10 17:56:35 +08:00
Siying Dong
237a3da677 StopWatch not to get time if it is created for statistics and it is disabled
Summary: Currently, even if statistics is not enabled, StopWatch only for the stats still gets the time of the day, which is wasteful. This patch adds a new option to StopWatch to disable this get in this case.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14703

Conflicts:
	db/db_impl.cc
2014-01-09 17:39:48 -08:00
Siying Dong
424a524ac9 [Performance Branch] A Hashed Linked List Based Mem Table
Summary:
Implement a mem table, in which keys are hashed based on prefixes. In each bucket, entries are organized in a sorted linked list. It has the same thread safety guarantee as skip list.

The motivation is to optimize memory usage for the case that prefix hashing is primary way of seeking to the entry. Compared to hash skip list implementation, this implementation is more memory efficient, but inside each bucket, search is always linear. The target scenario is that there are only very limited number of records in each hash bucket.

Test Plan: Add a test case in db_test

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14979
2014-01-09 16:19:11 -08:00