Commit Graph

311 Commits

Author SHA1 Message Date
Naman Gupta
8454cfe569 Add read/modify/write functionality to Put() api
Summary: The application can set a callback function, which is applied on the previous value. And calculates the new value. This new value can be set, either inplace, if the previous value existed in memtable, and new value is smaller than previous value. Otherwise the new value is added normally.

Test Plan: fbmake. Added unit tests. All unit tests pass.

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: sdong, kailiu, xinyaohu, sumeet, leveldb

Differential Revision: https://reviews.facebook.net/D14745
2014-01-14 07:55:16 -08:00
Siying Dong
5b5ab0c1a8 [Performance Branch] Fix memory leak in HashLinkListRep.GetIterator()
Summary: Full list constructed for full iterator can be leaked. This was a bug introduced when I copy the full iterator codes from hash skip list to hash link list. This patch fixes it.

Test Plan: Run valgrind test against db_test and make sure the memory leak is fixed

Reviewers: kailiu, haobo

Reviewed By: kailiu

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D15093
2014-01-10 12:12:28 -08:00
Siying Dong
237a3da677 StopWatch not to get time if it is created for statistics and it is disabled
Summary: Currently, even if statistics is not enabled, StopWatch only for the stats still gets the time of the day, which is wasteful. This patch adds a new option to StopWatch to disable this get in this case.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14703

Conflicts:
	db/db_impl.cc
2014-01-09 17:39:48 -08:00
Siying Dong
424a524ac9 [Performance Branch] A Hashed Linked List Based Mem Table
Summary:
Implement a mem table, in which keys are hashed based on prefixes. In each bucket, entries are organized in a sorted linked list. It has the same thread safety guarantee as skip list.

The motivation is to optimize memory usage for the case that prefix hashing is primary way of seeking to the entry. Compared to hash skip list implementation, this implementation is more memory efficient, but inside each bucket, search is always linear. The target scenario is that there are only very limited number of records in each hash bucket.

Test Plan: Add a test case in db_test

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14979
2014-01-09 16:19:11 -08:00
Igor Canadi
17a222670b Merge branch 'master' into performance 2014-01-07 11:04:21 -08:00
kailiu
7e70ff63d6 Fix issue #57 2014-01-06 11:11:19 -08:00
Kai Liu
774ed89c24 Replace vector with autovector
Summary: this diff only replace the cases when we need to frequently create vector with small amount of entries. This diff doesn't aim to improve performance of a specific area, but more like a small scale test for the autovector and see how it works in real life.

Test Plan:
make check

I also ran the performance tests, however there is no performance gain/loss. All performance numbers are pretty much the same before/after the change.

Reviewers: dhruba, haobo, sdong, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14985
2014-01-02 16:43:35 -08:00
kailiu
f1cec73a76 Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/memtable.cc
	db/version_set.cc
	include/rocksdb/statistics.h
2013-12-27 12:23:17 -08:00
Siying Dong
18df47b79a Avoid malloc in NotFound key status if no message is given.
Summary:
In some places we have NotFound status created with empty message, but it doesn't avoid a malloc. With this patch, the malloc is avoided for that case.

The motivation of it is that I found in db_bench readrandom test when all keys are not existing, about 4% of the total running time is spent on malloc of Status, plus a similar amount of CPU spent on free of them, which is not necessary.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14691
2013-12-26 16:23:10 -08:00
Kai Liu
b40c052bfa Fix all the comparison issue in fb dev servers 2013-12-26 16:13:49 -08:00
kailiu
113a08c929 Fix [-Werror=sign-compare] in autovector_test 2013-12-26 15:47:07 -08:00
kailiu
c01676e46d Implement autovector
Summary:
A vector that leverages pre-allocated stack-based array to achieve better
performance for array with small amount of items.

Test Plan:
Added tests for both correctness and performance

Here is the performance benchmark between vector and autovector

Please note that in the test "Creation and Insertion Test", the test case were designed with the motivation described below:

* no element inserted: internal array of std::vector may not really get
  initialize.
* one element inserted: internal array of std::vector must have
  initialized.
* kSize elements inserted. This shows the most time we'll spend if we
  keep everything in stack.
* 2 * kSize elements inserted. The internal vector of
  autovector must have been initialized.

Note: kSize is the capacity of autovector

  =====================================================
  Creation and Insertion Test
  =====================================================
  created 100000 vectors:
  	each was inserted with 0 elements
  	total time elapsed: 128000 (ns)
  created 100000 autovectors:
  	each was inserted with 0 elements
  	total time elapsed: 3641000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 0 elements
  	total time elapsed: 9896000 (ns)
  -----------------------------------
  created 100000 vectors:
  	each was inserted with 1 elements
  	total time elapsed: 11089000 (ns)
  created 100000 autovectors:
  	each was inserted with 1 elements
  	total time elapsed: 5008000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 1 elements
  	total time elapsed: 24271000 (ns)
  -----------------------------------
  created 100000 vectors:
  	each was inserted with 4 elements
  	total time elapsed: 39369000 (ns)
  created 100000 autovectors:
  	each was inserted with 4 elements
  	total time elapsed: 10121000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 4 elements
  	total time elapsed: 28473000 (ns)
  -----------------------------------
  created 100000 vectors:
  	each was inserted with 8 elements
  	total time elapsed: 75013000 (ns)
  created 100000 autovectors:
  	each was inserted with 8 elements
  	total time elapsed: 18237000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 8 elements
  	total time elapsed: 42464000 (ns)
  -----------------------------------
  created 100000 vectors:
  	each was inserted with 16 elements
  	total time elapsed: 102319000 (ns)
  created 100000 autovectors:
  	each was inserted with 16 elements
  	total time elapsed: 76724000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 16 elements
  	total time elapsed: 68285000 (ns)
  -----------------------------------
  =====================================================
  Sequence Access Test
  =====================================================
  performed 100000 sequence access against vector
  	size: 4
  	total time elapsed: 198000 (ns)
  performed 100000 sequence access against autovector
  	size: 4
  	total time elapsed: 306000 (ns)
  -----------------------------------
  performed 100000 sequence access against vector
  	size: 8
  	total time elapsed: 565000 (ns)
  performed 100000 sequence access against autovector
  	size: 8
  	total time elapsed: 512000 (ns)
  -----------------------------------
  performed 100000 sequence access against vector
  	size: 16
  	total time elapsed: 1076000 (ns)
  performed 100000 sequence access against autovector
  	size: 16
  	total time elapsed: 1070000 (ns)
  -----------------------------------

Reviewers: dhruba, haobo, sdong, chip

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14655
2013-12-26 15:03:47 -08:00
Kai Liu
5643ae1a3f Merge pull request #32 from jamesgolick/master
Only try to use fallocate if it's actually present on the system.
2013-12-26 14:20:22 -08:00
Siying Dong
abaf26266d [RocksDB] [Performance Branch] Some Changes to PlainTable format
Summary:
Some changes to PlainTable format:
(1) support variable key length
(2) use user defined slice transformer to extract prefixes
(3) Run some test cases against PlainTable in db_test and table_test

Test Plan: test db_test

Reviewers: haobo, kailiu

CC: dhruba, igor, leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D14457
2013-12-20 12:08:35 -08:00
kailiu
5f5e5fc2e9 Revert atomic_size_t usage
Summary:
By disassemble the function, we found that the atomic variables do invoke the `lock` that locks the memory bus.
As a tradeoff, we protect the GetUsage by mutex and leave usage_ as plain size_t.

Test Plan: passed `cache_test`

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14667
2013-12-13 17:03:19 -08:00
Haobo Xu
5090316f0d [RocksDB] [Performance Branch] Trivia build fix
Summary: make release complains signed unsigned comparison.

Test Plan: make release

Reviewers: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14661
2013-12-13 14:21:59 -08:00
kailiu
b660e2d468 Expose usage info for the cache
Summary: This diff will help us to figure out the memory usage for the cache part.

Test Plan: added a new memory usage test for cache

Reviewers: haobo, sdong, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14559
2013-12-13 12:53:45 -08:00
James Golick
c28dd2a891 oops - missed a spot 2013-12-11 11:18:00 -08:00
Igor Canadi
e8d40c31b3 [RocksDB perf] Cache speedup
Summary:
I have ran a get benchmark where all the data is in the cache and observed that most of the time is spent on waiting for lock in LRUCache.

This is an effort to optimize LRUCache.

Test Plan:
The data was loaded with fillseq. Then, I ran a benchmark:

    /db_bench --db=/tmp/rocksdb_stat_bench --num=1000000 --benchmarks=readrandom --statistics=1 --use_existing_db=1 --threads=16 --disable_seek_compaction=1 --cache_size=20000000000 --cache_numshardbits=8 --table_cache_numshardbits=8

I ran the benchmark three times. Here are the results:
AFTER THE PATCH: 798072, 803998, 811807
BEFORE THE PATCH: 782008, 815593, 763017

Reviewers: dhruba, haobo, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14571
2013-12-11 08:33:29 -08:00
Haobo Xu
3c02c363b3 [RocksDB] [Performance Branch] Added dynamic bloom, to be used for memable non-existing key filtering
Summary: as title

Test Plan: dynamic_bloom_test

Reviewers: dhruba, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14385
2013-12-11 00:15:14 -08:00
James Golick
43c386b72e only try to use fallocate if it's actually present on the system 2013-12-10 22:34:19 -08:00
kailiu
c79e595471 Make Cache::GetCapacity constant
Summary: This will allow us to access constant via `DB::GetOptions().table_cache.GetCapacity()` or `DB::GetOptions().block_cache.GetCapacity()` since GetOptions() is also constant method.
2013-12-10 17:34:35 -08:00
Igor Canadi
19f5463d3f Don't LogFlush() in foreground threads
Summary: So fflush() takes a lock which is heavyweight. I added flush_pending_, but more importantly, I removed LogFlush() from foreground threads.

Test Plan: ./db_test

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14535
2013-12-10 10:57:46 -08:00
Igor Canadi
fb9fce4fc3 [RocksDB] BackupableDB
Summary:
In this diff I present you BackupableDB v1. You can easily use it to backup your DB and it will do incremental snapshots for you.
Let's first describe how you would use BackupableDB. It's inheriting StackableDB interface so you can easily construct it with your DB object -- it will add a method RollTheSnapshot() to the DB object. When you call RollTheSnapshot(), current snapshot of the DB will be stored in the backup dir. To restore, you can just call RestoreDBFromBackup() on a BackupableDB (which is a static method) and it will restore all files from the backup dir. In the next version, it will even support automatic backuping every X minutes.

There are multiple things you can configure:
1. backup_env and db_env can be different, which is awesome because then you can easily backup to HDFS or wherever you feel like.
2. sync - if true, it *guarantees* backup consistency on machine reboot
3. number of snapshots to keep - this will keep last N snapshots around if you want, for some reason, be able to restore from an earlier snapshot. All the backuping is done in incremental fashion - if we already have 00010.sst, we will not copy it again. *IMPORTANT* -- This is based on assumption that 00010.sst never changes - two files named 00010.sst from the same DB will always be exactly the same. Is this true? I always copy manifest, current and log files.
4. You can decide if you want to flush the memtables before you backup, or you're fine with backing up the log files -- either way, you get a complete and consistent view of the database at a time of backup.
5. More things you can find in BackupableDBOptions

Here is the directory structure I use:

   backup_dir/CURRENT_SNAPSHOT - just 4 bytes holding the latest snapshot
               0, 1, 2, ... - files containing serialized version of each snapshot - containing a list of files
               files/*.sst - sst files shared between snapshots - if one snapshot references 00010.sst and another one needs to backup it from the DB, it will just reference the same file
               files/ 0/, 1/, 2/, ... - snapshot directories containing private snapshot files - current, manifest and log files

All the files are ref counted and deleted immediatelly when they get out of scope.

Some other stuff in this diff:
1. Added GetEnv() method to the DB. Discussed with @haobo and we agreed that it seems right thing to do.
2. Fixed StackableDB interface. The way it was set up before, I was not able to implement BackupableDB.

Test Plan:
I have a unittest, but please don't look at this yet. I just hacked it up to help me with debugging. I will write a lot of good tests and update the diff.

Also, `make asan_check`

Reviewers: dhruba, haobo, emayanke

Reviewed By: dhruba

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D14295
2013-12-09 14:06:52 -08:00
Igor Canadi
9644e0e0c7 Print stack trace on assertion failure
Summary:
This will help me a lot! When we hit an assertion in unittest, we get the whole stack trace now.

Also, changed stack trace a bit, we now include actual demangled C++ class::function symbols!

Test Plan: Added ASSERT_TRUE(false) to a test, observed a stack trace

Reviewers: haobo, dhruba, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14499
2013-12-06 17:11:09 -08:00
kailiu
c7707f24c2 Refine the statistics 2013-12-06 16:51:35 -08:00
kailiu
551e9428ce Merge branch 'master' into performance 2013-12-06 14:15:42 -08:00
kailiu
e1d92dfd2e Fix a bunch of mac compilation issues in performance branch 2013-12-04 23:00:33 -08:00
Vamsi Ponnekanti
fa88cbc71e [Log dumper broken when merge operator is in log]
Summary: $title

Test Plan:
on my dev box

Revert Plan: OK

Task ID: #

Reviewers: emayanke, dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14451
2013-12-04 16:22:54 -08:00
Mark Callaghan
97aa401e2f Add compression options to db_bench
Summary:
This adds 2 options for compression to db_bench:
* universal_compression_size_percent
* compression_level - to set zlib compression level
It also logs compression_size_percent at startup in LOG

Task ID: #

Blame Rev:

Test Plan:
make check, run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14439
2013-12-03 14:28:48 -08:00
Igor Canadi
eb12e47e0e Killing Transform Rep
Summary:
Let's get rid of TransformRep and it's children. We have confirmed that HashSkipListRep works better with multifeed, so there is no benefit to keeping this around.

This diff is mostly just deleting references to obsoleted functions. I also have a diff for fbcode that we'll need to push when we switch to new release.

I had to expose HashSkipListRepFactory in the client header files because db_impl.cc needs access to GetTransform() function for SanitizeOptions.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14397
2013-12-03 12:42:15 -08:00
lovro
930cb0b9ee Clarify CompactionFilter thread safety requirements
Summary: Documenting our discussion

Test Plan: make

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: igor

Differential Revision: https://reviews.facebook.net/D14403
2013-12-02 16:41:43 -08:00
lovro
45a2f2d8d3 Fix build without glibc
Summary: The preprocessor does not follow normal rules of && evaluation, tries to evaluate __GLIBC_PREREQ(2, 12) even though the defined() check fails.  This breaks the build if __GLIBC_PREREQ is absent.

Test Plan: Try adding #undef __GLIBC_PREREQ above the offending line, build no longer breaks

Reviewed By: igor

Blame Rev: 4c81383628
2013-12-01 11:32:54 -08:00
Kai Liu
1966b63137 Merge branch 'master' into perf 2013-11-27 11:47:40 -08:00
lovro
4c81383628 Set background thread name with pthread_setname_np()
Summary: Makes it easier to monitor performance with top

Test Plan: ./manual_compaction_test with `top -H` running.  Previously was two `manual_compacti`, now one shows `rocksdb:bg0`.

Reviewers: igor, dhruba

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14367
2013-11-27 11:28:06 -08:00
Haobo Xu
4e6463ea44 [RocksDB][Performance Branch] Make height and branching factor configurable for skiplist implementation
Summary: As title. Especially, HashSkipListRepFactory will be able to specify a relatively small height, to reduce the memory overhead of one skiplist per bucket.

Test Plan: make check and test it on leaf4

Reviewers: dhruba, sdong, kailiu

CC: reconnect.grayhat, leveldb

Differential Revision: https://reviews.facebook.net/D14307
2013-11-26 21:59:36 -08:00
Siying Dong
8aac46d686 [RocksDB Performance Branch] Fix a regression bug of munmap
Summary:
Fix a stupid bug I just introduced in b59d4d5a50, which I didn't even mean to include.
GCC might remove the munmap.

Test Plan: Run it and make sure munmap succeeds

Reviewers: haobo, kailiu

Reviewed By: kailiu

CC: dhruba, reconnect.grayhat, leveldb

Differential Revision: https://reviews.facebook.net/D14361
2013-11-26 14:05:37 -08:00
Haobo Xu
5b825d6964 [RocksDB] Use raw pointer instead of shared pointer when passing Statistics object internally
Summary: liveness of the statistics object is already ensured by the shared pointer in DB options. There's no reason to pass again shared pointer among internal functions. Raw pointer is sufficient and efficient.

Test Plan: make check

Reviewers: dhruba, MarkCallaghan, igor

Reviewed By: dhruba

CC: leveldb, reconnect.grayhat

Differential Revision: https://reviews.facebook.net/D14289
2013-11-25 10:38:15 -08:00
Siying Dong
3e35aa6412 Revert "Allow users to profile a query and see bottleneck of the query"
This reverts commit 3d8ac31d71.
2013-11-21 17:40:39 -08:00
Siying Dong
b135d01e7b Allow users to profile a query and see bottleneck of the query
Summary:
Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later.

Test Plan: Enable this profiling in seveal existing tests.

Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14001

Conflicts:
	table/merger.cc
2013-11-21 17:39:19 -08:00
Siying Dong
3d8ac31d71 Allow users to profile a query and see bottleneck of the query
Summary:
Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later.

Test Plan: Enable this profiling in seveal existing tests.

Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14001
2013-11-21 16:29:57 -08:00
Siying Dong
58e1956d50 [Only for Performance Branch] A Hacky patch to lazily generate memtable key for prefix-hashed memtables.
Summary:
For prefix mem tables, encoding mem table key may be unnecessary if the prefix doesn't have any key. This patch is a little bit hacky but I want to try out the performance gain of removing this lazy initialization.

In longer term, we might want to revisit the way we abstract mem tables implementations.

Test Plan: make all check

Reviewers: haobo, igor, kailiu

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14265
2013-11-20 20:49:23 -08:00
Siying Dong
b59d4d5a50 A Simple Plain Table
Summary:
A Simple plain table format. No block structure. When creating the table reader, scanning the full table to create indexes.

Test Plan:Add unit test

Reviewers:haobo,dhruba,kailiu

CC:

Task ID: #

Blame Rev:
2013-11-20 18:44:22 -08:00
kailiu
6eb5649800 Move flush_block_policy from Options to TableFactory
Summary:
Previously we introduce a `flush_block_policy_factory` in Options, however, that options is strongly releated to Table based tables.
It will make more sense to move it to block based table's own factory class.

Test Plan: make check to pass existing tests

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14211
2013-11-19 22:00:48 -08:00
kailiu
1415f8820d Improve the "table stats"
Summary:
The primary motivation of the changes is to make it easier to figure out the inside of the tables.

* rename "table stats" to "table properties" since now we have more than "integers" to store in the property block.
* Add filter block size to the basic table properties.
* Whenever a table is built, we'll log the table properties (the sample output is in Test Plan).
* Make an api to expose deleted keys.

Test Plan:
Passed all existing test. and the sample output of table stats:

    ==================================================================
        Basic Properties
    ------------------------------------------------------------------
                  # data blocks: 1
                      # entries: 1

                   raw key size: 9
           raw average key size: 9
                 raw value size: 9
         raw average value size: 0

                data block size: 25
               index block size: 27
              filter block size: 18
         (estimated) table size: 70

                  filter policy: rocksdb.BuiltinBloomFilter
    ==================================================================
        User collected properties: InternalKeyPropertiesCollector
    ------------------------------------------------------------------
                    kDeletedKeys: 1
    ==================================================================

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14187
2013-11-19 16:29:42 -08:00
Igor Canadi
f611aba559 Move the compiler back to 4.8.1 + more small fixes
Summary:
1. Moved the compiler back to 4.8.1 and uses Centos 5.2 binaries if OS is Centos 5.2.

2. Fixes this issue: https://github.com/facebook/rocksdb/issues/7

3. We use lot of c++11 features, so we can't pretend we can compile without them. Makes it a first class dependency.

4. Fix blob_store_test, which failes on Ubuntu with "too many files opened" error

5. Removed dependency on port/port_chromium.h, which does not even exist on our system

Test Plan: make clean; make check

Reviewers: dhruba, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14145
2013-11-18 11:40:16 -08:00
kailiu
97d8e573a6 make util/env_posix.cc work under mac
Summary: This diff invoves some more complicated issues in the posix environment.

Test Plan: works under mac os. will need to verify dev box.

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14061
2013-11-16 23:44:39 -08:00
Pascal Borreli
443e04e62d Fixed typos 2013-11-16 11:21:34 +00:00
Kai Liu
80bb81c6fe Add the correct table_factory for tables in table_tests 2013-11-12 23:54:31 -08:00
Kai Liu
22e1b04deb Quick fix for a string format
Summary:

Fix one more string format issue that throws warning in mac
2013-11-12 21:22:32 -08:00