Summary: I made some cleanup while reading the source code in `db`. Most changes are about style, naming or C++ 11 new features.
Test Plan: ran `make check`
Reviewers: haobo, dhruba, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15009
Summary: I'm not sure what's the purpose of encoding file number to a new buffer for looking up the table cache. It seems to be unnecessary to me. With this patch, we point the lookup key to the address of the int64 of the file number.
Test Plan: make all check
Reviewers: dhruba, haobo, igor, kailiu
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14811
Summary:
In some places we have NotFound status created with empty message, but it doesn't avoid a malloc. With this patch, the malloc is avoided for that case.
The motivation of it is that I found in db_bench readrandom test when all keys are not existing, about 4% of the total running time is spent on malloc of Status, plus a similar amount of CPU spent on free of them, which is not necessary.
Test Plan: make all check
Reviewers: dhruba, haobo, igor
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14691
Summary:
A vector that leverages pre-allocated stack-based array to achieve better
performance for array with small amount of items.
Test Plan:
Added tests for both correctness and performance
Here is the performance benchmark between vector and autovector
Please note that in the test "Creation and Insertion Test", the test case were designed with the motivation described below:
* no element inserted: internal array of std::vector may not really get
initialize.
* one element inserted: internal array of std::vector must have
initialized.
* kSize elements inserted. This shows the most time we'll spend if we
keep everything in stack.
* 2 * kSize elements inserted. The internal vector of
autovector must have been initialized.
Note: kSize is the capacity of autovector
=====================================================
Creation and Insertion Test
=====================================================
created 100000 vectors:
each was inserted with 0 elements
total time elapsed: 128000 (ns)
created 100000 autovectors:
each was inserted with 0 elements
total time elapsed: 3641000 (ns)
created 100000 VectorWithReserveSizes:
each was inserted with 0 elements
total time elapsed: 9896000 (ns)
-----------------------------------
created 100000 vectors:
each was inserted with 1 elements
total time elapsed: 11089000 (ns)
created 100000 autovectors:
each was inserted with 1 elements
total time elapsed: 5008000 (ns)
created 100000 VectorWithReserveSizes:
each was inserted with 1 elements
total time elapsed: 24271000 (ns)
-----------------------------------
created 100000 vectors:
each was inserted with 4 elements
total time elapsed: 39369000 (ns)
created 100000 autovectors:
each was inserted with 4 elements
total time elapsed: 10121000 (ns)
created 100000 VectorWithReserveSizes:
each was inserted with 4 elements
total time elapsed: 28473000 (ns)
-----------------------------------
created 100000 vectors:
each was inserted with 8 elements
total time elapsed: 75013000 (ns)
created 100000 autovectors:
each was inserted with 8 elements
total time elapsed: 18237000 (ns)
created 100000 VectorWithReserveSizes:
each was inserted with 8 elements
total time elapsed: 42464000 (ns)
-----------------------------------
created 100000 vectors:
each was inserted with 16 elements
total time elapsed: 102319000 (ns)
created 100000 autovectors:
each was inserted with 16 elements
total time elapsed: 76724000 (ns)
created 100000 VectorWithReserveSizes:
each was inserted with 16 elements
total time elapsed: 68285000 (ns)
-----------------------------------
=====================================================
Sequence Access Test
=====================================================
performed 100000 sequence access against vector
size: 4
total time elapsed: 198000 (ns)
performed 100000 sequence access against autovector
size: 4
total time elapsed: 306000 (ns)
-----------------------------------
performed 100000 sequence access against vector
size: 8
total time elapsed: 565000 (ns)
performed 100000 sequence access against autovector
size: 8
total time elapsed: 512000 (ns)
-----------------------------------
performed 100000 sequence access against vector
size: 16
total time elapsed: 1076000 (ns)
performed 100000 sequence access against autovector
size: 16
total time elapsed: 1070000 (ns)
-----------------------------------
Reviewers: dhruba, haobo, sdong, chip
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14655
Summary: The previous patch is wrong. rep_.resize(kHeader) just resets the header portion to zero, and should not cause a re-allocation if g++ does it right. I will go ahead and revert it.
Test Plan: make check
Reviewers: dhruba, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14793
Summary: tmp_batch_ will get re-allocated for every merged write batch because of the existing resize in WriteBatch::Clear. Note that in DBImpl::BuildBatchGroup, we have a hard coded upper limit of batch size 1<<20 = 1MB already.
Test Plan: make check
Reviewers: dhruba, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14787
Summary:
Some changes to PlainTable format:
(1) support variable key length
(2) use user defined slice transformer to extract prefixes
(3) Run some test cases against PlainTable in db_test and table_test
Test Plan: test db_test
Reviewers: haobo, kailiu
CC: dhruba, igor, leveldb, nkg-
Differential Revision: https://reviews.facebook.net/D14457
Summary:
Instead of locking and saving a DB state, we can cache a DB state and update it only when it changes. This change reduces lock contention and speeds up read operations on the DB.
Performance improvements are substantial, although there is some cost in no-read workloads. I ran the regression tests on my devserver and here are the numbers:
overwrite 56345 -> 63001
fillseq 193730 -> 185296
readrandom 771301 -> 1219803 (58% improvement!)
readrandom_smallblockcache 677609 -> 862850
readrandom_memtable_sst 710440 -> 1109223
readrandom_fillunique_random 221589 -> 247869
memtablefillrandom 105286 -> 92643
memtablereadrandom 763033 -> 1288862
Test Plan:
make asan_check
I am also running db_stress
Reviewers: dhruba, haobo, sdong, kailiu
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14679
Summary:
For some tests I want to cache the database prior to running other tests on the same invocation
of db_bench. The readtocache test ignores --threads and --reads so those can be used by other tests
and it will still do a full read of --num rows with one thread. It might be invoked like:
db_bench --benchmarks=readtocache,readrandom --reads 100 --num 10000 --threads 8
Task ID: #
Blame Rev:
Test Plan:
run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14739
Summary:
db_test should be the first to execute because it finds the most bugs.
Also, when third parties report issues, we don't want ldb error message, we prefer to have db_test error message. For example, see thread: https://github.com/facebook/rocksdb/issues/25
Test Plan: make check
Reviewers: dhruba, haobo, kailiu
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14715
Summary: I realized that "D14409 Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo()" is not done in an optimized place. SaveTo() is usually inside mutex. Move it to Finalize(), which is called out of mutex.
Test Plan: make all check
Reviewers: dhruba, haobo, kailiu
Reviewed By: dhruba
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D14607
Summary: It seems to be a decision tradeoff in current codes: we make a malloc for every Get() to reduce one malloc for a flush inside mutex. It takes about 5% of CPU time in readrandom tests. We might consider the tradeoff to be the other way around.
Test Plan: make all check
Reviewers: dhruba, haobo, igor
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14697
Summary: A bug to fix. IT's already fixed in D14457, but want to check it in sooner to unblock tests
Test Plan: plain_table_db_test
Reviewers: nkg-, haobo
Reviewed By: nkg-
CC: kailiu, leveldb
Differential Revision: https://reviews.facebook.net/D14673
Summary:
This is the last diff that adds the property block to plain table.
The format resembles that of the block-based table: https://github.com/facebook/rocksdb/wiki/Rocksdb-table-format
[data block]
[meta block 1: stats block]
[meta block 2: future extended block]
...
[meta block K: future extended block] (we may add more meta blocks in the future)
[metaindex block]
[index block: we only have the placeholder here, we can add persistent index block in the future]
[Footer: contains magic number, handle to metaindex block and index block]
<end_of_file>
Test Plan: extended existing property block test.
Reviewers: haobo, sdong, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14523
Summary:
By disassemble the function, we found that the atomic variables do invoke the `lock` that locks the memory bus.
As a tradeoff, we protect the GetUsage by mutex and leave usage_ as plain size_t.
Test Plan: passed `cache_test`
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14667
Summary: make release complains signed unsigned comparison.
Test Plan: make release
Reviewers: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14661
Summary: This diff will help us to figure out the memory usage for the cache part.
Test Plan: added a new memory usage test for cache
Reviewers: haobo, sdong, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14559
Summary:
PlainTable now has a bug of the ordering of indexes for the prefixes in the same bucket. I thought std::map guaranteed key order but it didn't, probably because I didn't use it properly. But seems to me that we don't need to make extra sorting as input prefixes are already sorted. Found by problem by running leaf4 against plain table. Replace the map with a vector. It should performs better too.
After the fix, leaf4 unit tests are passing.
Test Plan:
run plain_table_db_test
Also going to run db_test with plain table in the uncommitted branch.
Reviewers: haobo, kailiu
Reviewed By: haobo
CC: nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D14649
Summary:
I realized that manifest will get deleted by PurgeObsoleteFiles in DBImpl, but it is sill cleaner to delete
files before we restore the backup
Test Plan: backupable_db_test
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14619
Summary: To reduce mutex contention caused by DBImpl.NewInternalIterator(), in this function, move all the iteration creation works out of mutex, only leaving object ref and get.
Test Plan:
make all check
will run db_stress for a while too to make sure no problem.
Reviewers: haobo, dhruba, kailiu
Reviewed By: haobo
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D14589
Summary: When deconstructing an iterator, no need to check obsolete file if it doesn't hold last reference of any version.
Test Plan: make all check
Reviewers: haobo, igor, dhruba, kailiu
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14595
Summary: @MarkCallaghan's tests indicate that performance with 8k rows in memtable is much worse than empty memtable. I wanted to add a regression tests that measures this effect, so we could optimize it. However, current config shows 634461 QPS on my devbox. Mark, any idea why this is so much faster than your measurements?
Test Plan: Ran the regression test.
Reviewers: MarkCallaghan, dhruba, haobo
Reviewed By: MarkCallaghan
CC: leveldb, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D14511
Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases
Test Plan: make all check
Reviewers: haobo, kailiu, dhruba
Reviewed By: haobo
CC: igor, nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D14415
Conflicts:
db/db_impl.cc
db/memtable.cc
Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get().
Test Plan: make all check
Reviewers: haobo, kailiu, dhruba
Reviewed By: dhruba
CC: nkg-, igor, leveldb
Differential Revision: https://reviews.facebook.net/D14409
Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get().
Test Plan: make all check
Reviewers: haobo, kailiu, dhruba
Reviewed By: dhruba
CC: nkg-, igor, leveldb
Differential Revision: https://reviews.facebook.net/D14409
Summary:
creating new iterators of mem tables can be expensive. Move them out of mutex.
DBImpl::WriteLevel0Table()'s mems seems to be a local vector and is only used by flushing. memtables to flush are also immutable, so it should be safe to do so.
Test Plan: make all check
Reviewers: haobo, dhruba, kailiu
Reviewed By: dhruba
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D14577
Conflicts:
db/db_impl.cc
Summary:
creating new iterators of mem tables can be expensive. Move them out of mutex.
DBImpl::WriteLevel0Table()'s mems seems to be a local vector and is only used by flushing. memtables to flush are also immutable, so it should be safe to do so.
Test Plan: make all check
Reviewers: haobo, dhruba, kailiu
Reviewed By: dhruba
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D14577
Summary:
I have ran a get benchmark where all the data is in the cache and observed that most of the time is spent on waiting for lock in LRUCache.
This is an effort to optimize LRUCache.
Test Plan:
The data was loaded with fillseq. Then, I ran a benchmark:
/db_bench --db=/tmp/rocksdb_stat_bench --num=1000000 --benchmarks=readrandom --statistics=1 --use_existing_db=1 --threads=16 --disable_seek_compaction=1 --cache_size=20000000000 --cache_numshardbits=8 --table_cache_numshardbits=8
I ran the benchmark three times. Here are the results:
AFTER THE PATCH: 798072, 803998, 811807
BEFORE THE PATCH: 782008, 815593, 763017
Reviewers: dhruba, haobo, kailiu
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14571
Summary: as title
Test Plan: dynamic_bloom_test
Reviewers: dhruba, sdong, kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14385
Summary: We now delete backups with newer sequence number, so the clients don't have to handle confusing situations when they restore from backup.
Test Plan: added a unit test
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14547