Commit Graph

943 Commits

Author SHA1 Message Date
Igor Canadi
ef6ad1708d [column families] Support to create and drop column families
Summary:
This diff provides basic implementations of CreateColumnFamily(), DropColumnFamily() and ListColumnFamilies(). It builds on top of https://reviews.facebook.net/D14733

It also includes a bug fix for DBImplReadOnly, where Get implementation would be redirected to DBImpl instead of DBImplReadOnly.

Test Plan: Added unit test

Reviewers: dhruba, haobo, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15021
2014-01-03 01:12:16 -08:00
Igor Canadi
7535443083 [RocksDB] Support for column families in manifest
Summary:
<This diff is for Column Family branch>

Added fields in manifest file to support adding and deleting column families.

Pretty simple change, each version edit record can be:
1. add column family
2. drop column family
3. add and delete N files from a single column family (compactions and flushes will generate such records)

Test Plan: make check works, the code is backward compatible

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14733
2014-01-02 04:18:28 -08:00
Igor Canadi
6de1b5b83e Merge branch 'master' into columnfamilies 2014-01-02 04:18:07 -08:00
Igor Canadi
b60c14f6ee Support multi-threaded DisableFileDeletions() and EnableFileDeletions()
Summary:
We don't want two threads to clash if they concurrently call DisableFileDeletions() and EnableFileDeletions(). I'm adding a counter that will enable file deletions only after all DisableFileDeletions() calls have been negated with EnableFileDeletions().

However, we also don't want to break the old behavior, so I added a parameter force to EnableFileDeletions(). If force is true, we will still enable file deletions after every call to EnableFileDeletions(), which is what is happening now.

Test Plan: make check

Reviewers: dhruba, haobo, sanketh

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14781
2014-01-02 03:33:42 -08:00
Igor Canadi
345fb94d26 moving autovector_test after db_test 2014-01-02 03:30:29 -08:00
Igor Canadi
52ea1be90a Add -DROCKSDB_FALLOCATE_PRESENT to fbcode build 2014-01-02 02:00:04 -08:00
Igor Canadi
2b3aab3ee6 Merge pull request #48 from dyu/master
fix build bug
2014-01-02 00:27:18 -08:00
Kai Liu
fe030bd1ca update the latest version in README.fb to 2.7 2013-12-30 16:16:24 -08:00
Kai Liu
5a20744a6a Simplify build_tools/build_detect_version 2013-12-30 16:14:55 -08:00
Kai Liu
1795397bf0 Update README.fb
Update the latest version number.
2013-12-30 14:53:56 -08:00
dyu
e842b99fc5 docs for shared library builds 2013-12-30 21:34:45 +08:00
dyu
a6b476a2ac tweak build bug fix 2013-12-30 21:33:52 +08:00
dyu
9d4dc0da27 fix build bug from recent commit:43c386b72e 2013-12-27 15:19:31 +08:00
Siying Dong
a094f3b3b5 TableCache.FindTable() to avoid the mem copy of file number
Summary: I'm not sure what's the purpose of encoding file number to a new buffer for looking up the table cache. It seems to be unnecessary to me. With this patch, we point the lookup key to the address of the int64 of the file number.

Test Plan: make all check

Reviewers: dhruba, haobo, igor, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14811
2013-12-26 16:57:07 -08:00
Siying Dong
18df47b79a Avoid malloc in NotFound key status if no message is given.
Summary:
In some places we have NotFound status created with empty message, but it doesn't avoid a malloc. With this patch, the malloc is avoided for that case.

The motivation of it is that I found in db_bench readrandom test when all keys are not existing, about 4% of the total running time is spent on malloc of Status, plus a similar amount of CPU spent on free of them, which is not necessary.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14691
2013-12-26 16:23:10 -08:00
Kai Liu
b40c052bfa Fix all the comparison issue in fb dev servers 2013-12-26 16:13:49 -08:00
kailiu
113a08c929 Fix [-Werror=sign-compare] in autovector_test 2013-12-26 15:47:07 -08:00
kailiu
079a21ba99 Fix the unused variable warning message in mac os 2013-12-26 15:12:30 -08:00
kailiu
c01676e46d Implement autovector
Summary:
A vector that leverages pre-allocated stack-based array to achieve better
performance for array with small amount of items.

Test Plan:
Added tests for both correctness and performance

Here is the performance benchmark between vector and autovector

Please note that in the test "Creation and Insertion Test", the test case were designed with the motivation described below:

* no element inserted: internal array of std::vector may not really get
  initialize.
* one element inserted: internal array of std::vector must have
  initialized.
* kSize elements inserted. This shows the most time we'll spend if we
  keep everything in stack.
* 2 * kSize elements inserted. The internal vector of
  autovector must have been initialized.

Note: kSize is the capacity of autovector

  =====================================================
  Creation and Insertion Test
  =====================================================
  created 100000 vectors:
  	each was inserted with 0 elements
  	total time elapsed: 128000 (ns)
  created 100000 autovectors:
  	each was inserted with 0 elements
  	total time elapsed: 3641000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 0 elements
  	total time elapsed: 9896000 (ns)
  -----------------------------------
  created 100000 vectors:
  	each was inserted with 1 elements
  	total time elapsed: 11089000 (ns)
  created 100000 autovectors:
  	each was inserted with 1 elements
  	total time elapsed: 5008000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 1 elements
  	total time elapsed: 24271000 (ns)
  -----------------------------------
  created 100000 vectors:
  	each was inserted with 4 elements
  	total time elapsed: 39369000 (ns)
  created 100000 autovectors:
  	each was inserted with 4 elements
  	total time elapsed: 10121000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 4 elements
  	total time elapsed: 28473000 (ns)
  -----------------------------------
  created 100000 vectors:
  	each was inserted with 8 elements
  	total time elapsed: 75013000 (ns)
  created 100000 autovectors:
  	each was inserted with 8 elements
  	total time elapsed: 18237000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 8 elements
  	total time elapsed: 42464000 (ns)
  -----------------------------------
  created 100000 vectors:
  	each was inserted with 16 elements
  	total time elapsed: 102319000 (ns)
  created 100000 autovectors:
  	each was inserted with 16 elements
  	total time elapsed: 76724000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 16 elements
  	total time elapsed: 68285000 (ns)
  -----------------------------------
  =====================================================
  Sequence Access Test
  =====================================================
  performed 100000 sequence access against vector
  	size: 4
  	total time elapsed: 198000 (ns)
  performed 100000 sequence access against autovector
  	size: 4
  	total time elapsed: 306000 (ns)
  -----------------------------------
  performed 100000 sequence access against vector
  	size: 8
  	total time elapsed: 565000 (ns)
  performed 100000 sequence access against autovector
  	size: 8
  	total time elapsed: 512000 (ns)
  -----------------------------------
  performed 100000 sequence access against vector
  	size: 16
  	total time elapsed: 1076000 (ns)
  performed 100000 sequence access against autovector
  	size: 16
  	total time elapsed: 1070000 (ns)
  -----------------------------------

Reviewers: dhruba, haobo, sdong, chip

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14655
2013-12-26 15:03:47 -08:00
Kai Liu
5643ae1a3f Merge pull request #32 from jamesgolick/master
Only try to use fallocate if it's actually present on the system.
2013-12-26 14:20:22 -08:00
Dhruba Borthakur
71ddb117c8 Add a pointer to the engineering design discussion forum.
Summary:
Add a pointer to the engineering design discussion forum.

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-12-23 12:19:18 -08:00
Igor Canadi
b26dc95628 Initialize sequence number in BatchResult - issue #39 2013-12-20 10:01:12 -08:00
Igor Canadi
1fdb3f7dc6 [RocksDB] Optimize locking for Get
Summary:
Instead of locking and saving a DB state, we can cache a DB state and update it only when it changes. This change reduces lock contention and speeds up read operations on the DB.

Performance improvements are substantial, although there is some cost in no-read workloads. I ran the regression tests on my devserver and here are the numbers:

  overwrite                    56345  ->   63001
  fillseq                      193730 ->  185296
  readrandom                   771301 -> 1219803 (58% improvement!)
  readrandom_smallblockcache   677609 ->  862850
  readrandom_memtable_sst      710440 -> 1109223
  readrandom_fillunique_random 221589 ->  247869
  memtablefillrandom           105286 ->   92643
  memtablereadrandom           763033 -> 1288862

Test Plan:
make asan_check
I am also running db_stress

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14679
2013-12-20 09:57:58 -08:00
Igor Canadi
540a2894b6 Merge pull request #28 from bartman/master
fix missing gflags library
2013-12-19 17:46:11 -08:00
Mark Callaghan
ca92068b12 Add 'readtocache' test
Summary:
For some tests I want to cache the database prior to running other tests on the same invocation
of db_bench. The readtocache test ignores --threads and --reads so those can be used by other tests
and it will still do a full read of --num rows with one thread. It might be invoked like:
  db_bench --benchmarks=readtocache,readrandom --reads 100 --num 10000 --threads 8

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14739
2013-12-18 16:54:53 -08:00
Igor Canadi
269709a885 Merge branch 'master' into columnfamilies 2013-12-18 15:56:37 -08:00
Igor Canadi
e914b6490d Reorder tests
Summary:
db_test should be the first to execute because it finds the most bugs.

Also, when third parties report issues, we don't want ldb error message, we prefer to have db_test error message. For example, see thread: https://github.com/facebook/rocksdb/issues/25

Test Plan: make check

Reviewers: dhruba, haobo, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14715
2013-12-18 13:37:06 -08:00
Igor Canadi
cbb8da6fdb Merge pull request #35 from zizkovrb/rm-ds_store
Remove utilities/.DS_Store file.
2013-12-18 13:15:03 -08:00
Igor Canadi
3b50b6213d Merge pull request #37 from mlin/more-c-bindings
C bindings: add a bunch of the newer options
2013-12-18 13:12:04 -08:00
Igor Canadi
9385a5247e [RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>

Sharing some of the work I've done so far. This diff compiles and passes the tests.

The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.

Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]

Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.

Please provide feedback.

Test Plan: make check works, the code is backward compatible

Reviewers: dhruba, haobo, sdong, kailiu, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14445
2013-12-18 13:08:22 -08:00
Siying Dong
14995a8ff3 Move level0 sorting logic from Version::SaveTo() to Version::Finalize()
Summary: I realized that "D14409 Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo()" is not done in an optimized place. SaveTo() is usually inside mutex. Move it to Finalize(), which is called out of mutex.

Test Plan: make all check

Reviewers: dhruba, haobo, kailiu

Reviewed By: dhruba

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14607
2013-12-17 18:06:58 -08:00
Siying Dong
a8b8b11dc4 Get() Does Not Reserve space for to_delete memtables
Summary: It seems to be a decision tradeoff in current codes: we make a malloc for every Get() to reduce one malloc for a flush inside mutex. It takes about 5% of CPU time in readrandom tests. We might consider the tradeoff to be the other way around.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14697
2013-12-17 17:16:16 -08:00
Josef Šimánek
8c34189f0c Remove .DS_Store files. 2013-12-17 06:00:47 +01:00
Mike Lin
2a2506b629 C bindings: add a bunch of the newer options 2013-12-15 13:47:06 -08:00
Igor Canadi
417b453fa6 [backupable db] Delete db_dir children when restoring backup
Summary:
I realized that manifest will get deleted by PurgeObsoleteFiles in DBImpl, but it is sill cleaner to delete
files before we restore the backup

Test Plan: backupable_db_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14619
2013-12-12 14:57:18 -08:00
Mark Callaghan
e9e6b00d29 Add monitoring for universal compaction and add counters for compaction IO
Summary:
Adds these counters
{ WAL_FILE_SYNCED, "rocksdb.wal.synced" }
  number of writes that request a WAL sync
{ WAL_FILE_BYTES, "rocksdb.wal.bytes" },
  number of bytes written to the WAL
{ WRITE_DONE_BY_SELF, "rocksdb.write.self" },
  number of writes processed by the calling thread
{ WRITE_DONE_BY_OTHER, "rocksdb.write.other" },
  number of writes not processed by the calling thread. Instead these were
  processed by the current holder of the write lock
{ WRITE_WITH_WAL, "rocksdb.write.wal" },
  number of writes that request WAL logging
{ COMPACT_READ_BYTES, "rocksdb.compact.read.bytes" },
  number of bytes read during compaction
{ COMPACT_WRITE_BYTES, "rocksdb.compact.write.bytes" },
  number of bytes written during compaction

Per-interval stats output was updated with WAL stats and correct stats for universal compaction
including a correct value for write-amplification. It now looks like:
                               Compactions
Level  Files Size(MB) Score Time(sec)  Read(MB) Write(MB)    Rn(MB)  Rnp1(MB)  Wnew(MB) RW-Amplify Read(MB/s) Write(MB/s)      Rn     Rnp1     Wnp1     NewW    Count  Ln-stall Stall-cnt
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0        7      464  46.4       281      3411      3875      3411         0      3875        2.1      12.1        13.8      621        0      240      240      628       0.0         0
Uptime(secs): 310.8 total, 2.0 interval
Writes cumulative: 9999999 total, 9999999 batches, 1.0 per batch, 1.22 ingest GB
WAL cumulative: 9999999 WAL writes, 9999999 WAL syncs, 1.00 writes per sync, 1.22 GB written
Compaction IO cumulative (GB): 1.22 new, 3.33 read, 3.78 write, 7.12 read+write
Compaction IO cumulative (MB/sec): 4.0 new, 11.0 read, 12.5 write, 23.4 read+write
Amplification cumulative: 4.1 write, 6.8 compaction
Writes interval: 100000 total, 100000 batches, 1.0 per batch, 12.5 ingest MB
WAL interval: 100000 WAL writes, 100000 WAL syncs, 1.00 writes per sync, 0.01 MB written
Compaction IO interval (MB): 12.49 new, 14.98 read, 21.50 write, 36.48 read+write
Compaction IO interval (MB/sec): 6.4 new, 7.6 read, 11.0 write, 18.6 read+write
Amplification interval: 101.7 write, 102.9 compaction
Stalls(secs): 142.924 level0_slowdown, 0.000 level0_numfiles, 0.805 memtable_compaction, 0.000 leveln_slowdown
Stalls(count): 132461 level0_slowdown, 0 level0_numfiles, 3 memtable_compaction, 0 leveln_slowdown

Task ID: #3329644, #3301695

Blame Rev:

Test Plan:
Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14583
2013-12-12 13:27:43 -08:00
Igor Canadi
249e736bc5 portable %lu printing 2013-12-12 08:13:47 -08:00
Igor Canadi
f5f5c645a8 Add readrandom with both memtable and sst regression test
Summary: @MarkCallaghan's tests indicate that performance with 8k rows in memtable is much worse than empty memtable. I wanted to add a regression tests that measures this effect, so we could optimize it. However, current config shows 634461 QPS on my devbox. Mark, any idea why this is so much faster than your measurements?

Test Plan: Ran the regression test.

Reviewers: MarkCallaghan, dhruba, haobo

Reviewed By: MarkCallaghan

CC: leveldb, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D14511
2013-12-11 13:51:20 -08:00
Siying Dong
a8029fdc75 Introduce MergeContext to Lazily Initialize merge operand list
Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14415

Conflicts:
	db/db_impl.cc
	db/memtable.cc
2013-12-11 11:37:28 -08:00
James Golick
c28dd2a891 oops - missed a spot 2013-12-11 11:18:00 -08:00
Siying Dong
bc5dd19b14 [RocksDB Performance Branch] Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo()
Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get().

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: dhruba

CC: nkg-, igor, leveldb

Differential Revision: https://reviews.facebook.net/D14409
2013-12-11 10:50:09 -08:00
Siying Dong
0304e3d2ff When flushing mem tables, create iterators out of mutex
Summary:
creating new iterators of mem tables can be expensive. Move them out of mutex.
DBImpl::WriteLevel0Table()'s mems seems to be a local vector and is only used by flushing. memtables to flush are also immutable, so it should be safe to do so.

Test Plan: make all check

Reviewers: haobo, dhruba, kailiu

Reviewed By: dhruba

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14577

Conflicts:
	db/db_impl.cc
2013-12-11 10:02:17 -08:00
Igor Canadi
e8d40c31b3 [RocksDB perf] Cache speedup
Summary:
I have ran a get benchmark where all the data is in the cache and observed that most of the time is spent on waiting for lock in LRUCache.

This is an effort to optimize LRUCache.

Test Plan:
The data was loaded with fillseq. Then, I ran a benchmark:

    /db_bench --db=/tmp/rocksdb_stat_bench --num=1000000 --benchmarks=readrandom --statistics=1 --use_existing_db=1 --threads=16 --disable_seek_compaction=1 --cache_size=20000000000 --cache_numshardbits=8 --table_cache_numshardbits=8

I ran the benchmark three times. Here are the results:
AFTER THE PATCH: 798072, 803998, 811807
BEFORE THE PATCH: 782008, 815593, 763017

Reviewers: dhruba, haobo, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14571
2013-12-11 08:33:29 -08:00
James Golick
43c386b72e only try to use fallocate if it's actually present on the system 2013-12-10 22:34:19 -08:00
Igor Canadi
5e4ab767cf BackupableDB delete backups with newer seq number
Summary: We now delete backups with newer sequence number, so the clients don't have to handle confusing situations when they restore from backup.

Test Plan: added a unit test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14547
2013-12-10 20:49:28 -08:00
Igor Canadi
204bb9cffd Get rid of LogFlush() in InternalIterator 2013-12-10 10:59:00 -08:00
Igor Canadi
19f5463d3f Don't LogFlush() in foreground threads
Summary: So fflush() takes a lock which is heavyweight. I added flush_pending_, but more importantly, I removed LogFlush() from foreground threads.

Test Plan: ./db_test

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14535
2013-12-10 10:57:46 -08:00
Siying Dong
4815468be4 Fix another sign and unsign comparison in test 2013-12-10 10:52:47 -08:00
Igor Canadi
cbe7ffef9a fix comparison between signed and unsigned 2013-12-10 10:48:49 -08:00
Igor Canadi
7cf5728440 Cleaning up BackupableDB + fix valgrind errors
Summary: Valgrind complained about BackupableDB. This fixes valgrind errors. Also, I cleaned up some code.

Test Plan: valgrind does not complain anymore

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14529
2013-12-10 10:35:06 -08:00