Commit Graph

846 Commits

Author SHA1 Message Date
Igor Canadi
6aef661230 some improvements to CompressedCache test 2014-02-14 17:47:53 -08:00
Igor Canadi
422bb09cb0 Fix table properties
Summary: Adapt table properties to column family world

Test Plan: make check

Reviewers: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16161
2014-02-14 17:13:10 -08:00
Igor Canadi
76c048183c Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	include/rocksdb/db.h
2014-02-14 16:46:03 -08:00
Igor Canadi
be7e273d83 fix u/s comparison #83 2014-02-14 16:18:55 -08:00
Igor Canadi
c67d48c852 [CF] DB test to run on non-default column family
Summary:
This is a huge diff and it was hectic, but the idea is actually quite simple. Every operation (Put, Get, etc.) done on default column family in DBTest is now forwarded to non-default ("pikachu"). The good news is that we had zero test failures! Column families look stable so far.

One interesting test that I adapted for column families is MultiThreadedTest. I replaced every Put() with a WriteBatch writing to all column families concurrently. Every Put in the write batch contains unique_id. Instead of Get() I do a multiget across all column families with the same key. If atomicity holds, I expect to see the same unique_id in all column families.

Test Plan: This is a test!

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16149
2014-02-14 16:08:59 -08:00
kailiu
63690625cd Expose the table properties to application
Summary: Provide a public API for users to access the table properties for each SSTable.

Test Plan: Added a unit tests to test the function correctness under differnet conditions.

Reviewers: haobo, dhruba, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16083
2014-02-13 16:28:21 -08:00
Kai Liu
b2e7ee8b41 Followup code refactor on plain table
Summary:
Fixed most comments in https://reviews.facebook.net/D15429.
Still have some remaining comments left.

Test Plan: make all check

Reviewers: sdong, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15885
2014-02-13 15:27:59 -08:00
Siying Dong
f3ae3d07cc Add more black-box tests for PlainTable and explicitly support total order mode
Summary:
1. Add some more implementation-aware tests for PlainTable
2. move from a hard-coded one index per 16 rows in one prefix to a configurable number. Also, make hash table ratio = 0  means binary search only. Also fixes some divide 0 risks.
3. Explicitly support total order (only use binary search)
4. some code cleaning up.

Test Plan: make all check

Reviewers: haobo, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16023
2014-02-12 17:37:22 -08:00
Igor Canadi
ccdb93e775 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/memtable_list.cc
	db/memtable_list.h
	db/version_set.cc
	db/version_set.h
2014-02-12 14:01:30 -08:00
Igor Canadi
b06840aa7d [CF] Rethinking ColumnFamilyHandle and fix to dropping column families
Summary:
The change to the public behavior:
* When opening a DB or creating new column family client gets a ColumnFamilyHandle.
* As long as column family handle is alive, client can do whatever he wants with it, even drop it
* Dropped column family can still be read from (using the column family handle)
* Added a new call CloseColumnFamily(). Client has to close all column families that he has opened before deleting the DB
* As soon as column family is closed, any calls to DB using that column family handle will fail (also any outstanding calls)

Internally:
* Ref-counting ColumnFamilyData
* New thread-safety for ColumnFamilySet
* Dropped column families are now completely dropped and their memory cleaned-up

Test Plan: added some tests to column_family_test

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16101
2014-02-12 13:47:09 -08:00
Igor Canadi
ca5f1a225a CompactionContext to include is_manual_compaction
Summary: Added a bit more information to compaction context, requested by internal team at FB.

Test Plan: Modified CompactionFilter test to make sure is_manual_compaction is properly set.

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16095
2014-02-12 12:24:18 -08:00
Lei Jin
994c327b86 IOError cleanup
Summary: Clean up IOErrors so that it only indicates errors talking to device.

Test Plan: make all check

Reviewers: igor, haobo, dhruba, emayanke

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15831
2014-02-12 11:42:54 -08:00
Lei Jin
5fbf2ef42d preload table handle on Recover() when max_open_files == -1
Summary: This covers existing table files before DB open happens and avoids contention on table cache

Test Plan: db_test

Reviewers: haobo, sdong, igor, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16089
2014-02-12 10:43:27 -08:00
Lei Jin
28b7f7faa8 enable plain table in db_bench
Summary: as title

Test Plan: ran db_bench to gather stats

Reviewers: haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16059
2014-02-12 10:41:55 -08:00
kailiu
265150cb49 Fix problem 3 for issue #80 2014-02-11 17:52:18 -08:00
Siying Dong
33042669f6 Reduce malloc of iterators in Get() code paths
Summary:
This patch optimized Get() code paths by avoiding malloc of iterators. Iterator creation is moved to mem table rep implementations, where a callback is called when any key is found. This is the same practice as what we do in (SST) table readers.

db_bench result for readrandom following a writeseq, with no compression, single thread and tmpfs, we see throughput improved to 144958 from 139027, about 3%.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

Reviewed By: haobo

CC: leveldb, yhchiang

Differential Revision: https://reviews.facebook.net/D14685
2014-02-11 10:32:51 -08:00
Kai Liu
d4b789fdee Add LIBRARY back to make dbg 2014-02-10 20:15:09 -08:00
Igor Canadi
8e634d3ea4 Merge pull request #74 from alberts/lz4
Support for LZ4 compression.
2014-02-10 15:46:56 -08:00
Albert Strasheim
df2f92214a Support for LZ4 compression. 2014-02-08 14:15:51 -08:00
kailiu
161ab42a8a Make table properties shareable
Summary:
We are going to expose properties of all tables to end users through "some" db interface.
However, current design doesn't naturally fit for this need, which is because:

1. If a table presents in table cache, we cannot simply return the reference to its table properties, because the table may be destroy after compaction (and we don't want to hold the ref of the version).
2. Copy table properties is OK, but it's slow.

Thus in this diff, I change the table reader's interface to return a shared pointer (for const table properties), instead a const refernce.

Test Plan: `make check` passed

Reviewers: haobo, sdong, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15999
2014-02-07 19:26:49 -08:00
Igor Canadi
8d4db63a2d [CF] OpenWithColumnFamilies -> Open
Summary: By discussion with @dhruba, overloading Open makes more sense

Test Plan: compiles!

Reviewers: dhruba

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16017
2014-02-07 14:49:33 -08:00
Igor Canadi
99e61fdd5c [CF] Separate dumping of DBOptions and ColumnFamilyOptions
Summary: When we open a DB, we should dump only DBOptions and then when we create a new column family, we dump ColumnFamilyOptions for each one.

Test Plan: make check, confirm contents of the LOG

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16011
2014-02-07 13:52:51 -08:00
Igor Canadi
2b8c44639a Merge branch 'master' into columnfamilies 2014-02-07 11:42:56 -08:00
Igor Canadi
1560bb913e Readrandom with tailing iterator
Summary:
Added an option for readrandom benchmark to run with tailing iterator instead of Get. Benefit of tailing iterator is that it doesn't require locking DB mutex on access.

I also have some results when running on my machine. The results highly depend on number of cache shards. With our current benchmark setting of 4 table cache shards and 6 block cache shards, I don't see much improvements of using tailing iterator. In that case, we're probably seeing cache mutex contention.

Here are the results for different number of shards

    cache shards       tailing iterator        get
       6                      1.38M           1.16M
      10                      1.58M           1.15M

As soon as we get rid of cache mutex contention, we're seeing big improvements in using tailing iterator vs. ordinary get.

Test Plan: ran regression test

Reviewers: dhruba, haobo, ljin, kailiu, sding

Reviewed By: haobo

CC: tnovak

Differential Revision: https://reviews.facebook.net/D15867
2014-02-07 09:47:47 -08:00
Igor Canadi
0143abdbb0 Merge branch 'master' into columnfamilies
Conflicts:
	HISTORY.md
	db/db_impl.cc
	db/db_impl.h
	db/db_iter.cc
	db/db_test.cc
	db/dbformat.h
	db/memtable.cc
	db/memtable_list.cc
	db/memtable_list.h
	db/table_cache.cc
	db/table_cache.h
	db/version_edit.h
	db/version_set.cc
	db/version_set.h
	db/write_batch.cc
	db/write_batch_test.cc
	include/rocksdb/options.h
	util/options.cc
2014-02-06 15:58:20 -08:00
Igor Canadi
0b4ccf765c Flushes should always go to HIGH priority thread pool
Summary:
This is not column-family related diff. It is in columnfamily branch because the change is significant and we want to push it with next major release (3.0).

It removes the leveldb notion of one thread pool and expands it to two thread pools by default (HIGH and LOW). Flush process is removed from compaction process and all flush threads are executed on HIGH thread pool, since we don't want long-running compactions to influence flush latency.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15987
2014-02-06 14:17:13 -08:00
Igor Canadi
f8d5443efe [CF] Thread-safety guarantees for ColumnFamilySet
Summary: Revised thread-safety guarantees and implemented a way to spinlock the object.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15975
2014-02-06 11:57:36 -08:00
Igor Canadi
8fa8a708ef [CF] Propagate correct options to WriteBatch::InsertInto
Summary:
WriteBatch can have multiple column families in one batch. Every column family has different options. So we have to add a way for write batch to get options for an arbitrary column family.

This required a bit more acrobatics since lots of interfaces had to be changed.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15957
2014-02-06 10:23:31 -08:00
kailiu
84f8185fc0 Merge branch 'master' into performance
Conflicts:
	HISTORY.md
	db/db_impl.cc
	db/memtable.cc
2014-02-05 21:21:00 -08:00
Igor Canadi
b4f441f48a Fixed a bug introduced by previous commit 2014-02-05 14:58:24 -08:00
Igor Canadi
f276e0e59d [CF] Options -> DBOptions
Summary: Replaced most of occurrences of Options with more specific DBOptions. This brings us very close to supporting different configuration options for each column family.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15933
2014-02-05 14:56:09 -08:00
Igor Canadi
6e56ab5702 [CF] Add full_options_ to ColumnFamilyData
Summary:
Lots of code expects Options on construction/function call. My original idea was to split Options argument into ColumnFamilyOptions and DBOptions (the latter only if needed). However, this will require huge code changes very deep in the stack.

The better idea is to have ColumnFamilyData hold both ColumnFamilyOptions and Options. ColumnFamilyData::Options would be constructed from DBOptions (same for each column family) and ColumnFamilyOptions (different for each column family)

Now when we construct a class or call any method that requires Options, we can just push him ColumnFamilyData::Options and be sure that it's using column-family-specific settings.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15927
2014-02-05 12:26:40 -08:00
Igor Canadi
328ac7ee02 Merge branch 'master' into columnfamilies 2014-02-05 12:00:33 -08:00
Igor Canadi
c24d8c4e90 [CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.

To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.

This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15915
2014-02-05 11:55:30 -08:00
Igor Canadi
7b9f134959 [CF] Move InternalStats to ColumnFamilyData
Summary: InternalStats is a messy thing, keeping both DB data and column family data. However, it's better off living in ColumnFamilyData than in DBImpl. For now, at least.

Test Plan: make check

Reviewers: dhruba, kailiu, haobo, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15879
2014-02-04 17:59:50 -08:00
Igor Canadi
73f62255c1 [CF] Split SanitizeOptions into two
Summary:
There are three SanitizeOption-s now : one for DBOptions, one for ColumnFamilyOptions and one for Options (which just calls the other two)

I have also reshuffled some options -- table_cache options and info_log should live in DBOptions, for example.

Test Plan: make check doesn't complain

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15873
2014-02-04 17:26:51 -08:00
kailiu
d43ebd8c65 Put table factory back to public api
Summary:
Previous I am too ambitious to hide every detail about table factory
to internal api. However, we cannot pass the compilatoin for external
users since we use table factory as the shared_ptr, which requires
the definition of table factory's destructor.

Test Plan: make check;

Reviewers: sdong, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15861
2014-02-03 19:51:20 -08:00
Igor Canadi
5e2c4fe766 Get rid of DBImpl::user_comparator()
Summary: user_comparator() is a Column Family property, not DBImpl

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15855
2014-02-03 17:50:47 -08:00
Igor Canadi
0e22badc08 [column families] Iterator and MultiGet
Summary: Support for different column families in Iterator and MultiGet code path.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15849
2014-02-03 17:44:40 -08:00
Igor Canadi
2966d764cd Fix some 32-bit compile errors
Summary: RocksDB doesn't compile on 32-bit architecture apparently. This is attempt to fix some of 32-bit errors. They are reported here: https://gist.github.com/paxos/8789697

Test Plan: RocksDB still compiles on 64-bit :)

Reviewers: kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15825
2014-02-03 13:48:30 -08:00
Igor Canadi
2a9271b403 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/db_impl_readonly.cc
2014-02-03 13:47:54 -08:00
Lei Jin
5b3b6549d6 use super_version in NewIterator() and MultiGet() function
Summary:
Use super_version insider NewIterator to avoid Ref() each component
separately under mutex
The new added bench shows NewIterator QPS increases from 515K to 719K
No meaningful improvement for multiget I guess due to its relatively small
cost comparing to 90 keys fetch in the test.

Test Plan: unit test and db_bench

Reviewers: igor, sdong

Reviewed By: igor

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D15609
2014-02-03 13:13:36 -08:00
Igor Canadi
29bacb2eb6 VersionSet cleanup
Summary:
Removed icmp_ from VersionSet (since it's per-column-family, not per-DB-instance)
Unfriended VersionSet and ColumnFamilyData (yay!)
Removed VersionSet::NumberLevels()
Cleaned up DBImpl

Test Plan: make check

Reviewers: dhruba, haobo, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15819
2014-02-03 13:10:47 -08:00
Siying Dong
d169b67680 [Performance Branch] PlainTable to encode rows with seqID 0, value type using 1 internal byte.
Summary: In PlainTable, use one single byte to represent 8 bytes of internal bytes, if seqID = 0 and it is value type (which should be common for bottom most files). It is to save 7 bytes for uncompressed cases.

Test Plan: make all check

Reviewers: haobo, dhruba, kailiu

Reviewed By: haobo

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D15489
2014-02-03 12:19:30 -08:00
Igor Canadi
5c6ef56152 Fix printf format 2014-02-03 10:25:37 -08:00
kailiu
4f6cb17bdb First phase API clean up
Summary:
Addressed all the issues in https://reviews.facebook.net/D15447.
Now most table-related modules are hidden from user land.

Test Plan: make check

Reviewers: sdong, haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15525
2014-02-03 00:30:43 -08:00
Igor Canadi
27a8856c23 Compacting column families
Summary: This diff enables non-default column families to get compacted both automatically and also by calling CompactRange()

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15813
2014-01-31 19:54:03 -08:00
Igor Canadi
5661ed8b80 Fix reduce_levels_test 2014-01-31 19:44:48 -08:00
Igor Canadi
fe9bd300d6 Merge branch 'master' into columnfamilies 2014-01-31 17:17:22 -08:00
Dhruba Borthakur
30a700657d Fix corruption_test failure caused by auto-enablement of checksum verification.
Summary:
Patch  https://reviews.facebook.net/D15591 enabled checksum
verification by default. This caused the unit test to fail.

Test Plan: ./corruption_test

Reviewers: igor, kailiu

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15795
2014-01-31 17:16:38 -08:00
Igor Canadi
f7489123e2 Move compaction picker and internal key comparator to ColumnFamilyData
Summary: Compaction picker and internal key comparator are different for each column family (not global), so they should live in ColumnFamilyData

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15801
2014-01-31 16:06:55 -08:00
Igor Canadi
5693db2a02 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
2014-01-31 14:45:15 -08:00
Igor Canadi
dbbffbd772 Mark the log_number file number used
Summary:
VersionSet::next_file_number_ is always assumed to be strictly greater than VersionSet::log_number_. In our new recovery code, we artificially set log_number_  to be (log_number + 1), so that once we flush, we don't recover from the same log file again (this is important because of merge operator non-idempotence)

When we set VersionSet::log_number_ to (log_number + 1), we also have to mark that file number used, such that next_file_number_ is increased to a legal level. Otherwise, VersionSet might assert.

This has not be a problem so far because here's what happens:
1. assume next_file_number is 5, we're recovering log_number 10
2. in DBImpl::Recover() we call MarkFileNumberUsed with 10. This will set VersionSet::next_file_number_ to 11.
3. If there are some updates, we will call WriteTable0ForRecovery(), which will use file number 11 as a new table file and advance VersionSet::next_file_number_ to 12.
4. When we LogAndApply() with log_number 11, assertion is true: assert(11 <= 12);

However, this was a lucky occurrence. Even though this diff doesn't cause a bug, I think the issue is important to fix.

Test Plan: In column families I have different recovery logic and this code path asserted. When adding MarkFileNumberUsed(log_number + 1) assert is gone.

Reviewers: dhruba, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15783
2014-01-31 14:43:16 -08:00
Igor Canadi
3615f534d1 Enable flushing memtables from arbitrary column families
Summary: Removed default_cfd_ from all flush code paths. This means we can now flush memtables from arbitrary column families!

Test Plan: Added a new unit test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15789
2014-01-31 14:42:52 -08:00
Siying Dong
56bea9f80d When using Universal Compaction, Zero out seqID in the last file too
Summary: I didn't figure out the reason why the feature of zeroing out earlier sequence ID is disabled in universal compaction. I do see bottommost_level is set correctly. It should simply work if we remove the constraint of universal compaction.

Test Plan: make all check

Reviewers: haobo, dhruba, kailiu, igor

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15423
2014-01-31 09:59:52 -08:00
kailiu
4e0298f23c Clean up arena API
Summary:
Easy thing goes first. This patch moves arena to internal dir; based
on which, the coming patch will deal with memtable_rep.

Test Plan: make check

Reviewers: haobo, sdong, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15615
2014-01-30 22:10:10 -08:00
Dhruba Borthakur
abd70ecc2b The default settings enable checksum verification on every read.
Summary: The default settings enable checksum verification on every read.

Test Plan: make check

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15591
2014-01-30 19:14:03 -08:00
Igor Canadi
9ca638a86d Enable iterating column families with a concurrent writer
Summary:
Sometimes we iterate through column families, and unlock the mutex in the body of the iteration. While mutex is unlocked, some column family might be created or dropped. We need to be able to continue iterating through column families even though our current column family got dropped.

This diff implements circular linked lists that connect all column families. It then uses the link list to enable iterating through linked lists. Even if the column family is dropped, its next_ pointer still can be used to advance to another alive column family.

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15603
2014-01-30 16:57:52 -08:00
Igor Canadi
6973bb1722 MakeRoomForWrite() support for column families
Summary: Making room for write will be the hardest part of the column family implementation. For now, I just iterate through all column families and run MakeRoomForWrite() for every one.

Test Plan: make check does not complain

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15597
2014-01-30 16:12:08 -08:00
Igor Canadi
c37e7de669 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
2014-01-30 11:47:23 -08:00
Igor Canadi
ac92420fc5 Merge branch 'master' into performance
Conflicts:
	db/db_impl.h
2014-01-30 10:09:23 -08:00
Igor Canadi
3c0dcf0e25 InternalStatistics
Summary:
In DBImpl we keep track of some statistics internally and expose them via GetProperty(). This diff encapsulates all the internal statistics into a class InternalStatisics. Most of it is copy/paste.

Apart from cleaning up db_impl.cc, this diff is also necessary for Column families, since every column family should have its own CompactionStats, MakeRoomForWrite-stall stats, etc. It's much easier to keep track of it in every column family if it's nicely encapsulated in its own class.

Test Plan: make check

Reviewers: dhruba, kailiu, haobo, sdong, emayanke

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15273
2014-01-29 20:40:41 -08:00
Lei Jin
d118707f8d set bg_error_ when background flush goes wrong
Summary: as title

Test Plan: unit test

Reviewers: haobo, igor, sdong, kailiu, dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15435
2014-01-29 15:55:58 -08:00
Igor Canadi
514e42c7cc Fix some lint warnings 2014-01-29 15:27:27 -08:00
Igor Canadi
fa99d53e55 Change ColumnFamilyData from struct to class
Summary: ColumnFamilyData grew a lot, there's much more data that it holds now. It makes more sense to encapsulate it better by making it a class.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15579
2014-01-29 15:18:36 -08:00
Igor Canadi
15999e728e Fix column family test (create directory) 2014-01-29 14:06:59 -08:00
Igor Canadi
4662969bf5 PurgeObsoleteFiles in DropColumnFamily
Summary: When we drop the column family, we want to delete all the files from that column family.

Test Plan: make check

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15561
2014-01-29 11:58:33 -08:00
Igor Canadi
20b231d712 Merge branch 'master' into columnfamilies 2014-01-29 11:47:43 -08:00
Igor Canadi
f24a3ee52d Read from and write to different column families
Summary: This one is big. It adds ability to write to and read from different column families (see the unit test). It also supports recovery of different column families from log, which was the hardest part to reason about. We need to make sure to never delete the log file which has unflushed data from any column family. To support that, I added another concept, which is versions_->MinLogNumber()

Test Plan: Added a unit test in column_family_test

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15537
2014-01-29 11:38:16 -08:00
Igor Canadi
e57f0cc1a1 add include <atomic> to version_set.h 2014-01-29 08:17:43 -08:00
Igor Canadi
c1071ed95c Merge branch 'master' into columnfamilies 2014-01-28 16:04:00 -08:00
Igor Canadi
5d2c62822e Only get the manifest file size if there is no error
Summary:
I came across this while working on column families. CorruptionTest::RecoverWriteError threw a SIGSEG because the descriptor_log_->file() was nullptr. I'm not sure why it doesn't happen in master, but better safe than sorry.

@kailiu, can we get this in release, too?

Test Plan: make check

Reviewers: kailiu, dhruba, haobo

Reviewed By: haobo

CC: leveldb, kailiu

Differential Revision: https://reviews.facebook.net/D15513
2014-01-28 16:02:51 -08:00
kailiu
a5e220f5ef Merge branch 'master' into performance
Conflicts:
	Makefile
	db/db_impl.cc
	db/db_test.cc
	db/memtable_list.cc
	db/memtable_list.h
	table/block_based_table_reader.cc
	table/table_test.cc
	util/cache.cc
	util/coding.cc
2014-01-28 10:35:55 -08:00
Igor Canadi
4bf25357ae [column families] Removing VersionSet::current()
Summary: Instead of VersionSet::current(), DBImpl uses default_cfd_->current directly.

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15483
2014-01-27 17:04:46 -08:00
Mark Callaghan
90f29ccbef Update monitoring to include average time per compaction and stall
Summary:
The new columns are msComp and msStall that provide average time per compaction and stall for that level in milliseconds.
Level  Files Size(MB) Score Time(sec)  Read(MB) Write(MB)    Rn(MB)  Rnp1(MB)  Wnew(MB) RW-Amplify Read(MB/s) Write(MB/s)      Rn     Rnp1     Wnp1     NewW    Count   msComp   msStall  Ln-stall Stall-cnt
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0        8       15   1.5         2         0        30         0         0        30        0.0       0.0        15.5        0        0        0        0       16      112       0.2       1.3      7568
  1        8       16   1.6         1        26        26        15        11        16        3.5      17.6        18.1        8        6       13        7        3      362       0.0       0.0         0
  2        1        2   0.0         0         0         2         0         0         2        0.0       0.0        18.4        0        0        0        0        1       50       0.0       0.0         0

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15345
2014-01-27 15:06:04 -08:00
Schalk-Willem Kruger
3d33da75ef Fix UnmarkEOF for partial blocks
Summary:
Blocks in the transaction log are a fixed size, but the last block in the transaction log file is usually a partial block. When a new record is added after the reader hit the end of the file, a new physical record will be appended to the last block. ReadPhysicalRecord can only read full blocks and assumes that the file position indicator is aligned to the start of a block. If the reader is forced to read further by simply clearing the EOF flag, ReadPhysicalRecord will read a full block starting from somewhere in the middle of a real block, causing it to lose alignment and to have a partial physical record at the end of the read buffer. This will result in length mismatches and checksum failures. When the log file is tailed for replication this will cause the log iterator to become invalid, necessitating the creation of a new iterator which will have to read the log file from scratch.

This diff fixes this issue by reading the remaining portion of the last block we read from. This is done when the reader is forced to read further (UnmarkEOF is called).

Test Plan:
- Added unit tests
- Stress test (with replication). Check dbdir/LOG file for corruptions.
- Test on test tier

Reviewers: emayanke, haobo, dhruba

Reviewed By: haobo

CC: vamsi, sheki, dhruba, kailiu, igor

Differential Revision: https://reviews.facebook.net/D15249
2014-01-27 14:49:10 -08:00
Igor Canadi
511b03a5b5 LogAndApply to take ColumnFamilyData
Summary: This removes the default implementation of LogAndApply that applied the changed to the default column family by default. It is mostly simple reformatting.

Test Plan: make check

Reviewers: dhruba, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15465
2014-01-27 13:57:58 -08:00
Igor Canadi
eb055609e4 [column families] Move memtable and immutable memtable list to column family data
Summary: All memtables and immutable memtables are moved from DBImpl to ColumnFamilyData. For now, they are all referenced from default column family in DBImpl. It shouldn't be hard to get them from custom column family.

Test Plan: make check

Reviewers: dhruba, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15459
2014-01-27 13:37:14 -08:00
Igor Canadi
ae16606f98 Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
	db/version_set.h
2014-01-27 11:11:51 -08:00
Igor Canadi
832158e7f7 Fsync directory after we create a new file
Summary:
@dhruba, I'm not sure where we need to sync the directory. I implemented the function in Env() and added the dir sync just after we close the newly created file in the builder.

Should I also add FsyncDir() to new files that get created by a compaction?

Test Plan: Confirmed that FsyncDir is returning Status::OK()

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D14751
2014-01-27 11:02:21 -08:00
Siying Dong
b20486f294 [Performance Branch] HashLinkList to avoid to convert length prefixed string back to internal keys
Summary: Converting from length prefixed buffer back to internal key costs some CPU but it is not necessary. In this patch, internal keys are pass though the functions so that we don't need to convert back to it.

Test Plan: make all check

Reviewers: haobo, kailiu

Reviewed By: kailiu

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D15393
2014-01-27 10:26:14 -08:00
Igor Canadi
cf783c674a Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.h
2014-01-27 10:00:24 -08:00
Igor Canadi
6c2ca1d3e6 Move NeedsCompaction() from VersionSet to Version
Summary: There is no reason to have functions NeedCompaction(), MaxCompactionScore() and MaxCompactionScoreLevel() in VersionSet, since they don't access any data in VersionSet.

Test Plan: make check

Reviewers: kailiu, haobo, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15333
2014-01-27 09:59:00 -08:00
Igor Canadi
e55b3c040c Fixing ref-counting memtables 2014-01-26 18:03:38 -08:00
Mike Lin
af7838de36 address code review comments on 5e3aeb5f8e
- reduce string copying in Compaction::Summary
- simplify file number checking in UniversalCompactionStopStyleSimilarSize unit test
2014-01-25 14:12:24 -08:00
Igor Canadi
983fafa56c Fix memory leak 2014-01-25 13:57:11 -08:00
Igor Canadi
68a91a2e6a missing include 2014-01-24 18:40:05 -08:00
Igor Canadi
5356b2a680 Merge branch 'master' into columnfamilies 2014-01-24 18:34:48 -08:00
Igor Canadi
04afa32134 Fix reduce levels
ReduceNumberOfLevels had segmentation fault in WriteSnapshot() since we
didn't change the number of levels in VersionSet (we consider them
immutable from now on). This fixes the problem.
2014-01-24 18:32:38 -08:00
Siying Dong
8477255da3 Moving Some includes from options.h to forward declaration
Summary: By removing some includes form options.h and reply on forward declaration, we can more easily reason the dependencies.

Test Plan: make all check

Reviewers: kailiu, haobo, igor, dhruba

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15411
2014-01-24 17:16:22 -08:00
Igor Canadi
6a404de4ba Merge branch 'master' into columnfamilies 2014-01-24 15:54:18 -08:00
Igor Canadi
f653fdcf5a Fixing iterator cleanup for Tailing iterator
Immutable tailing iterator doesn't set CleanupState::mem, so we don't
have to unref it.
2014-01-24 15:51:06 -08:00
Igor Canadi
1423e7c9de Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
	db/version_set_reduce_num_levels.cc
	util/ldb_cmd.cc
2014-01-24 15:03:54 -08:00
Igor Canadi
677fee27c6 Make VersionSet::ReduceNumberOfLevels() static
Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.

Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.

Test Plan: reduce_levels_test

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15303
2014-01-24 14:57:04 -08:00
Igor Canadi
c583157d49 MemTableListVersion
Summary:
MemTableListVersion is to MemTableList what Version is to VersionSet. I took almost the same ideas to develop MemTableListVersion. The reason is to have copying std::list done in background, while flushing, rather than in foreground (MultiGet() and NewIterator()) under a mutex! Also, whenever we copied MemTableList, we copied also some MemTableList metadata (flush_requested_, commit_in_progress_, etc.), which was wasteful.

This diff avoids std::list copy under a mutex in both MultiGet() and NewIterator(). I created a small database with some number of immutable memtables, and creating 100.000 iterators in a single-thread (!) decreased from {188739, 215703, 198028} to {154352, 164035, 159817}. A lot of the savings come from code under a mutex, so we should see much higher savings with multiple threads. Creating new iterator is very important to LogDevice team.

I also think this diff will make SuperVersion obsolete for performance reasons. I will try it in the next diff. SuperVersion gave us huge savings on Get() code path, but I think that most of the savings came from copying MemTableList under a mutex. If we had MemTableListVersion, we would never need to copy the entire object (like we still do in NewIterator() and MultiGet())

Test Plan: `make check` works. I will also do `make valgrind_check` before commit

Reviewers: dhruba, haobo, kailiu, sdong, emayanke, tnovak

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15255
2014-01-24 14:52:08 -08:00
Igor Canadi
e832e72b31 Revert "Moving to glibc-fb"
This reverts commit d24961b65e.

For some reason, glibc2.17-fb breaks gflags. Reverting for now
2014-01-24 11:50:38 -08:00
kailiu
66dc033af3 Temporarily disable caching index/filter blocks
Summary:
Mixing index/filter blocks with data blocks resulted in some known
issues.  To make sure in next release our users won't be affected,
we added a new option in BlockBasedTableFactory::TableOption to
conceal this functionality for now.

This patch also introduced a BlockBasedTableReader::OpenOptions,
which avoids the "infinite" growth of parameters in
BlockBasedTableReader::Open().

Test Plan: make check

Reviewers: haobo, sdong, igor, dhruba

Reviewed By: igor

CC: leveldb, tnovak

Differential Revision: https://reviews.facebook.net/D15327
2014-01-24 10:57:15 -08:00
Igor Canadi
d24961b65e Moving to glibc-fb
Summary:
It looks like we might have some trouble when building the new release with 4.8, since fbcode is using glibc2.17-fb by default and we are using glibc2.17. It was reported by Benjamin Renard in our internal group.

This diff moves our fbcode build to use glibc2.17-fb by default. I got some linker errors when compiling, complaining that `google::SetUsageMessage()` was undefined. After deleting all offending lines, the compile was successful and everything works.

Test Plan:
Compiled
Ran ./db_bench ./db_stress ./db_repl_stress

Reviewers: kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15405
2014-01-24 10:24:08 -08:00
Siying Dong
4605e20c58 If User setting of compaction multipliers overflow, use default value 1 instead
Summary: Currently, compaction multipliers can overflow and cause unexpected behaviors. In this patch, we detect those overflows and use multiplier 1 for them.

Test Plan: make all check

Reviewers: dhruba, haobo, igor, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15321
2014-01-24 10:14:23 -08:00
Igor Canadi
28d1a0c6f5 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/db_impl_readonly.h
	db/db_test.cc
	include/rocksdb/db.h
	include/utilities/stackable_db.h
2014-01-24 09:27:29 -08:00
Igor Canadi
09489d395f Fix a bug in DBImpl::CreateColumnFamily 2014-01-24 09:12:00 -08:00
Lei Jin
aba2acb5ec CompactRange() to return status
Summary: as title

Test Plan:
make all check
What else tests shall I cover?

Reviewers: igor, haobo

CC:

Differential Revision: https://reviews.facebook.net/D15339
2014-01-23 16:41:46 -08:00
Kai Liu
054c5dda8c Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/memtable.cc
	db/version_set.cc
	include/rocksdb/statistics.h
	util/statistics_imp.h
2014-01-23 16:32:49 -08:00
Tomislav Novak
81c9cc9b3b Tailing iterator
Summary:
This diff implements a special type of iterator that doesn't create a snapshot
(can be used to read newly inserted data) and is optimized for doing sequential
reads.

TailingIterator uses current superversion number to determine whether to
invalidate its internal iterators. If the version hasn't changed, it can often
avoid doing expensive seeks over immutable structures (sst files and immutable
memtables).

Test Plan:
* new unit tests
* running LD with this patch

Reviewers: igor, dhruba, haobo, sdong, kailiu

Reviewed By: sdong

CC: leveldb, lovro, march

Differential Revision: https://reviews.facebook.net/D15285
2014-01-23 16:26:08 -08:00
Igor Canadi
7c5e583a27 ColumnFamilySet
Summary:
I created a separate class ColumnFamilySet to keep track of column families. Before we did this in VersionSet and I believe this approach is cleaner.

Let me know if you have any comments. I will commit tomorrow.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15357
2014-01-23 14:03:38 -08:00
Igor Canadi
f9a25dda9f Fix wrong merge 2014-01-22 11:30:19 -08:00
Igor Canadi
92a022ad07 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/db_impl_readonly.cc
	db/version_set.cc
2014-01-22 10:59:07 -08:00
Igor Canadi
fb01755aa4 Unfriending classes
Summary:
In this diff I made some effort to reduce usage of friending. To do that, I had to expose Compaction::inputs_ through a method inputs(). Not sure if this is a good idea, there is a trade-off. I think it's less confusing than having lots of friends.

I also thought about other friendship relationships, but they are too much tangled at this point. Once you friend two classes, it's very hard to unfriend them :)

Test Plan: make check

Reviewers: haobo, kailiu, sdong, dhruba

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15267
2014-01-22 10:55:16 -08:00
Igor Canadi
6fe9b57748 Refactor Recover() code
Summary:
This diff does two things:
* Rethinks how we call Recover() with read_only option. Before, we call it with pointer to memtable where we'd like to apply those changes to. This memtable is set in db_impl_readonly.cc and it's actually DBImpl::mem_. Why don't we just apply updates to mem_ right away? It seems more intuitive.
* Changes when we apply updates to manifest. Before, the process is to recover all the logs, flush it to sst files and then do one giant commit that atomically adds all recovered sst files and sets the next log number. This works good enough, but causes some small troubles for my column family approach, since I can't have one VersionEdit apply to more than single column family[1]. The change here is to commit the files recovered from logs right away. Here is the state of the world before the change:
1. Recover log 5, add new sst files to edit
2. Recover log 7, add new sst files to edit
3. Recover log 8, add new sst files to edit
4. Commit all added sst files to manifest and mark log files 5, 7 and 8 as recoverd (via SetLogNumber(9) function)
After the change, we'll do:
1. Recover log 5, commit the new sst files and set log 5 as recovered
2. Recover log 7, commit the new sst files and set log 7 as recovered
3. Recover log 8, commit the new sst files and set log 8 as recovered

The added (small) benefit is that if we fail after (2), the new recovery will only have to recover log 8. In previous case, we'll have to restart the recovery from the beginning. The bigger benefit will be to enable easier integration of multiple column families in Recovery code path.

[1] I'm happy to dicuss this decison, but I believe this is the cleanest way to go. It also makes backward compatibility much easier. We don't have a requirement of adding multiple column families atomically.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15237
2014-01-22 10:45:26 -08:00
Igor Canadi
23f6791c9e Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl_readonly.cc
	db/db_test.cc
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
	db/version_set.h
	db/version_set_reduce_num_levels.cc
2014-01-21 17:01:52 -08:00
Siying Dong
7dea558e6d [Performance Branch] Fix a bug when merging from master
Summary: Commit "1304d8c8cefe66be1a3caa5e93413211ba2486f2" (Merge branch 'master' into performance) removes a line in performance branch by mistake. This patch fixes it.

Test Plan: make all check

Reviewers: haobo, kailiu, igor

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15297
2014-01-21 12:44:43 -08:00
Mark Callaghan
4e8321bfea Boost access before mutex is unlocked
Summary:
This moves the use of versions_ to before the mutex is unlocked
to avoid a possible race.

Task ID: #

Blame Rev:

Test Plan:
make check

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: haobo, dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15279
2014-01-17 21:32:23 -08:00
Kai Liu
ef602f6275 Misc cleanup on performance branch
Summary:

Did some trivial stuffs:

* Add more comments;
* fix compiler's warning messages (uninitialized variables).
* etc

Test Plan:

make check
2014-01-17 14:26:29 -08:00
Igor Canadi
83681bf9ef Statistics code cleanup
Summary: I'm separating code-cleanup part of https://reviews.facebook.net/D14517. This will make D14517 easier to understand and this diff easier to review.

Test Plan: make check

Reviewers: haobo, kailiu, sdong, dhruba, tnovak

Reviewed By: tnovak

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15099
2014-01-17 12:46:06 -08:00
Igor Canadi
8079dd5d24 Merge branch 'master' into performance 2014-01-17 12:20:07 -08:00
Igor Canadi
0f4a75b710 Fix SIGSEGV in compaction picker
Summary:
The SIGSEGV was introduced by https://reviews.facebook.net/D15171

I also fixed ExpandWhileOverlapping() which returned the failure by setting it's own stack variable to nullptr (!). This bug is present in 2.6 release, so I guess ExpandWhileOverlapping never fails :)

Test Plan: `make check`. Also MarkCallaghan confirmed it fixed the SIGSEGV he reported.

Reviewers: MarkCallaghan, kailiu, sdong, dhruba, haobo

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15261
2014-01-17 12:02:03 -08:00
Mark Callaghan
439e36db21 Fix SlowdownAmount
Summary:
This had a few bugs.
1) bottom and top were reversed. top is for the max value but the callers were passing the max
value to bottom. The result is that the max sleep is used when n >= bottom.
2) one of the callers passed values with type double and these values are frequently between
1.0 and 2.0 so rounding will do some bad things
3) sometimes the function returned 0 when there should be a stall

With this change and one other diff (out for review soon) there are slightly fewer stalls on one workload.

With the fix.
Stalls(secs): 160.166 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 58.495 leveln_slowdown
Stalls(count): 910261 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 54526 leveln_slowdown

Without the fix.
Stalls(secs): 172.227 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 56.538 leveln_slowdown
Stalls(count): 160831 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 52845 leveln_slowdown

Task ID: #

Blame Rev:

Test Plan:
run db_bench for --benchmarks=overwrite with IO-bound database

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15243
2014-01-17 10:15:30 -08:00
Mike Lin
5e3aeb5f8e An initial implementation of kCompactionStopStyleSimilarSize for universal compaction 2014-01-16 22:59:34 -08:00
Mike Lin
b1194f4903 Minor compaction logging improvements
1) make summary less likely to be truncated
2) format human-readable file sizes in summary
3) log the motivation for each universal compaction
2014-01-16 22:54:31 -08:00
Naman Gupta
1447bb5919 Allow callback to change size of existing value. Change return type of the callback function to an enum status to handle 3 cases.
Summary:
This diff fixes 2 hacks:
* The callback function can modify the existing value inplace, if the merged value fits within the existing buffer size. But currently the existing buffer size is not being modified. Now the callback recieves a int* allowing the size to be modified. Since size is encoded as a varint in the internal key for memtable. It might happen that the entire value might have be copied to the new location if the new size varint is smaller than the existing size varint.
* The callback function has 3 functionalities
    1. Modify existing buffer inplace, and update size correspondingly. Now to indicate that, Returns 1.
    2. Generate a new buffer indicating merged value. Returns 2.
    3. Fails to do either of above, based on whatever application logic. Returns 0.

Test Plan: Just make all for now. I'm adding another unit test to test each scenario.

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb, sdong, kailiu, xinyaohu, sumeet, danguo

Differential Revision: https://reviews.facebook.net/D15195
2014-01-16 15:12:39 -08:00
Kai Liu
d4f65f1683 Merge branch 'master' into performance
This patch merges master's changes on build_tools/format-diff.sh.
Conflicts:
	db/version_edit.cc
2014-01-16 14:31:18 -08:00
Igor Canadi
6d6fb70960 Remove compaction pointers
Summary: The only thing we do with compaction pointers is set them to some values, we never actually read them. I don't know what we used them for, but it doesn't look like we use them anymore.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15225
2014-01-16 14:06:53 -08:00
Igor Canadi
c699c84af4 CompactionPicker
Summary:
This is a big one. This diff moves all the code related to picking compactions from VersionSet to new class CompactionPicker. Column families' compactions will be completely separate processes, so we need to have multiple CompactionPickers.

To make this easier to review, most of the code change is just copy/paste. There is also a small change not to use VersionSet::current_, but rather to take `Version* version` as a parameter. Most of the other code is exactly the same.

In future diffs, I will also make some improvements to CompactionPickers. I think the most important part will be encapsulating it better. Currently Version, VersionSet, Compaction and CompactionPicker are all friend classes, which makes it harder to change the implementation.

This diff depends on D15171, D15183, D15189 and D15201

Test Plan: `make check`

Reviewers: kailiu, sdong, dhruba, haobo

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15207
2014-01-16 13:03:52 -08:00
kailiu
1304d8c8ce Merge branch 'master' into performance
Conflicts:
	Makefile
	db/db_impl.cc
	db/db_impl.h
	db/db_test.cc
	db/memtable.cc
	db/memtable.h
	db/version_edit.h
	db/version_set.cc
	include/rocksdb/options.h
	util/hash_skiplist_rep.cc
	util/options.cc
2014-01-15 23:12:31 -08:00
kailiu
eae1804f29 Remove the unnecessary use of shared_ptr
Summary:
shared_ptr is slower than unique_ptr (which literally comes with no performance cost compare with raw pointers).
In memtable and memtable rep, we use shared_ptr when we'd actually should use unique_ptr.

According to igor's previous work, we are likely to make quite some performance gain from this diff.

Test Plan: make check

Reviewers: dhruba, igor, sdong, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15213
2014-01-15 18:22:01 -08:00
Igor Canadi
787f11bb3b Move more functions from VersionSet to Version
Summary:
This moves functions:
* VersionSet::Finalize() -> Version::UpdateCompactionStats()
* VersionSet::UpdateFilesBySize() -> Version::UpdateFilesBySize()

The diff depends on D15189, D15183 and D15171

Test Plan: make check

Reviewers: kailiu, sdong, haobo, dhruba

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15201
2014-01-15 16:23:36 -08:00
Igor Canadi
615d1ea2f4 Moving Compaction class to separate header file
Summary:
I'm sure we'll all agree that version_set.cc needs simplifying. This diff moves Compaction class to a separate file.

The diff depends on D15171 and D15183

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15189
2014-01-15 16:22:34 -08:00
Igor Canadi
2f4eda7890 Move functions from VersionSet to Version
Summary:
There were some functions in VersionSet that had no reason to be there instead of Version. Moving them to Version will make column families implementation easier.

The functions moved are:
* NumLevelBytes
* LevelSummary
* LevelFileSummary
* MaxNextLevelOverlappingBytes
* AddLiveFiles (previously AddLiveFilesCurrentVersion())
* NeedSlowdownForNumLevel0Files

The diff continues on (and depends on) D15171

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong, emayanke

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15183
2014-01-15 16:18:04 -08:00
Igor Canadi
65a8a52b54 Decrease reliance on VersionSet::NumberLevels()
Summary:
With column families VersionSet will not have a constant number of levels (each CF can have different options), so we'll need to eliminate call to VersionSet::NumberLevels()

This diff decreases number of callsites, but we're not there yet. It associates number of levels with Version (each version is associated with single CF) instead of VersionSet.

I have also slightly changed how VersionSet keeps track of manifest size.

This diff also modifies constructor of Compaction such that it takes input_version and automatically Ref()s it. Before this was done outside of constructor.

In next diffs I will continue to decrease number of callsites of VersionSet::NumberLevels() and also references to current_

Test Plan: make check

Reviewers: haobo, dhruba, kailiu, sdong

Reviewed By: sdong

Differential Revision: https://reviews.facebook.net/D15171
2014-01-15 16:15:43 -08:00
Siying Dong
9b51af5a17 [RocksDB Performance Branch] DBImpl.NewInternalIterator() to reduce works inside mutex
Summary: To reduce mutex contention caused by DBImpl.NewInternalIterator(), in this function, move all the iteration creation works out of mutex, only leaving object ref and get.

Test Plan:
make all check
will run db_stress for a while too to make sure no problem.

Reviewers: haobo, dhruba, kailiu

Reviewed By: haobo

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14589

Conflicts:
	db/db_impl.cc
2014-01-14 17:41:44 -08:00
Igor Canadi
d9cd7a063f Fix CompactRange to apply filter to every key
Summary:
When doing CompactRange(), we should first flush the memtable and then calculate max_level_with_files. Also, we want to compact all the levels that have files, including level `max_level_with_files`.

This patch fixed the unit test.

Test Plan: Added a failing unit test and a fix, so it's not failing anymore.

Reviewers: dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D14421
2014-01-14 16:19:09 -08:00
Igor Canadi
1ed2404f27 Wrong number of levels is Invalid argument now, not corruption 2014-01-14 15:54:11 -08:00
Igor Canadi
6291020284 Fix test 2014-01-14 15:41:30 -08:00
Igor Canadi
7f3e417f59 Fix memtable construction in tests 2014-01-14 15:36:12 -08:00
Igor Canadi
055e6df45b VersionEdit not to take NumLevels()
Summary:
I will submit a sequence of diffs that are preparing master branch for column families. There are a lot of implicit assumptions in the code that are making column family implementation hard. If I make the change only in column family branch, it will make merging back to master impossible.

Most of the diffs will be simple code refactorings, so I hope we can have fast turnaround time. Feel free to grab me in person to discuss any of them.

This diff removes number of level check from VersionEdit. It is used only when VersionEdit is read, not written, but has to be set when it is written. I believe it is a right thing to make VersionEdit dumb and check consistency on the caller side. This will also make it much easier to implement Column Families, since different column families can have different number of levels.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15159
2014-01-14 15:27:09 -08:00
Igor Canadi
7d9f21cf23 BuildBatchGroup -- memcpy outside of lock
Summary: When building batch group, don't actually build a new batch since it requires heavy-weight mem copy and malloc. Only store references to the batches and build the batch group without lock held.

Test Plan:
`make check`

I am also planning to run performance tests. The workload that will benefit from this change is readwhilewriting. I will post the results once I have them.

Reviewers: dhruba, haobo, kailiu

Reviewed By: haobo

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D15063
2014-01-14 14:49:31 -08:00
Naman Gupta
1d9bac4d7f Use sanitized options while opening db
Summary: We use SanitizeOptions() to set appropriate values for some options, based on other options. So we should use the sanitized options by default. Luckily it hasn't caused a bug yet, but can result in a bug in the fugture.

Test Plan: make check

Reviewers: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14103
2014-01-14 11:46:24 -08:00
Siying Dong
9ea8bf90f1 DB::Put() to estimate write batch data size needed and pre-allocate buffer
Summary:
In one of CPU profiles, we see some CPU costs of string::reserve() inside Batch.Put(). This patch should be able to reduce some of the costs by allocating sufficient buffer before hand.

Since it is a trivial percentage of CPU costs, I didn't find a way to show the improvement in one of the benchmarks. I'll deploy it to same application and do the same CPU profiling to make sure those CPU costs are reduced.

Test Plan: make all check

Reviewers: haobo, kailiu, igor

Reviewed By: haobo

CC: leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D15135
2014-01-14 11:24:43 -08:00
Siying Dong
fbbf0d1456 Pre-calculate whether to slow down for too many level 0 files
Summary: Currently in DBImpl::MakeRoomForWrite(), we do  "versions_->NumLevelFiles(0) >= options_.level0_slowdown_writes_trigger" to check whether the writer thread needs to slow down. However, versions_->NumLevelFiles(0) is slightly more expensive than we expected. By caching the result of the comparison when installing a new version, we can avoid this function call every time.

Test Plan:
make all check
Manually trigger this behavior by applying universal compaction style and make sure inserts are made slow after there are certain number of files.

Reviewers: haobo, kailiu, igor

Reviewed By: kailiu

CC: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D15141
2014-01-14 11:23:02 -08:00
Siying Dong
51dd21926c DB::Put() to estimate write batch data size needed and pre-allocate buffer
Summary:
In one of CPU profiles, we see some CPU costs of string::reserve() inside Batch.Put(). This patch should be able to reduce some of the costs by allocating sufficient buffer before hand.

Since it is a trivial percentage of CPU costs, I didn't find a way to show the improvement in one of the benchmarks. I'll deploy it to same application and do the same CPU profiling to make sure those CPU costs are reduced.

Test Plan: make all check

Reviewers: haobo, kailiu, igor

Reviewed By: haobo

CC: leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D15135
2014-01-14 10:53:16 -08:00
Naman Gupta
8454cfe569 Add read/modify/write functionality to Put() api
Summary: The application can set a callback function, which is applied on the previous value. And calculates the new value. This new value can be set, either inplace, if the previous value existed in memtable, and new value is smaller than previous value. Otherwise the new value is added normally.

Test Plan: fbmake. Added unit tests. All unit tests pass.

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: sdong, kailiu, xinyaohu, sumeet, leveldb

Differential Revision: https://reviews.facebook.net/D14745
2014-01-14 07:55:16 -08:00
Igor Canadi
a107691711 [column families] Implement refcounting ColumnFamilyData
Summary: We don't want to delete ColumnFamilyData object if somebody has references to it.

Test Plan: `make check` for now, but will need to implement bigger column family test case

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15111
2014-01-13 09:22:44 -08:00
Igor Canadi
151f9e144f Merge branch 'master' into columnfamilies 2014-01-13 09:09:51 -08:00
Igor Canadi
d076cef347 [column families] Get rid of VersionSet::current_ and keep current Version for each column family
Summary:
The biggest change here is getting rid of current_ Version and adding a column_family_data->current Version to each column family.

I have also fixed some smaller things in VersionSet that made it easier to implement Column family support.

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15105
2014-01-13 08:59:15 -08:00
Igor Canadi
dd6ecdf342 Use ASSERT_EQ() instead of assert() in merge_test 2014-01-11 09:25:47 -08:00
Schalk-Willem Kruger
a09ee1069d Improve RocksDB "get" performance by computing merge result in memtable
Summary:
Added an option (max_successive_merges) that can be used to specify the
maximum number of successive merge operations on a key in the memtable.
This can be used to improve performance of the "get" operation. If many
successive merge operations are performed on a key, the performance of "get"
operations on the key deteriorates, as the value has to be computed for each
"get" operation by applying all the successive merge operations.

FB Task ID: #3428853

Test Plan:
make all check
db_bench --benchmarks=readrandommergerandom
counter_stress_test

Reviewers: haobo, vamsi, dhruba, sdong

Reviewed By: haobo

CC: zshao

Differential Revision: https://reviews.facebook.net/D14991
2014-01-10 17:33:56 -08:00
Siying Dong
aa0ef6602d [Performance Branch] If options.max_open_files set to be -1, cache table readers in FileMetadata for Get() and NewIterator()
Summary:
In some use cases, table readers for all live files should always be cached. In that case, there will be an opportunity to avoid the table cache look-up while Get() and NewIterator().

We define options.max_open_files = -1 to be the mode that table readers for live files will always be kept. In that mode, table readers are cached in FileMetaData (with a reference count hold in table cache). So that when executing table_cache.Get() and table_cache.newInterator(), LRU cache checking can be by-passed, to reduce latency.

Test Plan: add a test case in db_test

Reviewers: haobo, kailiu

Reviewed By: haobo

CC: dhruba, igor, leveldb

Differential Revision: https://reviews.facebook.net/D15039
2014-01-10 15:57:49 -08:00
Siying Dong
5b5ab0c1a8 [Performance Branch] Fix memory leak in HashLinkListRep.GetIterator()
Summary: Full list constructed for full iterator can be leaked. This was a bug introduced when I copy the full iterator codes from hash skip list to hash link list. This patch fixes it.

Test Plan: Run valgrind test against db_test and make sure the memory leak is fixed

Reviewers: kailiu, haobo

Reviewed By: kailiu

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D15093
2014-01-10 12:12:28 -08:00
Siying Dong
237a3da677 StopWatch not to get time if it is created for statistics and it is disabled
Summary: Currently, even if statistics is not enabled, StopWatch only for the stats still gets the time of the day, which is wasteful. This patch adds a new option to StopWatch to disable this get in this case.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14703

Conflicts:
	db/db_impl.cc
2014-01-09 17:39:48 -08:00
Siying Dong
424a524ac9 [Performance Branch] A Hashed Linked List Based Mem Table
Summary:
Implement a mem table, in which keys are hashed based on prefixes. In each bucket, entries are organized in a sorted linked list. It has the same thread safety guarantee as skip list.

The motivation is to optimize memory usage for the case that prefix hashing is primary way of seeking to the entry. Compared to hash skip list implementation, this implementation is more memory efficient, but inside each bucket, search is always linear. The target scenario is that there are only very limited number of records in each hash bucket.

Test Plan: Add a test case in db_test

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14979
2014-01-09 16:19:11 -08:00
Siying Dong
5575316350 StopWatch not to get time if it is created for statistics and it is disabled
Summary: Currently, even if statistics is not enabled, StopWatch only for the stats still gets the time of the day, which is wasteful. This patch adds a new option to StopWatch to disable this get in this case.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14703
2014-01-08 16:05:36 -08:00
Igor Canadi
19e3ee64ac Add column family information to WAL
Summary:
I have added three new value types:
* kTypeColumnFamilyDeletion
* kTypeColumnFamilyValue
* kTypeColumnFamilyMerge
which include column family Varint32 before the data (value, deletion and merge). These values are used only in WAL (not in memtables yet).

This endeavour required changing some WriteBatch internals.

Test Plan: Added a unittest

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15045
2014-01-08 12:53:33 -08:00
Mark Callaghan
50994bf699 Don't always compress L0 files written by memtable flush
Summary:
Code was always compressing L0 files written by a memtable flush
when compression was enabled. Now this is done when
min_level_to_compress=0 for leveled compaction and when
universal_compaction_size_percent=-1 for universal compaction.

Task ID: #3416472

Blame Rev:

Test Plan:
ran db_bench with compression options

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba, igor, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14757
2014-01-07 21:50:26 -08:00
Igor Canadi
a45b7d83ba Merge pull request #59 from mlin/more-c-bindings
C API: add rocksdb_env_set_high_priority_background_threads
2014-01-07 16:33:03 -08:00
Igor Canadi
72918efffe [column families] Implement DB::OpenWithColumnFamilies()
Summary:
In addition to implementing OpenWithColumnFamilies, this diff also includes some minor changes:
* Changed all column family names from Slice() to std::string. The performance of column family name handling is not critical, and it's more convenient and cleaner to have names as std::strings
* Implemented ColumnFamilyOptions(const Options&) and DBOptions(const Options&)
* Added ColumnFamilyOptions to VersionSet::ColumnFamilyData. ColumnFamilyOptions are specified on OpenWithColumnFamilies() and CreateColumnFamily()

I will keep the diff in the Phabricator for a day or two and will push to the branch then. Feel free to comment even after the diff has been pushed.

Test Plan: Added a simple unit test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15033
2014-01-07 11:05:50 -08:00
Igor Canadi
d3a2ba9c64 Merge branch 'master' into columnfamilies 2014-01-07 11:05:03 -08:00
Igor Canadi
17a222670b Merge branch 'master' into performance 2014-01-07 11:04:21 -08:00
Tomislav Novak
9f690ec62c Fix a deadlock in CompactRange()
Summary:
The way DBImpl::TEST_CompactRange() throttles down the number of bg compactions
can cause it to deadlock when CompactRange() is called concurrently from
multiple threads. Imagine a following scenario with only two threads
(max_background_compactions is 10 and bg_compaction_scheduled_ is initially 0):

   1. Thread #1 increments bg_compaction_scheduled_ (to LargeNumber), sets
      bg_compaction_scheduled_ to 9 (newvalue), schedules the compaction
      (bg_compaction_scheduled_ is now 10) and waits for it to complete.
   2. Thread #2 calls TEST_CompactRange(), increments bg_compaction_scheduled_
      (now LargeNumber + 10) and waits on a cv for bg_compaction_scheduled_ to
      drop to LargeNumber.
   3. BG thread completes the first manual compaction, decrements
      bg_compaction_scheduled_ and wakes up all threads waiting on bg_cv_.
      Thread #1 runs, increments bg_compaction_scheduled_ by LargeNumber again
      (now 2*LargeNumber + 9). Since that's more than LargeNumber + newvalue,
      thread #2 also goes to sleep (waiting on bg_cv_), without resetting
      bg_compaction_scheduled_.

This diff attempts to address the problem by introducing a new counter
bg_manual_only_ (when positive, MaybeScheduleFlushOrCompaction() will only
schedule manual compactions).

Test Plan:
I could pretty much consistently reproduce the deadlock with a program that
calls CompactRange(nullptr, nullptr) immediately after Write() from multiple
threads. This no longer happens with this patch.

Tests (make check) pass.

Reviewers: dhruba, igor, sdong, haobo

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14799
2014-01-07 10:37:34 -08:00
Igor Canadi
fff5c7e817 Merge branch 'master' into columnfamilies 2014-01-06 13:31:41 -08:00
Kai Liu
5e7d5629c7 Fix the valgrind issues 2014-01-03 11:48:31 -08:00
Igor Canadi
ef6ad1708d [column families] Support to create and drop column families
Summary:
This diff provides basic implementations of CreateColumnFamily(), DropColumnFamily() and ListColumnFamilies(). It builds on top of https://reviews.facebook.net/D14733

It also includes a bug fix for DBImplReadOnly, where Get implementation would be redirected to DBImpl instead of DBImplReadOnly.

Test Plan: Added unit test

Reviewers: dhruba, haobo, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15021
2014-01-03 01:12:16 -08:00
Kai Liu
774ed89c24 Replace vector with autovector
Summary: this diff only replace the cases when we need to frequently create vector with small amount of entries. This diff doesn't aim to improve performance of a specific area, but more like a small scale test for the autovector and see how it works in real life.

Test Plan:
make check

I also ran the performance tests, however there is no performance gain/loss. All performance numbers are pretty much the same before/after the change.

Reviewers: dhruba, haobo, sdong, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14985
2014-01-02 16:43:35 -08:00
kailiu
e72aa37cc5 Merge branch 'master' into performance
Conflicts:
	db/table_cache.cc
2014-01-02 16:34:59 -08:00
kailiu
476416c27c Some minor refactoring on the code
Summary: I made some cleanup while reading the source code in `db`. Most changes are about style, naming or C++ 11 new features.

Test Plan: ran `make check`

Reviewers: haobo, dhruba, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15009
2014-01-02 16:32:31 -08:00
kailiu
9281a826f1 Hotfix the bug in table cache's GetSliceForFileNumber
Forgot to fix this problem in master branch. Already fixed it in performance branch.
2014-01-02 10:30:42 -08:00
Igor Canadi
7535443083 [RocksDB] Support for column families in manifest
Summary:
<This diff is for Column Family branch>

Added fields in manifest file to support adding and deleting column families.

Pretty simple change, each version edit record can be:
1. add column family
2. drop column family
3. add and delete N files from a single column family (compactions and flushes will generate such records)

Test Plan: make check works, the code is backward compatible

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14733
2014-01-02 04:18:28 -08:00
Igor Canadi
6de1b5b83e Merge branch 'master' into columnfamilies 2014-01-02 04:18:07 -08:00
Igor Canadi
b60c14f6ee Support multi-threaded DisableFileDeletions() and EnableFileDeletions()
Summary:
We don't want two threads to clash if they concurrently call DisableFileDeletions() and EnableFileDeletions(). I'm adding a counter that will enable file deletions only after all DisableFileDeletions() calls have been negated with EnableFileDeletions().

However, we also don't want to break the old behavior, so I added a parameter force to EnableFileDeletions(). If force is true, we will still enable file deletions after every call to EnableFileDeletions(), which is what is happening now.

Test Plan: make check

Reviewers: dhruba, haobo, sanketh

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14781
2014-01-02 03:33:42 -08:00
Mike Lin
4b1d049236 C API: add rocksdb_env_set_high_priority_background_threads 2013-12-31 15:14:18 -08:00
kailiu
f1cec73a76 Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/memtable.cc
	db/version_set.cc
	include/rocksdb/statistics.h
2013-12-27 12:23:17 -08:00
Siying Dong
a094f3b3b5 TableCache.FindTable() to avoid the mem copy of file number
Summary: I'm not sure what's the purpose of encoding file number to a new buffer for looking up the table cache. It seems to be unnecessary to me. With this patch, we point the lookup key to the address of the int64 of the file number.

Test Plan: make all check

Reviewers: dhruba, haobo, igor, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14811
2013-12-26 16:57:07 -08:00
Siying Dong
18df47b79a Avoid malloc in NotFound key status if no message is given.
Summary:
In some places we have NotFound status created with empty message, but it doesn't avoid a malloc. With this patch, the malloc is avoided for that case.

The motivation of it is that I found in db_bench readrandom test when all keys are not existing, about 4% of the total running time is spent on malloc of Status, plus a similar amount of CPU spent on free of them, which is not necessary.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14691
2013-12-26 16:23:10 -08:00
kailiu
079a21ba99 Fix the unused variable warning message in mac os 2013-12-26 15:12:30 -08:00
Haobo Xu
bf4a48ccb3 [RocksDB] [Performance Branch] Revert previous patch.
Summary: The previous patch is wrong. rep_.resize(kHeader) just resets the header portion to zero, and should not cause a re-allocation if g++ does it right. I will go ahead and revert it.

Test Plan: make check

Reviewers: dhruba, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14793
2013-12-20 18:20:06 -08:00
Haobo Xu
e94eea4527 [RocksDB] [Performance Branch] Minor fix, Remove string resize from WriteBatch::Clear
Summary: tmp_batch_ will get re-allocated for every merged write batch because of the existing resize in WriteBatch::Clear. Note that in DBImpl::BuildBatchGroup, we have a hard coded upper limit of batch size 1<<20 = 1MB already.

Test Plan: make check

Reviewers: dhruba, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14787
2013-12-20 16:29:05 -08:00
Siying Dong
abaf26266d [RocksDB] [Performance Branch] Some Changes to PlainTable format
Summary:
Some changes to PlainTable format:
(1) support variable key length
(2) use user defined slice transformer to extract prefixes
(3) Run some test cases against PlainTable in db_test and table_test

Test Plan: test db_test

Reviewers: haobo, kailiu

CC: dhruba, igor, leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D14457
2013-12-20 12:08:35 -08:00
Igor Canadi
1fdb3f7dc6 [RocksDB] Optimize locking for Get
Summary:
Instead of locking and saving a DB state, we can cache a DB state and update it only when it changes. This change reduces lock contention and speeds up read operations on the DB.

Performance improvements are substantial, although there is some cost in no-read workloads. I ran the regression tests on my devserver and here are the numbers:

  overwrite                    56345  ->   63001
  fillseq                      193730 ->  185296
  readrandom                   771301 -> 1219803 (58% improvement!)
  readrandom_smallblockcache   677609 ->  862850
  readrandom_memtable_sst      710440 -> 1109223
  readrandom_fillunique_random 221589 ->  247869
  memtablefillrandom           105286 ->   92643
  memtablereadrandom           763033 -> 1288862

Test Plan:
make asan_check
I am also running db_stress

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14679
2013-12-20 09:57:58 -08:00
Mark Callaghan
ca92068b12 Add 'readtocache' test
Summary:
For some tests I want to cache the database prior to running other tests on the same invocation
of db_bench. The readtocache test ignores --threads and --reads so those can be used by other tests
and it will still do a full read of --num rows with one thread. It might be invoked like:
  db_bench --benchmarks=readtocache,readrandom --reads 100 --num 10000 --threads 8

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14739
2013-12-18 16:54:53 -08:00
Igor Canadi
269709a885 Merge branch 'master' into columnfamilies 2013-12-18 15:56:37 -08:00
Igor Canadi
3b50b6213d Merge pull request #37 from mlin/more-c-bindings
C bindings: add a bunch of the newer options
2013-12-18 13:12:04 -08:00
Igor Canadi
9385a5247e [RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>

Sharing some of the work I've done so far. This diff compiles and passes the tests.

The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.

Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]

Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.

Please provide feedback.

Test Plan: make check works, the code is backward compatible

Reviewers: dhruba, haobo, sdong, kailiu, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14445
2013-12-18 13:08:22 -08:00
Siying Dong
14995a8ff3 Move level0 sorting logic from Version::SaveTo() to Version::Finalize()
Summary: I realized that "D14409 Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo()" is not done in an optimized place. SaveTo() is usually inside mutex. Move it to Finalize(), which is called out of mutex.

Test Plan: make all check

Reviewers: dhruba, haobo, kailiu

Reviewed By: dhruba

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14607
2013-12-17 18:06:58 -08:00
Siying Dong
a8b8b11dc4 Get() Does Not Reserve space for to_delete memtables
Summary: It seems to be a decision tradeoff in current codes: we make a malloc for every Get() to reduce one malloc for a flush inside mutex. It takes about 5% of CPU time in readrandom tests. We might consider the tradeoff to be the other way around.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14697
2013-12-17 17:16:16 -08:00
Mike Lin
2a2506b629 C bindings: add a bunch of the newer options 2013-12-15 13:47:06 -08:00
Kai Liu
2e9efcd6d8 Add the property block for the plain table
Summary:
This is the last diff that adds the property block to plain table.
The format resembles that of the block-based table: https://github.com/facebook/rocksdb/wiki/Rocksdb-table-format

  [data block]
  [meta block 1: stats block]
  [meta block 2: future extended block]
  ...
  [meta block K: future extended block]  (we may add more meta blocks in the future)
  [metaindex block]
  [index block: we only have the placeholder here, we can add persistent index block in the future]
  [Footer: contains magic number, handle to metaindex block and index block]
  <end_of_file>

Test Plan: extended existing property block test.

Reviewers: haobo, sdong, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14523
2013-12-13 17:18:14 -08:00
kailiu
0cd1521af5 Completely remove argv_ since no one use it
There are still warning in some other environment, just move that useless variable `argv_`
2013-12-12 16:36:38 -08:00
kailiu
0e24f97b9f Revert last commit and add "unused" attribute to suppress warning 2013-12-12 15:40:44 -08:00
kailiu
bc9b488e92 fix a warning in db_test when running make release 2013-12-12 15:35:02 -08:00
Mark Callaghan
e9e6b00d29 Add monitoring for universal compaction and add counters for compaction IO
Summary:
Adds these counters
{ WAL_FILE_SYNCED, "rocksdb.wal.synced" }
  number of writes that request a WAL sync
{ WAL_FILE_BYTES, "rocksdb.wal.bytes" },
  number of bytes written to the WAL
{ WRITE_DONE_BY_SELF, "rocksdb.write.self" },
  number of writes processed by the calling thread
{ WRITE_DONE_BY_OTHER, "rocksdb.write.other" },
  number of writes not processed by the calling thread. Instead these were
  processed by the current holder of the write lock
{ WRITE_WITH_WAL, "rocksdb.write.wal" },
  number of writes that request WAL logging
{ COMPACT_READ_BYTES, "rocksdb.compact.read.bytes" },
  number of bytes read during compaction
{ COMPACT_WRITE_BYTES, "rocksdb.compact.write.bytes" },
  number of bytes written during compaction

Per-interval stats output was updated with WAL stats and correct stats for universal compaction
including a correct value for write-amplification. It now looks like:
                               Compactions
Level  Files Size(MB) Score Time(sec)  Read(MB) Write(MB)    Rn(MB)  Rnp1(MB)  Wnew(MB) RW-Amplify Read(MB/s) Write(MB/s)      Rn     Rnp1     Wnp1     NewW    Count  Ln-stall Stall-cnt
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0        7      464  46.4       281      3411      3875      3411         0      3875        2.1      12.1        13.8      621        0      240      240      628       0.0         0
Uptime(secs): 310.8 total, 2.0 interval
Writes cumulative: 9999999 total, 9999999 batches, 1.0 per batch, 1.22 ingest GB
WAL cumulative: 9999999 WAL writes, 9999999 WAL syncs, 1.00 writes per sync, 1.22 GB written
Compaction IO cumulative (GB): 1.22 new, 3.33 read, 3.78 write, 7.12 read+write
Compaction IO cumulative (MB/sec): 4.0 new, 11.0 read, 12.5 write, 23.4 read+write
Amplification cumulative: 4.1 write, 6.8 compaction
Writes interval: 100000 total, 100000 batches, 1.0 per batch, 12.5 ingest MB
WAL interval: 100000 WAL writes, 100000 WAL syncs, 1.00 writes per sync, 0.01 MB written
Compaction IO interval (MB): 12.49 new, 14.98 read, 21.50 write, 36.48 read+write
Compaction IO interval (MB/sec): 6.4 new, 7.6 read, 11.0 write, 18.6 read+write
Amplification interval: 101.7 write, 102.9 compaction
Stalls(secs): 142.924 level0_slowdown, 0.000 level0_numfiles, 0.805 memtable_compaction, 0.000 leveln_slowdown
Stalls(count): 132461 level0_slowdown, 0 level0_numfiles, 3 memtable_compaction, 0 leveln_slowdown

Task ID: #3329644, #3301695

Blame Rev:

Test Plan:
Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14583
2013-12-12 13:27:43 -08:00
Siying Dong
e8ab1934d9 [RocksDB Performance Branch] DBImpl.NewInternalIterator() to reduce works inside mutex
Summary: To reduce mutex contention caused by DBImpl.NewInternalIterator(), in this function, move all the iteration creation works out of mutex, only leaving object ref and get.

Test Plan:
make all check
will run db_stress for a while too to make sure no problem.

Reviewers: haobo, dhruba, kailiu

Reviewed By: haobo

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14589
2013-12-12 11:30:00 -08:00
Siying Dong
aaf9c6203c [RocksDB][Performance Branch]Iterator Cleanup method only tries to find obsolete files if it has the last reference to a version
Summary: When deconstructing an iterator, no need to check obsolete file if it doesn't hold last reference of any version.

Test Plan: make all check

Reviewers: haobo, igor, dhruba, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14595
2013-12-11 13:59:43 -08:00
Siying Dong
a8029fdc75 Introduce MergeContext to Lazily Initialize merge operand list
Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14415

Conflicts:
	db/db_impl.cc
	db/memtable.cc
2013-12-11 11:37:28 -08:00
Siying Dong
bc5dd19b14 [RocksDB Performance Branch] Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo()
Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get().

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: dhruba

CC: nkg-, igor, leveldb

Differential Revision: https://reviews.facebook.net/D14409
2013-12-11 10:50:09 -08:00
Siying Dong
41349d9ef1 [RocksDB Performance Branch] Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo()
Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get().

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: dhruba

CC: nkg-, igor, leveldb

Differential Revision: https://reviews.facebook.net/D14409
2013-12-11 10:49:49 -08:00
Siying Dong
0304e3d2ff When flushing mem tables, create iterators out of mutex
Summary:
creating new iterators of mem tables can be expensive. Move them out of mutex.
DBImpl::WriteLevel0Table()'s mems seems to be a local vector and is only used by flushing. memtables to flush are also immutable, so it should be safe to do so.

Test Plan: make all check

Reviewers: haobo, dhruba, kailiu

Reviewed By: dhruba

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14577

Conflicts:
	db/db_impl.cc
2013-12-11 10:02:17 -08:00
Siying Dong
95a411d853 When flushing mem tables, create iterators out of mutex
Summary:
creating new iterators of mem tables can be expensive. Move them out of mutex.
DBImpl::WriteLevel0Table()'s mems seems to be a local vector and is only used by flushing. memtables to flush are also immutable, so it should be safe to do so.

Test Plan: make all check

Reviewers: haobo, dhruba, kailiu

Reviewed By: dhruba

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14577
2013-12-11 09:57:19 -08:00
Haobo Xu
3c02c363b3 [RocksDB] [Performance Branch] Added dynamic bloom, to be used for memable non-existing key filtering
Summary: as title

Test Plan: dynamic_bloom_test

Reviewers: dhruba, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14385
2013-12-11 00:15:14 -08:00
kailiu
a82f42b765 rename db/memtablelist.{h,cc} 2013-12-10 19:03:13 -08:00
Igor Canadi
204bb9cffd Get rid of LogFlush() in InternalIterator 2013-12-10 10:59:00 -08:00
Igor Canadi
19f5463d3f Don't LogFlush() in foreground threads
Summary: So fflush() takes a lock which is heavyweight. I added flush_pending_, but more importantly, I removed LogFlush() from foreground threads.

Test Plan: ./db_test

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14535
2013-12-10 10:57:46 -08:00