rocksdb

Author	SHA1	Message	Date
Igor Canadi	f8d5443efe	[CF] Thread-safety guarantees for ColumnFamilySet Summary: Revised thread-safety guarantees and implemented a way to spinlock the object. Test Plan: make check Reviewers: dhruba, haobo, sdong, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15975	2014-02-06 11:57:36 -08:00
Igor Canadi	8fa8a708ef	[CF] Propagate correct options to WriteBatch::InsertInto Summary: WriteBatch can have multiple column families in one batch. Every column family has different options. So we have to add a way for write batch to get options for an arbitrary column family. This required a bit more acrobatics since lots of interfaces had to be changed. Test Plan: make check Reviewers: dhruba, haobo, sdong, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15957	2014-02-06 10:23:31 -08:00
kailiu	84f8185fc0	Merge branch 'master' into performance Conflicts: HISTORY.md db/db_impl.cc db/memtable.cc	2014-02-05 21:21:00 -08:00
Igor Canadi	b4f441f48a	Fixed a bug introduced by previous commit	2014-02-05 14:58:24 -08:00
Igor Canadi	f276e0e59d	[CF] Options -> DBOptions Summary: Replaced most of occurrences of Options with more specific DBOptions. This brings us very close to supporting different configuration options for each column family. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15933	2014-02-05 14:56:09 -08:00
Igor Canadi	6e56ab5702	[CF] Add full_options_ to ColumnFamilyData Summary: Lots of code expects Options on construction/function call. My original idea was to split Options argument into ColumnFamilyOptions and DBOptions (the latter only if needed). However, this will require huge code changes very deep in the stack. The better idea is to have ColumnFamilyData hold both ColumnFamilyOptions and Options. ColumnFamilyData::Options would be constructed from DBOptions (same for each column family) and ColumnFamilyOptions (different for each column family) Now when we construct a class or call any method that requires Options, we can just push him ColumnFamilyData::Options and be sure that it's using column-family-specific settings. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15927	2014-02-05 12:26:40 -08:00
Igor Canadi	328ac7ee02	Merge branch 'master' into columnfamilies	2014-02-05 12:00:33 -08:00
Igor Canadi	c24d8c4e90	[CF] Rethink table cache Summary: Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory. To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects. This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15915	2014-02-05 11:55:30 -08:00
Igor Canadi	7b9f134959	[CF] Move InternalStats to ColumnFamilyData Summary: InternalStats is a messy thing, keeping both DB data and column family data. However, it's better off living in ColumnFamilyData than in DBImpl. For now, at least. Test Plan: make check Reviewers: dhruba, kailiu, haobo, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15879	2014-02-04 17:59:50 -08:00
Igor Canadi	73f62255c1	[CF] Split SanitizeOptions into two Summary: There are three SanitizeOption-s now : one for DBOptions, one for ColumnFamilyOptions and one for Options (which just calls the other two) I have also reshuffled some options -- table_cache options and info_log should live in DBOptions, for example. Test Plan: make check doesn't complain Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15873	2014-02-04 17:26:51 -08:00
kailiu	d43ebd8c65	Put table factory back to public api Summary: Previous I am too ambitious to hide every detail about table factory to internal api. However, we cannot pass the compilatoin for external users since we use table factory as the shared_ptr, which requires the definition of table factory's destructor. Test Plan: make check; Reviewers: sdong, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15861	2014-02-03 19:51:20 -08:00
Igor Canadi	5e2c4fe766	Get rid of DBImpl::user_comparator() Summary: user_comparator() is a Column Family property, not DBImpl Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15855	2014-02-03 17:50:47 -08:00
Igor Canadi	0e22badc08	[column families] Iterator and MultiGet Summary: Support for different column families in Iterator and MultiGet code path. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15849	2014-02-03 17:44:40 -08:00
Igor Canadi	2966d764cd	Fix some 32-bit compile errors Summary: RocksDB doesn't compile on 32-bit architecture apparently. This is attempt to fix some of 32-bit errors. They are reported here: https://gist.github.com/paxos/8789697 Test Plan: RocksDB still compiles on 64-bit :) Reviewers: kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15825	2014-02-03 13:48:30 -08:00
Igor Canadi	2a9271b403	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h db/db_impl_readonly.cc	2014-02-03 13:47:54 -08:00
Lei Jin	5b3b6549d6	use super_version in NewIterator() and MultiGet() function Summary: Use super_version insider NewIterator to avoid Ref() each component separately under mutex The new added bench shows NewIterator QPS increases from 515K to 719K No meaningful improvement for multiget I guess due to its relatively small cost comparing to 90 keys fetch in the test. Test Plan: unit test and db_bench Reviewers: igor, sdong Reviewed By: igor CC: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D15609	2014-02-03 13:13:36 -08:00
Igor Canadi	29bacb2eb6	VersionSet cleanup Summary: Removed icmp_ from VersionSet (since it's per-column-family, not per-DB-instance) Unfriended VersionSet and ColumnFamilyData (yay!) Removed VersionSet::NumberLevels() Cleaned up DBImpl Test Plan: make check Reviewers: dhruba, haobo, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15819	2014-02-03 13:10:47 -08:00
Siying Dong	d169b67680	[Performance Branch] PlainTable to encode rows with seqID 0, value type using 1 internal byte. Summary: In PlainTable, use one single byte to represent 8 bytes of internal bytes, if seqID = 0 and it is value type (which should be common for bottom most files). It is to save 7 bytes for uncompressed cases. Test Plan: make all check Reviewers: haobo, dhruba, kailiu Reviewed By: haobo CC: igor, leveldb Differential Revision: https://reviews.facebook.net/D15489	2014-02-03 12:19:30 -08:00
Igor Canadi	5c6ef56152	Fix printf format	2014-02-03 10:25:37 -08:00
kailiu	4f6cb17bdb	First phase API clean up Summary: Addressed all the issues in https://reviews.facebook.net/D15447. Now most table-related modules are hidden from user land. Test Plan: make check Reviewers: sdong, haobo, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D15525	2014-02-03 00:30:43 -08:00
Igor Canadi	27a8856c23	Compacting column families Summary: This diff enables non-default column families to get compacted both automatically and also by calling CompactRange() Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15813	2014-01-31 19:54:03 -08:00
Igor Canadi	5661ed8b80	Fix reduce_levels_test	2014-01-31 19:44:48 -08:00
Igor Canadi	fe9bd300d6	Merge branch 'master' into columnfamilies	2014-01-31 17:17:22 -08:00
Dhruba Borthakur	30a700657d	Fix corruption_test failure caused by auto-enablement of checksum verification. Summary: Patch https://reviews.facebook.net/D15591 enabled checksum verification by default. This caused the unit test to fail. Test Plan: ./corruption_test Reviewers: igor, kailiu Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D15795	2014-01-31 17:16:38 -08:00
Igor Canadi	f7489123e2	Move compaction picker and internal key comparator to ColumnFamilyData Summary: Compaction picker and internal key comparator are different for each column family (not global), so they should live in ColumnFamilyData Test Plan: make check Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15801	2014-01-31 16:06:55 -08:00
Igor Canadi	5693db2a02	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc	2014-01-31 14:45:15 -08:00
Igor Canadi	dbbffbd772	Mark the log_number file number used Summary: VersionSet::next_file_number_ is always assumed to be strictly greater than VersionSet::log_number_. In our new recovery code, we artificially set log_number_ to be (log_number + 1), so that once we flush, we don't recover from the same log file again (this is important because of merge operator non-idempotence) When we set VersionSet::log_number_ to (log_number + 1), we also have to mark that file number used, such that next_file_number_ is increased to a legal level. Otherwise, VersionSet might assert. This has not be a problem so far because here's what happens: 1. assume next_file_number is 5, we're recovering log_number 10 2. in DBImpl::Recover() we call MarkFileNumberUsed with 10. This will set VersionSet::next_file_number_ to 11. 3. If there are some updates, we will call WriteTable0ForRecovery(), which will use file number 11 as a new table file and advance VersionSet::next_file_number_ to 12. 4. When we LogAndApply() with log_number 11, assertion is true: assert(11 <= 12); However, this was a lucky occurrence. Even though this diff doesn't cause a bug, I think the issue is important to fix. Test Plan: In column families I have different recovery logic and this code path asserted. When adding MarkFileNumberUsed(log_number + 1) assert is gone. Reviewers: dhruba, kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15783	2014-01-31 14:43:16 -08:00
Igor Canadi	3615f534d1	Enable flushing memtables from arbitrary column families Summary: Removed default_cfd_ from all flush code paths. This means we can now flush memtables from arbitrary column families! Test Plan: Added a new unit test Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15789	2014-01-31 14:42:52 -08:00
Siying Dong	56bea9f80d	When using Universal Compaction, Zero out seqID in the last file too Summary: I didn't figure out the reason why the feature of zeroing out earlier sequence ID is disabled in universal compaction. I do see bottommost_level is set correctly. It should simply work if we remove the constraint of universal compaction. Test Plan: make all check Reviewers: haobo, dhruba, kailiu, igor Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D15423	2014-01-31 09:59:52 -08:00
kailiu	4e0298f23c	Clean up arena API Summary: Easy thing goes first. This patch moves arena to internal dir; based on which, the coming patch will deal with memtable_rep. Test Plan: make check Reviewers: haobo, sdong, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D15615	2014-01-30 22:10:10 -08:00
Dhruba Borthakur	abd70ecc2b	The default settings enable checksum verification on every read. Summary: The default settings enable checksum verification on every read. Test Plan: make check Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15591	2014-01-30 19:14:03 -08:00
Igor Canadi	9ca638a86d	Enable iterating column families with a concurrent writer Summary: Sometimes we iterate through column families, and unlock the mutex in the body of the iteration. While mutex is unlocked, some column family might be created or dropped. We need to be able to continue iterating through column families even though our current column family got dropped. This diff implements circular linked lists that connect all column families. It then uses the link list to enable iterating through linked lists. Even if the column family is dropped, its next_ pointer still can be used to advance to another alive column family. Test Plan: make check Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15603	2014-01-30 16:57:52 -08:00
Igor Canadi	6973bb1722	MakeRoomForWrite() support for column families Summary: Making room for write will be the hardest part of the column family implementation. For now, I just iterate through all column families and run MakeRoomForWrite() for every one. Test Plan: make check does not complain Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15597	2014-01-30 16:12:08 -08:00
Igor Canadi	c37e7de669	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h	2014-01-30 11:47:23 -08:00
Igor Canadi	ac92420fc5	Merge branch 'master' into performance Conflicts: db/db_impl.h	2014-01-30 10:09:23 -08:00
Igor Canadi	3c0dcf0e25	InternalStatistics Summary: In DBImpl we keep track of some statistics internally and expose them via GetProperty(). This diff encapsulates all the internal statistics into a class InternalStatisics. Most of it is copy/paste. Apart from cleaning up db_impl.cc, this diff is also necessary for Column families, since every column family should have its own CompactionStats, MakeRoomForWrite-stall stats, etc. It's much easier to keep track of it in every column family if it's nicely encapsulated in its own class. Test Plan: make check Reviewers: dhruba, kailiu, haobo, sdong, emayanke Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15273	2014-01-29 20:40:41 -08:00
Lei Jin	d118707f8d	set bg_error_ when background flush goes wrong Summary: as title Test Plan: unit test Reviewers: haobo, igor, sdong, kailiu, dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D15435	2014-01-29 15:55:58 -08:00
Igor Canadi	514e42c7cc	Fix some lint warnings	2014-01-29 15:27:27 -08:00
Igor Canadi	fa99d53e55	Change ColumnFamilyData from struct to class Summary: ColumnFamilyData grew a lot, there's much more data that it holds now. It makes more sense to encapsulate it better by making it a class. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15579	2014-01-29 15:18:36 -08:00
Igor Canadi	15999e728e	Fix column family test (create directory)	2014-01-29 14:06:59 -08:00
Igor Canadi	4662969bf5	PurgeObsoleteFiles in DropColumnFamily Summary: When we drop the column family, we want to delete all the files from that column family. Test Plan: make check Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D15561	2014-01-29 11:58:33 -08:00
Igor Canadi	20b231d712	Merge branch 'master' into columnfamilies	2014-01-29 11:47:43 -08:00
Igor Canadi	f24a3ee52d	Read from and write to different column families Summary: This one is big. It adds ability to write to and read from different column families (see the unit test). It also supports recovery of different column families from log, which was the hardest part to reason about. We need to make sure to never delete the log file which has unflushed data from any column family. To support that, I added another concept, which is versions_->MinLogNumber() Test Plan: Added a unit test in column_family_test Reviewers: dhruba, haobo, sdong, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15537	2014-01-29 11:38:16 -08:00
Igor Canadi	e57f0cc1a1	add include <atomic> to version_set.h	2014-01-29 08:17:43 -08:00
Igor Canadi	c1071ed95c	Merge branch 'master' into columnfamilies	2014-01-28 16:04:00 -08:00
Igor Canadi	5d2c62822e	Only get the manifest file size if there is no error Summary: I came across this while working on column families. CorruptionTest::RecoverWriteError threw a SIGSEG because the descriptor_log_->file() was nullptr. I'm not sure why it doesn't happen in master, but better safe than sorry. @kailiu, can we get this in release, too? Test Plan: make check Reviewers: kailiu, dhruba, haobo Reviewed By: haobo CC: leveldb, kailiu Differential Revision: https://reviews.facebook.net/D15513	2014-01-28 16:02:51 -08:00
kailiu	a5e220f5ef	Merge branch 'master' into performance Conflicts: Makefile db/db_impl.cc db/db_test.cc db/memtable_list.cc db/memtable_list.h table/block_based_table_reader.cc table/table_test.cc util/cache.cc util/coding.cc	2014-01-28 10:35:55 -08:00
Igor Canadi	4bf25357ae	[column families] Removing VersionSet::current() Summary: Instead of VersionSet::current(), DBImpl uses default_cfd_->current directly. Test Plan: make check Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15483	2014-01-27 17:04:46 -08:00
Mark Callaghan	90f29ccbef	Update monitoring to include average time per compaction and stall Summary: The new columns are msComp and msStall that provide average time per compaction and stall for that level in milliseconds. Level Files Size(MB) Score Time(sec) Read(MB) Write(MB) Rn(MB) Rnp1(MB) Wnew(MB) RW-Amplify Read(MB/s) Write(MB/s) Rn Rnp1 Wnp1 NewW Count msComp msStall Ln-stall Stall-cnt ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 0 8 15 1.5 2 0 30 0 0 30 0.0 0.0 15.5 0 0 0 0 16 112 0.2 1.3 7568 1 8 16 1.6 1 26 26 15 11 16 3.5 17.6 18.1 8 6 13 7 3 362 0.0 0.0 0 2 1 2 0.0 0 0 2 0 0 2 0.0 0.0 18.4 0 0 0 0 1 50 0.0 0.0 0 Task ID: # Blame Rev: Test Plan: run db_bench Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15345	2014-01-27 15:06:04 -08:00
Schalk-Willem Kruger	3d33da75ef	Fix UnmarkEOF for partial blocks Summary: Blocks in the transaction log are a fixed size, but the last block in the transaction log file is usually a partial block. When a new record is added after the reader hit the end of the file, a new physical record will be appended to the last block. ReadPhysicalRecord can only read full blocks and assumes that the file position indicator is aligned to the start of a block. If the reader is forced to read further by simply clearing the EOF flag, ReadPhysicalRecord will read a full block starting from somewhere in the middle of a real block, causing it to lose alignment and to have a partial physical record at the end of the read buffer. This will result in length mismatches and checksum failures. When the log file is tailed for replication this will cause the log iterator to become invalid, necessitating the creation of a new iterator which will have to read the log file from scratch. This diff fixes this issue by reading the remaining portion of the last block we read from. This is done when the reader is forced to read further (UnmarkEOF is called). Test Plan: - Added unit tests - Stress test (with replication). Check dbdir/LOG file for corruptions. - Test on test tier Reviewers: emayanke, haobo, dhruba Reviewed By: haobo CC: vamsi, sheki, dhruba, kailiu, igor Differential Revision: https://reviews.facebook.net/D15249	2014-01-27 14:49:10 -08:00
Igor Canadi	511b03a5b5	LogAndApply to take ColumnFamilyData Summary: This removes the default implementation of LogAndApply that applied the changed to the default column family by default. It is mostly simple reformatting. Test Plan: make check Reviewers: dhruba, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15465	2014-01-27 13:57:58 -08:00
Igor Canadi	eb055609e4	[column families] Move memtable and immutable memtable list to column family data Summary: All memtables and immutable memtables are moved from DBImpl to ColumnFamilyData. For now, they are all referenced from default column family in DBImpl. It shouldn't be hard to get them from custom column family. Test Plan: make check Reviewers: dhruba, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15459	2014-01-27 13:37:14 -08:00
Igor Canadi	ae16606f98	Merge branch 'master' into columnfamilies Conflicts: db/version_set.cc db/version_set.h	2014-01-27 11:11:51 -08:00
Igor Canadi	832158e7f7	Fsync directory after we create a new file Summary: @dhruba, I'm not sure where we need to sync the directory. I implemented the function in Env() and added the dir sync just after we close the newly created file in the builder. Should I also add FsyncDir() to new files that get created by a compaction? Test Plan: Confirmed that FsyncDir is returning Status::OK() Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D14751	2014-01-27 11:02:21 -08:00
Siying Dong	b20486f294	[Performance Branch] HashLinkList to avoid to convert length prefixed string back to internal keys Summary: Converting from length prefixed buffer back to internal key costs some CPU but it is not necessary. In this patch, internal keys are pass though the functions so that we don't need to convert back to it. Test Plan: make all check Reviewers: haobo, kailiu Reviewed By: kailiu CC: igor, leveldb Differential Revision: https://reviews.facebook.net/D15393	2014-01-27 10:26:14 -08:00
Igor Canadi	cf783c674a	Merge branch 'master' into columnfamilies Conflicts: db/version_set.h	2014-01-27 10:00:24 -08:00
Igor Canadi	6c2ca1d3e6	Move NeedsCompaction() from VersionSet to Version Summary: There is no reason to have functions NeedCompaction(), MaxCompactionScore() and MaxCompactionScoreLevel() in VersionSet, since they don't access any data in VersionSet. Test Plan: make check Reviewers: kailiu, haobo, sdong Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15333	2014-01-27 09:59:00 -08:00
Igor Canadi	e55b3c040c	Fixing ref-counting memtables	2014-01-26 18:03:38 -08:00
Mike Lin	af7838de36	address code review comments on `5e3aeb5f8e` - reduce string copying in Compaction::Summary - simplify file number checking in UniversalCompactionStopStyleSimilarSize unit test	2014-01-25 14:12:24 -08:00
Igor Canadi	983fafa56c	Fix memory leak	2014-01-25 13:57:11 -08:00
Igor Canadi	68a91a2e6a	missing include	2014-01-24 18:40:05 -08:00
Igor Canadi	5356b2a680	Merge branch 'master' into columnfamilies	2014-01-24 18:34:48 -08:00
Igor Canadi	04afa32134	Fix reduce levels ReduceNumberOfLevels had segmentation fault in WriteSnapshot() since we didn't change the number of levels in VersionSet (we consider them immutable from now on). This fixes the problem.	2014-01-24 18:32:38 -08:00
Siying Dong	8477255da3	Moving Some includes from options.h to forward declaration Summary: By removing some includes form options.h and reply on forward declaration, we can more easily reason the dependencies. Test Plan: make all check Reviewers: kailiu, haobo, igor, dhruba Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15411	2014-01-24 17:16:22 -08:00
Igor Canadi	6a404de4ba	Merge branch 'master' into columnfamilies	2014-01-24 15:54:18 -08:00
Igor Canadi	f653fdcf5a	Fixing iterator cleanup for Tailing iterator Immutable tailing iterator doesn't set CleanupState::mem, so we don't have to unref it.	2014-01-24 15:51:06 -08:00
Igor Canadi	1423e7c9de	Merge branch 'master' into columnfamilies Conflicts: db/version_set.cc db/version_set_reduce_num_levels.cc util/ldb_cmd.cc	2014-01-24 15:03:54 -08:00
Igor Canadi	677fee27c6	Make VersionSet::ReduceNumberOfLevels() static Summary: A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method. Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file. Test Plan: reduce_levels_test Reviewers: dhruba, haobo, kailiu, sdong Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15303	2014-01-24 14:57:04 -08:00
Igor Canadi	c583157d49	MemTableListVersion Summary: MemTableListVersion is to MemTableList what Version is to VersionSet. I took almost the same ideas to develop MemTableListVersion. The reason is to have copying std::list done in background, while flushing, rather than in foreground (MultiGet() and NewIterator()) under a mutex! Also, whenever we copied MemTableList, we copied also some MemTableList metadata (flush_requested_, commit_in_progress_, etc.), which was wasteful. This diff avoids std::list copy under a mutex in both MultiGet() and NewIterator(). I created a small database with some number of immutable memtables, and creating 100.000 iterators in a single-thread (!) decreased from {188739, 215703, 198028} to {154352, 164035, 159817}. A lot of the savings come from code under a mutex, so we should see much higher savings with multiple threads. Creating new iterator is very important to LogDevice team. I also think this diff will make SuperVersion obsolete for performance reasons. I will try it in the next diff. SuperVersion gave us huge savings on Get() code path, but I think that most of the savings came from copying MemTableList under a mutex. If we had MemTableListVersion, we would never need to copy the entire object (like we still do in NewIterator() and MultiGet()) Test Plan: `make check` works. I will also do `make valgrind_check` before commit Reviewers: dhruba, haobo, kailiu, sdong, emayanke, tnovak Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15255	2014-01-24 14:52:08 -08:00
Igor Canadi	e832e72b31	Revert "Moving to glibc-fb" This reverts commit `d24961b65e`. For some reason, glibc2.17-fb breaks gflags. Reverting for now	2014-01-24 11:50:38 -08:00
kailiu	66dc033af3	Temporarily disable caching index/filter blocks Summary: Mixing index/filter blocks with data blocks resulted in some known issues. To make sure in next release our users won't be affected, we added a new option in BlockBasedTableFactory::TableOption to conceal this functionality for now. This patch also introduced a BlockBasedTableReader::OpenOptions, which avoids the "infinite" growth of parameters in BlockBasedTableReader::Open(). Test Plan: make check Reviewers: haobo, sdong, igor, dhruba Reviewed By: igor CC: leveldb, tnovak Differential Revision: https://reviews.facebook.net/D15327	2014-01-24 10:57:15 -08:00
Igor Canadi	d24961b65e	Moving to glibc-fb Summary: It looks like we might have some trouble when building the new release with 4.8, since fbcode is using glibc2.17-fb by default and we are using glibc2.17. It was reported by Benjamin Renard in our internal group. This diff moves our fbcode build to use glibc2.17-fb by default. I got some linker errors when compiling, complaining that `google::SetUsageMessage()` was undefined. After deleting all offending lines, the compile was successful and everything works. Test Plan: Compiled Ran ./db_bench ./db_stress ./db_repl_stress Reviewers: kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15405	2014-01-24 10:24:08 -08:00
Siying Dong	4605e20c58	If User setting of compaction multipliers overflow, use default value 1 instead Summary: Currently, compaction multipliers can overflow and cause unexpected behaviors. In this patch, we detect those overflows and use multiplier 1 for them. Test Plan: make all check Reviewers: dhruba, haobo, igor, kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15321	2014-01-24 10:14:23 -08:00
Igor Canadi	28d1a0c6f5	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h db/db_impl_readonly.h db/db_test.cc include/rocksdb/db.h include/utilities/stackable_db.h	2014-01-24 09:27:29 -08:00
Igor Canadi	09489d395f	Fix a bug in DBImpl::CreateColumnFamily	2014-01-24 09:12:00 -08:00
Lei Jin	aba2acb5ec	CompactRange() to return status Summary: as title Test Plan: make all check What else tests shall I cover? Reviewers: igor, haobo CC: Differential Revision: https://reviews.facebook.net/D15339	2014-01-23 16:41:46 -08:00
Kai Liu	054c5dda8c	Merge branch 'master' into performance Conflicts: db/db_impl.cc db/db_test.cc db/memtable.cc db/version_set.cc include/rocksdb/statistics.h util/statistics_imp.h	2014-01-23 16:32:49 -08:00
Tomislav Novak	81c9cc9b3b	Tailing iterator Summary: This diff implements a special type of iterator that doesn't create a snapshot (can be used to read newly inserted data) and is optimized for doing sequential reads. TailingIterator uses current superversion number to determine whether to invalidate its internal iterators. If the version hasn't changed, it can often avoid doing expensive seeks over immutable structures (sst files and immutable memtables). Test Plan: * new unit tests * running LD with this patch Reviewers: igor, dhruba, haobo, sdong, kailiu Reviewed By: sdong CC: leveldb, lovro, march Differential Revision: https://reviews.facebook.net/D15285	2014-01-23 16:26:08 -08:00
Igor Canadi	7c5e583a27	ColumnFamilySet Summary: I created a separate class ColumnFamilySet to keep track of column families. Before we did this in VersionSet and I believe this approach is cleaner. Let me know if you have any comments. I will commit tomorrow. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15357	2014-01-23 14:03:38 -08:00
Igor Canadi	f9a25dda9f	Fix wrong merge	2014-01-22 11:30:19 -08:00
Igor Canadi	92a022ad07	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h db/db_impl_readonly.cc db/version_set.cc	2014-01-22 10:59:07 -08:00
Igor Canadi	fb01755aa4	Unfriending classes Summary: In this diff I made some effort to reduce usage of friending. To do that, I had to expose Compaction::inputs_ through a method inputs(). Not sure if this is a good idea, there is a trade-off. I think it's less confusing than having lots of friends. I also thought about other friendship relationships, but they are too much tangled at this point. Once you friend two classes, it's very hard to unfriend them :) Test Plan: make check Reviewers: haobo, kailiu, sdong, dhruba Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15267	2014-01-22 10:55:16 -08:00
Igor Canadi	6fe9b57748	Refactor Recover() code Summary: This diff does two things: * Rethinks how we call Recover() with read_only option. Before, we call it with pointer to memtable where we'd like to apply those changes to. This memtable is set in db_impl_readonly.cc and it's actually DBImpl::mem_. Why don't we just apply updates to mem_ right away? It seems more intuitive. * Changes when we apply updates to manifest. Before, the process is to recover all the logs, flush it to sst files and then do one giant commit that atomically adds all recovered sst files and sets the next log number. This works good enough, but causes some small troubles for my column family approach, since I can't have one VersionEdit apply to more than single column family[1]. The change here is to commit the files recovered from logs right away. Here is the state of the world before the change: 1. Recover log 5, add new sst files to edit 2. Recover log 7, add new sst files to edit 3. Recover log 8, add new sst files to edit 4. Commit all added sst files to manifest and mark log files 5, 7 and 8 as recoverd (via SetLogNumber(9) function) After the change, we'll do: 1. Recover log 5, commit the new sst files and set log 5 as recovered 2. Recover log 7, commit the new sst files and set log 7 as recovered 3. Recover log 8, commit the new sst files and set log 8 as recovered The added (small) benefit is that if we fail after (2), the new recovery will only have to recover log 8. In previous case, we'll have to restart the recovery from the beginning. The bigger benefit will be to enable easier integration of multiple column families in Recovery code path. [1] I'm happy to dicuss this decison, but I believe this is the cleanest way to go. It also makes backward compatibility much easier. We don't have a requirement of adding multiple column families atomically. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15237	2014-01-22 10:45:26 -08:00
Igor Canadi	23f6791c9e	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl_readonly.cc db/db_test.cc db/version_edit.cc db/version_edit.h db/version_set.cc db/version_set.h db/version_set_reduce_num_levels.cc	2014-01-21 17:01:52 -08:00
Siying Dong	7dea558e6d	[Performance Branch] Fix a bug when merging from master Summary: Commit "1304d8c8cefe66be1a3caa5e93413211ba2486f2" (Merge branch 'master' into performance) removes a line in performance branch by mistake. This patch fixes it. Test Plan: make all check Reviewers: haobo, kailiu, igor Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15297	2014-01-21 12:44:43 -08:00
Mark Callaghan	4e8321bfea	Boost access before mutex is unlocked Summary: This moves the use of versions_ to before the mutex is unlocked to avoid a possible race. Task ID: # Blame Rev: Test Plan: make check Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: haobo, dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D15279	2014-01-17 21:32:23 -08:00
Kai Liu	ef602f6275	Misc cleanup on performance branch Summary: Did some trivial stuffs: * Add more comments; * fix compiler's warning messages (uninitialized variables). * etc Test Plan: make check	2014-01-17 14:26:29 -08:00
Igor Canadi	83681bf9ef	Statistics code cleanup Summary: I'm separating code-cleanup part of https://reviews.facebook.net/D14517. This will make D14517 easier to understand and this diff easier to review. Test Plan: make check Reviewers: haobo, kailiu, sdong, dhruba, tnovak Reviewed By: tnovak CC: leveldb Differential Revision: https://reviews.facebook.net/D15099	2014-01-17 12:46:06 -08:00
Igor Canadi	8079dd5d24	Merge branch 'master' into performance	2014-01-17 12:20:07 -08:00
Igor Canadi	0f4a75b710	Fix SIGSEGV in compaction picker Summary: The SIGSEGV was introduced by https://reviews.facebook.net/D15171 I also fixed ExpandWhileOverlapping() which returned the failure by setting it's own stack variable to nullptr (!). This bug is present in 2.6 release, so I guess ExpandWhileOverlapping never fails :) Test Plan: `make check`. Also MarkCallaghan confirmed it fixed the SIGSEGV he reported. Reviewers: MarkCallaghan, kailiu, sdong, dhruba, haobo Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15261	2014-01-17 12:02:03 -08:00
Mark Callaghan	439e36db21	Fix SlowdownAmount Summary: This had a few bugs. 1) bottom and top were reversed. top is for the max value but the callers were passing the max value to bottom. The result is that the max sleep is used when n >= bottom. 2) one of the callers passed values with type double and these values are frequently between 1.0 and 2.0 so rounding will do some bad things 3) sometimes the function returned 0 when there should be a stall With this change and one other diff (out for review soon) there are slightly fewer stalls on one workload. With the fix. Stalls(secs): 160.166 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 58.495 leveln_slowdown Stalls(count): 910261 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 54526 leveln_slowdown Without the fix. Stalls(secs): 172.227 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 56.538 leveln_slowdown Stalls(count): 160831 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 52845 leveln_slowdown Task ID: # Blame Rev: Test Plan: run db_bench for --benchmarks=overwrite with IO-bound database Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15243	2014-01-17 10:15:30 -08:00
Mike Lin	5e3aeb5f8e	An initial implementation of kCompactionStopStyleSimilarSize for universal compaction	2014-01-16 22:59:34 -08:00
Mike Lin	b1194f4903	Minor compaction logging improvements 1) make summary less likely to be truncated 2) format human-readable file sizes in summary 3) log the motivation for each universal compaction	2014-01-16 22:54:31 -08:00
Naman Gupta	1447bb5919	Allow callback to change size of existing value. Change return type of the callback function to an enum status to handle 3 cases. Summary: This diff fixes 2 hacks: * The callback function can modify the existing value inplace, if the merged value fits within the existing buffer size. But currently the existing buffer size is not being modified. Now the callback recieves a int* allowing the size to be modified. Since size is encoded as a varint in the internal key for memtable. It might happen that the entire value might have be copied to the new location if the new size varint is smaller than the existing size varint. * The callback function has 3 functionalities 1. Modify existing buffer inplace, and update size correspondingly. Now to indicate that, Returns 1. 2. Generate a new buffer indicating merged value. Returns 2. 3. Fails to do either of above, based on whatever application logic. Returns 0. Test Plan: Just make all for now. I'm adding another unit test to test each scenario. Reviewers: dhruba, haobo Reviewed By: haobo CC: leveldb, sdong, kailiu, xinyaohu, sumeet, danguo Differential Revision: https://reviews.facebook.net/D15195	2014-01-16 15:12:39 -08:00
Kai Liu	d4f65f1683	Merge branch 'master' into performance This patch merges master's changes on build_tools/format-diff.sh. Conflicts: db/version_edit.cc	2014-01-16 14:31:18 -08:00
Igor Canadi	6d6fb70960	Remove compaction pointers Summary: The only thing we do with compaction pointers is set them to some values, we never actually read them. I don't know what we used them for, but it doesn't look like we use them anymore. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15225	2014-01-16 14:06:53 -08:00
Igor Canadi	c699c84af4	CompactionPicker Summary: This is a big one. This diff moves all the code related to picking compactions from VersionSet to new class CompactionPicker. Column families' compactions will be completely separate processes, so we need to have multiple CompactionPickers. To make this easier to review, most of the code change is just copy/paste. There is also a small change not to use VersionSet::current_, but rather to take `Version* version` as a parameter. Most of the other code is exactly the same. In future diffs, I will also make some improvements to CompactionPickers. I think the most important part will be encapsulating it better. Currently Version, VersionSet, Compaction and CompactionPicker are all friend classes, which makes it harder to change the implementation. This diff depends on D15171, D15183, D15189 and D15201 Test Plan: `make check` Reviewers: kailiu, sdong, dhruba, haobo Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15207	2014-01-16 13:03:52 -08:00
kailiu	1304d8c8ce	Merge branch 'master' into performance Conflicts: Makefile db/db_impl.cc db/db_impl.h db/db_test.cc db/memtable.cc db/memtable.h db/version_edit.h db/version_set.cc include/rocksdb/options.h util/hash_skiplist_rep.cc util/options.cc	2014-01-15 23:12:31 -08:00
kailiu	eae1804f29	Remove the unnecessary use of shared_ptr Summary: shared_ptr is slower than unique_ptr (which literally comes with no performance cost compare with raw pointers). In memtable and memtable rep, we use shared_ptr when we'd actually should use unique_ptr. According to igor's previous work, we are likely to make quite some performance gain from this diff. Test Plan: make check Reviewers: dhruba, igor, sdong, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15213	2014-01-15 18:22:01 -08:00
Igor Canadi	787f11bb3b	Move more functions from VersionSet to Version Summary: This moves functions: * VersionSet::Finalize() -> Version::UpdateCompactionStats() * VersionSet::UpdateFilesBySize() -> Version::UpdateFilesBySize() The diff depends on D15189, D15183 and D15171 Test Plan: make check Reviewers: kailiu, sdong, haobo, dhruba Reviewed By: sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15201	2014-01-15 16:23:36 -08:00
Igor Canadi	615d1ea2f4	Moving Compaction class to separate header file Summary: I'm sure we'll all agree that version_set.cc needs simplifying. This diff moves Compaction class to a separate file. The diff depends on D15171 and D15183 Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15189	2014-01-15 16:22:34 -08:00
Igor Canadi	2f4eda7890	Move functions from VersionSet to Version Summary: There were some functions in VersionSet that had no reason to be there instead of Version. Moving them to Version will make column families implementation easier. The functions moved are: * NumLevelBytes * LevelSummary * LevelFileSummary * MaxNextLevelOverlappingBytes * AddLiveFiles (previously AddLiveFilesCurrentVersion()) * NeedSlowdownForNumLevel0Files The diff continues on (and depends on) D15171 Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong, emayanke Reviewed By: sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15183	2014-01-15 16:18:04 -08:00
Igor Canadi	65a8a52b54	Decrease reliance on VersionSet::NumberLevels() Summary: With column families VersionSet will not have a constant number of levels (each CF can have different options), so we'll need to eliminate call to VersionSet::NumberLevels() This diff decreases number of callsites, but we're not there yet. It associates number of levels with Version (each version is associated with single CF) instead of VersionSet. I have also slightly changed how VersionSet keeps track of manifest size. This diff also modifies constructor of Compaction such that it takes input_version and automatically Ref()s it. Before this was done outside of constructor. In next diffs I will continue to decrease number of callsites of VersionSet::NumberLevels() and also references to current_ Test Plan: make check Reviewers: haobo, dhruba, kailiu, sdong Reviewed By: sdong Differential Revision: https://reviews.facebook.net/D15171	2014-01-15 16:15:43 -08:00
Siying Dong	9b51af5a17	[RocksDB Performance Branch] DBImpl.NewInternalIterator() to reduce works inside mutex Summary: To reduce mutex contention caused by DBImpl.NewInternalIterator(), in this function, move all the iteration creation works out of mutex, only leaving object ref and get. Test Plan: make all check will run db_stress for a while too to make sure no problem. Reviewers: haobo, dhruba, kailiu Reviewed By: haobo CC: igor, leveldb Differential Revision: https://reviews.facebook.net/D14589 Conflicts: db/db_impl.cc	2014-01-14 17:41:44 -08:00
Igor Canadi	d9cd7a063f	Fix CompactRange to apply filter to every key Summary: When doing CompactRange(), we should first flush the memtable and then calculate max_level_with_files. Also, we want to compact all the levels that have files, including level `max_level_with_files`. This patch fixed the unit test. Test Plan: Added a failing unit test and a fix, so it's not failing anymore. Reviewers: dhruba, haobo, sdong Reviewed By: haobo CC: leveldb, xjin Differential Revision: https://reviews.facebook.net/D14421	2014-01-14 16:19:09 -08:00
Igor Canadi	1ed2404f27	Wrong number of levels is Invalid argument now, not corruption	2014-01-14 15:54:11 -08:00
Igor Canadi	6291020284	Fix test	2014-01-14 15:41:30 -08:00
Igor Canadi	7f3e417f59	Fix memtable construction in tests	2014-01-14 15:36:12 -08:00
Igor Canadi	055e6df45b	VersionEdit not to take NumLevels() Summary: I will submit a sequence of diffs that are preparing master branch for column families. There are a lot of implicit assumptions in the code that are making column family implementation hard. If I make the change only in column family branch, it will make merging back to master impossible. Most of the diffs will be simple code refactorings, so I hope we can have fast turnaround time. Feel free to grab me in person to discuss any of them. This diff removes number of level check from VersionEdit. It is used only when VersionEdit is read, not written, but has to be set when it is written. I believe it is a right thing to make VersionEdit dumb and check consistency on the caller side. This will also make it much easier to implement Column Families, since different column families can have different number of levels. Test Plan: make check Reviewers: dhruba, haobo, sdong, kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15159	2014-01-14 15:27:09 -08:00
Igor Canadi	7d9f21cf23	BuildBatchGroup -- memcpy outside of lock Summary: When building batch group, don't actually build a new batch since it requires heavy-weight mem copy and malloc. Only store references to the batches and build the batch group without lock held. Test Plan: `make check` I am also planning to run performance tests. The workload that will benefit from this change is readwhilewriting. I will post the results once I have them. Reviewers: dhruba, haobo, kailiu Reviewed By: haobo CC: leveldb, xjin Differential Revision: https://reviews.facebook.net/D15063	2014-01-14 14:49:31 -08:00
Naman Gupta	1d9bac4d7f	Use sanitized options while opening db Summary: We use SanitizeOptions() to set appropriate values for some options, based on other options. So we should use the sanitized options by default. Luckily it hasn't caused a bug yet, but can result in a bug in the fugture. Test Plan: make check Reviewers: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D14103	2014-01-14 11:46:24 -08:00
Siying Dong	9ea8bf90f1	DB::Put() to estimate write batch data size needed and pre-allocate buffer Summary: In one of CPU profiles, we see some CPU costs of string::reserve() inside Batch.Put(). This patch should be able to reduce some of the costs by allocating sufficient buffer before hand. Since it is a trivial percentage of CPU costs, I didn't find a way to show the improvement in one of the benchmarks. I'll deploy it to same application and do the same CPU profiling to make sure those CPU costs are reduced. Test Plan: make all check Reviewers: haobo, kailiu, igor Reviewed By: haobo CC: leveldb, nkg- Differential Revision: https://reviews.facebook.net/D15135	2014-01-14 11:24:43 -08:00
Siying Dong	fbbf0d1456	Pre-calculate whether to slow down for too many level 0 files Summary: Currently in DBImpl::MakeRoomForWrite(), we do "versions_->NumLevelFiles(0) >= options_.level0_slowdown_writes_trigger" to check whether the writer thread needs to slow down. However, versions_->NumLevelFiles(0) is slightly more expensive than we expected. By caching the result of the comparison when installing a new version, we can avoid this function call every time. Test Plan: make all check Manually trigger this behavior by applying universal compaction style and make sure inserts are made slow after there are certain number of files. Reviewers: haobo, kailiu, igor Reviewed By: kailiu CC: nkg-, leveldb Differential Revision: https://reviews.facebook.net/D15141	2014-01-14 11:23:02 -08:00
Siying Dong	51dd21926c	DB::Put() to estimate write batch data size needed and pre-allocate buffer Summary: In one of CPU profiles, we see some CPU costs of string::reserve() inside Batch.Put(). This patch should be able to reduce some of the costs by allocating sufficient buffer before hand. Since it is a trivial percentage of CPU costs, I didn't find a way to show the improvement in one of the benchmarks. I'll deploy it to same application and do the same CPU profiling to make sure those CPU costs are reduced. Test Plan: make all check Reviewers: haobo, kailiu, igor Reviewed By: haobo CC: leveldb, nkg- Differential Revision: https://reviews.facebook.net/D15135	2014-01-14 10:53:16 -08:00
Naman Gupta	8454cfe569	Add read/modify/write functionality to Put() api Summary: The application can set a callback function, which is applied on the previous value. And calculates the new value. This new value can be set, either inplace, if the previous value existed in memtable, and new value is smaller than previous value. Otherwise the new value is added normally. Test Plan: fbmake. Added unit tests. All unit tests pass. Reviewers: dhruba, haobo Reviewed By: haobo CC: sdong, kailiu, xinyaohu, sumeet, leveldb Differential Revision: https://reviews.facebook.net/D14745	2014-01-14 07:55:16 -08:00
Igor Canadi	a107691711	[column families] Implement refcounting ColumnFamilyData Summary: We don't want to delete ColumnFamilyData object if somebody has references to it. Test Plan: `make check` for now, but will need to implement bigger column family test case Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15111	2014-01-13 09:22:44 -08:00
Igor Canadi	151f9e144f	Merge branch 'master' into columnfamilies	2014-01-13 09:09:51 -08:00
Igor Canadi	d076cef347	[column families] Get rid of VersionSet::current_ and keep current Version for each column family Summary: The biggest change here is getting rid of current_ Version and adding a column_family_data->current Version to each column family. I have also fixed some smaller things in VersionSet that made it easier to implement Column family support. Test Plan: make check Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15105	2014-01-13 08:59:15 -08:00
Igor Canadi	dd6ecdf342	Use ASSERT_EQ() instead of assert() in merge_test	2014-01-11 09:25:47 -08:00
Schalk-Willem Kruger	a09ee1069d	Improve RocksDB "get" performance by computing merge result in memtable Summary: Added an option (max_successive_merges) that can be used to specify the maximum number of successive merge operations on a key in the memtable. This can be used to improve performance of the "get" operation. If many successive merge operations are performed on a key, the performance of "get" operations on the key deteriorates, as the value has to be computed for each "get" operation by applying all the successive merge operations. FB Task ID: #3428853 Test Plan: make all check db_bench --benchmarks=readrandommergerandom counter_stress_test Reviewers: haobo, vamsi, dhruba, sdong Reviewed By: haobo CC: zshao Differential Revision: https://reviews.facebook.net/D14991	2014-01-10 17:33:56 -08:00
Siying Dong	aa0ef6602d	[Performance Branch] If options.max_open_files set to be -1, cache table readers in FileMetadata for Get() and NewIterator() Summary: In some use cases, table readers for all live files should always be cached. In that case, there will be an opportunity to avoid the table cache look-up while Get() and NewIterator(). We define options.max_open_files = -1 to be the mode that table readers for live files will always be kept. In that mode, table readers are cached in FileMetaData (with a reference count hold in table cache). So that when executing table_cache.Get() and table_cache.newInterator(), LRU cache checking can be by-passed, to reduce latency. Test Plan: add a test case in db_test Reviewers: haobo, kailiu Reviewed By: haobo CC: dhruba, igor, leveldb Differential Revision: https://reviews.facebook.net/D15039	2014-01-10 15:57:49 -08:00
Siying Dong	5b5ab0c1a8	[Performance Branch] Fix memory leak in HashLinkListRep.GetIterator() Summary: Full list constructed for full iterator can be leaked. This was a bug introduced when I copy the full iterator codes from hash skip list to hash link list. This patch fixes it. Test Plan: Run valgrind test against db_test and make sure the memory leak is fixed Reviewers: kailiu, haobo Reviewed By: kailiu CC: igor, leveldb Differential Revision: https://reviews.facebook.net/D15093	2014-01-10 12:12:28 -08:00
Siying Dong	237a3da677	StopWatch not to get time if it is created for statistics and it is disabled Summary: Currently, even if statistics is not enabled, StopWatch only for the stats still gets the time of the day, which is wasteful. This patch adds a new option to StopWatch to disable this get in this case. Test Plan: make all check Reviewers: dhruba, haobo, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14703 Conflicts: db/db_impl.cc	2014-01-09 17:39:48 -08:00
Siying Dong	424a524ac9	[Performance Branch] A Hashed Linked List Based Mem Table Summary: Implement a mem table, in which keys are hashed based on prefixes. In each bucket, entries are organized in a sorted linked list. It has the same thread safety guarantee as skip list. The motivation is to optimize memory usage for the case that prefix hashing is primary way of seeking to the entry. Compared to hash skip list implementation, this implementation is more memory efficient, but inside each bucket, search is always linear. The target scenario is that there are only very limited number of records in each hash bucket. Test Plan: Add a test case in db_test Reviewers: haobo, kailiu, dhruba Reviewed By: haobo CC: igor, nkg-, leveldb Differential Revision: https://reviews.facebook.net/D14979	2014-01-09 16:19:11 -08:00
Siying Dong	5575316350	StopWatch not to get time if it is created for statistics and it is disabled Summary: Currently, even if statistics is not enabled, StopWatch only for the stats still gets the time of the day, which is wasteful. This patch adds a new option to StopWatch to disable this get in this case. Test Plan: make all check Reviewers: dhruba, haobo, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14703	2014-01-08 16:05:36 -08:00
Igor Canadi	19e3ee64ac	Add column family information to WAL Summary: I have added three new value types: * kTypeColumnFamilyDeletion * kTypeColumnFamilyValue * kTypeColumnFamilyMerge which include column family Varint32 before the data (value, deletion and merge). These values are used only in WAL (not in memtables yet). This endeavour required changing some WriteBatch internals. Test Plan: Added a unittest Reviewers: dhruba, haobo, sdong, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15045	2014-01-08 12:53:33 -08:00
Mark Callaghan	50994bf699	Don't always compress L0 files written by memtable flush Summary: Code was always compressing L0 files written by a memtable flush when compression was enabled. Now this is done when min_level_to_compress=0 for leveled compaction and when universal_compaction_size_percent=-1 for universal compaction. Task ID: #3416472 Blame Rev: Test Plan: ran db_bench with compression options Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba, igor, sdong Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14757	2014-01-07 21:50:26 -08:00
Igor Canadi	a45b7d83ba	Merge pull request #59 from mlin/more-c-bindings C API: add rocksdb_env_set_high_priority_background_threads	2014-01-07 16:33:03 -08:00
Igor Canadi	72918efffe	[column families] Implement DB::OpenWithColumnFamilies() Summary: In addition to implementing OpenWithColumnFamilies, this diff also includes some minor changes: * Changed all column family names from Slice() to std::string. The performance of column family name handling is not critical, and it's more convenient and cleaner to have names as std::strings * Implemented ColumnFamilyOptions(const Options&) and DBOptions(const Options&) * Added ColumnFamilyOptions to VersionSet::ColumnFamilyData. ColumnFamilyOptions are specified on OpenWithColumnFamilies() and CreateColumnFamily() I will keep the diff in the Phabricator for a day or two and will push to the branch then. Feel free to comment even after the diff has been pushed. Test Plan: Added a simple unit test Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15033	2014-01-07 11:05:50 -08:00
Igor Canadi	d3a2ba9c64	Merge branch 'master' into columnfamilies	2014-01-07 11:05:03 -08:00
Igor Canadi	17a222670b	Merge branch 'master' into performance	2014-01-07 11:04:21 -08:00
Tomislav Novak	9f690ec62c	Fix a deadlock in CompactRange() Summary: The way DBImpl::TEST_CompactRange() throttles down the number of bg compactions can cause it to deadlock when CompactRange() is called concurrently from multiple threads. Imagine a following scenario with only two threads (max_background_compactions is 10 and bg_compaction_scheduled_ is initially 0): 1. Thread #1 increments bg_compaction_scheduled_ (to LargeNumber), sets bg_compaction_scheduled_ to 9 (newvalue), schedules the compaction (bg_compaction_scheduled_ is now 10) and waits for it to complete. 2. Thread #2 calls TEST_CompactRange(), increments bg_compaction_scheduled_ (now LargeNumber + 10) and waits on a cv for bg_compaction_scheduled_ to drop to LargeNumber. 3. BG thread completes the first manual compaction, decrements bg_compaction_scheduled_ and wakes up all threads waiting on bg_cv_. Thread #1 runs, increments bg_compaction_scheduled_ by LargeNumber again (now 2*LargeNumber + 9). Since that's more than LargeNumber + newvalue, thread #2 also goes to sleep (waiting on bg_cv_), without resetting bg_compaction_scheduled_. This diff attempts to address the problem by introducing a new counter bg_manual_only_ (when positive, MaybeScheduleFlushOrCompaction() will only schedule manual compactions). Test Plan: I could pretty much consistently reproduce the deadlock with a program that calls CompactRange(nullptr, nullptr) immediately after Write() from multiple threads. This no longer happens with this patch. Tests (make check) pass. Reviewers: dhruba, igor, sdong, haobo Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14799	2014-01-07 10:37:34 -08:00
Igor Canadi	fff5c7e817	Merge branch 'master' into columnfamilies	2014-01-06 13:31:41 -08:00
Kai Liu	5e7d5629c7	Fix the valgrind issues	2014-01-03 11:48:31 -08:00
Igor Canadi	ef6ad1708d	[column families] Support to create and drop column families Summary: This diff provides basic implementations of CreateColumnFamily(), DropColumnFamily() and ListColumnFamilies(). It builds on top of https://reviews.facebook.net/D14733 It also includes a bug fix for DBImplReadOnly, where Get implementation would be redirected to DBImpl instead of DBImplReadOnly. Test Plan: Added unit test Reviewers: dhruba, haobo, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15021	2014-01-03 01:12:16 -08:00
Kai Liu	774ed89c24	Replace vector with autovector Summary: this diff only replace the cases when we need to frequently create vector with small amount of entries. This diff doesn't aim to improve performance of a specific area, but more like a small scale test for the autovector and see how it works in real life. Test Plan: make check I also ran the performance tests, however there is no performance gain/loss. All performance numbers are pretty much the same before/after the change. Reviewers: dhruba, haobo, sdong, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14985	2014-01-02 16:43:35 -08:00
kailiu	e72aa37cc5	Merge branch 'master' into performance Conflicts: db/table_cache.cc	2014-01-02 16:34:59 -08:00
kailiu	476416c27c	Some minor refactoring on the code Summary: I made some cleanup while reading the source code in `db`. Most changes are about style, naming or C++ 11 new features. Test Plan: ran `make check` Reviewers: haobo, dhruba, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15009	2014-01-02 16:32:31 -08:00
kailiu	9281a826f1	Hotfix the bug in table cache's GetSliceForFileNumber Forgot to fix this problem in master branch. Already fixed it in performance branch.	2014-01-02 10:30:42 -08:00
Igor Canadi	7535443083	[RocksDB] Support for column families in manifest Summary: <This diff is for Column Family branch> Added fields in manifest file to support adding and deleting column families. Pretty simple change, each version edit record can be: 1. add column family 2. drop column family 3. add and delete N files from a single column family (compactions and flushes will generate such records) Test Plan: make check works, the code is backward compatible Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D14733	2014-01-02 04:18:28 -08:00
Igor Canadi	6de1b5b83e	Merge branch 'master' into columnfamilies	2014-01-02 04:18:07 -08:00
Igor Canadi	b60c14f6ee	Support multi-threaded DisableFileDeletions() and EnableFileDeletions() Summary: We don't want two threads to clash if they concurrently call DisableFileDeletions() and EnableFileDeletions(). I'm adding a counter that will enable file deletions only after all DisableFileDeletions() calls have been negated with EnableFileDeletions(). However, we also don't want to break the old behavior, so I added a parameter force to EnableFileDeletions(). If force is true, we will still enable file deletions after every call to EnableFileDeletions(), which is what is happening now. Test Plan: make check Reviewers: dhruba, haobo, sanketh Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14781	2014-01-02 03:33:42 -08:00
Mike Lin	4b1d049236	C API: add rocksdb_env_set_high_priority_background_threads	2013-12-31 15:14:18 -08:00
kailiu	f1cec73a76	Merge branch 'master' into performance Conflicts: db/db_impl.cc db/db_test.cc db/memtable.cc db/version_set.cc include/rocksdb/statistics.h	2013-12-27 12:23:17 -08:00
Siying Dong	a094f3b3b5	TableCache.FindTable() to avoid the mem copy of file number Summary: I'm not sure what's the purpose of encoding file number to a new buffer for looking up the table cache. It seems to be unnecessary to me. With this patch, we point the lookup key to the address of the int64 of the file number. Test Plan: make all check Reviewers: dhruba, haobo, igor, kailiu Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14811	2013-12-26 16:57:07 -08:00
Siying Dong	18df47b79a	Avoid malloc in NotFound key status if no message is given. Summary: In some places we have NotFound status created with empty message, but it doesn't avoid a malloc. With this patch, the malloc is avoided for that case. The motivation of it is that I found in db_bench readrandom test when all keys are not existing, about 4% of the total running time is spent on malloc of Status, plus a similar amount of CPU spent on free of them, which is not necessary. Test Plan: make all check Reviewers: dhruba, haobo, igor Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D14691	2013-12-26 16:23:10 -08:00
kailiu	079a21ba99	Fix the unused variable warning message in mac os	2013-12-26 15:12:30 -08:00
Haobo Xu	bf4a48ccb3	[RocksDB] [Performance Branch] Revert previous patch. Summary: The previous patch is wrong. rep_.resize(kHeader) just resets the header portion to zero, and should not cause a re-allocation if g++ does it right. I will go ahead and revert it. Test Plan: make check Reviewers: dhruba, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D14793	2013-12-20 18:20:06 -08:00
Haobo Xu	e94eea4527	[RocksDB] [Performance Branch] Minor fix, Remove string resize from WriteBatch::Clear Summary: tmp_batch_ will get re-allocated for every merged write batch because of the existing resize in WriteBatch::Clear. Note that in DBImpl::BuildBatchGroup, we have a hard coded upper limit of batch size 1<<20 = 1MB already. Test Plan: make check Reviewers: dhruba, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D14787	2013-12-20 16:29:05 -08:00
Siying Dong	abaf26266d	[RocksDB] [Performance Branch] Some Changes to PlainTable format Summary: Some changes to PlainTable format: (1) support variable key length (2) use user defined slice transformer to extract prefixes (3) Run some test cases against PlainTable in db_test and table_test Test Plan: test db_test Reviewers: haobo, kailiu CC: dhruba, igor, leveldb, nkg- Differential Revision: https://reviews.facebook.net/D14457	2013-12-20 12:08:35 -08:00
Igor Canadi	1fdb3f7dc6	[RocksDB] Optimize locking for Get Summary: Instead of locking and saving a DB state, we can cache a DB state and update it only when it changes. This change reduces lock contention and speeds up read operations on the DB. Performance improvements are substantial, although there is some cost in no-read workloads. I ran the regression tests on my devserver and here are the numbers: overwrite 56345 -> 63001 fillseq 193730 -> 185296 readrandom 771301 -> 1219803 (58% improvement!) readrandom_smallblockcache 677609 -> 862850 readrandom_memtable_sst 710440 -> 1109223 readrandom_fillunique_random 221589 -> 247869 memtablefillrandom 105286 -> 92643 memtablereadrandom 763033 -> 1288862 Test Plan: make asan_check I am also running db_stress Reviewers: dhruba, haobo, sdong, kailiu Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D14679	2013-12-20 09:57:58 -08:00
Mark Callaghan	ca92068b12	Add 'readtocache' test Summary: For some tests I want to cache the database prior to running other tests on the same invocation of db_bench. The readtocache test ignores --threads and --reads so those can be used by other tests and it will still do a full read of --num rows with one thread. It might be invoked like: db_bench --benchmarks=readtocache,readrandom --reads 100 --num 10000 --threads 8 Task ID: # Blame Rev: Test Plan: run db_bench Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14739	2013-12-18 16:54:53 -08:00
Igor Canadi	269709a885	Merge branch 'master' into columnfamilies	2013-12-18 15:56:37 -08:00
Igor Canadi	3b50b6213d	Merge pull request #37 from mlin/more-c-bindings C bindings: add a bunch of the newer options	2013-12-18 13:12:04 -08:00
Igor Canadi	9385a5247e	[RocksDB] [Column Family] Interface proposal Summary: <This diff is for Column Family branch> Sharing some of the work I've done so far. This diff compiles and passes the tests. The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all. Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility. There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on] Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families. Please provide feedback. Test Plan: make check works, the code is backward compatible Reviewers: dhruba, haobo, sdong, kailiu, emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D14445	2013-12-18 13:08:22 -08:00
Siying Dong	14995a8ff3	Move level0 sorting logic from Version::SaveTo() to Version::Finalize() Summary: I realized that "D14409 Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo()" is not done in an optimized place. SaveTo() is usually inside mutex. Move it to Finalize(), which is called out of mutex. Test Plan: make all check Reviewers: dhruba, haobo, kailiu Reviewed By: dhruba CC: igor, leveldb Differential Revision: https://reviews.facebook.net/D14607	2013-12-17 18:06:58 -08:00
Siying Dong	a8b8b11dc4	Get() Does Not Reserve space for to_delete memtables Summary: It seems to be a decision tradeoff in current codes: we make a malloc for every Get() to reduce one malloc for a flush inside mutex. It takes about 5% of CPU time in readrandom tests. We might consider the tradeoff to be the other way around. Test Plan: make all check Reviewers: dhruba, haobo, igor Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14697	2013-12-17 17:16:16 -08:00
Mike Lin	2a2506b629	C bindings: add a bunch of the newer options	2013-12-15 13:47:06 -08:00
Kai Liu	2e9efcd6d8	Add the property block for the plain table Summary: This is the last diff that adds the property block to plain table. The format resembles that of the block-based table: https://github.com/facebook/rocksdb/wiki/Rocksdb-table-format [data block] [meta block 1: stats block] [meta block 2: future extended block] ... [meta block K: future extended block] (we may add more meta blocks in the future) [metaindex block] [index block: we only have the placeholder here, we can add persistent index block in the future] [Footer: contains magic number, handle to metaindex block and index block] <end_of_file> Test Plan: extended existing property block test. Reviewers: haobo, sdong, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14523	2013-12-13 17:18:14 -08:00
kailiu	0cd1521af5	Completely remove argv_ since no one use it There are still warning in some other environment, just move that useless variable `argv_`	2013-12-12 16:36:38 -08:00
kailiu	0e24f97b9f	Revert last commit and add "unused" attribute to suppress warning	2013-12-12 15:40:44 -08:00
kailiu	bc9b488e92	fix a warning in db_test when running `make release`	2013-12-12 15:35:02 -08:00
Mark Callaghan	e9e6b00d29	Add monitoring for universal compaction and add counters for compaction IO Summary: Adds these counters { WAL_FILE_SYNCED, "rocksdb.wal.synced" } number of writes that request a WAL sync { WAL_FILE_BYTES, "rocksdb.wal.bytes" }, number of bytes written to the WAL { WRITE_DONE_BY_SELF, "rocksdb.write.self" }, number of writes processed by the calling thread { WRITE_DONE_BY_OTHER, "rocksdb.write.other" }, number of writes not processed by the calling thread. Instead these were processed by the current holder of the write lock { WRITE_WITH_WAL, "rocksdb.write.wal" }, number of writes that request WAL logging { COMPACT_READ_BYTES, "rocksdb.compact.read.bytes" }, number of bytes read during compaction { COMPACT_WRITE_BYTES, "rocksdb.compact.write.bytes" }, number of bytes written during compaction Per-interval stats output was updated with WAL stats and correct stats for universal compaction including a correct value for write-amplification. It now looks like: Compactions Level Files Size(MB) Score Time(sec) Read(MB) Write(MB) Rn(MB) Rnp1(MB) Wnew(MB) RW-Amplify Read(MB/s) Write(MB/s) Rn Rnp1 Wnp1 NewW Count Ln-stall Stall-cnt -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0 7 464 46.4 281 3411 3875 3411 0 3875 2.1 12.1 13.8 621 0 240 240 628 0.0 0 Uptime(secs): 310.8 total, 2.0 interval Writes cumulative: 9999999 total, 9999999 batches, 1.0 per batch, 1.22 ingest GB WAL cumulative: 9999999 WAL writes, 9999999 WAL syncs, 1.00 writes per sync, 1.22 GB written Compaction IO cumulative (GB): 1.22 new, 3.33 read, 3.78 write, 7.12 read+write Compaction IO cumulative (MB/sec): 4.0 new, 11.0 read, 12.5 write, 23.4 read+write Amplification cumulative: 4.1 write, 6.8 compaction Writes interval: 100000 total, 100000 batches, 1.0 per batch, 12.5 ingest MB WAL interval: 100000 WAL writes, 100000 WAL syncs, 1.00 writes per sync, 0.01 MB written Compaction IO interval (MB): 12.49 new, 14.98 read, 21.50 write, 36.48 read+write Compaction IO interval (MB/sec): 6.4 new, 7.6 read, 11.0 write, 18.6 read+write Amplification interval: 101.7 write, 102.9 compaction Stalls(secs): 142.924 level0_slowdown, 0.000 level0_numfiles, 0.805 memtable_compaction, 0.000 leveln_slowdown Stalls(count): 132461 level0_slowdown, 0 level0_numfiles, 3 memtable_compaction, 0 leveln_slowdown Task ID: #3329644, #3301695 Blame Rev: Test Plan: Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14583	2013-12-12 13:27:43 -08:00
Siying Dong	e8ab1934d9	[RocksDB Performance Branch] DBImpl.NewInternalIterator() to reduce works inside mutex Summary: To reduce mutex contention caused by DBImpl.NewInternalIterator(), in this function, move all the iteration creation works out of mutex, only leaving object ref and get. Test Plan: make all check will run db_stress for a while too to make sure no problem. Reviewers: haobo, dhruba, kailiu Reviewed By: haobo CC: igor, leveldb Differential Revision: https://reviews.facebook.net/D14589	2013-12-12 11:30:00 -08:00
Siying Dong	aaf9c6203c	[RocksDB][Performance Branch]Iterator Cleanup method only tries to find obsolete files if it has the last reference to a version Summary: When deconstructing an iterator, no need to check obsolete file if it doesn't hold last reference of any version. Test Plan: make all check Reviewers: haobo, igor, dhruba, kailiu Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D14595	2013-12-11 13:59:43 -08:00
Siying Dong	a8029fdc75	Introduce MergeContext to Lazily Initialize merge operand list Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases Test Plan: make all check Reviewers: haobo, kailiu, dhruba Reviewed By: haobo CC: igor, nkg-, leveldb Differential Revision: https://reviews.facebook.net/D14415 Conflicts: db/db_impl.cc db/memtable.cc	2013-12-11 11:37:28 -08:00
Siying Dong	bc5dd19b14	[RocksDB Performance Branch] Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo() Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get(). Test Plan: make all check Reviewers: haobo, kailiu, dhruba Reviewed By: dhruba CC: nkg-, igor, leveldb Differential Revision: https://reviews.facebook.net/D14409	2013-12-11 10:50:09 -08:00
Siying Dong	41349d9ef1	[RocksDB Performance Branch] Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo() Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get(). Test Plan: make all check Reviewers: haobo, kailiu, dhruba Reviewed By: dhruba CC: nkg-, igor, leveldb Differential Revision: https://reviews.facebook.net/D14409	2013-12-11 10:49:49 -08:00
Siying Dong	0304e3d2ff	When flushing mem tables, create iterators out of mutex Summary: creating new iterators of mem tables can be expensive. Move them out of mutex. DBImpl::WriteLevel0Table()'s mems seems to be a local vector and is only used by flushing. memtables to flush are also immutable, so it should be safe to do so. Test Plan: make all check Reviewers: haobo, dhruba, kailiu Reviewed By: dhruba CC: igor, leveldb Differential Revision: https://reviews.facebook.net/D14577 Conflicts: db/db_impl.cc	2013-12-11 10:02:17 -08:00
Siying Dong	95a411d853	When flushing mem tables, create iterators out of mutex Summary: creating new iterators of mem tables can be expensive. Move them out of mutex. DBImpl::WriteLevel0Table()'s mems seems to be a local vector and is only used by flushing. memtables to flush are also immutable, so it should be safe to do so. Test Plan: make all check Reviewers: haobo, dhruba, kailiu Reviewed By: dhruba CC: igor, leveldb Differential Revision: https://reviews.facebook.net/D14577	2013-12-11 09:57:19 -08:00
Haobo Xu	3c02c363b3	[RocksDB] [Performance Branch] Added dynamic bloom, to be used for memable non-existing key filtering Summary: as title Test Plan: dynamic_bloom_test Reviewers: dhruba, sdong, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D14385	2013-12-11 00:15:14 -08:00
kailiu	a82f42b765	rename db/memtablelist.{h,cc}	2013-12-10 19:03:13 -08:00
Igor Canadi	204bb9cffd	Get rid of LogFlush() in InternalIterator	2013-12-10 10:59:00 -08:00
Igor Canadi	19f5463d3f	Don't LogFlush() in foreground threads Summary: So fflush() takes a lock which is heavyweight. I added flush_pending_, but more importantly, I removed LogFlush() from foreground threads. Test Plan: ./db_test Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14535	2013-12-10 10:57:46 -08:00
Igor Canadi	a204dabb9d	Merge pull request #31 from sepeth/c-api Rename leveldb to rocksdb in C api	2013-12-10 09:18:47 -08:00
Doğan Çeçen	6c4e110c8c	Rename leveldb to rocksdb in C api	2013-12-10 10:48:35 +02:00
Igor Canadi	fb9fce4fc3	[RocksDB] BackupableDB Summary: In this diff I present you BackupableDB v1. You can easily use it to backup your DB and it will do incremental snapshots for you. Let's first describe how you would use BackupableDB. It's inheriting StackableDB interface so you can easily construct it with your DB object -- it will add a method RollTheSnapshot() to the DB object. When you call RollTheSnapshot(), current snapshot of the DB will be stored in the backup dir. To restore, you can just call RestoreDBFromBackup() on a BackupableDB (which is a static method) and it will restore all files from the backup dir. In the next version, it will even support automatic backuping every X minutes. There are multiple things you can configure: 1. backup_env and db_env can be different, which is awesome because then you can easily backup to HDFS or wherever you feel like. 2. sync - if true, it guarantees backup consistency on machine reboot 3. number of snapshots to keep - this will keep last N snapshots around if you want, for some reason, be able to restore from an earlier snapshot. All the backuping is done in incremental fashion - if we already have 00010.sst, we will not copy it again. IMPORTANT -- This is based on assumption that 00010.sst never changes - two files named 00010.sst from the same DB will always be exactly the same. Is this true? I always copy manifest, current and log files. 4. You can decide if you want to flush the memtables before you backup, or you're fine with backing up the log files -- either way, you get a complete and consistent view of the database at a time of backup. 5. More things you can find in BackupableDBOptions Here is the directory structure I use: backup_dir/CURRENT_SNAPSHOT - just 4 bytes holding the latest snapshot 0, 1, 2, ... - files containing serialized version of each snapshot - containing a list of files files/*.sst - sst files shared between snapshots - if one snapshot references 00010.sst and another one needs to backup it from the DB, it will just reference the same file files/ 0/, 1/, 2/, ... - snapshot directories containing private snapshot files - current, manifest and log files All the files are ref counted and deleted immediatelly when they get out of scope. Some other stuff in this diff: 1. Added GetEnv() method to the DB. Discussed with @haobo and we agreed that it seems right thing to do. 2. Fixed StackableDB interface. The way it was set up before, I was not able to implement BackupableDB. Test Plan: I have a unittest, but please don't look at this yet. I just hacked it up to help me with debugging. I will write a lot of good tests and update the diff. Also, `make asan_check` Reviewers: dhruba, haobo, emayanke Reviewed By: dhruba CC: leveldb, haobo Differential Revision: https://reviews.facebook.net/D14295	2013-12-09 14:06:52 -08:00
kailiu	551e9428ce	Merge branch 'master' into performance	2013-12-06 14:15:42 -08:00
Siying Dong	ef2211a9ca	[RocksDB Performance Branch] Introduce MergeContext to Lazily Initialize merge operand list Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases Test Plan: make all check Reviewers: haobo, kailiu, dhruba Reviewed By: haobo CC: igor, nkg-, leveldb Differential Revision: https://reviews.facebook.net/D14415	2013-12-06 10:28:59 -08:00
kailiu	b1d2de4a40	Fix #26 by putting the implementation of CreateDBStatistics() to a cc file	2013-12-05 22:29:03 -08:00
kailiu	90729f8b23	Extract metaindex block from block-based table Summary: This change will allow other table to reuse the code for meta blocks. Test Plan: all existing unit tests passed Reviewers: dhruba, haobo, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D14475	2013-12-05 16:34:16 -08:00
Mayank Agarwal	92e8316118	Make GetDbIdentity pure virtual and also implement it for StackableDB, DBWithTTL Summary: As title Test Plan: make clean and make Reviewers: igor Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14469	2013-12-05 12:02:31 -08:00
Mayank Agarwal	18802689b8	Make an API to get database identity from the IDENTITY file Summary: This would enable rocksdb users to get the db identity without depending on implementation details(storing that in IDENTITY file) Test Plan: db/db_test (has identity checks) Reviewers: dhruba, haobo, igor, kailiu Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14463	2013-12-04 22:39:17 -08:00
Mark Callaghan	97aa401e2f	Add compression options to db_bench Summary: This adds 2 options for compression to db_bench: * universal_compression_size_percent * compression_level - to set zlib compression level It also logs compression_size_percent at startup in LOG Task ID: # Blame Rev: Test Plan: make check, run db_bench Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14439	2013-12-03 14:28:48 -08:00
Sajal Jain	28a1b9b95f	[rocksdb] statistics counters for memtable hits and misses Summary: added counters rocksdb.memtable.hit - for memtable hit rocksdb.memtable.miss - for memtable miss Test Plan: db_bench tests Reviewers: igor, dhruba, haobo Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D14433	2013-12-03 12:59:53 -08:00
Igor Canadi	eb12e47e0e	Killing Transform Rep Summary: Let's get rid of TransformRep and it's children. We have confirmed that HashSkipListRep works better with multifeed, so there is no benefit to keeping this around. This diff is mostly just deleting references to obsoleted functions. I also have a diff for fbcode that we'll need to push when we switch to new release. I had to expose HashSkipListRepFactory in the client header files because db_impl.cc needs access to GetTransform() function for SanitizeOptions. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14397	2013-12-03 12:42:15 -08:00
Igor Canadi	043fc14c3e	Get rid of some shared_ptrs Summary: I went through all remaining shared_ptrs and removed the ones that I found not-necessary. Only GenerateCachePrefix() is called fairly often, so don't expect much perf wins. The ones that are left are accessed infrequently and I think we're fine with keeping them. Test Plan: make asan_check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14427	2013-12-03 11:17:58 -08:00
Dhruba Borthakur	96bc3ec297	Memtables should be deleted appropriately in the unit test. Summary: Memtables should be deleted appropriately in the unit test. Test Plan: Reviewers: CC: Task ID: # Blame Rev:	2013-12-01 21:23:44 -08:00
Dhruba Borthakur	98968ba937	Free obsolete memtables outside the dbmutex had a memory leak. Summary: The commit at `27bbef1180` had a memory leak that was detected by valgrind. The memtable that has a refcount decrement in MemTableList::InstallMemtableFlushResults was not freed. Test Plan: valgrind ./db_test --leak-check=full Reviewers: igor Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14391	2013-11-28 10:25:22 -08:00
Igor Canadi	35ddf18367	Don't do compression tests if we don't have compression libs Summary: These tests fail if compression libraries are not installed. Test Plan: Manually disabled snappy, observed tests not ran. Reviewers: dhruba, kailiu Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14379	2013-11-27 13:32:56 -08:00
Kai Liu	1966b63137	Merge branch 'master' into perf	2013-11-27 11:47:40 -08:00
Haobo Xu	4e6463ea44	[RocksDB][Performance Branch] Make height and branching factor configurable for skiplist implementation Summary: As title. Especially, HashSkipListRepFactory will be able to specify a relatively small height, to reduce the memory overhead of one skiplist per bucket. Test Plan: make check and test it on leaf4 Reviewers: dhruba, sdong, kailiu CC: reconnect.grayhat, leveldb Differential Revision: https://reviews.facebook.net/D14307	2013-11-26 21:59:36 -08:00
Dhruba Borthakur	8478f380a0	During benchmarking, I see excessive use of vector.reserve(). Summary: This code path can potentially accumulate multiple important_files for level 0. But for other levels, it should have only one file in the important_files, so it is ok not to reserve excessive space, is it not? Test Plan: make check Reviewers: haobo Reviewed By: haobo CC: reconnect.grayhat, leveldb Differential Revision: https://reviews.facebook.net/D14349	2013-11-26 07:47:08 -08:00
Dhruba Borthakur	27bbef1180	Free obsolete memtables outside the dbmutex. Summary: Large memory allocations and frees are costly and best done outside the db-mutex. The memtables are already allocated outside the db-mutex but they were being freed while holding the db-mutex. This patch frees obsolete memtables outside the db-mutex. Test Plan: make check db_stress Unit tests pass, I am in the process of running stress tests. Reviewers: haobo, igor, emayanke Reviewed By: haobo CC: reconnect.grayhat, leveldb Differential Revision: https://reviews.facebook.net/D14319	2013-11-25 21:04:48 -08:00
Igor Canadi	3ce3658411	DB::GetOptions() Summary: We need access to options for BackupableDB Test Plan: make check Reviewers: dhruba Reviewed By: dhruba CC: leveldb, reconnect.grayhat Differential Revision: https://reviews.facebook.net/D14331	2013-11-25 15:51:50 -08:00
Igor Canadi	11c26bd4a4	[RocksDB] Interface changes required for BackupableDB Summary: This is part of https://reviews.facebook.net/D14295 -- smaller diff that is easier to review Test Plan: make asan_check Reviewers: dhruba, haobo, emayanke Reviewed By: emayanke CC: leveldb, kailiu, reconnect.grayhat Differential Revision: https://reviews.facebook.net/D14301	2013-11-25 12:39:23 -08:00
Dhruba Borthakur	299f5c76bb	Create new log file outside the dbmutex. Summary: All filesystem Io should be done outside the dbmutex. There was one place when we have to roll the transaction log that we were creating the new log file while holding the dbmutex. I rearranged this code so that the act of creating the new transaction log file is done without holding the dbmutex. I also allocate the new memtable outside the dbmutex, this is important because creating the memtable could be heavyweight. Test Plan: make check and dbstress Reviewers: haobo, igor Reviewed By: haobo CC: leveldb, reconnect.grayhat Differential Revision: https://reviews.facebook.net/D14283	2013-11-25 11:23:42 -08:00
Haobo Xu	5b825d6964	[RocksDB] Use raw pointer instead of shared pointer when passing Statistics object internally Summary: liveness of the statistics object is already ensured by the shared pointer in DB options. There's no reason to pass again shared pointer among internal functions. Raw pointer is sufficient and efficient. Test Plan: make check Reviewers: dhruba, MarkCallaghan, igor Reviewed By: dhruba CC: leveldb, reconnect.grayhat Differential Revision: https://reviews.facebook.net/D14289	2013-11-25 10:38:15 -08:00
Siying Dong	718488abc5	Add BloomFilter to PlainTableIterator::Seek() Summary: This patch adds a simple bloom filter in PlainTableIterator::Seek() Test Plan: N/A Reviewers: CC: Task ID: # Blame Rev:	2013-11-21 22:26:39 -08:00
kailiu	0c93df912e	Improve the readability of the TableProperties::ToString()	2013-11-21 17:54:23 -08:00
Siying Dong	3e35aa6412	Revert "Allow users to profile a query and see bottleneck of the query" This reverts commit `3d8ac31d71`.	2013-11-21 17:40:39 -08:00
Siying Dong	b135d01e7b	Allow users to profile a query and see bottleneck of the query Summary: Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later. Test Plan: Enable this profiling in seveal existing tests. Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14001 Conflicts: table/merger.cc	2013-11-21 17:39:19 -08:00
Siying Dong	3d8ac31d71	Allow users to profile a query and see bottleneck of the query Summary: Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later. Test Plan: Enable this profiling in seveal existing tests. Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14001	2013-11-21 16:29:57 -08:00
Siying Dong	58e1956d50	[Only for Performance Branch] A Hacky patch to lazily generate memtable key for prefix-hashed memtables. Summary: For prefix mem tables, encoding mem table key may be unnecessary if the prefix doesn't have any key. This patch is a little bit hacky but I want to try out the performance gain of removing this lazy initialization. In longer term, we might want to revisit the way we abstract mem tables implementations. Test Plan: make all check Reviewers: haobo, igor, kailiu Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14265	2013-11-20 20:49:23 -08:00
Siying Dong	b59d4d5a50	A Simple Plain Table Summary: A Simple plain table format. No block structure. When creating the table reader, scanning the full table to create indexes. Test Plan:Add unit test Reviewers:haobo,dhruba,kailiu CC: Task ID: # Blame Rev:	2013-11-20 18:44:22 -08:00
Siying Dong	071fb0d77b	Inline a couple of functions and put one save lazily clearing Summary: Machine several functions inline. Also, in DBIter.Seek() make value cleaning up lazily done. These are for the use case that Seek() are called lots of times but few return values. Test Plan: make all check Differential Revision: https://reviews.facebook.net/D14217	2013-11-20 17:32:57 -08:00
Haobo Xu	37b459f0aa	[RocksDB] Test diff on performance branch Summary: trivia comment change Test Plan: Go through the step ofs developing under the performance branch Reviewers: dhruba, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D14259	2013-11-20 14:34:52 -08:00
Haobo Xu	a617227a36	[RocksDB] fix prefix_test Summary: user comparator needs to work if either input is prefix only. Test Plan: ./prefix_test --write_buffer_size=100000 --total_prefixes=10000 --items_per_prefix=10 Reviewers: dhruba, igor Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14241	2013-11-20 09:16:23 -08:00
kailiu	6eb5649800	Move flush_block_policy from Options to TableFactory Summary: Previously we introduce a `flush_block_policy_factory` in Options, however, that options is strongly releated to Table based tables. It will make more sense to move it to block based table's own factory class. Test Plan: make check to pass existing tests Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14211	2013-11-19 22:00:48 -08:00
kailiu	1415f8820d	Improve the "table stats" Summary: The primary motivation of the changes is to make it easier to figure out the inside of the tables. * rename "table stats" to "table properties" since now we have more than "integers" to store in the property block. * Add filter block size to the basic table properties. * Whenever a table is built, we'll log the table properties (the sample output is in Test Plan). * Make an api to expose deleted keys. Test Plan: Passed all existing test. and the sample output of table stats: ================================================================== Basic Properties ------------------------------------------------------------------ # data blocks: 1 # entries: 1 raw key size: 9 raw average key size: 9 raw value size: 9 raw average value size: 0 data block size: 25 index block size: 27 filter block size: 18 (estimated) table size: 70 filter policy: rocksdb.BuiltinBloomFilter ================================================================== User collected properties: InternalKeyPropertiesCollector ------------------------------------------------------------------ kDeletedKeys: 1 ================================================================== Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14187	2013-11-19 16:29:42 -08:00
Igor Canadi	fc61428288	Include <unistd.h> in db_test Summary: This is the only compile issue in Ubuntu. It might be better to include <unistd.h> only in env_posix and add Truncate function to Env, but since we use truncate only in db_test, I don't think it makes much sense. Test Plan: Rocksdb now compiles on Ubuntu! Reviewers: dhruba, kailiu Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14127	2013-11-17 21:58:16 -08:00
Igor Canadi	de9ce7d439	Upgrading compiler to gcc4.8.1 Summary: Finally did it - the trick was in using --dynamic-linker option. This is first step to running ASAN. All of our code seems to compile just fine on 4.8.1. However, I still left fbcode.471.sh in the 'build_tools/' just in case. Test Plan: make clean; make Reviewers: dhruba, haobo, kailiu, emayanke, sdong Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14109	2013-11-17 13:52:55 -08:00
Kai Liu	75df72f2a5	Change the logic in KeyMayExist() Summary: Previously in KeyMayExist(), if DB::Get() returns non-Status::OK(), we assumes key may not exist. However, as if index block is not in block cache, Status::Incomplete() will return. Worse still, if options::filter_delete is enabled, we may falsely ignore the "delete" operation: https://github.com/facebook/rocksdb/blob/master/db/write_batch.cc#L217-L220 This diff fixes this bug and will let crash-test pass. Test Plan: Ran: ./db_stress --test_batches_snapshots=1 --ops_per_thread=1000000 --threads=32 --write_buffer_size=4194304 --destroy_db_initially=1 --reopen=0 --readpercent=5 --prefixpercent=45 --writepercent=35 --delpercent=5 --iterpercent=10 --db=/home/kailiu/local/newer --max_key=100000000 --disable_seek_compaction=0 --mmap_read=0 --block_size=16384 --cache_size=1048576 --open_files=500000 --verify_checksum=1 --sync=0 --disable_wal=0 --disable_data_sync=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --max_write_buffer_number=3 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --filter_deletes=1 Previously we'll see crash happens very soon. Reviewers: igor, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14115	2013-11-17 01:00:34 -08:00
kailiu	97d8e573a6	make util/env_posix.cc work under mac Summary: This diff invoves some more complicated issues in the posix environment. Test Plan: works under mac os. will need to verify dev box. Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14061	2013-11-16 23:44:39 -08:00
Pascal Borreli	443e04e62d	Fixed typos	2013-11-16 11:21:34 +00:00
Igor Canadi	21905dd4a8	Start DeleteFileTest with clean plate Summary: Remove all the files from the test dir before the test. The test failed when there were some old files still in the directory, since it checks the file counts. This is what caused jenkins' test failures. It was running fine on my machine so it was hard to repro. Test Plan: 1. create an extra 000001.log file in the test directory 2. run a ./deletefile_test - test failes 3. patch ./deletefile_test with this 4. test succeeds Reviewers: haobo, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D14097	2013-11-15 16:30:23 -08:00
Igor Canadi	29c931f70b	Avoid populating live set if we don't need to Summary: Also changed some comments Test Plan: ./deletefile_test Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D14091	2013-11-14 22:42:02 -08:00
Igor Canadi	a0ce3fd00a	PurgeObsoleteFiles() unittest Summary: Created a unittest that verifies that automatic deletion performed by PurgeObsoleteFiles() works correctly. Also, few small fixes on the logic part -- call version_set_->GetObsoleteFiles() in FindObsoleteFiles() instead of on some arbitrary positions. Test Plan: Created a unit test Reviewers: dhruba, haobo, nkg- Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D14079	2013-11-14 18:03:57 -08:00
Vamsi Ponnekanti	94dde686bb	[Merge operand meant for key K is being applied on wrong key] Summary: We iterate until we find a different key than original key. ikey is pointing to next key when we break out of loop. After the loop we apply all merge operands meant for original key on the next key! Test Plan: Need to give a build to Marcin to test out. Revert Plan: OK Task ID: #3181932 Reviewers: haobo, emayanke, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D14073	2013-11-14 17:13:24 -08:00
Igor Canadi	fda8142f29	Delete log files in the correct dir Summary: Log files are stored in wal_dir, not dbname_ Test Plan: deletfile_test Reviewers: nkg- Reviewed By: nkg- CC: leveldb Differential Revision: https://reviews.facebook.net/D14067	2013-11-13 14:54:54 -08:00
Kai Liu	88ba331c1a	Add the index/filter block cache Summary: This diff leverage the existing block cache and extend it to cache index/filter block. Test Plan: Added new tests in db_test and table_test The correctness is checked by: 1. make check 2. make valgrind_check Performance is test by: 1. 10 times of build_tools/regression_build_test.sh on two versions of rocksdb before/after the code change. Test results suggests no significant difference between them. For the two key operatons `overwrite` and `readrandom`, the average iops are both 20k and ~260k, with very small variance). 2. db_stress. Reviewers: dhruba Reviewed By: dhruba CC: leveldb, haobo, xjin Differential Revision: https://reviews.facebook.net/D13167	2013-11-12 22:46:51 -08:00
Kai Liu	35460ccb53	Fix the string format issue Summary: mac and our dev server has totally differnt definition of uint64_t, therefore fixing the warning in mac has actually made code in linux uncompileable. Test Plan: make clean && make -j32	2013-11-12 21:05:39 -08:00
Igor Canadi	d88d8ecf80	Fix deleting files Summary: One more fix! In some cases, our filenames start with "/". Apparently, env_ can't handle filenames with double // Test Plan: deletefile_test does not include this line in the LOG anymore: 2013/11/12-18:11:43.150149 7fe4a6fff700 RenameFile logfile #3 FAILED -- IO error: /tmp/rocksdbtest-3574/deletefile_test//000003.log: No such file or directory Reviewers: dhruba, haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D14055	2013-11-12 20:32:07 -08:00
kailiu	21587760b9	Fixing the warning messages captured under mac os # Consider using `git commit -m 'One line title' && arc diff`. # You will save time by running lint and unit in the background. Summary: The work to make sure mac os compiles rocksdb is not completed yet. But at least we can start cleaning some warnings captured only by g++ from mac os.. Test Plan: ran make in mac os Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14049	2013-11-12 20:05:28 -08:00
Igor Canadi	9df2b217e9	Move fast and break things Summary: Broke the compile when I removed purge_log_after_memtable_flush. sorrybus Test Plan: make db_bench works now Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D14037	2013-11-12 12:42:42 -08:00
Igor Canadi	9bc4a26f56	Small changes in Deleting obsolete files Summary: @haobo's suggestions from https://reviews.facebook.net/D13827 Renaming some variables, deprecating purge_log_after_flush, changing for loop into auto for loop. I have not implemented deleting objects outside of mutex yet because it would require a big code change - we would delete object in db_impl, which currently does not know anything about object because it's defined in version_edit.h (FileMetaData). We should do it at some point, though. Test Plan: Ran deletefile_test Reviewers: haobo Reviewed By: haobo CC: leveldb, haobo Differential Revision: https://reviews.facebook.net/D14025	2013-11-12 11:53:26 -08:00
Igor Canadi	dad425562f	Move the comment Summary: Moving the comment per @haobo suggestion. Test Plan: No Reviewers: haobo Reviewed By: haobo CC: leveldb, haobo Differential Revision: https://reviews.facebook.net/D14019	2013-11-12 10:07:55 -08:00
Igor Canadi	4abd219cfc	Combine two FindObsoleteFiles() Summary: We don't need to call FindObsoleteFiles() twice Test Plan: deletefile_test Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14007	2013-11-11 21:41:32 -08:00
Igor Canadi	94e139f94d	Fixing failed delete file test Summary: FindObsoleteFiles() has to be called before PurgeObsoleteFiles() because FindObsoleteFiles() sets manifest_file_number, log_number and prev_log_number to valid values. Test Plan: deletefile_test now works Reviewers: dhruba, emayanke, kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D13995	2013-11-11 21:03:41 -08:00
Dhruba Borthakur	318a4919d2	Fix valgrind check by initialising DeletionState. Summary: The valgrind error was introduced by commit `1510339e52`. Initialize DeletionState in constructor. Test Plan: valgrind --leak-check=yes ./deletefile_test Reviewers: igor, kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D13983	2013-11-11 16:01:13 -08:00
lovro	8a46ecd357	WriteBatch::Put() overload that gathers key and value from arrays of slices Summary: In our project, when writing to the database, we want to form the value as the concatenation of a small header and a larger payload. It's a shame to have to copy the payload just so we can give RocksDB API a linear view of the value. Since RocksDB makes a copy internally, it's easy to support gather writes. Test Plan: write_batch_test, new test case Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13947	2013-11-08 16:34:32 -08:00
Igor Canadi	1510339e52	Speed up FindObsoleteFiles Summary: Here's one solution we discussed on speeding up FindObsoleteFiles. Keep a set of all files in DBImpl and update the set every time we create a file. I probably missed few other spots where we create a file. It might speed things up a bit, but makes code uglier. I don't really like it. Much better approach would be to abstract all file handling to a separate class. Think of it as layer between DBImpl and Env. Having a separate class deal with file namings and deletion would benefit both code cleanliness (especially with huge DBImpl) and speed things up. It will take a huge effort to do this, though. Let's discuss offline today. Test Plan: Ran ./db_stress, verified that files are getting deleted Reviewers: dhruba, haobo, kailiu, emayanke Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D13827	2013-11-08 15:23:46 -08:00
Igor Canadi	dd218bbc88	Forgot to change interface everywhere Summary: Changed the name and interface for creating HashSkipListRep. Forgot to change it in db_test. Test Plan: make db_test Reviewers: haobo Reviewed By: haobo Differential Revision: https://reviews.facebook.net/D13965	2013-11-08 12:23:12 -08:00
Igor Canadi	8b3379dc0a	Implementing DynamicIterator for TransformRepNoLock Summary: What @haobo done with TransformRep, now in TransformRepNoLock. Similar implementation, except that I made DynamicIterator a subclass of Iterator which makes me have less iterator initializations. Test Plan: ./prefix_test. Seeing huge savings vs. TransformRep again! Reviewers: dhruba, haobo, sdong, kailiu Reviewed By: haobo CC: leveldb, haobo Differential Revision: https://reviews.facebook.net/D13953	2013-11-08 00:31:09 -08:00
Kai Liu	fd075d6edd	Provide mechanism to configure when to flush the block Summary: Allow block based table to configure the way flushing the blocks. This feature will allow us to add support for prefix-aligned block. Test Plan: make check Reviewers: dhruba, haobo, sdong, igor Reviewed By: sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D13875	2013-11-07 21:27:21 -08:00
Kai Liu	bba6595b1f	Fix the valgrind error Summary: I this bug from valgrind report and found a place that may potentially leak memory. Test Plan: re-ran the valgrind and no error any more Reviewers: emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D13959	2013-11-07 15:46:48 -08:00
Igor Canadi	444cf88a56	Flush the log outside of lock Summary: Added a new call LogFlush() that flushes the log contents to the OS buffers. We never call it with lock held. We call it once for every Read/Write and often in compaction/flush process so the frequency should not be a problem. Test Plan: db_test Reviewers: dhruba, haobo, kailiu, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13935	2013-11-07 11:31:56 -08:00
Haobo Xu	fd2044883a	[RocksDB] Generalize prefix-aware iterator to be used for more than one Seek Summary: Added a prefix_seek flag in ReadOptions to indicate that Seek is prefix aware(might not return data with different prefix), and also not bound to a specific prefix. Multiple Seeks and range scans can be invoked on the same iterator. If a specific prefix is specified, this flag will be ignored. Just a quick prototype that works for PrefixHashRep, the new lockless memtable could be easily extended with this support too. Test Plan: test it on Leaf Reviewers: dhruba, kailiu, sdong, igor Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D13929	2013-11-06 20:45:49 -08:00
shamdor	c2be2cba04	WAL log retention policy based on archive size. Summary: Archive cleaning will still happen every WAL_ttl seconds but archived logs will be deleted only if archive size is greater then a WAL_size_limit value. Empty archived logs will be deleted evety WAL_ttl. Test Plan: 1. Unit tests pass. 2. Benchmark. Reviewers: emayanke, dhruba, haobo, sdong, kailiu, igor Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D13869	2013-11-06 18:46:28 -08:00
Igor Canadi	be96f2498e	TransformRep - use array instead of unordered_map Summary: I'm sending this diff together with https://reviews.facebook.net/D13881 because it didn't allow me to send only the array one. Here I also replaced unordered_map with just an array of shared_ptrs. This elminated all the locks. I will run the new benchmark and post the results here. Test Plan: db_test Reviewers: dhruba, haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13893	2013-11-06 11:55:43 -08:00
Haobo Xu	fe4a449472	[RocksDB] prefixhash memtable test Summary: as title, half baked test for prefixhash memtable. Also contains deadlock test option Test Plan: run it Reviewers: igor, dhruba Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D13887	2013-11-05 23:20:10 -08:00
Dhruba Borthakur	39190588e1	Fix failure in rocksdb unit test CompressedCache Summary: The problem was that there was only a single key-value in a block and its compressibility was less than 88%. Rocksdb refuses to compress a block unless its compresses to lesser than 88% of its original size. If a block is not compressed, it does nto get inserted into the compressed block cache. Create the test data so that multiple records fit into the same data block. This increases the compressibility of these data block. Test Plan: ./db_test Reviewers: kailiu, haobo Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D13905	2013-11-05 16:11:34 -08:00
Dhruba Borthakur	7845fd9db9	Fixed valgrind error in DBTest.CompressedCache Summary: Fixed valgrind error in DBTest.CompressedCache. This fixes the valgrind error (thanks to Haobo). I am still trying to reproduce the test-failure case deterministically. Test Plan: db_test Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13899	2013-11-05 11:12:39 -08:00
Mayank Agarwal	f837f5b1c9	Making the transaction log iterator more robust Summary: strict essentially means that we MUST find the startsequence. Thus we should return if starteSequence is not found in the first file in case strict is set. This will take care of ending the iterator in case of permanent gaps due to corruptions in the log files Also created NextImpl function that will have internal variable to distinguish whether Next is being called from StartSequence or by application. Set NotFoudn::gaps status to give an indication of gaps happeneing. Polished the inline documentation at various places Test Plan: * db_repl_stress test * db_test relating to transaction log iterator * fbcode/wormhole/rocksdb/rocks_log_iterator * sigma production machine sigmafio032.prn1 Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13689	2013-11-04 20:49:03 -08:00
Dhruba Borthakur	b4ad5e89ae	Implement a compressed block cache. Summary: Rocksdb can now support a uncompressed block cache, or a compressed block cache or both. Lookups first look for a block in the uncompressed cache, if it is not found only then it is looked up in the compressed cache. If it is found in the compressed cache, then it is uncompressed and inserted into the uncompressed cache. It is possible that the same block resides in the compressed cache as well as the uncompressed cache at the same time. Both caches have their own individual LRU policy. Test Plan: Unit test case attached. Reviewers: kailiu, sdong, haobo, leveldb Reviewed By: haobo CC: xjin, haobo Differential Revision: https://reviews.facebook.net/D12675	2013-11-01 14:31:35 -07:00
Igor Canadi	beeb74be6f	Move I/O outside of lock Summary: I'm figuring out how Version[Set, Edit, ] classes work and I stumbled on this. It doesn't seem that the comment is accurate anymore. What I read is when the manifest grows too big, create a new file (and not only when we call LogAndApply for the first time). Test Plan: make check (currently running) Reviewers: dhruba, haobo, kailiu, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13839	2013-11-01 12:32:27 -07:00
Haobo Xu	8cbe5bb56b	[RocksDB] Add OnCompactionStart to CompactionFilter class Summary: This is to give application compaction filter a chance to access context information of a specific compaction run. For example, depending on whether a compaction goes through all data files, the application could do things differently. Test Plan: make check Reviewers: dhruba, kailiu, sdong Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13683	2013-10-31 13:36:43 -07:00
Naman Gupta	b4fab3be2a	Merge branch 'master' of github.com:facebook/rocksdb into inplace	2013-10-31 11:51:03 -07:00
Naman Gupta	fe25070242	In-place updates for equal keys and similar sized values Summary: Currently for each put, a fresh memory is allocated, and a new entry is added to the memtable with a new sequence number irrespective of whether the key already exists in the memtable. This diff is an attempt to update the value inplace for existing keys. It currently handles a very simple case: 1. Key already exists in the current memtable. Does not inplace update values in immutable memtable or snapshot 2. Latest value type is a 'put' ie kTypeValue 3. New value size is less than existing value, to avoid reallocating memory TODO: For a put of an existing key, deallocate memory take by values, for other value types till a kTypeValue is found, ie. remove kTypeMerge. TODO: Update the transaction log, to allow consistent reload of the memtable. Test Plan: Added a unit test verifying the inplace update. But some other unit tests broken due to invalid sequence number checks. WIll fix them next. Reviewers: xinyaohu, sumeet, haobo, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12423 Automatic commit by arc	2013-10-31 11:27:12 -07:00
Siying Dong	f03b2df010	Follow-up Cleaning-up After D13521 Summary: This patch is to address @haobo's comments on D13521: 1. rename Table to be TableReader and make its factory function to be GetTableReader 2. move the compression type selection logic out of TableBuilder but to compaction logic 3. more accurate comments 4. Move stat name constants into BlockBasedTable implementation. 5. remove some uncleaned codes in simple_table_db_test Test Plan: pass test suites. Reviewers: haobo, dhruba, kailiu Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13785	2013-10-30 10:52:33 -07:00
Siying Dong	068a819ac9	Fix valgrind_check problem of simple_table_db_test.cc Summary: Two memory issues caused valgrind_check to fail on simple_table_db_test. Fix it Test Plan: make -j32 valgrind_check Reviewers: kailiu, haobo, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13773	2013-10-29 14:29:03 -07:00
Siying Dong	d4eec30ed0	Make "Table" pluggable Summary: This patch makes Table and TableBuilder a abstract class and make all the implementation of the current table into BlockedBasedTable and BlockedBasedTable Builder. Test Plan: Make db_test.cc to work with block based table. Add a new test simple_table_db_test.cc where a different simple table format is implemented. Reviewers: dhruba, haobo, kailiu, emayanke, vamsi Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13521	2013-10-28 17:54:09 -07:00
Kai Liu	994575c134	Support user-defined table stats collector Summary: 1. Added a new option that support user-defined table stats collection. 2. Added a deleted key stats collector in `utilities` Test Plan: Added a unit test for newly added code. Also ran make check to make sure other tests are not broken. Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13491	2013-10-28 15:45:14 -07:00
Kai Liu	7e91b86f4d	Fix a valgrind warning Summary: A latest valgrind test found a recently added unit test has memory leak, which is because DB is not closed at the end of the test. Test Plan: re-run the valgrind locally and make sure there's no memory leakage any more. Reviewers: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D13725	2013-10-28 14:34:27 -07:00
Igor Canadi	100fa8e013	If a Put fails, fail all other puts Summary: When a Put fails, it can leave database in a messy state. We don't want to pretend that everything is OK when it may not be. We fail every write following the failed one. I added checks for corruption to DBImpl::Write(). Is there anywhere else I need to add them? Test Plan: Corruption unit test. Reviewers: dhruba, haobo, kailiu Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13671	2013-10-28 12:36:02 -07:00
Kai Liu	a1d38a41fd	fix the error message in debug mode Summary: my fix patch introduced a new error in debug mode. Test Plan: `make` and `make release`	2013-10-27 23:11:13 -07:00
Kai Liu	39c14891b6	Fix the gcc warning for unused variable Summary: Fix the unused variable warning for `first` when running `make release` Test Plan: make make check Reviewers: dhruba, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D13695	2013-10-27 22:57:45 -07:00
Mayank Agarwal	56305221c4	Unify DeleteFile and DeleteWalFiles Summary: This is to simplify rocksdb public APIs and improve the code quality. Created an additional parameter to ParseFileName for log sub type and improved the code for deleting a wal file. Wrote exhaustive unit-tests in delete_file_test Unification of other redundant APIs can be taken up in a separate diff Test Plan: Expanded delete_file test Reviewers: dhruba, haobo, kailiu, sdong Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13647	2013-10-25 08:32:14 -07:00
Kai Liu	c17607a251	Fix the log number bug when updating MANIFEST file Summary: Crash may occur during the flushes of more than two mem tables. As the info log suggested, even when both were successfully flushed, the recovery process still pick up one of the memtable's log for recovery. This diff fix the problem by setting the correct "log number" in MANIFEST. Test Plan: make test; deployed to leaf4 and make sure it doesn't result in crashes of this type. Reviewers: haobo, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13659	2013-10-24 21:05:33 -07:00
Slobodan Predolac	e44976b199	Conversion of db_bench, db_stress and db_repl_stress to use gflags Summary: Converted db_stress, db_repl_stress and db_bench to use gflags Test Plan: I tested by printing out all the flags from old and new versions. Tried defaults, + various combinations with "interesting flags". Also, tested by running db_crashtest.py and db_crashtest2.py. Reviewers: emayanke, dhruba, haobo, kailiu, sdong Reviewed By: emayanke CC: leveldb, xjin Differential Revision: https://reviews.facebook.net/D13581	2013-10-24 07:43:14 -07:00
Haobo Xu	2fb361ad98	[RocksDB] Add perf_context.wal_write_time to track time spent on writing the recovery log. Summary: as title Test Plan: make check; ./perf_context_test Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13629	2013-10-23 13:38:39 -07:00
Mayank Agarwal	e56ce03691	Hardcoding temp file name for Identity file to 000000.dbtmp just like it's done for CURRENT file Summary: as per Dhruba's suggestion Test Plan: make all check; Seen the Id getting generated properly in db_repl_stress Reviewers: dhruba, kailiu Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13635	2013-10-23 11:34:22 -07:00
Mayank Agarwal	9b50106f9a	Dbid feature Summary: Create a new type of file on startup if it doesn't already exist called DBID. This will store a unique number generated from boost library's uuid header file. The use-case is to identify the case of a db losing all its data and coming back up either empty or from an image(backup/live replica's recovery) the key point to note is that DBID is not stored in a backup or db snapshot It's preferable to use Boost for uuid because: 1) A non-standard way of generating uuid is not good 2) /proc/sys/kernel/random/uuid generates a uuid but only on linux environments and the solution would not be clean 3) c++ doesn't have any direct way to get a uuid 4) Boost is a very good library that was already having linkage in rocksdb from third-party Note: I had to update the TOOLCHAIN_REV in build files to get latest verison of boost from third-party as the older version had a bug. I had to put Wno-uninitialized in Makefile because boost-1.51 has an unitialized variable and rocksdb would not comiple otherwise. Latet open-source for boost is 1.54 but is not there in third-party. I have notified the concerned people in fbcode about it. @kailiu : While releasing to third-party, an additional dependency will need to be created for boost in TARGETS file. I can help identify. Test Plan: Expand db_test to test 2 cases 1) Restarting db with Id file present - verify that no change to Id 2)Restarting db with Id file deleted - verify that a different Id is there after reopen Also run make all check Reviewers: dhruba, haobo, kailiu, sdong Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13587	2013-10-22 12:23:34 -07:00
Mayank Agarwal	ae8e0770b4	Disallow transaction log iterator to skip sequences Summary: This is expected to solve the "gaps in transaction log iterator" problem. * After a lot of observations on the gaps on the sigmafio machines I found that it is due to a race between log reader and writer always. * So when we drop the wormhole subscription and refresh the iterator, the gaps are not there. * It is NOT due to some boundary or corner case left unattended in the iterator logic because I checked many instances of the gaps against their log files with ldb. The log files are NOT corrupted also. * The solution is to not allow the iterator to read incompletely written sequences and detect gaps inside itself and invalidate it which will cause the application to refresh the iterator normally and seek to the required sequence properly. * Thus, the iterator can at least guarantee that it will not give any gaps. Test Plan: * db_test based log iterator tests * db_repl_stress * testing on sigmafio setup to see gaps go away Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13593	2013-10-22 11:45:35 -07:00
Siying Dong	65428b0c0a	Fix Bug: iterator.Prev() or iterator.SeekToLast() might return the first element instead of the correct one Summary: Recent patch https://reviews.facebook.net/D11865 introduced a regression bug: DBIter::FindPrevUserEntry(), which is called by DBIter::Prev() (and also implicitly if calling iterator.SeekToLast()) might do issue a seek when having skipped too many entries. If the skipped entry just before the seek() is a delete, the saved key is erased so that it seeks to the front, so Prev() would return the first element. This patch fixes the bug by not doing seek() in DBIter::FindNextUserEntry() if saved key has been erased. Test Plan: Add a test DBTest.IterPrevMaxSkip which would fail without the patch and would pass with the change. Reviewers: dhruba, xjin, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13557	2013-10-17 18:33:18 -07:00
Siying Dong	9edda37027	Universal Compaction to Have a Size Percentage Threshold To Decide Whether to Compress Summary: This patch adds a option for universal compaction to allow us to only compress output files if the files compacted previously did not yet reach a specified ratio, to save CPU costs in some cases. Compression is always skipped for flushing. This is because the size information is not easy to evaluate for flushing case. We can improve it later. Test Plan: add test DBTest.UniversalCompactionCompressRatio1 and DBTest.UniversalCompactionCompressRatio12 Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13467	2013-10-17 13:33:39 -07:00
Dhruba Borthakur	9cd221094c	Add appropriate LICENSE and Copyright message. Summary: Add appropriate LICENSE and Copyright message. Test Plan: make check Reviewers: CC: Task ID: # Blame Rev:	2013-10-16 17:48:41 -07:00
Siying Dong	073cbfc8f0	Enable background flush thread by default and fix issues related to it Summary: Enable background flush thread in this patch and fix unit tests with: (1) After background flush, schedule a background compaction if condition satisfied; (2) Fix a bug that if universal compaction is enabled and number of levels are set to be 0, compaction will not be automatically triggered (3) Fix unit tests to wait for compaction to finish instead of flush, before checking the compaction results. Test Plan: pass all unit tests Reviewers: haobo, xjin, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13461	2013-10-16 13:32:53 -07:00
Mayank Agarwal	da2fd001a6	Fix rocksdb->levledb BytewiseComparator and inverted order of error in db/version_set.cc Summary: This is needed to make existing dbs be able to open and also because BytewiseComparator was not changed since leveldb. The inverted order in the error message caused confusion prebiously Test Plan: make; open existing db Reviewers: leveldb, dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D13449	2013-10-14 18:16:54 -07:00
Mayank Agarwal	fe3713961e	Features in Transaction log iterator Summary: * Logstore requests a valid change of reutrning an empty iterator and not an error in case of no log files. * Changed the code to return the writebatch containing the sequence number requested from GetupdatesSince even if it lies in the middle. Earlier we used to return the next writebatch,. This also allows me oto guarantee that no files played upon by the iterator are redundant. I mean the starting log file has at least a sequence number >= the sequence number requested form GetupdatesSince. * Cleaned up redundant logic in Iterator::Next and made a new function SeekToStartSequence for greater readability and maintainibilty. * Modified a test in db_test accordingly Please check the logic carefully and suggest improvements. I have a separate patch out for more improvements like restricting reader to read till written sequences. Test Plan: * transaction log iterator tests in db_test, * db_repl_stress. * rocks_log_iterator_test in fbcode/wormhole/rocksdb/test - 2 tests thriving on hacks till now can get simplified * testing on the shadow setup for sigma with replication Reviewers: dhruba, haobo, kailiu, sdong Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13437	2013-10-14 18:16:21 -07:00
Kai Liu	86ef6c3f74	Add statistics to sst file Summary: So far we only have key/value pairs as well as bloom filter stored in the sst file. It will be great if we are able to store more metadata about this table itself, for example, the entry size, bloom filter name, etc. This diff is the first step of this effort. It allows table to keep the basic statistics mentioned in http://fburl.com/14995441, as well as allowing writing user-collected stats to stats block. After this diff, we will figure out the interface of how to allow user to collect their interested statistics. Test Plan: 1. Added several unit tests. 2. Ran `make check` to ensure it doesn't break other tests. Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13419	2013-10-14 15:56:13 -07:00
Siying Dong	88f2f89068	Change Function names from Compaction->Flush When they really mean Flush Summary: When I debug the unit test failures when enabling background flush thread, I feel the function names can be made clearer for people to understand. Also, if the names are fixed, in many places, some tests' bugs are obvious (and some of those tests are failing). This patch is to clean it up for future maintenance. Test Plan: Run test suites. Reviewers: haobo, dhruba, xjin Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13431	2013-10-14 15:12:15 -07:00
sdong	f8509653ba	LRUCache to try to clean entries not referenced first. Summary: With this patch, when LRUCache.Insert() is called and the cache is full, it will first try to free up entries whose reference counter is 1 (would become 0 after remo\ ving from the cache). We do it in two passes, in the first pass, we only try to release those unreferenced entries. If we cannot free enough space after traversing t\ he first remove_scan_cnt_ entries, we start from the beginning again and remove those entries being used. Test Plan: add two unit tests to cover the codes Reviewers: dhruba, haobo, emayanke Reviewed By: emayanke CC: leveldb, emayanke, xjin Differential Revision: https://reviews.facebook.net/D13377	2013-10-11 09:26:21 -07:00
Dhruba Borthakur	c0ce562c32	Bad nfs file checked in a long time back. Summary: Bad nfs file checked in a long time back. Test Plan: Reviewers: CC: Task ID: # Blame Rev:	2013-10-10 22:00:49 -07:00
Mayank Agarwal	a8b4a69de0	Fixing error in ParseFileName causing DestroyDB to fail on archive directory Summary: This careless error was causing ASSERT_OK(DestroyDB) to fail in db_test. Basically .. was being returned as a child of db/archive and ParseFileName returned false on that, but 'type' was set to LogFile from earlier and not reset. The return of ParseFileName was not being checked to delete the log file or not. Test Plan: make all check Reviewers: dhruba, haobo, xjin, kailiu, nkg- Reviewed By: nkg- CC: leveldb Differential Revision: https://reviews.facebook.net/D13413	2013-10-10 18:18:31 -07:00
Naman Gupta	cbf4a06427	Add option for storing transaction logs in a separate dir Summary: In some cases, you might not want to store the data log (write ahead log) files in the same dir as the sst files. An example use case is leaf, which stores sst files in tmpfs. And would like to save the log files in a separate dir (disk) to save memory. Test Plan: make all. Ran db_test test. A few test failing. P2785018. If you guys don't see an obvious problem with the code, maybe somebody from the rocksdb team could help me debug the issue here. Running this on leaf worked well. I could see logs stored on disk, and deleted appropriately after compactions. Obviously this is only one set of options. The unit tests cover different options. Seems like I'm missing some edge cases. Reviewers: dhruba, haobo, leveldb CC: xinyaohu, sumeet Differential Revision: https://reviews.facebook.net/D13239	2013-10-08 17:40:27 -07:00
Naman Gupta	116071411b	Make db_test more robust Summary: While working on D13239, I noticed that the same options are not used for opening and destroying at db. So adding that. Also added asserts for successful DestroyDB calls. Test Plan: Ran unit tests. Atleast 1 unit test is failing. They failures are a result of some past logic change. I'm not really planning to fix those. But I would like to check this in. And hopefully the respective unit test owners can fix the broken tests Reviewers: leveldb, haobo CC: xinyaohu, sumeet, dhruba Differential Revision: https://reviews.facebook.net/D13329	2013-10-08 13:19:31 -07:00
Dhruba Borthakur	1a8c1b0817	Unit test failure in DBTest.NumImmutableMemTable. Summary: Previous patch introduced a unit test failure in DBTest.NumImmutableMemTable because of change in property names. Test Plan: Reviewers: CC: Task ID: # Blame Rev:	2013-10-06 01:12:02 -07:00
Dhruba Borthakur	4463b11cad	Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix. Summary: Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix. Test Plan: make check Reviewers: emayanke, haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13311	2013-10-06 00:14:26 -07:00
Haobo Xu	bf89edf78b	[RocksDB] Added a property "leveldb.num-immutable-mem-table" so that Flush can be called without blocking, and application still has a way to check when it's done also without blocking. Summary: as title Test Plan: DBTest.NumImmutableMemTable Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13305	2013-10-05 11:54:08 -07:00
Dhruba Borthakur	0a9f873f4b	Removed scribe, thrift and java modules. Summary: Removed scribe, thrift and java modules. Test Plan: make release make check Reviewers: emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D13293	2013-10-04 15:36:00 -07:00
Dhruba Borthakur	a143ef9b38	Change namespace from leveldb to rocksdb Summary: Change namespace from leveldb to rocksdb. This allows a single application to link in open-source leveldb code as well as rocksdb code into the same process. Test Plan: compile rocksdb Reviewers: emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D13287	2013-10-04 11:59:26 -07:00
Mayank Agarwal	b3ed08129b	Add a statistic to count the number of calls to GetUpdatesSince Summary: This is useful to keep track of refreshes in transaction log iterator Test Plan: make; db_stress --statistics=1 shows it Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13281	2013-10-04 10:47:20 -07:00
Mayank Agarwal	854d236361	Add backward compatible option in GetLiveFiles to choose whether to not Flush first Summary: As explained in comments in GetLiveFiles in db.h, this option will cause flush to be skipped in GetLiveFiles because some use-cases use GetSortedWalFiles after GetLiveFiles to generate more complete snapshots. Using GetSortedWalFiles after GetLiveFiles allows us to not Flush in GetLiveFiles first because wals have everything. Note: file deletions will be disabled before calling GLF or GSWF so live logs will not move to archive logs or get delted. Note: Manifest file is truncated to a proper value in GLF, so it will always reply from the proper wal files on a restart Test Plan: make Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13257	2013-10-04 10:20:10 -07:00
Haobo Xu	200c05a23f	[RocksDB] Still honor DisableFileDeletions when purge_log_after_memtable_flush is on Summary: as title Test Plan: make check Reviewers: emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D13263	2013-10-03 16:12:43 -07:00
Haobo Xu	fa798e9e28	[Rocksdb] Submit mem table flush job in a different thread pool Summary: As title. This is just a quick hack and not ready for commit. fails a lot of unit test. I will test/debug it directly in ViewState shadow . Test Plan: Try it in shadow test. Reviewers: dhruba, xjin CC: leveldb Differential Revision: https://reviews.facebook.net/D12933	2013-10-03 14:37:19 -07:00
Xing Jin	658a3ce2fa	Fix SIGSEGV issue in universal compaction Summary: We saw SIGSEGV when set options.num_levels=1 in universal compaction style. Dug into this issue for a while, and finally found the root cause (thank Haobo for discussion). Test Plan: Add new unit test. It throws SIGSEGV without this change. Also run "make all check". Reviewers: haobo, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13251	2013-10-02 17:33:31 -07:00
Haobo Xu	71046971f0	[RocksDB] Added perf counters to track skipped internal keys during iteration Summary: as title. unit test not polished. this is for a quick live test Test Plan: live Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13221	2013-10-02 10:48:41 -07:00
Xing Jin	8eb552bf4d	New unit test for iterator with snapshot Summary: I played with the reported bug about iterator with snapshot: https://code.google.com/p/leveldb/issues/detail?id=200. I turned the original test program (https://code.google.com/p/leveldb/issues/attachmentText?id=200&aid=2000000000&name=test.cc&token=7uOUQW-HFlbAFMUm7EqtaAEy7Tw%3A1378320724136) into a new unit test, but I cannot reproduce the problem. Notice lines 31-34 in above link. I have ran the new test with and without such Put() operations. Both succeed. So this diff simply adds the test, without changing any source codes. Test Plan: run new test. Reviewers: dhruba, haobo, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12735	2013-09-28 11:39:08 -07:00
Haobo Xu	0c4040681a	[RocksDB] Move last_sequence and last_flushed_sequence_ update back into lock protected area Summary: A previous diff moved these outside of lock protected area. Moved back in now. Also moved tmp_batch_ update outside of lock protected area, as only the single write thread can access it. Test Plan: make check Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13137	2013-09-26 20:43:11 -07:00
Haobo Xu	08740b15a4	[RocksDB] Fix skiplist sequential insertion optimization Summary: The original optimization missed updating links other than the lowest level. Test Plan: make check; perf_context_test Reviewers: dhruba Reviewed By: dhruba CC: leveldb, adsharma Differential Revision: https://reviews.facebook.net/D13119	2013-09-26 15:17:03 -07:00
Haobo Xu	e0aa19a94e	[RocbsDB] Add an option to enable set based memtable for perf_context_test Summary: as title. Some result: -- Sequential insertion of 1M key/value with stock skip list (all in on memtable) time ./perf_context_test --total_keys=1000000 --use_set_based_memetable=0 Inserting 1000000 key/value pairs ... Put uesr key comparison: Count: 1000000 Average: 8.0179 StdDev: 176.34 Min: 0.0000 Median: 2.5555 Max: 88933.0000 Percentiles: P50: 2.56 P75: 2.83 P99: 58.21 P99.9: 133.62 P99.99: 987.50 Get uesr key comparison: Count: 1000000 Average: 43.4465 StdDev: 379.03 Min: 2.0000 Median: 36.0195 Max: 88939.0000 Percentiles: P50: 36.02 P75: 43.66 P99: 112.98 P99.9: 824.84 P99.99: 7615.38 real 0m21.345s user 0m14.723s sys 0m5.677s -- Sequential insertion of 1M key/value with set based memtable (all in on memtable) time ./perf_context_test --total_keys=1000000 --use_set_based_memetable=1 Inserting 1000000 key/value pairs ... Put uesr key comparison: Count: 1000000 Average: 61.5022 StdDev: 6.49 Min: 0.0000 Median: 62.4295 Max: 71.0000 Percentiles: P50: 62.43 P75: 66.61 P99: 71.00 P99.9: 71.00 P99.99: 71.00 Get uesr key comparison: Count: 1000000 Average: 29.3810 StdDev: 3.20 Min: 1.0000 Median: 29.1801 Max: 34.0000 Percentiles: P50: 29.18 P75: 32.06 P99: 34.00 P99.9: 34.00 P99.99: 34.00 real 0m28.875s user 0m21.699s sys 0m5.749s Worst case comparison for a Put is 88933 (skiplist) vs 71 (set based memetable) Of course, there's other in-efficiency in set based memtable implementation, which lead to the overall worst performance. However, P99 behavior advantage is very very obvious. Test Plan: ./perf_context_test and viewstate shadow testing Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13095	2013-09-25 22:49:18 -07:00
Dhruba Borthakur	f1a60e5c3e	The vector rep implementation was segfaulting because of incorrect initialization of vector. Summary: The constructor for Vector memtable has a parameter called 'count' that specifies the capacity of the vector to be reserved at allocation time. It was incorrectly used to initialize the size of the vector. Test Plan: Enhanced db_test. Reviewers: haobo, xjin, emayanke Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13083	2013-09-25 11:33:52 -07:00
Dhruba Borthakur	5e9f3a9aa7	Better locking in vectorrep that increases throughput to match speed of storage. Summary: There is a use-case where we want to insert data into rocksdb as fast as possible. Vector rep is used for this purpose. The background flush thread needs to flush the vectorrep to storage. It acquires the dblock then sorts the vector, releases the dblock and then writes the sorted vector to storage. This is suboptimal because the lock is held during the sort, which prevents new writes for occuring. This patch moves the sorting of the vector rep to outside the db mutex. Performance is now as fastas the underlying storage system. If you are doing buffered writes to rocksdb files, then you can observe throughput upwards of 200 MB/sec writes. This is an early draft and not yet ready to be reviewed. Test Plan: make check Task ID: # Blame Rev: Reviewers: haobo Reviewed By: haobo CC: leveldb, haobo Differential Revision: https://reviews.facebook.net/D12987	2013-09-19 21:48:10 -07:00
Haobo Xu	4734dbb742	[RocksDB] Unit test to show Seek key comparison number Summary: Added SeekKeyComparison to show the uer key comparison incurred by Seek. Test Plan: make perf_context_test export LEVELDB_TESTS=DBTest.SeekKeyComparison ./perf_context_test --write_buffer_size=500000 --total_keys=10000 ./perf_context_test --write_buffer_size=250000 --total_keys=10000 Reviewers: dhruba, xjin Reviewed By: xjin CC: leveldb Differential Revision: https://reviews.facebook.net/D12843	2013-09-18 21:43:41 -07:00
Haobo Xu	72fcbf055d	[RocksDB] Fix DBTest.UniversalCompactionSizeAmplification too Summary: as title Test Plan: make db_test; ./db_test Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13005	2013-09-17 21:29:33 -07:00
Haobo Xu	5b76338c01	[RocksDB] Fix DBTest.UniversalCompactionTrigger to reflect the correct compaction trigger condition. Summary: as title Test Plan: make db_test; ./db_test Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12981	2013-09-17 14:17:48 -07:00
Rajat Goel	11c65021fb	Revert "Minor fixes found while trying to compile it using clang on Mac OS X" This reverts commit `5f2c136c32`.	2013-09-15 23:01:26 -07:00
Haobo Xu	1d8c57db23	[RocksDB] Universal compaction trigger condition minor fix Summary: Currently, when total number of files reaches level0_file_num_compaction_trigger, universal compaction will schedule a compaction job, but the job will not honor the compaction until the total number of files is level0_file_num_compaction_trigger+1. Fixed the condition for consistent behavior (start compaction on reaching level0_file_num_compaction_trigger). Test Plan: make check; db_stress Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12945	2013-09-15 22:35:59 -07:00
Rajat Goel	5f2c136c32	Minor fixes found while trying to compile it using clang on Mac OS X	2013-09-15 22:06:14 -07:00
Dhruba Borthakur	4012ca1c7b	Added a parameter to limit the maximum space amplification for universal compaction. Summary: Added a new field called max_size_amplification_ratio in the CompactionOptionsUniversal structure. This determines the maximum percentage overhead of space amplification. The size amplification is defined to be the ratio between the size of the oldest file to the sum of the sizes of all other files. If the size amplification exceeds the specified value, then min_merge_width and max_merge_width are ignored and a full compaction of all files is done. A value of 10 means that the size a database that stores 100 bytes of user data could occupy 110 bytes of physical storage. Test Plan: Unit test DBTest.UniversalCompactionSpaceAmplification added. Reviewers: haobo, emayanke, xjin Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D12825	2013-09-13 16:27:18 -07:00
Haobo Xu	0e422308aa	[RocksDB] Remove Log file immediately after memtable flush Summary: As title. The DB log file life cycle is tied up with the memtable it backs. Once the memtable is flushed to sst and committed, we should be able to delete the log file, without holding the mutex. This is part of the bigger change to avoid FindObsoleteFiles at runtime. It deals with log files. sst files will be dealt with later. Test Plan: make check; db_bench Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11709	2013-09-12 11:54:44 -07:00
Haobo Xu	f2f4c8072f	[RocksDB] Added nano second stopwatch and new perf counters to track block read cost Summary: The pupose of this diff is to expose per user-call level precise timing of block read, so that we can answer questions like: a Get() costs me 100ms, is that somehow related to loading blocks from file system, or sth else? We will answer that with EXACTLY how many blocks have been read, how much time was spent on transfering the bytes from os, how much time was spent on checksum verification and how much time was spent on block decompression, just for that one Get. A nano second stopwatch was introduced to track time with higher precision. The cost/precision of the stopwatch is also measured in unit-test. On my dev box, retrieving one time instance costs about 30ns, on average. The deviation of timing results is good enough to track 100ns-1us level events. And the overhead could be safely ignored for 100us level events (10000 instances/s), for example, a viewstate thrift call. Test Plan: perf_context_test, also testing with viewstate shadow traffic. Reviewers: dhruba Reviewed By: dhruba CC: leveldb, xjin Differential Revision: https://reviews.facebook.net/D12351	2013-09-07 21:14:54 -07:00
Dhruba Borthakur	32c965d417	Flush was hanging because the configured options specified that more than 1 memtable need to be merged. Summary: There is an config option called Options.min_write_buffer_number_to_merge that specifies the minimum number of write buffers to merge in memory before flushing to a file in L0. But in the the case when the db is being closed, we should not be using this config, instead we should flush whatever write buffers were available at that time. Test Plan: Unit test attached. Reviewers: haobo, emayanke Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D12717	2013-09-06 16:28:33 -07:00
Dhruba Borthakur	197034e4c3	An iterator may automatically invoke reseeks. Summary: An iterator invokes reseek if the number of sequential skips over the same userkey exceeds a configured number. This makes iter->Next() faster (bacause of fewer key compares) if a large number of adjacent internal keys in a table (sst or memtable) have the same userkey. Test Plan: Unit test DBTest.IterReseek. Reviewers: emayanke, haobo, xjin Reviewed By: xjin CC: leveldb, xjin Differential Revision: https://reviews.facebook.net/D11865	2013-09-06 11:50:53 -07:00
Mayank Agarwal	aa5c897d42	Return pathname relative to db dir in LogFile and cleanup AppendSortedWalsOfType Summary: So that replication can just download from wherever LogFile.Pathname is pointing them. Test Plan: make all check;./db_repl_stress Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12609	2013-09-04 13:44:43 -07:00
Xing Jin	42c109cc2e	New ldb command to convert compaction style Summary: Add new command "change_compaction_style" to ldb tool. For universal->level, it shows "nothing to do". For level->universal, it compacts all files into a single one and moves the file to level 0. Also add check for number of files at level 1+ when opening db with universal compaction style. Test Plan: 'make all check'. New unit test for internal convertion function. Also manully test various cmd like: ./ldb change_compaction_style --old_compaction_style=0 --new_compaction_style=1 --db=/tmp/leveldbtest-3088/db_test Reviewers: haobo, dhruba Reviewed By: haobo CC: vamsi, emayanke Differential Revision: https://reviews.facebook.net/D12603	2013-09-04 13:13:08 -07:00
Mayank Agarwal	c34271a5a5	Fix bug in Counters and record Sequencenumber using only TickerCount Summary: The way counters/statistics are implemented in rocksdb demands that enum Tickers and TickerNameMap follow the same order, otherwise statistics exposed from fbcode/rocks get out-of-sync. 2 counters for prefix had violated this order and when I built counters for fbcode/mcrocksdb, statistics for sequence number were appearing out-of-sync. The other change is to record sequence-number using setTickerCount only and not recordTick. This is because of difference in statistics as understood by rocks/utils which uses ServiceData::statistics function and rocksdb statistics. In rocksdb there is just 1 counter for a countername. But in ServiceData there are 4 independent buckets for every countername-Count, Sum, Average and Rate. SetTickerCount and RecordTick update the same variable in rocksdb but different buckets in ServiceData. Therefore, I had to choose one consistent function from RecordTick or SetTickerCount for sequence number in rocksdb. I chose SetTickerCount because the statistics object in options passed during rocksdb-open is user-dependent and SetTickerCount makes sense there. There will be a corresponding diff to mcorcksdb in fbcode shortly. Test Plan: make all check; check ticker value using fprintfs Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12669	2013-09-01 17:59:32 -07:00
Mayank Agarwal	ab5c5c28fe	Fix build caused by DeleteFile not tolerating / at the beginning Summary: db->DeleteFile calls ParseFileName to check name that was returned for sst file. Now, sst filename is returned using TableFileName which uses MakeFileName. This puts a / at the front of the name and ParseFileName doesn't like that. Changed ParseFileName to tolerate /s at the beginning. The test delet_file_test used to pass earlier because this behaviour of MakeFileName had been changed a while back to not return a / during which delete_file_test was checked in. But MakeFileName had to be reverted to add / at the front because GetLiveFiles used at many places outside rocksdb used the previous behaviour of MakeFileName. Test Plan: make;./delete_filetest;make all check Reviewers: dhruba, haobo, vamsi Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12663	2013-09-01 17:59:13 -07:00
Mayank Agarwal	46dcf51ca5	Return a '/' before names of all files through MakeFileName Summary: // won't hurt but a missing / hurts sometimes Test Plan: make all check; ./db_repl_stress Reviewers: vamsi Reviewed By: vamsi CC: dhruba Differential Revision: https://reviews.facebook.net/D12621	2013-08-30 14:25:22 -07:00
Dhruba Borthakur	59de2dbad7	Cleanup DeleteFile API Summary: The DeleteFile API was removing files inside the db-lock. This is now changed to remove files outside the db-lock. The GetLiveFilesMetadata() returns the smallest and largest seqnuence number of each file as well. Test Plan: deletefile_test Reviewers: emayanke, haobo Reviewed By: haobo CC: leveldb Maniphest Tasks: T63 Differential Revision: https://reviews.facebook.net/D12567	2013-08-28 21:18:58 -07:00
Haobo Xu	48e5ea0c34	[RocksDB] Fix TransformRepFactory related valgrind problem Summary: Let TransformRepFactory own the passed in transform. Also make it better encapsulated. Test Plan: make valgrind_check; Reviewers: dhruba, emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D12591	2013-08-28 19:27:54 -07:00
Dhruba Borthakur	fc0c399d2e	Introduced a new flag non_blocking_io in ReadOptions. Summary: If ReadOptions.non_blocking_io is set to true, then KeyMayExists and Iterators will return data that is cached in RAM. If the Iterator needs to do IO from storage to serve the data, then the Iterator.status() will return Status::IsRetry(). Test Plan: Enhanced unit test DBTest.KeyMayExist to detect if there were are IOs issues from storage. Added DBTest.NonBlockingIteration to verify nonblocking Iterations. Reviewers: emayanke, haobo Reviewed By: haobo CC: leveldb Maniphest Tasks: T63 Differential Revision: https://reviews.facebook.net/D12531	2013-08-28 10:49:14 -07:00
Haobo Xu	43eef52001	[RocksDB] move stats counting outside of mutex protected region for DB::Get() Summary: As title. This is possible as tickers are atomic now. db_bench on high qps in-memory muti-thread random get workload, showed ~5% throughput improvement. Test Plan: make check; db_bench; db_stress Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12555	2013-08-27 13:36:10 -07:00
Mayank Agarwal	dad2731729	Fix bug in KeyMayExist Summary: In KeyMayExist.db_test we do a Flush which causes sst file to be written and added as open file in TableCache, but block cache for the file is not populated. So value_found should have been false where it was true and KeyMayExist.db_test should not have passed earlier. But it passed because BlockReader in table/table.cc takes 2 default arguments at the end called for_compaction and no_io. Although I passed no_io=true from InternalGet to BlockReader, but it understood for_compaction=true and defaulted no_io to false. This is a bug and although will be removed by Dhruba's new patch to incorporate no_io in readoptions, I'm submitting this patch to fix this bug independently of that patch. Test Plan: make all check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12537	2013-08-26 08:45:58 -07:00
Mayank Agarwal	b1074ac24f	Use initializer list for VersionSet Summary: initialiszer list is fasteri/preferable because it can straightaway call the constructor for this object, otherwise it will be created first and then again initialized. Although gain may not be much in this case because files_ is just a pointer and not a complex object, this is recommended practice. Test Plan: make all check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12519	2013-08-24 18:16:01 -07:00
Deon Nicholas	573844807c	Fix for no_io Summary: Oops. My bad. Test Plan: Make all check Reviewers: emayanke Reviewed By: emayanke CC: haobo, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D12525	2013-08-23 16:36:01 -07:00
Tyler Harter	4504c99030	Internal/user key bug fix. Summary: Fix code so that the filter_block layer only assumes keys are internal when prefix_extractor is set. Test Plan: ./filter_block_test Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12501	2013-08-23 14:49:57 -07:00
Dhruba Borthakur	1186192ed1	Replace include/leveldb with include/rocksdb. Summary: Replace include/leveldb with include/rocksdb. Test Plan: make clean; make check make clean; make release Differential Revision: https://reviews.facebook.net/D12489	2013-08-23 10:51:00 -07:00
Jim Paton	74781a0c49	Add three new MemTableRep's Summary: This patch adds three new MemTableRep's: UnsortedRep, PrefixHashRep, and VectorRep. UnsortedRep stores keys in an std::unordered_map of std::sets. When an iterator is requested, it dumps the keys into an std::set and iterates over that. VectorRep stores keys in an std::vector. When an iterator is requested, it creates a copy of the vector and sorts it using std::sort. The iterator accesses that new vector. PrefixHashRep stores keys in an unordered_map mapping prefixes to ordered sets. I also added one API change. I added a function MemTableRep::MarkImmutable. This function is called when the rep is added to the immutable list. It doesn't do anything yet, but it seems like that could be useful. In particular, for the vectorrep, it means we could elide the extra copy and just sort in place. The only reason I haven't done that yet is because the use of the ArenaAllocator complicates things (I can elaborate on this if needed). Test Plan: make -j32 check ./db_stress --memtablerep=vector ./db_stress --memtablerep=unsorted ./db_stress --memtablerep=prefixhash --prefix_size=10 Reviewers: dhruba, haobo, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12117	2013-08-22 23:10:02 -07:00
Xing Jin	17dc128048	Pull from https://reviews.facebook.net/D10917 Summary: Pull Mark's patch and slightly revise it. I revised another place in db_impl.cc with similar new formula. Test Plan: make all check. Also run "time ./db_bench --num=2500000000 --numdistinct=2200000000". It has run for 20+ hours and hasn't finished. Looks good so far: Installed stack trace handler for SIGILL SIGSEGV SIGBUS SIGABRT LevelDB: version 2.0 Date: Tue Aug 20 23:11:55 2013 CPU: 32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz CPUCache: 20480 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 2500000000 RawSize: 276565.6 MB (estimated) FileSize: 157356.3 MB (estimated) Write rate limit: 0 Compression: snappy WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ DB path: [/tmp/leveldbtest-3088/dbbench] fillseq : 7202.000 micros/op 138 ops/sec; DB path: [/tmp/leveldbtest-3088/dbbench] fillsync : 7148.000 micros/op 139 ops/sec; (2500000 ops) DB path: [/tmp/leveldbtest-3088/dbbench] fillrandom : 7105.000 micros/op 140 ops/sec; DB path: [/tmp/leveldbtest-3088/dbbench] overwrite : 6930.000 micros/op 144 ops/sec; DB path: [/tmp/leveldbtest-3088/dbbench] readrandom : 1.020 micros/op 980507 ops/sec; (0 of 2500000000 found) DB path: [/tmp/leveldbtest-3088/dbbench] readrandom : 1.021 micros/op 979620 ops/sec; (0 of 2500000000 found) DB path: [/tmp/leveldbtest-3088/dbbench] readseq : 113.000 micros/op 8849 ops/sec; DB path: [/tmp/leveldbtest-3088/dbbench] readreverse : 102.000 micros/op 9803 ops/sec; DB path: [/tmp/leveldbtest-3088/dbbench] Created bg thread 0x7f0ac17f7700 compact : 111701.000 micros/op 8 ops/sec; DB path: [/tmp/leveldbtest-3088/dbbench] readrandom : 1.020 micros/op 980376 ops/sec; (0 of 2500000000 found) DB path: [/tmp/leveldbtest-3088/dbbench] readseq : 120.000 micros/op 8333 ops/sec; DB path: [/tmp/leveldbtest-3088/dbbench] readreverse : 29.000 micros/op 34482 ops/sec; DB path: [/tmp/leveldbtest-3088/dbbench] ... finished 618100000 ops Reviewers: MarkCallaghan, haobo, dhruba, chip Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D12441	2013-08-22 22:37:13 -07:00
Tyler Harter	94cf218720	Revert "Prefix scan: db_bench and bug fixes" This reverts commit `c2bd8f4824`.	2013-08-22 18:01:11 -07:00
Tyler Harter	c2bd8f4824	Prefix scan: db_bench and bug fixes Summary: If use_prefix_filters is set and read_range>1, then the random seeks will set a the prefix filter to be the prefix of the key which was randomly selected as the target. Still need to add statistics (perhaps in a separate diff). Test Plan: ./db_bench --benchmarks=fillseq,prefixscanrandom --num=10000000 --statistics=1 --use_prefix_blooms=1 --use_prefix_api=1 --bloom_bits=10 Reviewers: dhruba Reviewed By: dhruba CC: leveldb, haobo Differential Revision: https://reviews.facebook.net/D12273	2013-08-22 16:06:50 -07:00
Simha Venkataramaiah	60bf2b7d4a	Add APIs to query SST file metadata and to delete specific SST files Summary: An api to query the level, key ranges, size etc for each SST file and an api to delete a specific file from the db and all associated state in the bookkeeping datastructures. Notes: Editing the manifest version does not release the obsolete files right away. However deleting the file directly will mess up the iterator. We may need a more aggressive/timely file deletion api. I have used std::unique_ptr - will switch to boost:: since this is external. thoughts? Unit test is fragile right now as it expects the compaction at certain levels. Test Plan: unittest Reviewers: dhruba, vamsi, emayanke CC: zshao, leveldb, haobo Task ID: # Blame Rev:	2013-08-22 15:27:19 -07:00
Jim Paton	cb703c9d03	Allow WriteBatch::Handler to abort iteration Summary: Sometimes you don't need to iterate through the whole WriteBatch. This diff makes the Handler member functions return a bool that indicates whether to abort or not. If they return true, the iteration stops. One thing I just thought of is that this will break backwards-compability. Maybe it would be better to add a virtual member function WriteBatch::Handler::ShouldAbort() that returns false by default. Comments requested. I still have to add a new unit test for the abort code, but let's finalize the API first. Test Plan: make -j32 check Reviewers: dhruba, haobo, vamsi, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12339	2013-08-21 18:27:48 -07:00
Haobo Xu	f9e2decf7c	[RocksDB] Minor iterator cleanup Summary: Was going through the iterator related code, did some cleanup along the way. Basically replaced array with vector and adopted range based loop where applicable. Test Plan: make check; make valgrind_check Reviewers: dhruba, emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D12435	2013-08-21 16:54:48 -07:00
Deon Nicholas	b87dcae1a3	Made merge_oprator a shared_ptr; and added TTL unit tests Test Plan: - make all check; - make release; - make stringappend_test; ./stringappend_test Reviewers: haobo, emayanke Reviewed By: haobo CC: leveldb, kailiu Differential Revision: https://reviews.facebook.net/D12381	2013-08-20 13:35:28 -07:00
Mayank Agarwal	8a3547d38e	API for getting archived log files Summary: Also expanded class LogFile to have startSequene and FileSize and exposed it publicly Test Plan: make all check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12087	2013-08-19 13:37:04 -07:00
Deon Nicholas	e1346968d8	Merge operator fixes part 1. Summary: -Added null checks and revisions to DBIter::MergeValuesNewToOld() -Added DBIter test to stringappend_test -Major fix with Merge and TTL More plans for fixes later. Test Plan: -make clean; make stringappend_test -j 32; ./stringappend_test -make all check; Reviewers: haobo, emayanke, vamsi, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D12315	2013-08-19 11:42:47 -07:00
Mayank Agarwal	387ac0f1e1	Expose statistic for sequence number and implement setTickerCount Summary: statistic for sequence number is needed by wormhole. setTickerCount is demanded for this statistic. I can't simply recordTick(max_sequence) when db recovers because the statistic iobject is owned by client and may/may not be reset during reopen. Eg. statistic is reset in mcrocksdb whereas it is not in db_stress. Therefore it is best to go with setTickerCount Test Plan: ./db_stress ... --statistics=1 and observed expected sequence number Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12327	2013-08-15 23:00:20 -07:00
Deon Nicholas	d1d3d15eb7	Tiny fix to db_bench for make release. Summary: In release, "found variable assigned but not used anywhere". Changed it to work with assert. Someone accept this :). Test Plan: make release -j 32 Reviewers: haobo, dhruba, emayanke Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D12309	2013-08-15 17:50:12 -07:00
Deon Nicholas	ad48c3c262	Benchmarking for Merge Operator Summary: Updated db_bench and utilities/merge_operators.h to allow for dynamic benchmarking of merge operators in db_bench. Added a new test (--benchmarks=mergerandom), which performs a bunch of random Merge() operations over random keys. Also added a "--merge_operator=" flag so that the tester can easily benchmark different merge operators. Currently supports the PutOperator and UInt64Add operator. Support for stringappend or list append may come later. Test Plan: 1. make db_bench 2. Test the PutOperator (simulating Put) as follows: ./db_bench --benchmarks=fillrandom,readrandom,updaterandom,readrandom,mergerandom,readrandom --merge_operator=put --threads=2 3. Test the UInt64AddOperator (simulating numeric addition) similarly: ./db_bench --value_size=8 --benchmarks=fillrandom,readrandom,updaterandom,readrandom,mergerandom,readrandom --merge_operator=uint64add --threads=2 Reviewers: haobo, dhruba, zshao, MarkCallaghan Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11535	2013-08-15 17:13:07 -07:00
Xing Jin	0a5afd1afc	Minor fix to current codes Summary: Minor fix to current codes, including: coding style, output format, comments. No major logic change. There are only 2 real changes, please see my inline comments. Test Plan: make all check Reviewers: haobo, dhruba, emayanke Differential Revision: https://reviews.facebook.net/D12297	2013-08-14 23:03:57 -07:00
Jim Paton	0307c5fe3a	Implement log blobs Summary: This patch adds the ability for the user to add sequences of arbitrary data (blobs) to write batches. These blobs are saved to the log along with everything else in the write batch. You can add multiple blobs per WriteBatch and the ordering of blobs, puts, merges, and deletes are preserved. Blobs are not saves to SST files. RocksDB ignores blobs in every way except for writing them to the log. Before committing this patch, I need to add some test code. But I'm submitting it now so people can comment on the API. Test Plan: make -j32 check Reviewers: dhruba, haobo, vamsi Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12195	2013-08-14 16:32:46 -07:00
Haobo Xu	d9dd2a1926	[RocksDB] Expose thread local perf counter for low overhead, per call level performance statistics. Summary: As title. No locking/atomic is needed due to thread local. There is also no need to modify the existing client interface, in order to expose related counters. perf_context_test shows a simple example of retrieving the number of user key comparison done for each put and get call. More counters could be added later. Sample output ./perf_context_test 1000000 ==== Test PerfContextTest.KeyComparisonCount Inserting 1000000 key/value pairs ... total user key comparison get: 43446523 total user key comparison put: 8017877 max user key comparison get: 88939 avg user key comparison get:43 Basically, the current skiplist does well on average, but could perform poorly in extreme cases. Test Plan: run perf_context_test <total number of entries to put/get> Reviewers: dhruba Differential Revision: https://reviews.facebook.net/D12225	2013-08-14 15:24:06 -07:00
Mayank Agarwal	f1bf169484	Counter for merge failure Summary: With Merge returning bool, it can keep failing silently(eg. While faling to fetch timestamp in TTL). We need to detect this through a rocksdb counter which can get bumped whenever Merge returns false. This will also be super-useful for the mcrocksdb-counter service where Merge may fail. Added a counter NUMBER_MERGE_FAILURES and appropriately updated db/merge_helper.cc I felt that it would be better to directly add counter-bumping in Merge as a default function of MergeOperator class but user should not be aware of this, so this approach seems better to me. Test Plan: make all check Reviewers: dnicholas, haobo, dhruba, vamsi CC: leveldb Differential Revision: https://reviews.facebook.net/D12129	2013-08-13 14:25:42 -07:00
Tyler Harter	f5f1842282	Prefix filters for scans (v4) Summary: Similar to v2 (db and table code understands prefixes), but use ReadOptions as in v3. Also, make the CreateFilter code faster and cleaner. Test Plan: make db_test; export LEVELDB_TESTS=PrefixScan; ./db_test Reviewers: dhruba Reviewed By: dhruba CC: haobo, emayanke Differential Revision: https://reviews.facebook.net/D12027	2013-08-13 14:04:56 -07:00
sumeet	3b81df34bd	Separate compaction filter for each compaction Summary: If we have same compaction filter for each compaction, application cannot know about the different compaction processes. Later on, we can put in more details in compaction filter for the application to consume and use it according to its needs. For e.g. In the universal compaction, we have a compaction process involving all the files while others don't involve all the files. Applications may want to collect some stats only when during full compaction. Test Plan: run existing unit tests Reviewers: haobo, dhruba Reviewed By: dhruba CC: xinyaohu, leveldb Differential Revision: https://reviews.facebook.net/D12057	2013-08-13 10:56:20 -07:00
Dhruba Borthakur	03bd4461ad	Merge branch 'performance' of github.com:facebook/rocksdb into performance	2013-08-12 10:31:21 -07:00
Dhruba Borthakur	f3967a5132	Merge remote-tracking branch 'origin' into performance	2013-08-12 09:58:50 -07:00
Dhruba Borthakur	93d77a27d2	Universal Compaction should keep DeleteMarkers unless it is the earliest file. Summary: The pre-existing code was purging a DeleteMarker if thay key did not exist in deeper levels. But in the Universal Compaction Style, all files are in Level0. For compaction runs that did not include the earliest file, we were erroneously purging the DeleteMarkers. The fix is to purge DeleteMarkers only if the compaction includes the earlist file. Test Plan: DBTest.Randomized triggers this code path. Differential Revision: https://reviews.facebook.net/D12081	2013-08-09 14:03:57 -07:00
Xing Jin	8ae905ed63	Fix unit tests for universal compaction (step 2) Summary: Continue fixing existing unit tests for universal compaction. I have tried to apply universal compaction to all unit tests those haven't called ChangeOptions(). I left a few which are either apparently not applicable to universal compaction (because they check files/keys/values at level 1 or above levels), or apparently not related to compaction (e.g., open a file, open a db). I also add a new unit test for universal compaction. Good news is I didn't see any bugs during this round. Test Plan: Ran "make all check" yesterday. Has rebased and is rerunning Reviewers: haobo, dhruba Differential Revision: https://reviews.facebook.net/D12135	2013-08-09 13:35:44 -07:00
Haobo Xu	3a3b1c3e6c	[RocksDB] Improve manifest dump to print internal keys in hex for version edits. Summary: Currently, VersionEdit::DebugString always display internal keys in the original ascii format. This could cause manifest dump to be truncated if internal keys contain special charactors (like null). Also added an option --input_key_hex for ldb idump to indicate that the passed in user keys are in hex. Test Plan: run ldb manifest_dump Reviewers: dhruba, emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D12111	2013-08-08 16:19:01 -07:00
Xing Jin	17b8f786a3	Fix unit tests/bugs for universal compaction (first step) Summary: This is the first step to fix unit tests and bugs for universal compactiion. I added universal compaction option to ChangeOptions(), and fixed all unit tests calling ChangeOptions(). Some of these tests obviously assume more than 1 level and check file number/values in level 1 or above levels. I set kSkipUniversalCompaction for these tests. The major bug I found is manual compaction with universal compaction never stops. I have put a fix for it. I have also set universal compaction as the default compaction and found at least 20+ unit tests failing. I haven't looked into the details. The next step is to check all unit tests without calling ChangeOptions(). Test Plan: make all check Reviewers: dhruba, haobo Differential Revision: https://reviews.facebook.net/D12051	2013-08-07 14:05:44 -07:00
Dhruba Borthakur	f5fa26b6a9	Merge branch 'performance' of github.com:facebook/rocksdb into performance Conflicts: db/builder.cc db/db_impl.cc db/version_set.cc include/leveldb/statistics.h	2013-08-07 11:58:06 -07:00
Deon Nicholas	68a4cdf3f7	Build fix with merge_test and ttl	2013-08-06 11:42:21 -07:00
Deon Nicholas	c2d7826ced	[RocksDB] [MergeOperator] The new Merge Interface! Uses merge sequences. Summary: Here are the major changes to the Merge Interface. It has been expanded to handle cases where the MergeOperator is not associative. It does so by stacking up merge operations while scanning through the key history (i.e.: during Get() or Compaction), until a valid Put/Delete/end-of-history is encountered; it then applies all of the merge operations in the correct sequence starting with the base/sentinel value. I have also introduced an "AssociativeMerge" function which allows the user to take advantage of associative merge operations (such as in the case of counters). The implementation will always attempt to merge the operations/operands themselves together when they are encountered, and will resort to the "stacking" method if and only if the "associative-merge" fails. This implementation is conjectured to allow MergeOperator to handle the general case, while still providing the user with the ability to take advantage of certain efficiencies in their own merge-operator / data-structure. NOTE: This is a preliminary diff. This must still go through a lot of review, revision, and testing. Feedback welcome! Test Plan: -This is a preliminary diff. I have only just begun testing/debugging it. -I will be testing this with the existing MergeOperator use-cases and unit-tests (counters, string-append, and redis-lists) -I will be "desk-checking" and walking through the code with the help gdb. -I will find a way of stress-testing the new interface / implementation using db_bench, db_test, merge_test, and/or db_stress. -I will ensure that my tests cover all cases: Get-Memtable, Get-Immutable-Memtable, Get-from-Disk, Iterator-Range-Scan, Flush-Memtable-to-L0, Compaction-L0-L1, Compaction-Ln-L(n+1), Put/Delete found, Put/Delete not-found, end-of-history, end-of-file, etc. -A lot of feedback from the reviewers. Reviewers: haobo, dhruba, zshao, emayanke Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11499	2013-08-05 20:14:32 -07:00
Jim Paton	8e792e5896	Add soft_rate_limit stats Summary: This diff adds histogram stats for soft_rate_limit stalls. It also renames the old rate_limit stats to hard_rate_limit. Test Plan: make -j32 check Reviewers: dhruba, haobo, MarkCallaghan Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12021	2013-08-05 18:45:23 -07:00
Mayank Agarwal	1d7b4765c3	Expose base db object from ttl wrapper Summary: rocksdb replicaiton will need this when writing value+TS from master to slave 'as is' Test Plan: make Reviewers: dhruba, vamsi, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11919	2013-08-05 18:44:14 -07:00
Jim Paton	1036537c94	Add soft and hard rate limit support Summary: This diff adds support for both soft and hard rate limiting. The following changes are included: 1) Options.rate_limit is renamed to Options.hard_rate_limit. 2) Options.rate_limit_delay_milliseconds is renamed to Options.rate_limit_delay_max_milliseconds. 3) Options.soft_rate_limit is added. 4) If the maximum compaction score is > hard_rate_limit and rate_limit_delay_max_milliseconds == 0, then writes are delayed by 1 ms at a time until the max compaction score falls below hard_rate_limit. 5) If the max compaction score is > soft_rate_limit but <= hard_rate_limit, then writes are delayed by 0-1 ms depending on how close we are to hard_rate_limit. 6) Users can disable 4 by setting hard_rate_limit = 0. They can add a limit to the maximum amount of time waited by setting rate_limit_delay_max_milliseconds > 0. Thus, the old behavior can be preserved by setting soft_rate_limit = 0, which is the default. Test Plan: make -j32 check ./db_stress Reviewers: dhruba, haobo, MarkCallaghan Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12003	2013-08-05 15:43:49 -07:00
Dhruba Borthakur	711a30cb30	Merge branch 'master' into performance Conflicts: include/leveldb/options.h include/leveldb/statistics.h util/options.cc	2013-08-02 10:22:08 -07:00
Mayank Agarwal	c42485f67c	Merge operator for ttl Summary: Implemented a TtlMergeOperator class which inherits from MergeOperator and is TTL aware. It strips out timestamp from existing_value and attaches timestamp to new_value, calling user-provided-Merge in between. Test Plan: make all check Reviewers: haobo, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11775	2013-08-01 09:27:17 -07:00
Mayank Agarwal	59d0b02f8b	Expand KeyMayExist to return the proper value if it can be found in memory and also check block_cache Summary: Removed KeyMayExistImpl because KeyMayExist demanded Get like semantics now. Removed no_io from memtable and imm because we need the proper value now and shouldn't just stop when we see Merge in memtable. Added checks to block_cache. Updated documentation and unit-test Test Plan: make all check;db_stress for 1 hour Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11853	2013-08-01 09:07:46 -07:00
Jim Paton	9700677a2b	Slow down writes gradually rather than suddenly Summary: Currently, when a certain number of level0 files (level0_slowdown_writes_trigger) are present, RocksDB will slow down each write by 1ms. There is a second limit of level0 files at which RocksDB will stop writes altogether (level0_stop_writes_trigger). This patch enables the user to supply a third parameter specifying the number of files at which Rocks will start slowing down writes (level0_start_slowdown_writes). When this number is reached, Rocks will slow down writes as a quadratic function of level0_slowdown_writes_trigger - num_level0_files. For some workloads, this improves latency and throughput. I will post some stats momentarily in https://our.intern.facebook.com/intern/tasks/?t=2613384. Test Plan: make -j32 check ./db_stress ./db_bench Reviewers: dhruba, haobo, MarkCallaghan, xjin Reviewed By: xjin CC: leveldb, xjin, zshao Differential Revision: https://reviews.facebook.net/D11859	2013-07-31 16:20:48 -07:00
Xing Jin	0f0a24e298	Make arena block size configurable Summary: Add an option for arena block size, default value 4096 bytes. Arena will allocate blocks with such size. I am not sure about passing parameter to skiplist in the new virtualized framework, though I talked to Jim a bit. So add Jim as reviewer. Test Plan: new unit test, I am running db_test. For passing paramter from configured option to Arena, I tried tests like: TEST(DBTest, Arena_Option) { std::string dbname = test::TmpDir() + "/db_arena_option_test"; DestroyDB(dbname, Options()); DB* db = nullptr; Options opts; opts.create_if_missing = true; opts.arena_block_size = 1000000; // tested 99, 999999 Status s = DB::Open(opts, dbname, &db); db->Put(WriteOptions(), "a", "123"); } and printed some debug info. The results look good. Any suggestion for such a unit-test? Reviewers: haobo, dhruba, emayanke, jpaton Reviewed By: dhruba CC: leveldb, zshao Differential Revision: https://reviews.facebook.net/D11799	2013-07-31 12:42:23 -07:00
Jim Paton	6db52b525a	Don't use redundant Env::NowMicros() calls Summary: After my patch for stall histograms, there are redundant calls to NowMicros() by both the stop watches and DBImpl::MakeRoomForWrites. So I removed the redundant calls such that the information is gotten from the stopwatch. Test Plan: make clean make -j32 check Reviewers: dhruba, haobo, MarkCallaghan Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11883	2013-07-29 15:46:36 -07:00
Jim Paton	abc90b067c	Use specific DB name in merge_test Summary: Currently, merge_test uses /tmp/testdb for the test database. It should really use something more specific to merge_test. Most of the other tests use test::TmpDir() + "/<test name>db". This patch implements such behavior for merge_test; it makes merge_test use test::TmpDir() + "/merge_testdb" Test Plan: make clean make -j32 merge_test ./merge_test Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11877	2013-07-29 13:26:38 -07:00
Jim Paton	18afff2e63	Add stall counts to statistics Summary: Previously, statistics are kept on how much time is spent on stalls of different types. This patch adds support for keeping number of stalls of each type. For example, instead of just reporting how many microseconds are spent waiting for memtables to be compacted, it will also report how many times a write stalled for that to occur. Test Plan: make -j32 check ./db_stress # Not really sure what else should be done... Reviewers: dhruba, MarkCallaghan, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11841	2013-07-29 10:34:23 -07:00
Dhruba Borthakur	a91fdf1b99	The target file size for L0 files was incorrectly set to LLONG_MAX. Summary: The target file size should be valid value. Only if UniversalCompactionStyle is enabled then set max file size to be LLONG_MAX. Test Plan: Reviewers: CC: Task ID: # Blame Rev:	2013-07-24 14:28:00 -07:00
Dhruba Borthakur	d7ba5bce37	Revert `6fbe4e981a`: If disable wal is set, then batch commits are avoided Summary: Revert "If disable wal is set, then batch commits are avoided" because keeping the mutex while inserting into the skiplist means that readers and writes are all serialized on the mutex. Test Plan: Reviewers: CC: Task ID: # Blame Rev:	2013-07-24 10:01:13 -07:00
Jim Paton	52d7ecfc78	Virtualize SkipList Interface Summary: This diff virtualizes the skiplist interface so that users can provide their own implementation of a backing store for MemTables. Eventually, the backing store will be responsible for its own synchronization, allowing users (and us) to experiment with different lockless implementations. Test Plan: make clean make -j32 check ./db_stress Reviewers: dhruba, emayanke, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11739	2013-07-23 14:42:27 -07:00
Dhruba Borthakur	6fbe4e981a	If disable wal is set, then batch commits are avoided. Summary: rocksdb uses batch commit to write to transaction log. But if disable wal is set, then writes to transaction log are anyways avoided. In this case, there is not much value-add to batch things, batching can cause unnecessary delays to Puts(). This patch avoids batching when disableWal is set. Test Plan: make check. I am running db_stress now. Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11763	2013-07-23 14:22:57 -07:00
Mayank Agarwal	bf66c10b13	Use KeyMayExist for WriteBatch-Deletes Summary: Introduced KeyMayExist checking during writebatch-delete and removed from Outer Delete API because it uses writebatch-delete. Added code to skip getting Table from disk if not already present in table_cache. Some renaming of variables. Introduced KeyMayExistImpl which allows checking since specified sequence number in GetImpl useful to check partially written writebatch. Changed KeyMayExist to not be pure virtual and provided a default implementation. Expanded unit-tests in db_test to check appropriately. Ran db_stress for 1 hour with ./db_stress --max_key=100000 --ops_per_thread=10000000 --delpercent=50 --filter_deletes=1 --statistics=1. Test Plan: db_stress;make check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb, xjin Differential Revision: https://reviews.facebook.net/D11745	2013-07-23 13:36:50 -07:00
Haobo Xu	d364eea1fc	[RocksDB] Fix FindMinimumEmptyLevelFitting Summary: as title Test Plan: make check; Reviewers: xjin CC: leveldb Differential Revision: https://reviews.facebook.net/D11751	2013-07-22 12:31:43 -07:00
Haobo Xu	9ee68871dc	[RocksDB] Enable manual compaction to move files back to an appropriate level. Summary: As title. This diff added an option reduce_level to CompactRange. When set to true, it will try to move the files back to the minimum level sufficient to hold the data set. Note that the default is set to true now, just to excerise it in all existing tests. Will set the default to false before check-in, for backward compatibility. Test Plan: make check; Reviewers: dhruba, emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D11553	2013-07-19 16:20:36 -07:00
Dhruba Borthakur	4a745a5666	Merge branch 'master' into performance Conflicts: db/version_set.cc include/leveldb/options.h util/options.cc	2013-07-17 15:05:57 -07:00
Dhruba Borthakur	76a4923307	Make file-sizes and grandparentoverlap to be unsigned to avoid bad comparisions. Summary: The maxGrandParentOverlapBytes_ was signed which was causing an erroneous comparision between signed and unsigned longs. This, in turn, was causing compaction-created-output-files to be very small in size. Test Plan: make check Differential Revision: https://reviews.facebook.net/D11727	2013-07-17 14:26:06 -07:00
Mayank Agarwal	e9b675bd94	Fix memory leak in KeyMayExist test part of db_test Summary: NewBloomFilterPolicy call requires Delete to be called later on Test Plan: make; valgrind ./db_test Reviewers: haobo, dhruba, vamsi Differential Revision: https://reviews.facebook.net/D11667	2013-07-12 16:58:57 -07:00
Mayank Agarwal	2a986919d6	Make rocksdb-deletes faster using bloom filter Summary: Wrote a new function in db_impl.c-CheckKeyMayExist that calls Get but with a new parameter turned on which makes Get return false only if bloom filters can guarantee that key is not in database. Delete calls this function and if the option- deletes_use_filter is turned on and CheckKeyMayExist returns false, the delete will be dropped saving: 1. Put of delete type 2. Space in the db,and 3. Compaction time Test Plan: make all check; will run db_stress and db_bench and enhance unit-test once the basic design gets approved Reviewers: dhruba, haobo, vamsi Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11607	2013-07-11 12:11:11 -07:00
Xing Jin	8a5341ec7d	Newbie code question Summary: This diff is more about my question when reading compaction codes, instead of a normal diff. I don't quite understand the logic here. Test Plan: I didn't do any test. If this is a bug, I will continue doing some test. Reviewers: haobo, dhruba, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11661	2013-07-11 09:03:40 -07:00
Dhruba Borthakur	289efe9922	Update statistics only if needed. Summary: Update statistics only if needed. Test Plan: Reviewers: CC: Task ID: # Blame Rev:	2013-07-09 16:17:00 -07:00
Dhruba Borthakur	7cb8d462d5	Rename PickCompactionHybrid to PickCompactionUniversal. Summary: Rename PickCompactionHybrid to PickCompactionUniversal. Changed a few LOG message from "Hybrid:" to "Universal:". Test Plan: Reviewers: CC: Task ID: # Blame Rev:	2013-07-09 16:08:54 -07:00
Haobo Xu	a8d5f8dde2	[RocksDB] Remove old readahead options Summary: As title. Test Plan: make check; db_bench Reviewers: dhruba, MarkCallaghan CC: leveldb Differential Revision: https://reviews.facebook.net/D11643	2013-07-09 11:22:33 -07:00
Haobo Xu	9ba82786ce	[RocksDB] Provide contiguous sequence number even in case of write failure Summary: Replication logic would be simplifeid if we can guarantee that write sequence number is always contiguous, even if write failure occurs. Dhruba and I looked at the sequence number generation part of the code. It seems fixable. Note that if WAL was successful and insert into memtable was not, we would be in an unfortunate state. The approach in this diff is : IO error is expected and error status will be returned to client, sequence number will not be advanced; In-mem error is not expected and we panic. Test Plan: make check; db_stress Reviewers: dhruba, sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D11439	2013-07-08 15:31:09 -07:00
Dhruba Borthakur	116ec527f2	Renamed 'hybrid_compaction' tp be "Universal Compaction'. Summary: All the universal compaction parameters are encapsulated in a new file universal_compaction.h Test Plan: make check	2013-07-03 15:47:53 -07:00
Dhruba Borthakur	47c4191fe8	Reduce write amplification by merging files in L0 back into L0 Summary: There is a new option called hybrid_mode which, when switched on, causes HBase style compactions. Files from L0 are compacted back into L0. This meat of this compaction algorithm is in PickCompactionHybrid(). All files reside in L0. That means all files have overlapping keys. Each file has a time-bound, i.e. each file contains a range of keys that were inserted around the same time. The start-seqno and the end-seqno refers to the timeframe when these keys were inserted. Files that have contiguous seqno are compacted together into a larger file. All files are ordered from most recent to the oldest. The current compaction algorithm starts to look for candidate files starting from the most recent file. It continues to add more files to the same compaction run as long as the sum of the files chosen till now is smaller than the next candidate file size. This logic needs to be debated and validated. The above logic should reduce write amplification to a large extent... will publish numbers shortly. Test Plan: dbstress runs for 6 hours with no data corruption (tested so far). Differential Revision: https://reviews.facebook.net/D11289	2013-06-30 20:07:04 -07:00
Haobo Xu	71e0f695c1	[RocksDB] Expose count for WriteBatch Summary: As title. Exposed a Count function that returns the number of updates in a batch. Could be handy for replication sequence number check. Test Plan: make check; Reviewers: emayanke, sheki, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11523	2013-06-26 15:13:21 -07:00
Abhishek Kona	5ef6bb8c37	[rocksdb][refactor] statistic printing code to one place Summary: $title Test Plan: db_bench --statistics=1 Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11373	2013-06-18 20:28:41 -07:00
Haobo Xu	3cc1af2062	[RocksDB] Option for incremental sync Summary: This diff added an option to control the incremenal sync frequency. db_bench has a new flag bytes_per_sync for easy tuning exercise. Test Plan: make check; db_bench Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11295	2013-06-18 15:00:32 -07:00
Abhishek Kona	79f4fd2b62	[Rocksdb] Simplify Printing code in db_bench Summary: simplify the printing code in db_bench use TickersMap and HistogramsNameMap introduced in previous diffs. Test Plan: ./db_bench --statistics=1 and see if all the statistics are printed Reviewers: haobo, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11355	2013-06-18 14:58:00 -07:00
Dhruba Borthakur	6acbe0fc45	Compact multiple memtables before flushing to storage. Summary: Merge multiple multiple memtables in memory before writing it out to a file in L0. There is a new config parameter min_write_buffer_number_to_merge that specifies the number of write buffers that should be merged together to a single file in storage. The system will not flush wrte buffers to storage unless at least these many buffers have accumulated in memory. The default value of this new parameter is 1, which means that a write buffer will be immediately flushed to disk as soon it is ready. Test Plan: make check Differential Revision: https://reviews.facebook.net/D11241	2013-06-18 14:28:04 -07:00
Abhishek Kona	00124683de	[rocksdb] do not trim range for level0 in manual compaction Summary: https://code.google.com/p/leveldb/issues/detail?can=1&q=178&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary&id=178 Ported the solution as is to RocksDB. Test Plan: moved the unit test as manual_compaction_test Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11331	2013-06-17 13:58:17 -07:00
Abhishek Kona	bff718d81c	[Rocksdb] Implement filluniquerandom Summary: Use a bit set to keep track of which random number is generated. Currently only supports single-threaded. All our perf tests are run with threads=1 Copied over bitset implementation from common/datastructures Test Plan: printed the generated keys, and verified all keys were present. Reviewers: MarkCallaghan, haobo, dhruba Reviewed By: MarkCallaghan CC: leveldb Differential Revision: https://reviews.facebook.net/D11247	2013-06-14 16:17:56 -07:00
Deon Nicholas	2a52e1dcb6	Fix db_bench for release build. Test Plan: make release Reviewers: haobo, dhruba, jpaton Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11307	2013-06-14 16:00:47 -07:00
Haobo Xu	1afdf28701	[RocksDB] Compaction Filter Cleanup Summary: This hopefully gives the right semantics to compaction filter. Will write a small wiki to explain the ideas. Test Plan: make check; db_stress Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11121	2013-06-14 14:23:08 -07:00
Abhishek Kona	7a5f71d19a	[Rocksdb] measure table open io in a histogram Summary: Table is setup for compaction using Table::SetupForCompaction. So read block calls can be differentiated b/w Gets/Compaction. Use this and measure times. Test Plan: db_bench --statistics=1 Reviewers: dhruba, haobo Reviewed By: haobo CC: leveldb, MarkCallaghan Differential Revision: https://reviews.facebook.net/D11217	2013-06-13 17:25:09 -07:00
Deon Nicholas	4985a9f73b	[Rocksdb] [Multiget] Introduced multiget into db_bench Summary: Preliminary! Introduced the --use_multiget=1 and --keys_per_multiget=n flags for db_bench. Also updated and tested the ReadRandom() method to include an option to use multiget. By default, keys_per_multiget=100. Preliminary tests imply that multiget is at least 1.25x faster per key than regular get. Will continue adding Multiget for ReadMissing, ReadHot, RandomWithVerify, ReadRandomWriteRandom; soon. Will also think about ways to better verify benchmarks. Test Plan: 1. make db_bench 2. ./db_bench --benchmarks=fillrandom 3. ./db_bench --benchmarks=readrandom --use_existing_db=1 --use_multiget=1 --threads=4 --keys_per_multiget=100 4. ./db_bench --benchmarks=readrandom --use_existing_db=1 --threads=4 5. Verify ops/sec (and 1000000 of 1000000 keys found) Reviewers: haobo, MarkCallaghan, dhruba Reviewed By: MarkCallaghan CC: leveldb Differential Revision: https://reviews.facebook.net/D11127	2013-06-12 12:42:21 -07:00
Haobo Xu	bdf1085944	[RocksDB] cleanup EnvOptions Summary: This diff simplifies EnvOptions by treating it as POD, similar to Options. - virtual functions are removed and member fields are accessed directly. - StorageOptions is removed. - Options.allow_readahead and Options.allow_readahead_compactions are deprecated. - Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite Test Plan: make check; db_stress Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11175	2013-06-12 11:17:19 -07:00
Dhruba Borthakur	e673d5d26d	Do not submit multiple simultaneous seek-compaction requests. Summary: The code was such that if multi-threaded-compactions as well as seek compaction are enabled then it submits multiple compaction request for the same range of keys. This causes extraneous sst-files to accumulate at various levels. Test Plan: I am not able to write a very good unit test for this one but can easily reproduce this bug with 'dbstress' with the following options. batch=1;maxk=100000000;ops=100000000;ro=0;fm=2;bpl=10485760;of=500000; wbn=3; mbc=20; mb=2097152; wbs=4194304; dds=1; sync=0; t=32; bs=16384; cs=1048576; of=500000; ./db_stress --disable_seek_compaction=0 --mmap_read=0 --threads=$t --block_size=$bs --cache_size=$cs --open_files=$of --verify_checksum=1 --db=/data/mysql/leveldb/dbstress.dir --sync=$sync --disable_wal=1 --disable_data_sync=$dds --write_buffer_size=$wbs --target_file_size_base=$mb --target_file_size_multiplier=$fm --max_write_buffer_number=$wbn --max_background_compactions=$mbc --max_bytes_for_level_base=$bpl --reopen=$ro --ops_per_thread=$ops --max_key=$maxk --test_batches_snapshots=$batch Reviewers: leveldb, emayanke Reviewed By: emayanke Differential Revision: https://reviews.facebook.net/D11055	2013-06-10 15:49:19 -07:00
Dhruba Borthakur	1b69f1e584	Fix refering freed memory in earlier commit. Summary: Fix refering freed memory in earlier commit by https://reviews.facebook.net/D11181 Test Plan: make check Reviewers: haobo, sheki Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11193	2013-06-10 15:08:13 -07:00
Dhruba Borthakur	c5de1b9391	Print name of user comparator in LOG. Summary: The current code prints the name of the InternalKeyComparator in the log file. We would also like to print the name of the user-specified comparator for easier debugging. Test Plan: make check Reviewers: sheki Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D11181	2013-06-10 12:11:55 -07:00
Mayank Agarwal	184343a061	Max_mem_compaction_level can have maximum value of num_levels-1 Summary: Without this files could be written out to a level greater than the maximum level possible and is the source of the segfaults that wormhole awas getting. The sequence of steps that was followed: 1. WriteLevel0Table was called when memtable was to be flushed for a file. 2. PickLevelForMemTableOutput was called to determine the level to which this file should be pushed. 3. PickLevelForMemTableOutput returned a wrong result because max_mem_compaction_level was equal to 2 even when num_levels was equal to 0. The fix to re-initialize max_mem_compaction_level based on num_levels passed seems correct. Test Plan: make all check; Also made a dummy file to mimic the wormhole-file behaviour which was causing the segfaults and found that the same segfault occurs without this change and not with this. Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11157	2013-06-09 10:38:55 -07:00
Abhishek Kona	e982b5a489	[Rocksdb] measure table open io in a histogram Summary: as title Test Plan: db_bench --statistics=1 check for statistic. Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11109	2013-06-07 10:02:28 -07:00
Deon Nicholas	db1f0cddf3	Fixed valgrind errors	2013-06-05 17:25:16 -07:00
Deon Nicholas	d8c7c45ea0	Very basic Multiget and simple test cases. Summary: Implemented the MultiGet operator which takes in a list of keys and returns their associated values. Currently uses std::vector as its container data structure. Otherwise, it works identically to "Get". Test Plan: 1. make db_test ; compile it 2. ./db_test ; test it 3. make all check ; regress / run all tests 4. make release ; (optional) compile with release settings Reviewers: haobo, MarkCallaghan, dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D10875	2013-06-05 11:22:38 -07:00
Abhishek Kona	d91b42ee27	[Rocksdb] Measure all FSYNC/SYNC times Summary: Add stop watches around all sync calls. Test Plan: db_bench check if respective histograms are printed Reviewers: haobo, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11073	2013-06-05 11:06:21 -07:00
Abhishek Kona	ee522d0032	[Rocksdb] Log on disable/enable file deletions Summary: as title Test Plan: compile Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11085	2013-06-05 10:48:24 -07:00
Mark Callaghan	d9f538e1a9	Improve output for GetProperty('leveldb.stats') Summary: Display separate values for read, write & total compaction IO. Display compaction amplification and write amplification. Add similar values for the period since the last call to GetProperty. Results since the server started are reported as "cumulative" stats. Results since the last call to GetProperty are reported as "interval" stats. Level Files Size(MB) Time(sec) Read(MB) Write(MB) Rn(MB) Rnp1(MB) Wnew(MB) Amplify Read(MB/s) Write(MB/s) Rn Rnp1 Wnp1 NewW Count Ln-stall ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0 7 13 21 0 211 0 0 211 0.0 0.0 10.1 0 0 0 0 113 0.0 1 79 157 88 993 989 198 795 194 9.0 11.3 11.2 106 405 502 97 14 0.0 2 19 36 5 63 63 37 27 36 2.4 12.3 12.2 19 14 32 18 12 0.0 >>>>>>>>>>>>>>>>>>>>>>>>> text below has been is new and/or reformatted Uptime(secs): 122.2 total, 0.9 interval Compaction IO cumulative (GB): 0.21 new, 1.03 read, 1.23 write, 2.26 read+write Compaction IO cumulative (MB/sec): 1.7 new, 8.6 read, 10.3 write, 19.0 read+write Amplification cumulative: 6.0 write, 11.0 compaction Compaction IO interval (MB): 5.59 new, 0.00 read, 5.59 write, 5.59 read+write Compaction IO interval (MB/sec): 6.5 new, 0.0 read, 6.5 write, 6.5 read+write Amplification interval: 1.0 write, 1.0 compaction >>>>>>>>>>>>>>>>>>>>>>>> text above is new and/or reformatted Stalls(secs): 90.574 level0_slowdown, 0.000 level0_numfiles, 10.165 memtable_compaction, 0.000 leveln_slowdown Task ID: # Blame Rev: Test Plan: make check, run db_bench Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11049	2013-06-03 20:24:49 -07:00
Haobo Xu	2b1fb5b01d	[RocksDB] Add score column to leveldb.stats Summary: Added the 'score' column to the compaction stats output, which shows the level total size devided by level target size. Could be useful when monitoring compaction decisions... Test Plan: make check; db_bench Reviewers: dhruba CC: leveldb, MarkCallaghan Differential Revision: https://reviews.facebook.net/D11025	2013-06-03 17:38:27 -07:00
Haobo Xu	d897d33bf1	[RocksDB] Introduce Fast Mutex option Summary: This diff adds an option to specify whether PTHREAD_MUTEX_ADAPTIVE_NP will be enabled for the rocksdb single big kernel lock. db_bench also have this option now. Quickly tested 8 thread cpu bound 100 byte random read. No fast mutex: ~750k/s ops With fast mutex: ~880k/s ops Test Plan: make check; db_bench; db_stress Reviewers: dhruba CC: MarkCallaghan, leveldb Differential Revision: https://reviews.facebook.net/D11031	2013-06-01 23:11:34 -07:00
Haobo Xu	ab8d2f6ab2	[RocksDB] [Performance] Allow different posix advice to be applied to the same table file Summary: Current posix advice implementation ties up the access pattern hint with the creation of a file. It is not possible to apply different advice for different access (random get vs compaction read), without keeping two open files for the same table. This patch extended the RandomeAccessFile interface to accept new access hint at anytime. Particularly, we are able to set different access hint on the same table file based on when/how the file is used. Two options are added to set the access hint, after the file is first opened and after the file is being compacted. Test Plan: make check; db_stress; db_bench Reviewers: dhruba Reviewed By: dhruba CC: MarkCallaghan, leveldb Differential Revision: https://reviews.facebook.net/D10905	2013-05-30 19:08:44 -07:00
Haobo Xu	2df65c118c	[RocksDB] Dump counters and histogram data periodically with compaction stats Summary: As title Test Plan: make check Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D10995	2013-05-29 12:00:18 -07:00
Dhruba Borthakur	a8d807ee91	Record the number of open db iterators. Summary: Enhance the statitics to report the number of open db iterators. Test Plan: make check Reviewers: haobo, emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D10983	2013-05-29 08:47:08 -07:00
Haobo Xu	fb684da082	[RocksDB] Fix CorruptionTest Summary: Overriding block_size_deviation to zero, so that CorruptionTest can pass. Test Plan: make check Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D10977	2013-05-28 12:36:42 -07:00
heyongqiang	4b29651206	add block deviation option to terminate a block before it exceeds block_size Summary: a new option block_size_deviation is added. Test Plan: run db_test and db_bench Reviewers: dhruba, haobo Reviewed By: haobo Differential Revision: https://reviews.facebook.net/D10821	2013-05-24 15:52:49 -07:00
Haobo Xu	ef15b9d178	[RocksDB] Fix MaybeDumpStats Summary: MaybeDumpStats was causing lock problem Test Plan: make check; db_stress Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D10935	2013-05-24 15:43:16 -07:00
Haobo Xu	0e879c93de	[RocksDB] dump leveldb.stats periodically in LOG file. Summary: Added an option stats_dump_period_sec to dump leveldb.stats to LOG periodically for diagnosis. By defauly, it's set to a very big number 3600 (1 hour). Test Plan: make check; Reviewers: dhruba Reviewed By: dhruba CC: leveldb, zshao Differential Revision: https://reviews.facebook.net/D10761	2013-05-23 16:56:59 -07:00
Dhruba Borthakur	26541860d3	The max size of the write buffer size can be 64 GB. Summary: There was an artifical limit on the size of the write buffer size. Test Plan: make check Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D10911	2013-05-23 15:00:27 -07:00
Dhruba Borthakur	898f79372f	Fix valgrind errors introduced by https://reviews.facebook.net/D10863 Summary: The valgrind errors were in the unit tests where we change the number of levels of a database using internal methods. Test Plan: valgrind ./reduce_levels_test valgrind ./db_test Reviewers: emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D10893	2013-05-23 12:14:41 -07:00
Haobo Xu	c2e2460f8a	[RocksDB] Expose DBStatistics Summary: Make Statistics usable by client Test Plan: make check; db_bench Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D10899	2013-05-23 11:49:38 -07:00
Haobo Xu	87d0af15d8	[RocksDB] Introduce an option to skip log error on recovery Summary: Currently, with paranoid_check on, DB::Open will fail on any log read error on recovery. If client is ok with losing most recent updates, we could simply skip those errors. However, it's important to introduce an additional flag, so that paranoid_check can still guard against more serious problems. Test Plan: make check; db_stress Reviewers: dhruba, emayanke Reviewed By: emayanke CC: leveldb, emayanke Differential Revision: https://reviews.facebook.net/D10869	2013-05-21 14:30:36 -07:00
Dhruba Borthakur	d1aaaf718c	Ability to set different size fanout multipliers for every level. Summary: There is an existing field Options.max_bytes_for_level_multiplier that sets the multiplier for the size of each level in the database. This patch introduces the ability to set different multipliers for every level in the database. The size of a level is determined by using both max_bytes_for_level_multiplier as well as the per-level fanout. size of level[i] = size of level[i-1] * max_bytes_for_level_multiplier * fanout[i-1] The default value of fanout is 1, so that it is backward compatible. Test Plan: make check Reviewers: haobo, emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D10863	2013-05-21 13:50:20 -07:00
Haobo Xu	c3c13db346	[RocksDB] [Performance Bug] MemTable::Get Slow Summary: The merge operator diff introduced a performance problem in MemTable::Get. An exit condition is missed when the current key does not match the user key. This could lead to full memtable scan if the user key is not found. Test Plan: make check; db_bench Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D10851	2013-05-21 13:40:38 -07:00
Abhishek Kona	e1174306c5	[RocksDB] Simplify StopWatch implementation Summary: Make stop watch a simple implementation, instead of subclass of a virtual class Allocate stop watches off the stack instead of heap. Code is more terse now. Test Plan: make all check, db_bench with --statistics=1 Reviewers: haobo, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D10809	2013-05-17 10:55:34 -07:00
Abhishek Kona	446151cd20	[Rocksdb] Remove unused double apis to record into histograms Summary: Statistics.h and histogram.h had double based api's to record values. Remove them as they are not used anywhere Test Plan: make all check Reviewers: haobo, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D10815	2013-05-16 10:40:30 -07:00
Haobo Xu	4ca3c67bd3	[RocksDB] Cleanup compaction filter to use a class interface, instead of function pointer and additional context pointer. Summary: This diff replaces compaction_filter_args and CompactionFilter with a single compaction_filter parameter. It gives CompactionFilter better encapsulation and a similar look to Comparator and MergeOpertor, which improves consistency of the overall interface. The change is not backward compatible. Nevertheless, the two references in fbcode are not in production yet. Test Plan: make check Reviewers: dhruba Reviewed By: dhruba CC: leveldb, zshao Differential Revision: https://reviews.facebook.net/D10773	2013-05-13 14:06:10 -07:00
Haobo Xu	73c0a33346	[RocksDB] fix compaction filter trigger condition Summary: Currently, compaction filter is run on internal key older than the oldest snapshot, which is incorrect. Compaction filter should really be run on the most recent internal key when there is no external snapshot. Test Plan: make check; db_stress Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D10641	2013-05-13 12:33:02 -07:00
Abhishek Kona	8d58ecdc29	[RocksDB] Expose compaction stalls via db_statistics Test Plan: make check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D10575	2013-05-10 14:41:45 -07:00
Dhruba Borthakur	a8d3aa2c26	Assertion failure for L0-L1 compactions. Summary: For level-0 compactions, we try to find if can include more L0 files in the same compaction run. This causes the 'smallest' and 'largest' key to get extended to a larger range. But the suceeding call to ParentRangeInCompaction() was still using the earlier values of 'smallest' and 'largest', Because of this bug, a file in L1 can be part of two concurrent compactions: one L0-L1 compaction and the other L1-L2 compaction. This should not cause any data loss, but will cause an assertion failure with debug builds. Test Plan: make check Differential Revision: https://reviews.facebook.net/D10677	2013-05-08 17:10:11 -07:00
Abhishek Kona	988c20b9f7	[RocksDB] Clear Archive WAL files Summary: WAL files are moved to archive directory and clear only at DB::Open. Can lead to a lot of space consumption in a Database. Added logic to periodically clear Archive Directory too. Test Plan: make all check + add unit test Reviewers: dhruba, heyongqiang Reviewed By: heyongqiang CC: leveldb Differential Revision: https://reviews.facebook.net/D10617	2013-05-06 11:41:01 -07:00
Haobo Xu	05e8854085	[Rocksdb] Support Merge operation in rocksdb Summary: This diff introduces a new Merge operation into rocksdb. The purpose of this review is mostly getting feedback from the team (everyone please) on the design. Please focus on the four files under include/leveldb/, as they spell the client visible interface change. include/leveldb/db.h include/leveldb/merge_operator.h include/leveldb/options.h include/leveldb/write_batch.h Please go over local/my_test.cc carefully, as it is a concerete use case. Please also review the impelmentation files to see if the straw man implementation makes sense. Note that, the diff does pass all make check and truly supports forward iterator over db and a version of Get that's based on iterator. Future work: - Integration with compaction - A raw Get implementation I am working on a wiki that explains the design and implementation choices, but coding comes just naturally and I think it might be a good idea to share the code earlier. The code is heavily commented. Test Plan: run all local tests Reviewers: dhruba, heyongqiang Reviewed By: dhruba CC: leveldb, zshao, sheki, emayanke, MarkCallaghan Differential Revision: https://reviews.facebook.net/D9651	2013-05-03 16:59:02 -07:00
Abhishek Kona	41cb922b34	Allocate the LogReporter from heap. Summary: Summary: The current code has a bug that take address of stack allocated LogReporter. It is causing SIGSEGV because the stack address is no longer valid when referenced. Test Plan: Tested on prod. Reviewers: haobo, dhruba, heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D10557	2013-04-29 13:19:24 -07:00
Abhishek Kona	fb96ec1686	[RocksDB] Print all internally collected histograms in db_bench. Also print p95 Summary: $title Test Plan: make db_bench . run db_bench and check for expected output Reviewers: haobo, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D10521	2013-04-25 13:36:47 -07:00
Haobo Xu	eb6d139666	[RocksDB] Move table.h to table/ Summary: - don't see a point exposing table.h to the public. - fixed make clean to remove also *.d files. Test Plan: make check; db_stress Reviewers: dhruba, heyongqiang Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D10479	2013-04-22 16:07:56 -07:00
Abhishek Kona	344e832f55	[RocksDB] Fix ReadMissing in db_bench Summary: D8943 Broke read_missing. Fix it by adding a "." at the end of the generated key Test Plan: generate, print and check the key has a "." Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D10455	2013-04-22 15:44:19 -07:00
Haobo Xu	b4243e5a3d	[RocksDB] CompactionFilter cleanup Summary: - removed the compaction_filter_value from the callback interface. Restrict compaction filter to purging values. - modify some comments to reflect curent status. Test Plan: make check Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D10335	2013-04-20 10:26:51 -07:00
Mark Callaghan	b1ff9ac9c5	Add --writes_per_second rate limit, print p99.99 in histogram Summary: Adds the --writes_per_second rate limit for the readwhilewriting test. The purpose is to optionally avoid saturating storage with writes & compaction and test read response time when some writes are being done. Changes the histogram code to also print the p99.99 value Task ID: # Blame Rev: Test Plan: make check, ran db_bench with it Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D10305	2013-04-20 10:26:51 -07:00
Haobo Xu	1255dcd446	[RocksDB] Add stacktrace signal handler Summary: This diff provides the ability to print out a stacktrace when the process receives certain signals. Currently, we enable this for the following signals (program error related): SIGILL SIGSEGV SIGBUS SIGABRT Application simply #include "util/stack_trace.h" and call leveldb::InstallStackTraceHandler() during initialization, if signal handler is needed. It's not done automatically when openning db, because it's the application(process)'s responsibility to install signal handler and some applications might already have their own (like fbcode). Sample output: Received signal 11 (Segmentation fault) #0 0x408ff0 ./signal_test() [0x408ff0] /home/haobo/rocksdb/util/signal_test.cc:4 #1 0x40827d ./signal_test() [0x40827d] /home/haobo/rocksdb/util/signal_test.cc:24 #2 0x7f8bb183172e /usr/local/fbcode/gcc-4.7.1-glibc-2.14.1/lib/libc.so.6(__libc_start_main+0x10e) [0x7f8bb183172e] ??:0 #3 0x408ebc ./signal_test() [0x408ebc] /home/engshare/third-party/src/glibc/glibc-2.14.1/glibc-2.14.1/csu/../sysdeps/x86_64/elf/start.S:113 Segmentation fault (core dumped) For each frame, we print the raw pointer, the symbol provided by backtrace_symbols (still not good enough), and the source file/line. Note that address translation is done by directly shell out to addr2line. ??:0 means addr2line fails to do the translation. Hacky, but I think it's good for now. Test Plan: signal_test.cc Reviewers: dhruba, MarkCallaghan Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D10173	2013-04-20 10:26:50 -07:00
Abhishek Kona	7c6c3c0ff4	[Rockdsdb] Better Error messages. Closing db instead of deleting db Summary: A better error message. A local change. Did not look at other places where this could be done. Test Plan: compile Reviewers: dhruba, MarkCallaghan Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D10251	2013-04-15 15:27:15 -07:00
Dhruba Borthakur	9b81d3c406	Simplified level_ptrs by using a std:vector Summary: Simplified level_ptrs by using a std:vector Test Plan: make check Reviewers: sheki, emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D10245	2013-04-15 13:52:51 -07:00
Haobo Xu	013e9ebbf1	[RocksDB] [Performance] Speed up FindObsoleteFiles Summary: FindObsoleteFiles was slow, holding the single big lock, resulted in bad p99 behavior. Didn't profile anything, but several things could be improved: 1. VersionSet::AddLiveFiles works with std::set, which is by itself slow (a tree). You also don't know how many dynamic allocations occur just for building up this tree. switched to std::vector, also added logic to pre-calculate total size and do just one allocation 2. Don't see why env_->GetChildren() needs to be mutex proteced, moved to PurgeObsoleteFiles where mutex could be unlocked. 3. switched std::set to std:unordered_set, the conversion from vector is also inside PurgeObsoleteFiles I have a feeling this should pretty much fix it. Test Plan: make check; db_stress Reviewers: dhruba, heyongqiang, MarkCallaghan Reviewed By: dhruba CC: leveldb, zshao Differential Revision: https://reviews.facebook.net/D10197	2013-04-12 11:29:27 -07:00
Mayank Agarwal	94d86b25a9	Fix memory leak for probableWALfiles in db_impl.cc Summary: using unique_ptr to have automatic delete for probableWALfiles in db_impl.cc Test Plan: make Reviewers: sheki, dhruba Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D10083	2013-04-11 16:57:15 -07:00
Dhruba Borthakur	7730587120	Prevent segfault in OpenCompactionOutputFile Summary: The segfault was happening because the program was unable to open a new sst file (as part of the compaction) because the process ran out of file descriptors. The fix is to check the return status of the file creation before taking any other action. Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fabf03f9700 (LWP 29904)] leveldb::DBImpl::OpenCompactionOutputFile (this=this@entry=0x7fabf9011400, compact=compact@entry=0x7fabf741a2b0) at db/db_impl.cc:1399 1399 db/db_impl.cc: No such file or directory. (gdb) where Test Plan: make check Reviewers: MarkCallaghan, sheki Reviewed By: MarkCallaghan CC: leveldb Differential Revision: https://reviews.facebook.net/D10101	2013-04-10 09:59:48 -07:00
Abhishek Kona	574b76f710	[RocksDB][Bug] Look at all the files, not just the first file in TransactionLogIter as BatchWrites can leave it in Limbo Summary: Transaction Log Iterator did not move to the next file in the series if there was a write batch at the end of the currentFile. The solution is if the last seq no. of the current file is < RequestedSeqNo. Assume the first seqNo. of the next file has to satisfy the request. Also major refactoring around the code. Moved opening the logreader to a seperate function, got rid of goto. Test Plan: added a unit test for it. Reviewers: dhruba, heyongqiang Reviewed By: heyongqiang CC: leveldb, emayanke Differential Revision: https://reviews.facebook.net/D10029	2013-04-08 16:28:09 -07:00
Abhishek Kona	0e40185a7d	[Rocksdb] Remove useless struct TableAndFile Summary: TableAndFile was a struct used earlier to delete the file as we did not have std::unique_ptr in the codebase. With Chip introducing C++11 hotness like std::unique_ptr we can do away with the struct. Test Plan: make all check Reviewers: haobo, heyongqiang Reviewed By: heyongqiang CC: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D9975	2013-04-05 11:26:46 -07:00
Abhishek Kona	ca789a10cc	[Rocksdb] Recover last updated sequence number from manifest also. Summary: During recovery, last_updated_manifest number was not set if there were no records in the Write-ahead log. Now check for the recovered manifest also and set last_updated_manifest file to the max value. Test Plan: unit test Reviewers: heyongqiang Reviewed By: heyongqiang CC: leveldb Differential Revision: https://reviews.facebook.net/D9891	2013-04-02 17:18:27 -07:00
Haobo Xu	6763110867	[RocksDB] Replace iterator based loop with range based loop for stl containers Summary: As title. Code is shorter and cleaner See https://our.dev.facebook.com/intern/tasks/?t=2233981 Test Plan: make check Reviewers: dhruba, heyongqiang Reviewed By: dhruba CC: leveldb, zshao Differential Revision: https://reviews.facebook.net/D9789	2013-04-02 11:46:45 -07:00
Haobo Xu	645ff8f231	Let's get rid of delete as much as possible, here are some examples. Summary: If a class owns an object: - If the object can be null => use a unique_ptr. no delete - If the object can not be null => don't even need new, let alone delete - for runtime sized array => use vector, no delete. Test Plan: make check Reviewers: dhruba, heyongqiang Reviewed By: heyongqiang CC: leveldb, zshao, sheki, emayanke, MarkCallaghan Differential Revision: https://reviews.facebook.net/D9783	2013-03-28 17:31:44 -07:00
Abhishek Kona	3b51605b8d	[RocksDB] Fix binary search while finding probable wal files Summary: RocksDB does a binary search to look at the files which might contain the requested sequence number at the call GetUpdatesSince. There was a bug in the binary search => when the file pointed by the middle index of bsearch was empty/corrupt it needst to resize the vector and update indexes. This now fixes that. Test Plan: existing unit tests pass. Reviewers: heyongqiang, dhruba Reviewed By: heyongqiang CC: leveldb Differential Revision: https://reviews.facebook.net/D9777	2013-03-28 13:37:15 -07:00
Abhishek Kona	8e9c781ae5	[Rocksdb] Fix Crash on finding a db with no log files. Error out instead Summary: If the vector returned by GetUpdatesSince is empty, it is still returned to the user. This causes it throw an std::range error. The probable file list is checked and it returns an IOError status instead of OK now. Test Plan: added a unit test. Reviewers: dhruba, heyongqiang Reviewed By: heyongqiang CC: leveldb Differential Revision: https://reviews.facebook.net/D9771	2013-03-28 13:19:07 -07:00
Abhishek Kona	7fdd5f5b33	Use non-mmapd files for Write-Ahead Files Summary: Use non mmapd files for Write-Ahead log. Earlier use of MMaped files. made the log iterator read ahead and miss records. Now the reader and writer will point to the same physical location. There is no perf regression : ./db_bench --benchmarks=fillseq --db=/dev/shm/mmap_test --num=$(million 20) --use_existing_db=0 --threads=2 with This diff : fillseq : 10.756 micros/op 185281 ops/sec; 20.5 MB/s without this dif : fillseq : 11.085 micros/op 179676 ops/sec; 19.9 MB/s Test Plan: unit test included Reviewers: dhruba, heyongqiang Reviewed By: heyongqiang CC: leveldb Differential Revision: https://reviews.facebook.net/D9741	2013-03-28 13:13:35 -07:00
Abhishek Kona	63f216ee0a	memory manage statistics Summary: Earlier Statistics object was a raw pointer. This meant the user had to clear up the Statistics object after creating the database. In most use cases the database is created in a function and the statistics pointer is out of scope. Hence the statistics object would never be deleted. Now Using a shared_ptr to manage this. Want this in before the next release. Test Plan: make all check. Reviewers: dhruba, emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D9735	2013-03-27 11:27:39 -07:00
Haobo Xu	ecd8db0200	[RocksDB] Minimize Mutex protected code section in the critical path Summary: rocksdb uses a single global lock to protect in memory metadata. We should minimize the mutex protected code section to increase the effective parallelism of the program. See https://our.intern.facebook.com/intern/tasks/?t=2218928 Test Plan: make check db_bench Reviewers: dhruba, heyongqiang CC: zshao, leveldb Differential Revision: https://reviews.facebook.net/D9705	2013-03-26 22:42:26 -07:00
Abhishek Kona	9b70529c86	Disable Unit Test for TransactionLogIteratorStall Summary: The unit test fails as our solution does not work with MMap'd files. Disable the failing unit test. Put it back with the next diff which should fix the problem. Test Plan: db_test Reviewers: heyongqiang CC: dhruba Differential Revision: https://reviews.facebook.net/D9645	2013-03-21 15:51:18 -07:00
Abhishek Kona	27c15fb67e	TransactionLogIter should stall at the last record. Currently it errors out Summary: * Add a method to check if the log reader is at EOF. * If we know a record has been flushed force the log_reader to believe it is not at EOF, using a new method UnMarkEof(). This does not work with MMpaed files. Test Plan: added a unit test. Reviewers: dhruba, heyongqiang Reviewed By: heyongqiang CC: leveldb Differential Revision: https://reviews.facebook.net/D9567	2013-03-21 15:12:35 -07:00
Dhruba Borthakur	d0798f67f4	Run compactions even if workload is readonly or read-mostly. Summary: The events that trigger compaction: * opening the database * Get -> only if seek compaction is not disabled and other checks are true * MakeRoomForWrite -> when memtable is full * BackgroundCall -> If the background thread is about to do a compaction run, it schedules a new background task to trigger a possible compaction. This will cause additional background threads to find and process other compactions that can run concurrently. Test Plan: ran db_bench with overwrite and readonly alternatively. Reviewers: sheki, MarkCallaghan Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D9579	2013-03-20 23:43:29 -07:00
Dhruba Borthakur	ad96563b79	Ability to configure bufferedio-reads, filesystem-readaheads and mmap-read-write per database. Summary: This patch allows an application to specify whether to use bufferedio, reads-via-mmaps and writes-via-mmaps per database. Earlier, there was a global static variable that was used to configure this functionality. The default setting remains the same (and is backward compatible): 1. use bufferedio 2. do not use mmaps for reads 3. use mmap for writes 4. use readaheads for reads needed for compaction I also added a parameter to db_bench to be able to explicitly specify whether to do readaheads for compactions or not. Test Plan: make check Reviewers: sheki, heyongqiang, MarkCallaghan Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D9429	2013-03-20 23:14:03 -07:00
Mayank Agarwal	b1bea58457	Fix more signed-unsigned comparisons Summary: Some comparisons left in log_test.cc and db_test.cc complained by make Test Plan: make Reviewers: dhruba, sheki Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D9537	2013-03-19 17:21:36 -07:00
Mayank Agarwal	487168cdcf	Fixed sign-comparison in rocksdb code-base and fixed Makefile Summary: Makefile had options to ignore sign-comparisons and unused-parameters, which should be there. Also fixed the specific errors in the code-base Test Plan: make Reviewers: chip, dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D9531	2013-03-19 14:35:23 -07:00
Mark Callaghan	72d14eafd3	add --benchmarks=levelstats option to db_bench, prevent "nan" in stats output Summary: Add --benchmarks=levelstats option to report per-level stats (#files, #bytes) Change readwhilewriting test to report response time for writes but exclude them from the stats merged by all threads. Prevent "NaN" in stats output by preventing division by 0. Remove "o" file I committed by mistake. Task ID: # Blame Rev: Test Plan: make check Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D9513	2013-03-19 13:14:44 -07:00
Abhishek Kona	02c459805b	Ignore a zero-sized file while looking for a seq-no in GetUpdatesSince Summary: Rocksdb can create 0 sized log files when it is opened and closed without any operations. The GetUpdatesSince fails currently if there is a log file of size zero. This diff fixes this. If there is a log file is 0, it is removed form the probable_file_list Test Plan: unit test Reviewers: dhruba, heyongqiang Reviewed By: heyongqiang CC: leveldb Differential Revision: https://reviews.facebook.net/D9507	2013-03-19 11:00:09 -07:00
Abhishek Kona	7b9db9c98e	DO not report level size as zero when there are no files in L0 Summary: Instead of checking for number of files in L0. Check for number of files in the requested level. Bug introduced in D4929 (diff trying to do too many things). Test Plan: db_test. Reviewers: dhruba, MarkCallaghan Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D9483	2013-03-18 12:04:38 -07:00
Mark Callaghan	5a8c8845a9	Enhance db_bench Summary: Add --benchmarks=updaterandom for read-modify-write workloads. This is different from --benchmarks=readrandomwriterandom in a few ways. First, an "operation" is the combined time to do the read & write rather than treating them as two ops. Second, the same key is used for the read & write. Change RandomGenerator to support rows larger than 1M. That was using "assert" to fail and assert is compiled-away when -DNDEBUG is used. Add more options to db_bench --duration - sets the number of seconds for tests to run. When not set the operation count continues to be the limit. This is used by random operation tests. --use_snapshot - when set GetSnapshot() is called prior to each random read. This is to measure the overhead from using snapshots. --get_approx - when set GetApproximateSizes() is called prior to each random read. This is to measure the overhead for a query optimizer. Task ID: # Blame Rev: Test Plan: run db_bench Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D9267	2013-03-14 16:00:23 -07:00
Mayank Agarwal	5b278b53ae	Fix valgrind errors in rocksdb tests: auto_roll_logger_test, reduce_levels_test Summary: Fix for memory leaks in rocksdb tests. Also modified the variable NUM_FAILED_TESTS to print the actual number of failed tests. Test Plan: make <test>; valgrind --leak-check=full ./<test> Reviewers: sheki, dhruba Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D9333	2013-03-12 16:03:16 -07:00
Dhruba Borthakur	ebf16f57c9	Prevent segfault because SizeUnderCompaction was called without any locks. Summary: SizeBeingCompacted was called without any lock protection. This causes crashes, especially when running db_bench with value_size=128K. The fix is to compute SizeUnderCompaction while holding the mutex and passing in these values into the call to Finalize. (gdb) where #4 leveldb::VersionSet::SizeBeingCompacted (this=this@entry=0x7f0b490931c0, level=level@entry=4) at db/version_set.cc:1827 #5 0x000000000043a3c8 in leveldb::VersionSet::Finalize (this=this@entry=0x7f0b490931c0, v=v@entry=0x7f0b3b86b480) at db/version_set.cc:1420 #6 0x00000000004418d1 in leveldb::VersionSet::LogAndApply (this=0x7f0b490931c0, edit=0x7f0b3dc8c200, mu=0x7f0b490835b0, new_descriptor_log=<optimized out>) at db/version_set.cc:1016 #7 0x00000000004222b2 in leveldb::DBImpl::InstallCompactionResults (this=this@entry=0x7f0b49083400, compact=compact@entry=0x7f0b2b8330f0) at db/db_impl.cc:1473 #8 0x0000000000426027 in leveldb::DBImpl::DoCompactionWork (this=this@entry=0x7f0b49083400, compact=compact@entry=0x7f0b2b8330f0) at db/db_impl.cc:1757 #9 0x0000000000426690 in leveldb::DBImpl::BackgroundCompaction (this=this@entry=0x7f0b49083400, madeProgress=madeProgress@entry=0x7f0b41bf2d1e, deletion_state=...) at db/db_impl.cc:1268 #10 0x0000000000428f42 in leveldb::DBImpl::BackgroundCall (this=0x7f0b49083400) at db/db_impl.cc:1170 #11 0x000000000045348e in BGThread (this=0x7f0b49023100) at util/env_posix.cc:941 #12 leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper (arg=0x7f0b49023100) at util/env_posix.cc:874 #13 0x00007f0b4a7cf10d in start_thread (arg=0x7f0b41bf3700) at pthread_create.c:301 #14 0x00007f0b49b4b11d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Test Plan: make check I am running db_bench with a value size of 128K to see if the segfault is fixed. Reviewers: MarkCallaghan, sheki, emayanke Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D9279	2013-03-11 14:09:01 -07:00
Dhruba Borthakur	6d812b6afb	A mechanism to detect manifest file write errors and put db in readonly mode. Summary: If there is an error while writing an edit to the manifest file, the manifest file is closed and reopened to check if the edit made it in. However, if the re-opening of the manifest is unsuccessful and options.paranoid_checks is set t true, then the db refuses to accept new puts, effectively putting the db in readonly mode. In a future diff, I would like to make the default value of paranoid_check to true. Test Plan: make check Reviewers: sheki Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D9201	2013-03-07 09:45:49 -08:00
Abhishek Kona	d68880a1b9	Do not allow Transaction Log Iterator to fall ahead when writer is writing the same file Summary: Store the last flushed, seq no. in db_impl. Check against it in transaction Log iterator. Do not attempt to read ahead if we do not know if the data is flushed completely. Does not work if flush is disabled. Any ideas on fixing that? * Minor change, iter->Next is called the first time automatically for * the first time. Test Plan: existing test pass. More ideas on testing this? Planning to run some stress test. Reviewers: dhruba, heyongqiang CC: leveldb Differential Revision: https://reviews.facebook.net/D9087	2013-03-06 14:05:53 -08:00
Dhruba Borthakur	afed60938f	Fox db_stress crash by copying keys before changing sequencenum to zero. Summary: The compaction process zeros out sequence numbers if the output is part of the bottommost level. The Slice is supposed to refer to an immutable data buffer. The merger that implements the priority queue while reading kvs as the input of a compaction run reies on this fact. The bug was that were updating the sequence number of a record in-place and that was causing suceeding invocations of the merger to return kvs in arbitrary order of sequence numbers. The fix is to copy the key to a local memory buffer before setting its seqno to 0. Test Plan: Set Options.purge_redundant_kvs_while_flush = false and then run db_stress --ops_per_thread=1000 --max_key=320 Reviewers: emayanke, sheki Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D9147	2013-03-06 10:52:08 -08:00
Dhruba Borthakur	f5896681b4	Removed unnecesary file object in table_cache. Summary: TableCache->file is not used. remove it. I kept the TableAndFile structure and will clean it up in a future patch. Test Plan: make clean check Reviewers: sheki, chip Reviewed By: chip CC: leveldb Differential Revision: https://reviews.facebook.net/D9075	2013-03-04 13:56:23 -08:00
Mark Callaghan	993543d1be	Add rate_delay_limit_milliseconds Summary: This adds the rate_delay_limit_milliseconds option to make the delay configurable in MakeRoomForWrite when the max compaction score is too high. This delay is called the Ln slowdown. This change also counts the Ln slowdown per level to make it possible to see where the stalls occur. From IO-bound performance testing, the Level N stalls occur: * with compression -> at the largest uncompressed level. This makes sense because compaction for compressed levels is much slower. When Lx is uncompressed and Lx+1 is compressed then files pile up at Lx because the (Lx,Lx+1)->Lx+1 compaction process is the first to be slowed by compression. * without compression -> at level 1 Task ID: #1832108 Blame Rev: Test Plan: run with real data, added test Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D9045	2013-03-04 07:41:15 -08:00
Dhruba Borthakur	806e264350	Ability for rocksdb to compact when flushing the in-memory memtable to a file in L0. Summary: Rocks accumulates recent writes and deletes in the in-memory memtable. When the memtable is full, it writes the contents on the memtable to a file in L0. This patch removes redundant records at the time of the flush. If there are multiple versions of the same key in the memtable, then only the most recent one is dumped into the output file. The purging of redundant records occur only if the most recent snapshot is earlier than the earliest record in the memtable. Should we switch on this feature by default or should we keep this feature turned off in the default settings? Test Plan: Added test case to db_test.cc Reviewers: sheki, vamsi, emayanke, heyongqiang Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D8991	2013-03-04 00:01:47 -08:00
bil	4992633751	enable the ability to set key size in db_bench in rocksdb Summary: 1. the default value for key size is still 16 2. enable the ability to set the key size via command line --key_size= Test Plan: build & run db_banch and pass some value via command line. verify it works correctly. Reviewers: sheki Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D8943	2013-03-01 14:10:09 -08:00
Abhishek Kona	c41f1e995c	Codemod NULL to nullptr Summary: scripted NULL to nullptr in * include/leveldb/ * db/ * table/ * util/ Test Plan: make all check Reviewers: dhruba, emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D9003	2013-02-28 18:04:58 -08:00
Dhruba Borthakur	e45c7a8444	Abilty to support upto a million .sst files in the database Summary: There was an artifical limit of 50K files per database. This is insifficient if the database is 1 TB in size and each file is 2 MB. Test Plan: make check Reviewers: sheki, emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D8919	2013-02-26 16:27:51 -08:00
Abhishek Kona	a9866b721b	Refactor statistics. Remove individual functions like incNumFileOpens Summary: Use only the counter mechanism. Do away with incNumFileOpens, incNumFileClose, incNumFileErrors s/NULL/nullptr/g in db/table_cache.cc Test Plan: make clean check Reviewers: dhruba, heyongqiang, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D8841	2013-02-25 13:58:34 -08:00
Abhishek Kona	959337ed5b	Measure compaction time. Summary: just record time consumed in compaction Test Plan: compile Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D8781	2013-02-22 11:38:40 -08:00
Abhishek Kona	ec77366e14	Counters for bytes written and read. Summary: * Counters for bytes read and write. as a part of this diff, I want to=> * Measure compaction times. @dhruba can you point which function, should * I time to get Compaction-times. Was looking at CompactRange. Test Plan: db_test Reviewers: dhruba, emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D8763	2013-02-21 16:06:32 -08:00
Vamsi Ponnekanti	6abb30d4d0	[Missed adding cmdline parsing for new flags added in D8685] Summary: I had added FLAGS_numdistinct and FLAGS_deletepercent for randomwithverify but forgot to add cmdline parsing for those flags. Test Plan: [nponnekanti@dev902 /data/users/nponnekanti/rocksdb] ./db_bench --benchmarks=randomwithverify --numdistinct=500 LevelDB: version 1.5 Date: Thu Feb 21 10:34:40 2013 CPU: 24 * Intel(R) Xeon(R) CPU X5650 @ 2.67GHz CPUCache: 12288 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Compression: snappy WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ Created bg thread 0x7fbf90bff700 randomwithverify : 4.693 micros/op 213098 ops/sec; ( get:900000 put:80000 del:20000 total:1000000 found:714556) [nponnekanti@dev902 /data/users/nponnekanti/rocksdb] ./db_bench --benchmarks=randomwithverify --deletepercent=5 LevelDB: version 1.5 Date: Thu Feb 21 10:35:03 2013 CPU: 24 * Intel(R) Xeon(R) CPU X5650 @ 2.67GHz CPUCache: 12288 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Compression: snappy WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ Created bg thread 0x7fe14dfff700 randomwithverify : 4.883 micros/op 204798 ops/sec; ( get:900000 put:50000 del:50000 total:1000000 found:443847) [nponnekanti@dev902 /data/users/nponnekanti/rocksdb] [nponnekanti@dev902 /data/users/nponnekanti/rocksdb] ./db_bench --benchmarks=randomwithverify --deletepercent=5 --numdistinct=500 LevelDB: version 1.5 Date: Thu Feb 21 10:36:18 2013 CPU: 24 * Intel(R) Xeon(R) CPU X5650 @ 2.67GHz CPUCache: 12288 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Compression: snappy WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ Created bg thread 0x7fc31c7ff700 randomwithverify : 4.920 micros/op 203233 ops/sec; ( get:900000 put:50000 del:50000 total:1000000 found:445522) Revert Plan: OK Task ID: # Reviewers: dhruba, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D8769	2013-02-21 12:26:32 -08:00
Vamsi Ponnekanti	945d2b59b9	[Add randomwithverify benchmark option] Summary: Added RandomWithVerify benchmark option. Test Plan: This whole diff is to test. [nponnekanti@dev902 /data/users/nponnekanti/rocksdb] ./db_bench --benchmarks=randomwithverify LevelDB: version 1.5 Date: Tue Feb 19 17:50:28 2013 CPU: 24 * Intel(R) Xeon(R) CPU X5650 @ 2.67GHz CPUCache: 12288 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Compression: snappy WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ Created bg thread 0x7fa9c3fff700 randomwithverify : 5.004 micros/op 199836 ops/sec; ( get:900000 put:80000 del:20000 total:1000000 found:711992) Revert Plan: OK Task ID: # Reviewers: dhruba, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D8685	2013-02-21 10:27:02 -08:00
amayank	b2c50f1c3f	Fix for the weird behaviour encountered by ldb Get where it could read only the second-latest value Summary: Changed the Get and Scan options with openForReadOnly mode to have access to the memtable. Changed the visibility of NewInternalIterator in db_impl from private to protected so that the derived class db_impl_read_only can call that in its NewIterator function for the scan case. The previous approach which changed the default for flush_on_destroy_ from false to true caused many problems in the unit tests due to empty sst files that it created. All unit tests pass now. Test Plan: make clean; make all check; ldb put and get and scans Reviewers: dhruba, heyongqiang, sheki Reviewed By: dhruba CC: kosievdmerwe, zshao, dilipj, kailiu Differential Revision: https://reviews.facebook.net/D8697	2013-02-20 10:45:52 -08:00
Abhishek Kona	fe10200ddc	Introduce histogram in statistics.h Summary: * Introduce is histogram in statistics.h * stop watch to measure time. * introduce two timers as a poc. Replaced NULL with nullptr to fight some lint errors Should be useful for google. Test Plan: ran db_bench and check stats. make all check Reviewers: dhruba, heyongqiang Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D8637	2013-02-20 10:43:32 -08:00
amayank	f3901e0647	Revert "Fix for the weird behaviour encountered by ldb Get where it could read only the second-latest value" This reverts commit `4c696ed001`.	2013-02-18 22:32:27 -08:00
Dhruba Borthakur	fd367e677e	Fix unit test failure in db_filename.cc Summary: c_test: db/filename.cc:74: std::string leveldb::DescriptorFileName(const string&,.... Test Plan: this is a failure in a unit test Differential Revision: https://reviews.facebook.net/D8667	2013-02-18 21:53:56 -08:00
Dhruba Borthakur	4564915446	Zero out redundant sequence numbers for kvs to increase compression efficiency Summary: The sequence numbers in each record eat up plenty of space on storage. The optimization zeroes out sequence numbers on kvs in the Lmax layer that are earlier than the earliest snapshot. Test Plan: Unit test attached. Differential Revision: https://reviews.facebook.net/D8619	2013-02-18 21:51:15 -08:00
amayank	4c696ed001	Fix for the weird behaviour encountered by ldb Get where it could read only the second-latest value Summary: flush_on_destroy has a default value of false and the memtable is flushed in the dbimpl-destructor only when that is set to true. Because we want the memtable to be flushed everytime that the destructor is called(db is closed) and the cases where we work with the memtable only are very less it is a good idea to give this a default value of true. Thus the put from ldb wil have its data flushed to disk in the destructor and the next Get will be able to read it when opened with OpenForReadOnly. The reason that ldb could read the latest value when the db was opened in the normal Open mode is that the Get from normal Open first reads the memtable and directly finds the latest value written there and the Get from OpenForReadOnly doesn't have access to the memtable (which is correct because all its Put/Modify) are disabled Test Plan: make all; ldb put and get and scans Reviewers: dhruba, heyongqiang, sheki Reviewed By: heyongqiang CC: kosievdmerwe, zshao, dilipj, kailiu Differential Revision: https://reviews.facebook.net/D8631	2013-02-15 16:56:06 -08:00
Kai Liu	b63aafce42	Allow the logs to be purged by TTL. Summary: * Add a SplitByTTLLogger to enable this feature. In this diff I implemented generalized AutoSplitLoggerBase class to simplify the development of such classes. * Refactor the existing AutoSplitLogger and fix several bugs. Test Plan: * Added a unit tests for different types of "auto splitable" loggers individually. * Tested the composited logger which allows the log files to be splitted by both TTL and log size. Reviewers: heyongqiang, dhruba Reviewed By: heyongqiang CC: zshao, leveldb Differential Revision: https://reviews.facebook.net/D8037	2013-02-04 19:42:40 -08:00
Chip Turner	0b83a83191	Fix poor error on num_levels mismatch and few other minor improvements Summary: Previously, if you opened a db with num_levels set lower than the database, you received the unhelpful message "Corruption: VersionEdit: new-file entry." Now you get a more verbose message describing the issue. Also, fix handling of compression_levels (both the run-over-the-end issue and the memory management of it). Lastly, unique_ptr'ify a couple of minor calls. Test Plan: make check Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D8151	2013-01-25 15:37:26 -08:00
Chip Turner	772f75b3fb	Stop continually re-creating build_version.c Summary: We continually rebuilt build_version.c because we put the current date into it, but that's what __DATE__ already is. This makes builds faster. This also fixes an issue with 'make clean FOO' not working properly. Also tweak the build rules to be more consistent, always have warnings, and add a 'make release' rule to handle flags for release builds. Test Plan: make, make clean Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D8139	2013-01-24 17:51:39 -08:00
Chip Turner	3dafdfb2c4	Use fallocate to prevent excessive allocation of sst files and logs Summary: On some filesystems, pre-allocation can be a considerable amount of space. xfs in our production environment pre-allocates by 1GB, for instance. By using fallocate to inform the kernel of our expected file sizes, we eliminate this wasteage (that isn't recovered until the file is closed which, in the case of LOG files, can be a considerable amount of time). Test Plan: created an xfs loopback filesystem, mounted with allocsize=4M, and ran db_stress. LOG file without this change was 4M, and with it it was 128k then grew to normal size. Reviewers: dhruba Reviewed By: dhruba CC: adsharma, leveldb Differential Revision: https://reviews.facebook.net/D7953	2013-01-24 12:25:13 -08:00
Chip Turner	2fdf91a4f8	Fix a number of object lifetime/ownership issues Summary: Replace manual memory management with std::unique_ptr in a number of places; not exhaustive, but this fixes a few leaks with file handles as well as clarifies semantics of the ownership of file handles with log classes. Test Plan: db_stress, make check Reviewers: dhruba Reviewed By: dhruba CC: zshao, leveldb, heyongqiang Differential Revision: https://reviews.facebook.net/D8043	2013-01-23 16:54:11 -08:00
Abhishek Kona	16903c35b0	Add counters to count gets and writes Summary: Add Tickers to count Write's and Get's Test Plan: make check Reviewers: dhruba, chip Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7977	2013-01-17 12:27:56 -08:00
Kosie van der Merwe	3c3df7402f	Fixed issues Valgrind found. Summary: Found issues with `db_test` and `db_stress` when running valgrind. `DBImpl` had an issue where if an compaction failed then it will use the uninitialised file size of an output file is used. This manifested as the final call to output to the log in `DoCompactionWork()` branching on uninitialized memory (all the way down in printf's innards). Test Plan: Ran `valgrind --track_origins=yes ./db_test` and `valgrind ./db_stress` to see if issues disappeared. Ran `make check` to see if there were no regressions. Reviewers: vamsi, dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D8001	2013-01-17 10:04:45 -08:00
Abhishek Kona	7d5a4383bb	rollover manifest file. Summary: Check in LogAndApply if the file size is more than the limit set in Options. Things to consider : will this be expensive? Test Plan: make all check. Inputs on a new unit test? Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7701	2013-01-16 12:09:44 -08:00
Chip Turner	a2dcd79c1e	Add optional clang compile mode Summary: clang is an alternate compiler based on llvm. It produces nicer error messages and finds some bugs that gcc doesn't, such as the size_t change in this file (which caused some write return values to be misinterpreted!) Clang isn't the default; to try it, do "USE_CLANG=1 make" or "export USE_CLANG=1" then make as normal Test Plan: "make check" and "USE_CLANG=1 make check" Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D7899	2013-01-15 18:48:37 -08:00
Chip Turner	9bbcab57a9	Fix broken build Summary: Mis-merged from HEAD, had a duplicate declaration. Test Plan: make -j32 OPT=-g Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D7911	2013-01-15 14:05:49 -08:00
Kosie van der Merwe	28fe86c48a	Fixed bug with seek compactions on Level 0 Summary: Due to how the code handled compactions in Level 0 in `PickCompaction()` it could be the case that two compactions on level 0 ran that produced tables in level 1 that overlap. However, this case seems like it would only occur on a seek compaction which is unlikely on level 0. Furthermore, level 0 and level 1 had to have a certain arrangement of files. Test Plan: make check Reviewers: dhruba, vamsi Reviewed By: dhruba CC: leveldb, sheki Differential Revision: https://reviews.facebook.net/D7923	2013-01-15 12:43:09 -08:00
Chip Turner	c0cb289d57	Various build cleanups/improvements Summary: Specific changes: 1) Turn on -Werror so all warnings are errors 2) Fix some warnings the above now complains about 3) Add proper dependency support so changing a .h file forces a .c file to rebuild 4) Automatically use fbcode gcc on any internal machine rather than whatever system compiler is laying around 5) Fix jemalloc to once again be used in the builds (seemed like it wasn't being?) 6) Fix issue where 'git' would fail in build_detect_version because of LD_LIBRARY_PATH being set in the third-party build system Test Plan: make, make check, make clean, touch a header file, make sure rebuild is expected Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D7887	2013-01-14 18:40:22 -08:00
Mark Callaghan	2ba125faf6	fix warning for unused variable Test Plan: compile Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7857	2013-01-11 15:00:47 -08:00
Abhishek Kona	85ad13be1a	Port fix for Leveldb manifest writing bug from Open-Source Summary: Pretty much a blind copy of the patch in open source. Hope to get this in before we make a release Test Plan: make clean check Reviewers: dhruba, heyongqiang Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7809	2013-01-10 12:06:03 -08:00
Kosie van der Merwe	d8371ef1f6	Fixing some issues Valgrind found Summary: Found some issues running Valgrind on `db_test` (there are still some outstanding ones) and fixed them. Test Plan: make check ran `valgrind ./db_test` and saw that errors no longer occur Reviewers: dhruba, vamsi, emayanke, sheki Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7803	2013-01-08 12:16:40 -08:00
Dhruba Borthakur	628dc2aad9	db_bench should use the default value for max_grandparent_overlap_factor. Summary: This was a peformance regression caused by https://reviews.facebook.net/D6729. The default value of max_grandparent_overlap_factor was erroneously set to 0 in db_bench. This was causing compactions to create really really small files because the max_grandparent_overlap_factor was erroneously set to zero in the benchmark. Test Plan: Run --benchmarks=overwrite Reviewers: heyongqiang, emayanke, sheki, MarkCallaghan Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D7797	2013-01-08 11:21:11 -08:00
Kosie van der Merwe	d6e873f22f	Added clearer error message for failure to create db directory in DBImpl::Recover() Summary: Changed CreateDir() to CreateDirIfMissing() so a directory that already exists now causes and error. Fixed CreateDirIfMissing() and added Env.DirExists() Test Plan: make check to test for regessions Ran the following to test if the error message is not about lock files not existing ./db_bench --db=dir/testdb After creating a file "testdb", ran the following to see if it failed with sane error message: ./db_bench --db=testdb Reviewers: dhruba, emayanke, vamsi, sheki Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D7707	2013-01-07 10:11:18 -08:00
Mark Callaghan	4069f66cc7	Add --seed, --read_range to db_bench Summary: Adds the option --seed to db_bench to specify the base for the per-thread RNG. When not set each thread uses the same value across runs of db_bench which defeats IO stress testing. Adds the option --read_range. When set to a value > 1 an iterator is created and each query done for the randomread benchmark will do a range scan for that many rows. When not set or set to 1 the existing behavior (a point lookup) is done. Fixes a bug where a printf format string was missing. Test Plan: run db_bench Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7749	2013-01-07 09:56:10 -08:00
Kosie van der Merwe	8cd86a7be5	Fixing and adding some comments Summary: `MemTableList::Add()` neglected to mention that it took ownership of the reference held by its caller. The comment in `MemTable::Get()` was wrong in describing the format of the key. Test Plan: None Reviewers: dhruba, sheki, emayanke, vamsi Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7755	2013-01-03 17:13:56 -08:00
Dhruba Borthakur	d7d43ae21a	ExtendOverlappingInputs too slow for large databases. Summary: There was a bug in the ExtendOverlappingInputs method so that the terminating condition for the backward search was incorrect. Test Plan: make clean check Reviewers: sheki, emayanke, MarkCallaghan Reviewed By: MarkCallaghan CC: leveldb Differential Revision: https://reviews.facebook.net/D7725	2013-01-02 13:19:06 -08:00
Dhruba Borthakur	f4c2b7cf97	Enhance ReadOnly mode to process the all committed transactions. Summary: Leveldb has an api OpenForReadOnly() that opens the database in readonly mode. This call had an option to not process the transaction log. This patch removes this option and always processes all transactions that had been committed. It has been done in such a way that it does not create/write to any new files in the process. The invariant of "no-writes" to the leveldb data directory is still true. This enhancement allows multiple threads to open the same database in readonly mode and access all trancations that were committed right upto the OpenForReadOnly call. I changed the public API to match the new semantics because there are no users who are currently using this api. Test Plan: make clean check Reviewers: sheki Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D7479	2012-12-19 16:30:46 -08:00
Dhruba Borthakur	3d1e92b05a	Enhancements to rocksdb for better support for replication. Summary: 1. The OpenForReadOnly() call should not lock the db. This is useful so that multiple processes can open the same database concurrently for reading. 2. GetUpdatesSince should not error out if the archive directory does not exist. 3. A new constructor for WriteBatch that can takes a serialized string as a parameter of the constructor. Test Plan: make clean check Reviewers: sheki Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D7449	2012-12-17 11:40:19 -08:00
Kosie van der Merwe	62d48571de	Added meta-database support. Summary: Added kMetaDatabase for meta-databases in db/filename.h along with supporting fuctions. Fixed switch in DBImpl so that it also handles kMetaDatabase. Fixed DestroyDB() that it can handle destroying meta-databases. Test Plan: make check Reviewers: sheki, emayanke, vamsi, dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D7245	2012-12-17 11:26:59 -08:00
Abhishek Kona	2f0585fb97	Fix a bug. Where DestroyDB deletes a non-existant archive directory. Summary: C tests would fail sometimes as DestroyDB would return a Failure Status message when deleting an archival directory which was not created (WAL_ttl_seconds = 0). Fix: Ignore the Status returned on Deleting Archival Directory. Test Plan: * make check Reviewers: dhruba, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7395	2012-12-17 10:25:26 -08:00
Zheng Shao	c28097538a	manifest_dump: Add --hex=1 option Summary: Without this option, manifest_dump does not print binary keys for files in a human-readable way. Test Plan: ./manifest_dump --hex=1 --verbose=0 --file=/data/users/zshao/fdb_comparison/leveldb/fbobj.apprequest-0_0_original/MANIFEST-000002 manifest_file_number 589 next_file_number 590 last_sequence 2311567 log_number 543 prev_log_number 0 --- level 0 --- version# 0 --- 532:1300357['0000455BABE20000' @ 2183973 : 1 .. 'FFFCA5D7ADE20000' @ 2184254 : 1] 536:1308170['000198C75CE30000' @ 2203313 : 1 .. 'FFFCF94A79E30000' @ 2206463 : 1] 542:1321644['0002931AA5E50000' @ 2267055 : 1 .. 'FFF77B31C5E50000' @ 2270754 : 1] 544:1286390['000410A309E60000' @ 2278592 : 1 .. 'FFFE470A73E60000' @ 2289221 : 1] 538:1298778['0006BCF4D8E30000' @ 2217050 : 1 .. 'FFFD77DAF7E30000' @ 2220489 : 1] 540:1282353['00090D5356E40000' @ 2231156 : 1 .. 'FFFFF4625CE40000' @ 2231969 : 1] --- level 1 --- version# 0 --- 510:2112325['000007F9C2D40000' @ 1782099 : 1 .. '146F5B67B8D80000' @ 1905458 : 1] 511:2121742['146F8A3023D60000' @ 1824388 : 1 .. '28BC8FBB9CD40000' @ 1777993 : 1] 512:801631['28BCD396F1DE0000' @ 2080191 : 1 .. '3082DBE9ADDB0000' @ 1989927 : 1] Reviewers: dhruba, sheki, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7425	2012-12-16 08:58:28 -08:00
Abhishek Kona	2ba866e0c5	GetSequence API in write batch. Summary: WriteBatch is now used by the GetUpdatesSinceAPI. This API is external and will be used by the rocks server. Rocks Server and others will need to know about the Sequence Number in the WriteBatch. This public method will allow for that. Test Plan: make all check. Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7293	2012-12-12 22:21:10 -08:00
Abhishek Kona	22c283625b	Fix Bug in Binary Search for files containing a seq no. and delete Archived Log Files during Destroy DB. Summary: * Fixed implementation bug in Binary_Searvch introduced in https://reviews.facebook.net/D7119 * Binary search is also overflow safe. * Delete archive log files and archive dir during DestroyDB Test Plan: make check Reviewers: dhruba CC: kosievdmerwe, emayanke Differential Revision: https://reviews.facebook.net/D7263	2012-12-11 16:15:02 -08:00
Dhruba Borthakur	24fc379273	An public api to fetch the latest transaction id. Summary: Implement a interface to retrieve the most current transaction id from the database. Test Plan: Added unit test. Reviewers: sheki Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D7269	2012-12-10 16:04:19 -08:00
Abhishek Kona	1c6742e32f	Refactor GetArchivalDirectoryName to filename.h Summary: filename.h has functions to do similar things. Moving code away from db_impl.cc Test Plan: make check Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D7251	2012-12-10 10:51:07 -08:00
Abhishek Kona	8055008909	GetUpdatesSince API to enable replication. Summary: How it works: * GetUpdatesSince takes a SequenceNumber. * A LogFile with the first SequenceNumber nearest and lesser than the requested Sequence Number is found. * Seek in the logFile till the requested SeqNumber is found. * Return an iterator which contains logic to return record's one by one. Test Plan: * Test case included to check the good code path. * Will update with more test-cases. * Feedback required on test-cases. Reviewers: dhruba, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7119	2012-12-07 11:42:13 -08:00
Dhruba Borthakur	c847a31727	Print compaction score for every compaction run. Summary: A compaction is picked based on its score. It is useful to print the compaction score in the LOG because it aids in debugging. If one looks at the logs, one can find out why a compaction was preferred over another. Test Plan: make clean check Differential Revision: https://reviews.facebook.net/D7137	2012-12-04 10:03:47 -08:00
sheki	d4627e6de4	Move WAL files to archive directory, instead of deleting. Summary: Create a directory "archive" in the DB directory. During DeleteObsolteFiles move the WAL files (*.log) to the Archive directory, instead of deleting. Test Plan: Created a DB using DB_Bench. Reopened it. Checked if files move. Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D6975	2012-11-28 17:28:08 -08:00
Abhishek Kona	d29f181923	Fix all the lint errors. Summary: Scripted and removed all trailing spaces and converted all tabs to spaces. Also fixed other lint errors. All lint errors from this point of time should be taken seriously. Test Plan: make all check Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7059	2012-11-28 17:18:41 -08:00
Dhruba Borthakur	9a357847eb	Delete non-visible keys during a compaction even in the presense of snapshots. Summary: LevelDB should delete almost-new keys when a long-open snapshot exists. The previous behavior is to keep all versions that were created after the oldest open snapshot. This can lead to database size bloat for high-update workloads when there are long-open snapshots and long-open snapshot will be used for logical backup. By "almost new" I mean that the key was updated more than once after the oldest snapshot. If there were two snapshots with seq numbers s1 and s2 (s1 < s2), and if we find two instances of the same key k1 that lie entirely within s1 and s2 (i.e. s1 < k1 < s2), then the earlier version of k1 can be safely deleted because that version is not visible in any snapshot. Test Plan: unit test attached make clean check Differential Revision: https://reviews.facebook.net/D6999	2012-11-28 15:47:40 -08:00
Dhruba Borthakur	3366eda839	Print out status at the end of a compaction run. Summary: Print out status at the end of a compaction run. This helps in debugging. Test Plan: make clean check Reviewers: sheki Reviewed By: sheki Differential Revision: https://reviews.facebook.net/D7035	2012-11-27 22:17:38 -08:00
sheki	43f5a07989	Remove unused varibles. Cause compiler warnings. Test Plan: make check Reviewers: dhruba Reviewed By: dhruba CC: emayanke Differential Revision: https://reviews.facebook.net/D6993	2012-11-26 20:55:24 -08:00
Dhruba Borthakur	2a39699900	Assertion failure while running with unit tests with OPT=-g Summary: When we expand the range of keys for a level 0 compaction, we need to invoke ParentFilesInCompaction() only once for the entire range of keys that is being compacted. We were invoking it for each file that was being compacted, but this triggers an assertion because each file's range were contiguous but non-overlapping. I renamed ParentFilesInCompaction to ParentRangeInCompaction to adequately represent that it is the range-of-keys and not individual files that we compact in a single compaction run. Here is the assertion that is fixed by this patch. db_test: db/version_set.cc:585: void leveldb::Version::ExtendOverlappingInputs(int, const leveldb::Slice&, const leveldb::Slice&, std::vector<leveldb::FileMetaData, std::allocator<leveldb::FileMetaData> >*, int): Assertion `user_cmp->Compare(flimit, user_begin) >= 0' failed. Test Plan: make clean check OPT=-g Reviewers: sheki Reviewed By: sheki CC: MarkCallaghan, emayanke, leveldb Differential Revision: https://reviews.facebook.net/D6963	2012-11-26 14:00:39 -08:00
Dhruba Borthakur	e0cd6bf0e9	The c_test was sometimes failing with an assertion. Summary: On fast filesystems (e.g. /dev/shm and ext4), the flushing of memstore to disk was fast and quick, and the background compaction thread was not getting scheduled fast enough to delete obsolete files before the db was closed. This caused the repair method to pick up those files that were not part of the db and the unit test was failing. The fix is to enhance the unti test to run a compaction before closing the database so that all files that are not part of the database are truly deleted from the filesystem. Test Plan: make c_test; ./c_test Reviewers: chip, emayanke, sheki Reviewed By: chip CC: leveldb Differential Revision: https://reviews.facebook.net/D6915	2012-11-26 11:59:51 -08:00
Dhruba Borthakur	7632fdb5cb	Support taking a configurable number of files from the same level to compact in a single compaction run. Summary: The compaction process takes some files from LevelK and merges it into LevelK+1. The number of files it picks from LevelK was capped such a way that the total amount of data picked does not exceed the maxfilesize of that level. This essentially meant that only one file from LevelK is picked for a single compaction. For bulkloads, we would like to take many many file from LevelK and compact them using a single compaction run. This patch introduces a option called the 'source_compaction_factor' (similar to expanded_compaction_factor). It is a multiplier that is multiplied by the maxfilesize of that level to arrive at the limit that is used to throttle the number of source files from LevelK. For bulk loads, set source_compaction_factor to a very high number so that multiple files from the same level are picked for compaction in a single compaction. The default value of source_compaction_factor is 1, so that we can keep backward compatibilty with existing compaction semantics. Test Plan: make clean check Reviewers: emayanke, sheki Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D6867	2012-11-21 08:37:03 -08:00
Dhruba Borthakur	fbb73a4ac3	Support to disable background compactions on a database. Summary: This option is needed for fast bulk uploads. The goal is to load all the data into files in L0 without any interference from background compactions. Test Plan: make clean check Reviewers: sheki Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D6849	2012-11-20 21:12:06 -08:00
Dhruba Borthakur	3754f2f4ff	A major bug that was not considering the compaction score of the n-1 level. Summary: The method Finalize() recomputes the compaction score of each level and then sorts these score from largest to smallest. The idea is that the level with the largest compaction score will be a better candidate for compaction. There are usually very few levels, and a bubble sort code was used to sort these compaction scores. There existed a bug in the sorting code that skipped looking at the score for the n-1 level. This meant that even if the compaction score of the n-1 level is large, it will not be picked for compaction. This patch fixes the bug and also introduces "asserts" in the code to detect any possible inconsistencies caused by future bugs. This bug existed in the very first code change that introduced multi-threaded compaction to the leveldb code. That version of code was committed on Oct 19th via `1ca0584345` Test Plan: make clean check OPT=-g Reviewers: emayanke, sheki, MarkCallaghan Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D6837	2012-11-20 15:44:21 -08:00
Dhruba Borthakur	dde70898a1	Fix asserts Summary: make check OPT=-g fails with the following assert. ==== Test DBTest.ApproximateSizes db_test: db/version_set.cc:765: void leveldb::VersionSet::Builder::CheckConsistencyForDeletes(leveldb::VersionEdit, int, int): Assertion `found' failed. The assertion was that file #7 that was being deleted did not preexists, but actualy it did pre-exist as shown in the manifest dump shows below. The bug was that we did not check for file existance at the same level. **********************Edit[0] = VersionEdit { Comparator: leveldb.BytewiseComparator } *********************Edit[1] = VersionEdit { LogNumber: 8 PrevLogNumber: 0 NextFile: 9 LastSeq: 80 AddFile: 0 7 8005319 'key000000' @ 1 : 1 .. 'key000079' @ 80 : 1 } ***********************Edit[2] = VersionEdit { LogNumber: 8 PrevLogNumber: 0 NextFile: 13 LastSeq: 80 CompactPointer: 0 'key000079' @ 80 : 1 DeleteFile: 0 7 AddFile: 1 9 2101425 'key000000' @ 1 : 1 .. 'key000020' @ 21 : 1 AddFile: 1 10 2101425 'key000021' @ 22 : 1 .. 'key000041' @ 42 : 1 AddFile: 1 11 2101425 'key000042' @ 43 : 1 .. 'key000062' @ 63 : 1 AddFile: 1 12 1701165 'key000063' @ 64 : 1 .. 'key000079' @ 80 : 1 } Test Plan: Reviewers: CC: Task ID: # Blame Rev:	2012-11-19 14:51:22 -08:00
Dhruba Borthakur	a4b79b6e28	Merge branch 'master' into performance	2012-11-19 13:20:25 -08:00
Dhruba Borthakur	74054fa993	Fix compilation error while compiling unit tests with OPT=-g Summary: Fix compilation error while compiling with OPT=-g Test Plan: make clean check OPT=-g Reviewers: CC: Task ID: # Blame Rev:	2012-11-19 13:16:46 -08:00
Dhruba Borthakur	48dafb2c59	Fix compilation error introduced by previous commit `7889e09455` Summary: Fix compilation error introduced by previous commit `7889e09455` Test Plan: make clean check	2012-11-19 12:16:45 -08:00
Dhruba Borthakur	7889e09455	Enhance manifest_dump to print each individual edit. Summary: The manifest file contains a series of edits. If the verbose option is switched on, then print each individual edit in the manifest file. This helps in debugging. Test Plan: make clean manifest_dump Reviewers: emayanke, sheki Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D6807	2012-11-19 12:04:35 -08:00
amayank	65b035a47f	Fix a coding error in db_test.cc Summary: The new function MinLevelToCompress in db_test.cc was incomplete. It needs to tell the calling function-TEST whether the test has to be skipped or not Test Plan: make all;./db_test Reviewers: dhruba, heyongqiang Reviewed By: dhruba CC: sheki Differential Revision: https://reviews.facebook.net/D6771	2012-11-19 12:04:35 -08:00
Dhruba Borthakur	4b622ab0f2	Enhance manifest_dump to print each individual edit. Summary: The manifest file contains a series of edits. If the verbose option is switched on, then print each individual edit in the manifest file. This helps in debugging. Test Plan: make clean manifest_dump Reviewers: emayanke, sheki Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D6807	2012-11-19 12:02:27 -08:00
Dhruba Borthakur	62e7583f94	enhance dbstress to simulate hard crash Summary: dbstress has an option to reopen the database. Make it such that the previous handle is not closed before we reopen, this simulates a situation similar to a process crash. Added new api to DMImpl to remove the lock file. Test Plan: run db_stress Reviewers: emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D6777	2012-11-18 23:16:17 -08:00
amayank	de278a6de9	Fix a coding error in db_test.cc Summary: The new function MinLevelToCompress in db_test.cc was incomplete. It needs to tell the calling function-TEST whether the test has to be skipped or not Test Plan: make all;./db_test Reviewers: dhruba, heyongqiang Reviewed By: dhruba CC: sheki Differential Revision: https://reviews.facebook.net/D6771	2012-11-16 14:56:50 -08:00
Dhruba Borthakur	6c5a4d646a	Merge branch 'master' into performance Conflicts: db/db_impl.h	2012-11-14 21:39:52 -08:00
Dhruba Borthakur	e988c11f58	Enhance db_bench to be able to specify a grandparent_overlap_factor. Summary: The value specified in max_grandparent_overlap_factor is used to limit the file size in a compaction run. This patch makes it configurable when using db_bench. Test Plan: make clean db_bench Reviewers: MarkCallaghan, heyongqiang Reviewed By: heyongqiang CC: leveldb Differential Revision: https://reviews.facebook.net/D6729	2012-11-14 16:20:13 -08:00
Dhruba Borthakur	5d16e503a6	Improved CompactionFilter api: pass in a opaque argument to CompactionFilter invocation. Summary: There are applications that operate on multiple leveldb instances. These applications will like to pass in an opaque type for each leveldb instance and this type should be passed back to the application with every invocation of the CompactionFilter api. Test Plan: Enehanced unit test for opaque parameter to CompactionFilter. Reviewers: heyongqiang Reviewed By: heyongqiang CC: MarkCallaghan, sheki, emayanke Differential Revision: https://reviews.facebook.net/D6711	2012-11-13 16:22:26 -08:00
Dhruba Borthakur	43d9a8225a	Fix asserts so that "make check OPT=-g" works on performance branch Summary: Compilation used to fail with the error: db/version_set.cc:1773: error: ‘number_of_files_to_sort_’ is not a member of ‘leveldb::VersionSet’ I created a new method called CheckConsistencyForDeletes() so that all the high cost checking is done only when OPT=-g is specified. I also fixed a bug in PickCompactionBySize that was triggered when OPT=-g was switched on. The base_index in the compaction record was not set correctly. Test Plan: make check OPT=-g Differential Revision: https://reviews.facebook.net/D6687	2012-11-13 10:40:52 -08:00
Dhruba Borthakur	a785e029f7	The db_bench utility was broken in 1.5.4.fb because of a signed-unsigned comparision. Summary: The db_bench utility was broken in 1.5.4.fb because of a signed-unsigned comparision. The static variable FLAGS_min_level_to_compress was recently changed from int to 'unsigned in' but it is initilized to a nagative value -1. The segfault is of this type: Program received signal SIGSEGV, Segmentation fault. Open (this=0x7fffffffdee0) at db/db_bench.cc:939 939 db/db_bench.cc: No such file or directory. (gdb) where Test Plan: run db_bench with no options. Reviewers: heyongqiang Reviewed By: heyongqiang CC: MarkCallaghan, emayanke, sheki Differential Revision: https://reviews.facebook.net/D6663	2012-11-12 13:59:35 -08:00
Dhruba Borthakur	9c6c232e47	Compilation error while compiling with OPT=-g Summary: make clean check OPT=-g fails leveldb::DBStatistics::getTickerCount(leveldb::Tickers)’: ./db/db_statistics.h:34: error: ‘MAX_NO_TICKERS’ was not declared in this scope util/ldb_cmd.cc:255: warning: left shift count >= width of type Test Plan: make clean check OPT=-g Reviewers: CC: Task ID: # Blame Rev:	2012-11-11 00:20:40 -08:00
Abhishek Kona	0f8e4721a5	Metrics: record compaction drop's and bloom filter effectiveness Summary: Record BloomFliter hits and drop off reasons during compaction. Test Plan: Unit tests work. Reviewers: dhruba, heyongqiang Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D6591	2012-11-09 11:38:45 -08:00
heyongqiang	20d18a89a3	disable size compaction in ldb reduce_levels and added compression and file size parameter to it Summary: disable size compaction in ldb reduce_levels, this will avoid compactions rather than the manual comapction, added --compression=none\|snappy\|zlib\|bzip2 and --file_size= per-file size to ldb reduce_levels command Test Plan: run ldb Reviewers: dhruba, MarkCallaghan Reviewed By: dhruba CC: sheki, emayanke Differential Revision: https://reviews.facebook.net/D6597	2012-11-09 10:14:47 -08:00
Abhishek Kona	391885c4e4	stat's collection in leveldb Summary: Prototype stat's collection. Diff is a good estimate of what the final code will look like. A few assumptions : * Used a global static instance of the statistics object. Plan to pass it to each internal function. Static allows metrics only at app level. * In the Ticker's do not do any locking. Depend on the mutex at each function of LevelDB. If we ever remove the mutex, we should change here too. The other option is use atomic objects anyways as there won't be any contention as they will be always acquired only by one thread. * The counters are dumb, increment through lifecycle. Plan to use ods etc to get last5min stat etc. Test Plan: made changes in db_bench Ran ./db_bench --statistics=1 --num=10000 --cache_size=5000 This will print the cache hit/miss stats. Reviewers: dhruba, heyongqiang Differential Revision: https://reviews.facebook.net/D6441	2012-11-08 13:55:49 -08:00
Dhruba Borthakur	95dda37858	Move filesize-based-sorting to outside the Mutex Summary: When a new version is created, we sort all the files at every level based on their size. This is necessary because we want to compact the largest file first. The sorting takes quite a bit of CPU. Moved the sorting code to be outside the mutex. Also, the earlier code was sorting files at all levels but we do not need to sort the highest-number level because those files are never the cause of any compaction. To reduce sorting costs, we sort only the first few files in each level because it is likely that those are the only files in that level that will be picked for compaction. At steady state, I have seen that this patch increase throughout from 1500 writes/sec to 1700 writes/sec at the end of a 72 hour run. The cpu saving by not sorting the last level was not distinctive in this test run because there were only 100K files in the highest numbered level. I expect the cpu saving to be significant when the number of files is much higher. This is mostly an early preview and not ready for rigorous review. With this patch, the writs/sec is now bottlenecked not by the sorting code but by GetOverlappingInputs. I am working on a patch to optimize GetOverlappingInputs. Test Plan: make check Reviewers: MarkCallaghan, heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D6411	2012-11-07 15:39:44 -08:00
Dhruba Borthakur	18cb6004d2	Fixed compilation error in previous merge. Summary: Fixed compilation error in previous merge. Test Plan: Reviewers: CC: Task ID: # Blame Rev:	2012-11-07 15:24:47 -08:00
Dhruba Borthakur	8143062edd	Merge branch 'master' into performance Conflicts: db/db_impl.cc db/version_set.cc util/options.cc	2012-11-07 15:11:37 -08:00
heyongqiang	3fcf533ed0	Add a readonly db Summary: as subject Test Plan: run db_bench readrandom Reviewers: dhruba Reviewed By: dhruba CC: MarkCallaghan, emayanke, sheki Differential Revision: https://reviews.facebook.net/D6495	2012-11-07 14:19:48 -08:00
Dhruba Borthakur	9b87a2bae8	Avoid doing a exhaustive search when looking for overlapping files. Summary: The Version::GetOverlappingInputs() is called multiple times in the compaction code path. Eack invocation does a binary search for overlapping files in the specified key range. This patch remembers the offset of an overlapped file when GetOverlappingInputs() is called the first time within a compaction run. Suceeding calls to GetOverlappingInputs() uses the remembered index to avoid the binary search. I measured that 1000 iterations of GetOverlappingInputs takes around 4500 microseconds without this patch. If I use this patch with the hint on every invocation, then 1000 iterations take about 3900 microsecond. Test Plan: make check OPT=-g Reviewers: heyongqiang Reviewed By: heyongqiang CC: MarkCallaghan, emayanke, sheki Differential Revision: https://reviews.facebook.net/D6513	2012-11-07 11:47:17 -08:00
Abhishek Kona	4e413df3d0	Flush Data at object destruction if disableWal is used. Summary: Added a conditional flush in ~DBImpl to flush. There is still a chance of writes not being persisted if there is a crash (not a clean shutdown) before the DBImpl instance is destroyed. Test Plan: modified db_test to meet the new expectations. Reviewers: dhruba, heyongqiang Differential Revision: https://reviews.facebook.net/D6519	2012-11-06 15:04:42 -08:00
Dhruba Borthakur	aa42c66814	Fix all warnings generated by -Wall option to the compiler. Summary: The default compilation process now uses "-Wall" to compile. Fix all compilation error generated by gcc. Test Plan: make all check Reviewers: heyongqiang, emayanke, sheki Reviewed By: heyongqiang CC: MarkCallaghan Differential Revision: https://reviews.facebook.net/D6525	2012-11-06 14:07:31 -08:00
Dhruba Borthakur	5f91868cee	Merge branch 'master' into performance Conflicts: db/version_set.cc util/options.cc	2012-11-05 16:51:55 -08:00
Dhruba Borthakur	cb7a00227f	The method GetOverlappingInputs should use binary search. Summary: The method Version::GetOverlappingInputs used a sequential search to map a kay-range to a set of files. But the files are arranged in ascending order of key, so a biary search is more effective. This patch implements Version::GetOverlappingInputsBinarySearch that finds one file that corresponds to the specified key range and then iterates backwards and forwards to find all overlapping files. This patch is critical for making compactions efficient, especially when there are thousands of files in a single level. I measured that 1000 iterations of TEST_MaxNextLevelOverlappingBytes takes 16000 microseconds without this patch. With this patch, the same method takes about 4600 microseconds. Test Plan: Almost all unit tests in db_test uses this method to lookup keys. Reviewers: heyongqiang Reviewed By: heyongqiang CC: MarkCallaghan, emayanke, sheki Differential Revision: https://reviews.facebook.net/D6465	2012-11-05 16:08:01 -08:00
Dhruba Borthakur	5273c81483	Ability to invoke application hook for every key during compaction. Summary: There are certain use-cases where the application intends to delete older keys aftre they have expired a certian time period. One option for those applications is to periodically scan the entire database and delete appropriate keys. A better way is to allow the application to hook into the compaction process. This patch allows the application to set a method callback for every key that is being compacted. If this method returns true, then the key is not preserved in the output of the compaction. Test Plan: This is mostly to preview the proposed new public api. Since it is a public api, please do due diligence on reviewing it. I will be writing test cases for this api in mynext version of this patch. Reviewers: MarkCallaghan, heyongqiang Reviewed By: heyongqiang CC: sheki, adsharma Differential Revision: https://reviews.facebook.net/D6285	2012-11-05 16:02:13 -08:00
heyongqiang	f1a7c735b5	fix complie error Summary: as subject Test Plan:n/a	2012-11-05 10:30:19 -08:00
heyongqiang	d55c2ba305	Add a tool to change number of levels Summary: as subject. Test Plan: manually test it, will add a testcase Reviewers: dhruba, MarkCallaghan Differential Revision: https://reviews.facebook.net/D6345	2012-11-05 10:17:39 -08:00
Dhruba Borthakur	81f735d97c	Merge branch 'master' into performance Conflicts: db/db_impl.cc util/options.cc	2012-11-05 09:41:38 -08:00
Dhruba Borthakur	a1bd5b7752	Compilation problem introduced by previous commit `854c66b089`. Summary: Compilation problem introduced by previous commit `854c66b089`. Test Plan: make check	2012-11-04 22:04:14 -08:00
amayank	854c66b089	Make compression options configurable. These include window-bits, level and strategy for ZlibCompression Summary: Leveldb currently uses windowBits=-14 while using zlib compression.(It was earlier 15). This makes the setting configurable. Related changes here: https://reviews.facebook.net/D6105 Test Plan: make all check Reviewers: dhruba, MarkCallaghan, sheki, heyongqiang Differential Revision: https://reviews.facebook.net/D6393	2012-11-02 11:26:39 -07:00
heyongqiang	3096fa7534	Add two more options: disable block cache and make table cache shard number configuable Summary: as subject Test Plan: run db_bench and db_test Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D6111	2012-11-01 13:23:21 -07:00
Mark Callaghan	3e7e269292	Use timer to measure sleep rather than assume it is 1000 usecs Summary: This makes the stall timers in MakeRoomForWrite more accurate by timing the sleeps. From looking at the logs the real sleep times are usually about 2000 usecs each when SleepForMicros(1000) is called. The modified LOG messages are: 2012/10/29-12:06:33.271984 2b3cc872f700 delaying write 13 usecs for level0_slowdown_writes_trigger 2012/10/29-12:06:34.688939 2b3cc872f700 delaying write 1728 usecs for rate limits with max score 3.83 Task ID: # Blame Rev: Test Plan: run db_bench, look at DB/LOG Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D6297	2012-10-30 07:21:37 -07:00
heyongqiang	fb8d437325	fix test failure Summary: as subject Test Plan: db_test Reviewers: dhruba, MarkCallaghan Reviewed By: MarkCallaghan Differential Revision: https://reviews.facebook.net/D6309	2012-10-29 18:55:52 -07:00
heyongqiang	925f60d39d	add a test case to make sure chaning num_levels will fail Summary: Summary: as subject Test Plan: db_test Reviewers: dhruba, MarkCallaghan Reviewed By: MarkCallaghan Differential Revision: https://reviews.facebook.net/D6303	2012-10-29 15:27:07 -07:00
Dhruba Borthakur	53e04311b1	Merge branch 'master' into performance Conflicts: db/db_bench.cc util/options.cc	2012-10-29 14:18:00 -07:00
Dhruba Borthakur	321dfdc3ae	Allow having different compression algorithms on different levels. Summary: The leveldb API is enhanced to support different compression algorithms at different levels. This adds the option min_level_to_compress to db_bench that specifies the minimum level for which compression should be done when compression is enabled. This can be used to disable compression for levels 0 and 1 which are likely to suffer from stalls because of the CPU load for memtable flushes and (L0,L1) compaction. Level 0 is special as it gets frequent memtable flushes. Level 1 is special as it frequently gets all:all file compactions between it and level 0. But all other levels could be the same. For any level N where N > 1, the rate of sequential IO for that level should be the same. The last level is the exception because it might not be full and because files from it are not read to compact with the next larger level. The same amount of time will be spent doing compaction at any level N excluding N=0, 1 or the last level. By this standard all of those levels should use the same compression. The difference is that the loss (using more disk space) from a faster compression algorithm is less significant for N=2 than for N=3. So we might be willing to trade disk space for faster write rates with no compression for L0 and L1, snappy for L2, zlib for L3. Using a faster compression algorithm for the mid levels also allows us to reclaim some cpu without trading off much loss in disk space overhead. Also note that little is to be gained by compressing levels 0 and 1. For a 4-level tree they account for 10% of the data. For a 5-level tree they account for 1% of the data. With compression enabled: * memtable flush rate is ~18MB/second * (L0,L1) compaction rate is ~30MB/second With compression enabled but min_level_to_compress=2 * memtable flush rate is ~320MB/second * (L0,L1) compaction rate is ~560MB/second This practicaly takes the same code from https://reviews.facebook.net/D6225 but makes the leveldb api more general purpose with a few additional lines of code. Test Plan: make check Differential Revision: https://reviews.facebook.net/D6261	2012-10-29 11:48:09 -07:00
Mark Callaghan	acc8567b24	Add more rates to db_bench output Summary: Adds the "MB/sec in" and "MB/sec out" to this line: Amplification: 1.7 rate, 0.01 GB in, 0.02 GB out, 8.24 MB/sec in, 13.75 MB/sec out Changes all values to be reported per interval and since test start for this line: ... thread 0: (10000,60000) ops and (19155.6,27307.5) ops/second in (0.522041,2.197198) seconds Task ID: # Blame Rev: Test Plan: run db_bench Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D6291	2012-10-29 11:30:07 -07:00
Dhruba Borthakur	de7689b1d7	Fix unit test failure caused by delaying deleting obsolete files. Summary: A previous commit `4c107587ed` introduced the idea that some version updates might not delete obsolete files. This means that if a unit test blindly counts the number of files in the db directory it might not represent the true state of the database. Use GetLiveFiles() insteads to count the number of live files in the database. Test Plan: make check	2012-10-29 11:12:24 -07:00
Mark Callaghan	70c42bf05f	Adds DB::GetNextCompaction and then uses that for rate limiting db_bench Summary: Adds a method that returns the score for the next level that most needs compaction. That method is then used by db_bench to rate limit threads. Threads are put to sleep at the end of each stats interval until the score is less than the limit. The limit is set via the --rate_limit=$double option. The specified value must be > 1.0. Also adds the option --stats_per_interval to enable additional metrics reported every stats interval. Task ID: # Blame Rev: Test Plan: run db_bench Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D6243	2012-10-29 10:17:43 -07:00
Kai Liu	d50f8eb603	Enable LevelDb to create a new log file if current log file is too large. Summary: Enable LevelDb to create a new log file if current log file is too large. Test Plan: Write a script and manually check the generated info LOG. Task ID: 1803577 Blame Rev: Reviewers: dhruba, heyongqiang Reviewed By: heyongqiang CC: zshao Differential Revision: https://reviews.facebook.net/D6003	2012-10-26 14:55:02 -07:00
Mark Callaghan	65855dd8d4	Normalize compaction stats by time in compaction Summary: I used server uptime to compute per-level IO throughput rates. I intended to use time spent doing compaction at that level. This fixes that. Task ID: # Blame Rev: Test Plan: run db_bench, look at results Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D6237	2012-10-26 14:19:13 -07:00
Dhruba Borthakur	ea9e087851	Merge branch 'master' into performance Conflicts: db/db_bench.cc db/db_impl.cc db/db_test.cc	2012-10-26 08:57:56 -07:00
Dhruba Borthakur	8eedf13a82	Fix unit test failure caused by delaying deleting obsolete files. Summary: A previous commit `4c107587ed` introduced the idea that some version updates might not delete obsolete files. This means that if a unit test blindly counts the number of files in the db directory it might not represent the true state of the database. Use GetLiveFiles() insteads to count the number of live files in the database. Test Plan: make check Reviewers: heyongqiang, MarkCallaghan Reviewed By: MarkCallaghan Differential Revision: https://reviews.facebook.net/D6207	2012-10-26 08:42:05 -07:00
Dhruba Borthakur	5b0fe6c73b	Greedy algorithm for picking files to compact. Summary: It is best if we pick the largest file to compact in a level. This reduces the write amplification factor for compactions. Each level has an auxiliary data structure called files_by_size_ that sorts all files by their size. This data structure is updated when a new version is created. Test Plan: make check Differential Revision: https://reviews.facebook.net/D6195	2012-10-25 18:27:53 -07:00
Dhruba Borthakur	8fb5f40468	firstIndex fix for multi-threaded compaction code. Summary: Prior to multi-threaded compaction, wrap-around would be done by using current_->files_[level[0]. With this change we should be using the first file for which f->being_compacted is not true. `1ca0584345 (commitcomment-2041516)` Test Plan: make check Differential Revision: https://reviews.facebook.net/D6165	2012-10-25 08:44:47 -07:00
Mark Callaghan	e7206f43ee	Improve statistics Summary: This adds more statistics to be reported by GetProperty("leveldb.stats"). The new stats include time spent waiting on stalls in MakeRoomForWrite. This also includes the total amplification rate where that is: (#bytes of sequential IO during compaction) / (#bytes from Put) This also includes a lot more data for the per-level compaction report. * Rn(MB) - MB read from level N during compaction between levels N and N+1 * Rnp1(MB) - MB read from level N+1 during compaction between levels N and N+1 * Wnew(MB) - new data written to the level during compaction * Amplify - ( Write(MB) + Rnp1(MB) ) / Rn(MB) * Rn - files read from level N during compaction between levels N and N+1 * Rnp1 - files read from level N+1 during compaction between levels N and N+1 * Wnp1 - files written to level N+1 during compaction between levels N and N+1 * NewW - new files written to level N+1 during compaction * Count - number of compactions done for this level This is the new output from DB::GetProperty("leveldb.stats"). The old output stopped at Write(MB) Compactions Level Files Size(MB) Time(sec) Read(MB) Write(MB) Rn(MB) Rnp1(MB) Wnew(MB) Amplify Read(MB/s) Write(MB/s) Rn Rnp1 Wnp1 NewW Count ------------------------------------------------------------------------------------------------------------------------------------- 0 3 6 33 0 576 0 0 576 -1.0 0.0 1.3 0 0 0 0 290 1 127 242 351 5316 5314 570 4747 567 17.0 12.1 12.1 287 2399 2685 286 32 2 161 328 54 822 824 326 496 328 4.0 1.9 1.9 160 251 411 160 161 Amplification: 22.3 rate, 0.56 GB in, 12.55 GB out Uptime(secs): 439.8 Stalls(secs): 206.938 level0_slowdown, 0.000 level0_numfiles, 24.129 memtable_compaction Task ID: # Blame Rev: Test Plan: run db_bench Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - (cherry picked from commit ecdeead38f86cc02e754d0032600742c4f02fec8) Reviewers: dhruba Differential Revision: https://reviews.facebook.net/D6153	2012-10-24 14:21:38 -07:00
Dhruba Borthakur	3b06f94fa2	Merge branch 'master' into performance Conflicts: db/db_impl.cc db/db_impl.h db/version_set.cc	2012-10-23 22:30:07 -07:00
Dhruba Borthakur	4c107587ed	Delete files outside the mutex. Summary: The compaction process deletes a large number of files. This takes quite a bit of time and is best done outside the mutex lock. Test Plan: make check Differential Revision: https://reviews.facebook.net/D6123	2012-10-22 11:53:23 -07:00
heyongqiang	5010daa7a8	add "seek_compaction" to log for better debug Summary: Summary: as subject Test Plan: compile Reviewers: dhruba Reviewed By: dhruba CC: MarkCallaghan Differential Revision: https://reviews.facebook.net/D6117	2012-10-22 10:00:25 -07:00
Dhruba Borthakur	3489cd615c	Merge branch 'master' into performance Conflicts: db/db_impl.cc db/db_impl.h	2012-10-21 02:15:19 -07:00
Dhruba Borthakur	f95219fb32	Delete files outside the mutex. Summary: The compaction process deletes a large number of files. This takes quite a bit of time and is best done outside the mutex lock. Test Plan: make check Differential Revision: https://reviews.facebook.net/D6123	2012-10-21 02:03:00 -07:00
Dhruba Borthakur	98f23cf04a	Merge branch 'master' into performance Conflicts: db/db_impl.cc db/db_impl.h	2012-10-21 01:55:19 -07:00
Dhruba Borthakur	64c4b9f0e2	Delete files outside the mutex. Summary: The compaction process deletes a large number of files. This takes quite a bit of time and is best done outside the mutex lock. Test Plan: Reviewers: CC: Task ID: # Blame Rev:	2012-10-21 01:49:48 -07:00
Dhruba Borthakur	e982f5a1d2	Merge branch 'master' into performance Conflicts: util/options.cc	2012-10-19 15:16:42 -07:00
Dhruba Borthakur	cf5adc8016	db_bench was not correctly initializing the value for delete_obsolete_files_period_micros option. Summary: The parameter delete_obsolete_files_period_micros controls the periodicity of deleting obsolete files. db_bench was reading in this parameter intoa local variable called 'l' but was incorrectly using another local variable called 'n' while setting it in the db.options data structure. This patch also logs the value of delete_obsolete_files_period_micros in the LOG file at db startup time. I am hoping that this will improve the overall write throughput drastically. Test Plan: run db_bench Reviewers: MarkCallaghan, heyongqiang Reviewed By: MarkCallaghan Differential Revision: https://reviews.facebook.net/D6099	2012-10-19 15:10:12 -07:00
Dhruba Borthakur	1ca0584345	This is the mega-patch multi-threaded compaction published in https://reviews.facebook.net/D5997. Summary: This patch allows compaction to occur in multiple background threads concurrently. If a manual compaction is issued, the system falls back to a single-compaction-thread model. This is done to ensure correctess and simplicity of code. When the manual compaction is finished, the system resumes its concurrent-compaction mode automatically. The updates to the manifest are done via group-commit approach. Test Plan: run db_bench	2012-10-19 14:00:53 -07:00
Dhruba Borthakur	aa73538f2a	The deletion of obsolete files should not occur very frequently. Summary: The method DeleteObsolete files is a very costly methind, especially when the number of files in a system is large. It makes a list of all live-files and then scans the directory to compute the diff. By default, this method is executed after every compaction run. This patch makes it such that DeleteObsolete files is never invoked twice within a configured period. Test Plan: run all unit tests Reviewers: heyongqiang, MarkCallaghan Reviewed By: MarkCallaghan Differential Revision: https://reviews.facebook.net/D6045	2012-10-16 10:26:10 -07:00
Dhruba Borthakur	0230866791	Enhance db_bench to allow setting the number of levels in a database. Summary: Enhance db_bench to allow setting the number of levels in a database. Test Plan: run db_bench and look at LOG Reviewers: heyongqiang, MarkCallaghan Reviewed By: MarkCallaghan CC: MarkCallaghan Differential Revision: https://reviews.facebook.net/D6027	2012-10-15 10:18:49 -07:00
Dhruba Borthakur	c1006d4276	An configurable option to write data using write instead of mmap. Summary: We have seen that reading data via the pread call (instead of mmap) is much faster on Linux 2.6.x kernels. This patch makes an equivalent option to switch off mmaps for the write path as well. db_bench --mmap_write=0 will use write() instead of mmap() to write data to a file. This change is backward compatible, the default option is to continue using mmap for writing to a file. Test Plan: "make check all" Differential Revision: https://reviews.facebook.net/D5781	2012-10-03 17:08:13 -07:00
Mark Callaghan	e678a5947a	Add --stats_interval option to db_bench Summary: The option is zero by default and in that case reporting is unchanged. By unchanged, the interval at which stats are reported is scaled after each report and newline is not issued after each report so one line is rewritten. When non-zero it specifies the constant interval (in operations) at which statistics are reported and the stats include the rate per interval. This makes it easier to determine whether QPS changes over the duration of the test. Task ID: # Blame Rev: Test Plan: run db_bench Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba Reviewed By: dhruba CC: heyongqiang Differential Revision: https://reviews.facebook.net/D5817	2012-10-03 09:54:33 -07:00
Mark Callaghan	d8763abecd	Fix the bounds check for the --readwritepercent option Summary: see above Task ID: # Blame Rev: Test Plan: run db_bench with invalid value for option Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba Reviewed By: dhruba CC: heyongqiang Differential Revision: https://reviews.facebook.net/D5823	2012-10-03 09:52:26 -07:00
Mark Callaghan	98804f914f	Fix compiler warnings and errors in ldb.c Summary: stdlib.h is needed for exit() --readhead --> --readahead Task ID: # Blame Rev: Test Plan: compile Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - fix compiler warnings & errors Reviewers: dhruba Reviewed By: dhruba CC: heyongqiang Differential Revision: https://reviews.facebook.net/D5805	2012-10-03 06:46:59 -07:00
Abhishek Kona	fec81318b0	Commandline tool to compace LevelDB databases. Summary: A simple CLI which calles DB->CompactRange() Can take String key's as range. Test Plan: Inserted data into a table. Waited for a minute, used compact tool on it. File modification time's changed so Compact did something on the files. Existing unit tests work. Reviewers: heyongqiang, dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D5697	2012-10-01 10:49:19 -07:00
Dhruba Borthakur	c1bb32e1ba	Trigger read compaction only if seeks to storage are incurred. Summary: In the current code, a Get() call can trigger compaction if it has to look at more than one file. This causes unnecessary compaction because looking at more than one file is a penalty only if the file is not yet in the cache. Also, th current code counts these files before the bloom filter check is applied. This patch counts a 'seek' only if the file fails the bloom filter check and has to read in data block(s) from the storage. This patch also counts a 'seek' if a file is not present in the file-cache, because opening a file means that its index blocks need to be read into cache. Test Plan: unit test attached. I will probably add one more unti tests. Reviewers: heyongqiang Reviewed By: heyongqiang CC: MarkCallaghan Differential Revision: https://reviews.facebook.net/D5709	2012-09-28 11:10:52 -07:00
Dhruba Borthakur	24eea931ef	If ReadCompaction is switched off, then it is better to not even submit background compaction jobs. Summary: If ReadCompaction is switched off, then it is better to not even submit background compaction jobs. I see about 3% increase in read-throughput on a pure memory database. Test Plan: run db_bench Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D5673	2012-09-25 11:07:01 -07:00
Dhruba Borthakur	ae36e509f8	The BackupAPI should also list the length of the manifest file. Summary: The GetLiveFiles() api lists the set of sst files and the current MANIFEST file. But the database continues to append new data to the MANIFEST file even when the application is backing it up to the backup location. This means that the database-version that is stored in the MANIFEST FILE in the backup location does not correspond to the sst files returned by GetLiveFiles. This API adds a new parameter to GetLiveFiles. This new parmeter returns the current size of the MANIFEST file. Test Plan: Unit test attached. Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D5631	2012-09-25 03:13:25 -07:00
Dhruba Borthakur	bb2dcd2457	Segfault in DoCompactionWork caused by buffer overflow Summary: The code was allocating 200 bytes on the stack but it writes 256 bytes into the array. x8a8ea5 std::_Rb_tree<>::erase() @ 0x7f134bee7eb0 (unknown) @ 0x8a8ea5 std::_Rb_tree<>::erase() @ 0x8a35d6 leveldb::DBImpl::CleanupCompaction() @ 0x8a7810 leveldb::DBImpl::BackgroundCompaction() @ 0x8a804d leveldb::DBImpl::BackgroundCall() @ 0x8c4eff leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper() @ 0x7f134b3c010d start_thread @ 0x7f134bf9f10d clone Test Plan: run db_bench with overwrite option Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D5595	2012-09-21 10:55:38 -07:00
Dhruba Borthakur	fb4b381a0c	Print out the compile version in the LOG. Summary: Print out the compile version in the LOG. Test Plan: run dbbench and verify LOG Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D5529	2012-09-18 13:24:32 -07:00
heyongqiang	a8464ed820	add an option to disable seek compaction Summary: as subject. This diff should be good for benchmarking. will send another diff to make it better in the case the seek compaction is enable. In that coming diff, will not count a seek if the bloomfilter filters. Test Plan: build Reviewers: dhruba, MarkCallaghan Reviewed By: MarkCallaghan Differential Revision: https://reviews.facebook.net/D5481	2012-09-17 13:59:57 -07:00
Dhruba Borthakur	ba55d77b5d	Ability to take a file-lvel snapshot from leveldb. Summary: A set of apis that allows an application to backup data from the leveldb database based on a set of files. Test Plan: unint test attached. more coming soon. Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D5439	2012-09-17 09:14:50 -07:00
heyongqiang	b85cdca690	add a global var leveldb::useMmapRead to enable mmap Summary: Summary: as subject. this can be used for benchmarking. If we want it for some cases, we can do more changes to make this part of the option. Test Plan: db_test Reviewers: dhruba CC: MarkCallaghan Differential Revision: https://reviews.facebook.net/D5451	2012-09-16 22:07:35 -07:00
heyongqiang	dcbd6be340	remove boost Summary: as subject Test Plan: build Reviewers: dhruba Differential Revision: https://reviews.facebook.net/D5469	2012-09-16 19:33:43 -07:00
Mark Callaghan	fa29f82548	scan a long for FLAGS_cache_size to fix a compiler warning Summary: FLAGS_cache_size is a long, no need to scan %lld into a size_t for it (which generates a compiler warning) Test Plan: run db_bench Reviewers: dhruba, heyongqiang Reviewed By: heyongqiang CC: heyongqiang Differential Revision: https://reviews.facebook.net/D5427	2012-09-14 12:45:42 -07:00
Mark Callaghan	837113908c	Add --compression_type=X option with valid values: snappy (default) none bzip2 zlib Summary: This adds an option to db_bench to specify the compression algorithm to use for LevelDB Test Plan: ran db_bench Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D5421	2012-09-14 12:28:21 -07:00
Dhruba Borthakur	93f4952089	Ability to switch off filesystem read-aheads Summary: Ability to switch off filesystem read-aheads. This change is backward-compatible: the default setting is to allow file system read-aheads. Test Plan: run benchmarks Reviewers: heyongqiang, adsharma Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D5391	2012-09-13 12:09:56 -07:00
Dhruba Borthakur	7ecc5d4ad5	Enable db_bench to specify block size. Summary: Enable db_bench to specify block size. Test Plan: compile and run Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D5373	2012-09-13 10:22:43 -07:00
Dhruba Borthakur	407727b75f	Fix compiler warnings. Use uint64_t instead of uint. Summary: Fix compiler warnings. Use uint64_t instead of uint. Test Plan: build using -Wall Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D5355	2012-09-12 14:42:36 -07:00
heyongqiang	0f43aa474e	put log in a seperate dir Summary: added a new option db_log_dir, which points the log dir. Inside that dir, in order to make log names unique, the log file name is prefixed with the leveldb data dir absolute path. Test Plan: db_test Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D5205	2012-09-06 17:52:08 -07:00
Dhruba Borthakur	536ca698ba	The ReadnRandomWriteRandom was always looping FLAGS_num of times. Summary: If none of reads or writes are specified by user, then pick the FLAGS_NUM as the number of iterations in the ReadRandomWriteRandom test. If either reads or writes are defined, then use their maximum. Test Plan: run benchmark Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D5217	2012-09-06 09:13:24 -07:00
Dhruba Borthakur	94208a7881	Benchmark with both reads and writes at the same time. Summary: This patch enables the db_bench benchmark to issue both random reads and random writes at the same time. This options can be trigged via ./db_bench --benchmarks=readrandomwriterandom The default percetage of reads is 90. One can change the percentage of reads by specifying the --readwritepercent. ./db_bench --benchmarks=readrandomwriterandom=50 This is a feature request from Jeffro asking for leveldb performance with a 90:10 read:write ratio. Test Plan: run on test machine. Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D5067	2012-09-04 12:06:26 -07:00
Dhruba Borthakur	fe93631678	Clean up compiler warnings generated by -Wall option. Summary: Clean up compiler warnings generated by -Wall option. make clean all OPT=-Wall This is a pre-requisite before making a new release. Test Plan: compile and run unit tests Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D5019	2012-08-29 14:24:51 -07:00
Dhruba Borthakur	e5fe80e4e3	The sharding of the block cache is limited to 220 pieces. Summary: The numbers of shards that the block cache is divided into is configurable. However, if the user specifies that he/she wants the block cache to be divided into more than 220 pieces, then the system will rey to allocate a huge array of that size) that could fail. It is better to limit the sharding of the block cache to an upper bound. The default sharding is 16 shards (i.e. 24) and the maximum is now 2 million shards (i.e. 2*20). Also, fixed a bug with the LRUCache where the numShardBits should be a private member of the LRUCache object rather than a static variable. Test Plan: run db_bench with --cache_numshardbits=64. Task ID: # Blame Rev: Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D5013	2012-08-29 12:17:59 -07:00
heyongqiang	a4f9b8b49e	merge 1.5 Summary: as subject Test Plan: db_test table_test Reviewers: dhruba	2012-08-28 11:43:33 -07:00
heyongqiang	6fee5a74f5	Do not spin in a tight loop attempting compactions if there is a compaction error Summary: as subject. ported the change from google code leveldb 1.5 Test Plan: run db_test Reviewers: dhruba Differential Revision: https://reviews.facebook.net/D4839	2012-08-28 11:43:33 -07:00
heyongqiang	935fdd030b	fix filename_test Summary: as subject Test Plan: run filename_test Reviewers: dhruba Differential Revision: https://reviews.facebook.net/D4965	2012-08-28 11:42:42 -07:00
heyongqiang	690bf88682	in db_stats_logger.cc, hold mutex_ while accessing versions_ Summary: as subject Test Plan:db_test Reviewers: dhruba	2012-08-28 11:29:30 -07:00
heyongqiang	d3759ca121	fix db_test error with scribe logger turned on Summary: as subject Test Plan: db_test Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D4929	2012-08-28 11:22:58 -07:00
Dhruba Borthakur	fc20273e73	Introduce a new method Env->Fsync() that issues fsync (instead of fdatasync). Summary: Introduce a new method Env->Fsync() that issues fsync (instead of fdatasync). This is needed for data durability when running on ext3 filesystems. Added options to the benchmark db_bench to generate performance numbers with either fsync or fdatasync enabled. Cleaned up Makefile to build leveldb_shell only when building the thrift leveldb server. Test Plan: build and run benchmark Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D4911	2012-08-27 21:24:17 -07:00
heyongqiang	1de83cc2ac	add more logs Summary: as subject add a tool to read sst file as subject. ./sst_reader --command=check --file= ./sst_reader --command=scan --file= Test Plan: db_test run this command Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D4881	2012-08-24 15:20:49 -07:00
heyongqiang	1c99b0a6b3	add more logs Summary: as subject Test Plan:db_test Reviewers: dhruba	2012-08-24 15:18:43 -07:00
Dhruba Borthakur	f3ee54526f	Utility to dump manifest contents. Summary: ./manifest_dump --file=/tmp/dbbench/MANIFEST-000002 Output looks like manifest_file_number 30 next_file_number 31 last_sequence 388082 log_number 28 prev_log_number 0 --- level 0 --- --- level 1 --- --- level 2 --- 5:3244155['0000000000000000' @ 1 : 1 .. '0000000000028220' @ 28221 : 1] 7:3244177['0000000000028221' @ 28222 : 1 .. '0000000000056441' @ 56442 : 1] 9:3244156['0000000000056442' @ 56443 : 1 .. '0000000000084662' @ 84663 : 1] 11:3244178['0000000000084663' @ 84664 : 1 .. '0000000000112883' @ 112884 : 1] 13:3244158['0000000000112884' @ 112885 : 1 .. '0000000000141104' @ 141105 : 1] 15:3244176['0000000000141105' @ 141106 : 1 .. '0000000000169325' @ 169326 : 1] 17:3244156['0000000000169326' @ 169327 : 1 .. '0000000000197546' @ 197547 : 1] 19:3244178['0000000000197547' @ 197548 : 1 .. '0000000000225767' @ 225768 : 1] 21:3244155['0000000000225768' @ 225769 : 1 .. '0000000000253988' @ 253989 : 1] 23:3244179['0000000000253989' @ 253990 : 1 .. '0000000000282209' @ 282210 : 1] 25:3244157['0000000000282210' @ 282211 : 1 .. '0000000000310430' @ 310431 : 1] 27:3244176['0000000000310431' @ 310432 : 1 .. '0000000000338651' @ 338652 : 1] 29:3244156['0000000000338652' @ 338653 : 1 .. '0000000000366872' @ 366873 : 1] --- level 3 --- --- level 4 --- --- level 5 --- --- level 6 --- Test Plan: run on test directory created by dbbench Reviewers: heyongqiang Reviewed By: heyongqiang CC: hustliubo Differential Revision: https://reviews.facebook.net/D4743	2012-08-24 15:17:09 -07:00
Dhruba Borthakur	e5a7c8e580	Log the open-options to the LOG. Summary: Log the open-options to the LOG. Use options_ instead of options because SanitizeOptions could modify the max_file_open limit. Test Plan: num db_bench Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D4833	2012-08-22 12:22:12 -07:00
heyongqiang	21082fa13c	regression for trigger compaction logic Summary: as subject Test Plan: manually run db_bench confirmed Reviewers: dhruba Differential Revision: https://reviews.facebook.net/D4809	2012-08-21 18:11:21 -07:00
Dhruba Borthakur	a098207c95	Fixed unit test c_test by initializing logger=NULL. Summary: Fixed unit test c_test by initializing logger=NULL. Removed "atomic" from last_log_ts so that unit tests do not require C11 compiler. Anyway, last_log_ts is mostly used for logging, so it is ok if it is loosely accurate. Test Plan: run c_test Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D4803	2012-08-21 17:10:29 -07:00
Dhruba Borthakur	f4e7febf22	Record the version of the source repository that was used to build the leveldb library. Summary: Record the version of the source that we are compiling. We keep a record of the git revision in util/version.cc. This source file is then built as a regular source file as part of the compilation process. One can run "strings executable_filename \| grep _build_" to find the version of the source that we used to build the executable file. Test Plan: none Differential Revision: https://reviews.facebook.net/D4785	2012-08-21 14:47:15 -07:00
heyongqiang	6ba1f17789	adding a scribe logger in leveldb to log leveldb deploy stats Summary: as subject. A new log is written to scribe via thrift client when a new db is opened and when there is a compaction. a new option var scribe_log_db_stats is added. Test Plan: manually checked using command "ptail -time 0 leveldb_deploy_stats" Reviewers: dhruba Differential Revision: https://reviews.facebook.net/D4659	2012-08-21 11:43:22 -07:00
Dhruba Borthakur	e56b2c5a31	Prevent concurrent multiple opens of leveldb database. Summary: The fcntl call cannot detect lock conflicts when invoked multiple times from the same thread. Use a static lockedFile Set to record the paths that are locked. A lockfile request checks to see if htis filename already exists in lockedFiles, if so, then it triggers an error. Otherwise, it inserts the filename in the lockedFiles Set. A unlock file request verifies that the filename is in the lockedFiles set and removes it from lockedFiles set. Test Plan: unit test attached Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D4755	2012-08-20 23:55:04 -07:00
heyongqiang	deb1a1fa9b	add disable wal to db_bench Summary: as subject. ./db_bench --benchmarks=fillrandom --num=1000000 --disable_data_sync=1 --write_buffer_size=50000000 --target_file_size_base=100000000 --disable_wal=1 LevelDB: version 1.4 Date: Sun Aug 19 16:01:59 2012 CPU: 8 * Intel(R) Xeon(R) CPU L5630 @ 2.13GHz CPUCache: 12288 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) ------------------------------------------------ fillrandom : 4.591 micros/op 217797 ops/sec; 24.1 MB/s ./db_bench --benchmarks=fillrandom --num=1000000 --disable_data_sync=1 --write_buffer_size=50000000 --target_file_size_base=100000000 LevelDB: version 1.4 Date: Sun Aug 19 16:02:54 2012 CPU: 8 * Intel(R) Xeon(R) CPU L5630 @ 2.13GHz CPUCache: 12288 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) ------------------------------------------------ fillrandom : 3.696 micros/op 270530 ops/sec; 29.9 MB/s Test Plan: db_bench Reviewers: dhruba Differential Revision: https://reviews.facebook.net/D4767	2012-08-19 22:37:51 -07:00
Dhruba Borthakur	2aa514ec8c	Utility to dump manifest contents. Summary: ./manifest_dump --file=/tmp/dbbench/MANIFEST-000002 Output looks like manifest_file_number 30 next_file_number 31 last_sequence 388082 log_number 28 prev_log_number 0 --- level 0 --- --- level 1 --- --- level 2 --- 5:3244155['0000000000000000' @ 1 : 1 .. '0000000000028220' @ 28221 : 1] 7:3244177['0000000000028221' @ 28222 : 1 .. '0000000000056441' @ 56442 : 1] 9:3244156['0000000000056442' @ 56443 : 1 .. '0000000000084662' @ 84663 : 1] 11:3244178['0000000000084663' @ 84664 : 1 .. '0000000000112883' @ 112884 : 1] 13:3244158['0000000000112884' @ 112885 : 1 .. '0000000000141104' @ 141105 : 1] 15:3244176['0000000000141105' @ 141106 : 1 .. '0000000000169325' @ 169326 : 1] 17:3244156['0000000000169326' @ 169327 : 1 .. '0000000000197546' @ 197547 : 1] 19:3244178['0000000000197547' @ 197548 : 1 .. '0000000000225767' @ 225768 : 1] 21:3244155['0000000000225768' @ 225769 : 1 .. '0000000000253988' @ 253989 : 1] 23:3244179['0000000000253989' @ 253990 : 1 .. '0000000000282209' @ 282210 : 1] 25:3244157['0000000000282210' @ 282211 : 1 .. '0000000000310430' @ 310431 : 1] 27:3244176['0000000000310431' @ 310432 : 1 .. '0000000000338651' @ 338652 : 1] 29:3244156['0000000000338652' @ 338653 : 1 .. '0000000000366872' @ 366873 : 1] --- level 3 --- --- level 4 --- --- level 5 --- --- level 6 --- Test Plan: run on test directory created by dbbench Reviewers: heyongqiang Reviewed By: heyongqiang CC: hustliubo Differential Revision: https://reviews.facebook.net/D4743	2012-08-17 22:36:59 -07:00
heyongqiang	680e571c4c	add compaction log Summary: Summary: add compaction summary to log log looks like: 2012/08/17-18:18:32.557334 7fdcaa2bb700 Compaction summary: Base level 0, input file:[11 9 7 ],[] Test Plan: tested via db_test Reviewers: dhruba Differential Revision: https://reviews.facebook.net/D4749	2012-08-17 19:29:39 -07:00
heyongqiang	20ee76bd34	use ts as suffix for LOG.old files Summary: as subject and only maintain 10 log files. Test Plan: new test in db_test Reviewers: dhruba Differential Revision: https://reviews.facebook.net/D4731	2012-08-17 16:22:04 -07:00
heyongqiang	f16e393658	add more options to db_ben Summary: as subject Test Plan: run db_bench with new options Reviewers: dhruba Differential Revision: https://reviews.facebook.net/D4677	2012-08-15 17:42:33 -07:00
heyongqiang	fcb2ea4715	disable data sync options needs to be checked when doing level-0 dump Summary: Summary: as subject Test Plan: use db_bench Reviewers: dhruba Differential Revision: https://reviews.facebook.net/D4671	2012-08-15 16:39:02 -07:00
Dhruba Borthakur	c3096afd61	Introduce a new option disableDataSync for opening the database. If this is set to true, then the data written to newly created data files are not sycned to disk, instead depend on the OS to flush dirty data to stable storage. This option is good for bulk Test Plan: manual tests Task ID: # Blame Rev: Differential Revision: https://reviews.facebook.net/D4515	2012-08-03 15:23:53 -07:00
heyongqiang	22ee777f68	add flush interface to DB Summary: as subject. The flush will flush everything in the db. Test Plan: new test in db_test.cc Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D4029	2012-07-06 12:11:19 -07:00
heyongqiang	a347d4ac0d	add disable WAL option Summary: add disable WAL option Test Plan: new testcase in db_test.cc Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D4011	2012-07-05 15:06:56 -07:00
heyongqiang	4e4b6812ff	Make some variables configurable for each db instance Summary: Make configurable 'targetFileSize', 'targetFileSizeMultiplier', 'maxBytesForLevelBase', 'maxBytesForLevelMultiplier', 'expandedCompactionFactor', 'maxGrandParentOverlapFactor' Test Plan: N/A Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D3801	2012-06-27 14:36:31 -07:00
Dhruba Borthakur	a35e574344	Make Leveldb save data into HDFS files. You have to set USE_HDFS in your environment variable to compile leveldb with HDFS support. Test Plan: Run benchmark. Differential Revision: https://reviews.facebook.net/D3549	2012-06-14 00:29:01 -07:00
Dhruba Borthakur	338939e5c1	Print log message when we are throttling writes. Summary: Added option --writes=xxx to specify the number of keys that we want to overwrite in the benchmark. Task ID: # Blame Rev: Test Plan: Revert Plan: Reviewers: adsharma CC: sc Differential Revision: https://reviews.facebook.net/D3465	2012-06-01 14:03:37 -07:00
Dhruba Borthakur	f50ece60c7	Fix table-cache size bug, gather table-cache statistics and prevent readahead done by fs. Summary: Summary: The db_bench test was not using the specified value for the max-file-open. Fixed. The fs readhead is switched off. Gather statistics about the table cache and print it out at the end of the tets run. Test Plan: Revert Plan: Reviewers: adsharma, sc Reviewed By: adsharma Differential Revision: https://reviews.facebook.net/D3441	2012-05-30 16:42:45 -07:00
Dhruba Borthakur	8f293b68a9	Support --bufferedio=[0,1] from db_bench. If bufferedio = 0, then the read code path clears the OS page cache after the IO is completed. The default remains as bufferedio=1 Summary: Task ID: # Blame Rev: Test Plan: Revert Plan: Differential Revision: https://reviews.facebook.net/D3429	2012-05-29 13:29:44 -07:00
Dhruba Borthakur	33a3c6ff6c	Ability to make the benchmark issue a large number of IOs. This is helpful to populate many gigabytes of data for benchmarking at scale. Summary: Task ID: # Blame Rev: Test Plan: Revert Plan: Differential Revision: https://reviews.facebook.net/D3333	2012-05-22 12:20:09 -07:00
Dhruba Borthakur	3b86a51cb1	Ability to switch on checksum verification from benchmark. Summary: Task ID: # Blame Rev: Test Plan: Revert Plan: Differential Revision: https://reviews.facebook.net/D3309	2012-05-19 00:13:50 -07:00
Dhruba Borthakur	a2a0e358cb	Add support to specify the number of shards for the Block cache. By default, the block cache is sharded into 16 parts. Summary: Task ID: # Blame Rev: Test Plan: Revert Plan: Differential Revision: https://reviews.facebook.net/D3273	2012-05-16 17:23:49 -07:00
Dhruba Borthakur	37d0dcb9b1	Use the elapsed time (instead of the per-thread time) to compute ops/sec. Summary: Task ID: # Blame Rev: Test Plan: Revert Plan: Differential Revision: https://reviews.facebook.net/D3147	2012-05-11 12:43:31 -07:00
Arun Sharma	90b2924fb2	skiplist: optimize for sequential insert pattern Summary: skiplist doesn't cache the location of the last insert and becomes CPU bound when the input data has sequential keys. Notes on thread safety: ::Insert() already requires external synchronization. So this change is not making it any worse. Test Plan: skiplist_test Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D3129	2012-05-11 09:57:40 -07:00
Dhruba Borthakur	cc6c32535a	Support arcdiff. Summary: Task ID: # Blame Rev: Test Plan: Revert Plan: Differential Revision: https://reviews.facebook.net/D3105	2012-05-09 23:35:05 -07:00
Sanjay Ghemawat	85584d497e	Added bloom filter support. In particular, we add a new FilterPolicy class. An instance of this class can be supplied in Options when opening a database. If supplied, the instance is used to generate summaries of keys (e.g., a bloom filter) which are placed in sstables. These summaries are consulted by DB::Get() so we can avoid reading sstable blocks that are guaranteed to not contain the key we are looking for. This change provides one implementation of FilterPolicy based on bloom filters. Other changes: - Updated version number to 1.4. - Some build tweaks. - C binding for CompactRange. - A few more benchmarks: deleteseq, deleterandom, readmissing, seekrandom. - Minor .gitignore update.	2012-04-17 08:36:46 -07:00
Sanjay Ghemawat	bc1ee4d25e	build shared libraries; updated version to 1.3; add Status accessors	2012-03-30 13:15:49 -07:00
Sanjay Ghemawat	583f1499c0	fix LOCK file deletion to prevent crash on windows	2012-03-09 07:51:04 -08:00
Sanjay Ghemawat	d79762e273	added group commit; drastically speeds up mult-threaded synchronous write workloads	2012-03-08 16:23:21 -08:00
Sanjay Ghemawat	239ac9d2de	avoid very large compactions; fix build on Linux	2012-02-02 09:34:14 -08:00
Sanjay Ghemawat	3c8be108bf	fixed issues 66 (leaking files on disk error) and 68 (no sync of CURRENT file)	2012-01-25 14:56:52 -08:00
Hans Wennborg	42fb47f6ed	Pass system's CFLAGS, remove exit time destructor, sstable bug fix. - Pass system's values of CFLAGS,LDFLAGS. Don't override OPT if it's already set. Original patch by Alessio Treglia <alessio@debian.org>: http://code.google.com/p/leveldb/issues/detail?id=27#c6 - Remove 1 exit time destructor from leveldb. See http://crbug.com/101600 - Fix problem where sstable building code would pass an internal key to the user comparator. (Sync with uptream at 25436817.)	2011-11-14 17:06:16 +00:00
Hans Wennborg	36a5f8ed7f	A number of fixes: - Replace raw slice comparison with a call to user comparator. Added test for custom comparators. - Fix end of namespace comments. - Fixed bug in picking inputs for a level-0 compaction. When finding overlapping files, the covered range may expand as files are added to the input set. We now correctly expand the range when this happens instead of continuing to use the old range. For example, suppose L0 contains files with the following ranges: F1: a .. d F2: c .. g F3: f .. j and the initial compaction target is F3. We used to search for range f..j which yielded {F2,F3}. However we now expand the range as soon as another file is added. In this case, when F2 is added, we expand the range to c..j and restart the search. That picks up file F1 as well. This change fixes a bug related to deleted keys showing up incorrectly after a compaction as described in Issue 44. (Sync with upstream @25072954)	2011-10-31 17:22:06 +00:00
Gabor Cselle	299ccedfec	A number of bugfixes: - Added DB::CompactRange() method. Changed manual compaction code so it breaks up compactions of big ranges into smaller compactions. Changed the code that pushes the output of memtable compactions to higher levels to obey the grandparent constraint: i.e., we must never have a single file in level L that overlaps too much data in level L+1 (to avoid very expensive L-1 compactions). Added code to pretty-print internal keys. - Fixed bug where we would not detect overlap with files in level-0 because we were incorrectly using binary search on an array of files with overlapping ranges. Added "leveldb.sstables" property that can be used to dump all of the sstables and ranges that make up the db state. - Removing post_write_snapshot support. Email to leveldb mailing list brought up no users, just confusion from one person about what it meant. - Fixing static_cast char to unsigned on BIG_ENDIAN platforms. Fixes Issue 35 and Issue 36. - Comment clarification to address leveldb Issue 37. - Change license in posix_logger.h to match other files. - A build problem where uint32 was used instead of uint32_t. Sync with upstream @24408625	2011-10-05 16:30:28 -07:00
gabor@google.com	7263023651	Bugfixes: for Get(), don't hold mutex while writing log. - Fix bug in Get: when it triggers a compaction, it could sometimes mark the compaction with the wrong level (if there was a gap in the set of levels examined for the Get). - Do not hold mutex while writing to the log file or to the MANIFEST file. Added a new benchmark that runs a writer thread concurrently with reader threads. Percentiles ------------------------------ micros/op: avg median 99 99.9 99.99 99.999 max ------------------------------------------------------ before: 42 38 110 225 32000 42000 48000 after: 24 20 55 65 130 1100 7000 - Fixed race in optimized Get. It should have been using the pinned memtables, not the current memtables. git-svn-id: https://leveldb.googlecode.com/svn/trunk@50 62dab493-f737-651d-591e-8d6aee1b9529	2011-09-01 19:08:02 +00:00
gabor@google.com	e3584f9c28	Bugfix for issue 33; reduce lock contention in Get(), parallel benchmarks. - Fix for issue 33 (non-null-terminated result from leveldb_property_value()) - Support for running multiple instances of a benchmark in parallel. - Reduce lock contention on Get(): (1) Do not hold the lock while searching memtables. (2) Shard block and table caches 16-ways. Benchmark for evaluating this change: $ db_bench --benchmarks=fillseq1,readrandom --threads=$n (fillseq1 is a small hack to make sure fillseq runs once regardless of number of threads specified on the command line). git-svn-id: https://leveldb.googlecode.com/svn/trunk@49 62dab493-f737-651d-591e-8d6aee1b9529	2011-08-22 21:08:51 +00:00
gabor@google.com	ab323f7e1e	Bugfixes for iterator and documentation. - Fix bug in Iterator::Prev where it would return the wrong key. Fixes issues 29 and 30. - Added a tweak to testharness to allow running just some tests. - Fixing two minor documentation errors based on issues 28 and 25. - Cleanup; fix namespaces of export-to-C code. Also fix one "const char" vs "char" mismatch. git-svn-id: https://leveldb.googlecode.com/svn/trunk@48 62dab493-f737-651d-591e-8d6aee1b9529	2011-08-16 01:21:01 +00:00
dgrogan@chromium.org	a05525d13b	@23023120 git-svn-id: https://leveldb.googlecode.com/svn/trunk@47 62dab493-f737-651d-591e-8d6aee1b9529	2011-08-06 00:19:37 +00:00
gabor@google.com	021ee9c32b	C binding for leveldb, better readseq benchmark for SQLite. - Added a C binding for LevelDB. May be useful as a stable ABI that can be used by programs that keep leveldb in a shared library, or for JNI API. - Replaced SQLite's readseq benchmark to a more efficient version. SQLite readseq speeds increased by about a factor of 2x from the previous version. Also updated benchmark page to reflect readseq speed up. git-svn-id: https://leveldb.googlecode.com/svn/trunk@46 62dab493-f737-651d-591e-8d6aee1b9529	2011-08-05 20:40:49 +00:00
gabor@google.com	60bd8015f2	Speed up Snappy uncompression, new Logger interface. - Removed one copy of an uncompressed block contents changing the signature of Snappy_Uncompress() so it uncompresses into a flat array instead of a std::string. Speeds up readrandom ~10%. - Instead of a combination of Env/WritableFile, we now have a Logger interface that can be easily overridden applications that want to supply their own logging. - Separated out the gcc and Sun Studio parts of atomic_pointer.h so we can use 'asm', 'volatile' keywords for Sun Studio. git-svn-id: https://leveldb.googlecode.com/svn/trunk@39 62dab493-f737-651d-591e-8d6aee1b9529	2011-07-21 02:40:18 +00:00
gabor@google.com	6872ace901	Sun Studio support, and fix for test related memory fixes. - LevelDB patch for Sun Studio Based on a patch submitted by Theo Schlossnagle - thanks! This fixes Issue 17. - Fix a couple of test related memory leaks. git-svn-id: https://leveldb.googlecode.com/svn/trunk@38 62dab493-f737-651d-591e-8d6aee1b9529	2011-07-19 23:36:47 +00:00
gabor@google.com	6699c7ebe6	Small tweaks and bugfixes for Issue 18 and 19. Slight tweak to the no-overlap optimization: only push to level 2 to reduce the amount of wasted space when the same small key range is being repeatedly overwritten. Fix for Issue 18: Avoid failure on Windows by avoiding deletion of lock file until the end of DestroyDB(). Fix for Issue 19: Disregard sequence numbers when checking for overlap in sstable ranges. This fixes issue 19: when writing the same key over and over again, we would generate a sequence of sstables that were never merged together since their sequence numbers were disjoint. Don't ignore map/unmap error checks. Miscellaneous fixes for small problems Sanjay found while diagnosing issue/9 and issue/16 (corruption_testr failures). - log::Reader reports the record type when it finds an unexpected type. - log::Reader no longer reports an error when it encounters an expected zero record regardless of the setting of the "checksum" flag. - Added a missing forward declaration. - Documented a side-effects of larger write buffer sizes (longer recovery time). git-svn-id: https://leveldb.googlecode.com/svn/trunk@37 62dab493-f737-651d-591e-8d6aee1b9529	2011-07-15 00:20:57 +00:00
gabor@google.com	e0cbd242cb	Fixing issue 11: version_set_test.cc was missing git-svn-id: https://leveldb.googlecode.com/svn/trunk@33 62dab493-f737-651d-591e-8d6aee1b9529	2011-06-22 18:45:39 +00:00
gabor@google.com	ccf0fcd5c2	A number of smaller fixes and performance improvements: - Implemented Get() directly instead of building on top of a full merging iterator stack. This speeds up the "readrandom" benchmark by up to 15-30%. - Fixed an opensource compilation problem. Added --db=<name> flag to control where the database is placed. - Automatically compact a file when we have done enough overlapping seeks to that file. - Fixed a performance bug where we would read from at least one file in a level even if none of the files overlapped the key being read. - Makefile fix for Mac OSX installations that have XCode 4 without XCode 3. - Unified the two occurrences of binary search in a file-list into one routine. - Found and fixed a bug where we would unnecessarily search the last file when looking for a key larger than all data in the level. - A fix to avoid the need for trivial move compactions and therefore gets rid of two out of five syncs in "fillseq". - Removed the MANIFEST file write when switching to a new memtable/log-file for a 10-20% improvement on fill speed on ext4. - Adding a SNAPPY setting in the Makefile for folks who have Snappy installed. Snappy compresses values and speeds up writes. git-svn-id: https://leveldb.googlecode.com/svn/trunk@32 62dab493-f737-651d-591e-8d6aee1b9529	2011-06-22 02:36:45 +00:00
hans@chromium.org	80e5b0d944	sync with upstream @21706995 Fixed race condition reported by Dave Smit (dizzyd@dizzyd,com) on the leveldb mailing list. We were not signalling waiters after a trivial move from level-0. The result was that in some cases (hard to reproduce), a write would get stuck forever waiting for the number of level-0 files to drop below its hard limit. The new code is simpler: there is just one condition variable instead of two, and the condition variable is signalled after every piece of background work finishes. Also, all compaction work (including for manual compactions) is done in the background thread, and therefore we can remove the "compacting_" variable. git-svn-id: https://leveldb.googlecode.com/svn/trunk@31 62dab493-f737-651d-591e-8d6aee1b9529	2011-06-07 14:40:26 +00:00
dgrogan@chromium.org	740d8b3d00	Update from upstream @21551990 * Patch LevelDB to build for OSX and iOS * Fix race condition in memtable iterator deletion. * Other small fixes. git-svn-id: https://leveldb.googlecode.com/svn/trunk@29 62dab493-f737-651d-591e-8d6aee1b9529	2011-05-28 00:53:58 +00:00
dgrogan@chromium.org	da79909507	sync with upstream @ 21409451 Check the NEWS file for details of what changed. git-svn-id: https://leveldb.googlecode.com/svn/trunk@28 62dab493-f737-651d-591e-8d6aee1b9529	2011-05-21 02:17:43 +00:00
dgrogan@chromium.org	ba6dac0e80	@20776309 * env_chromium.cc should not export symbols. * Fix MSVC warnings. * Removed large value support. * Fix broken reference to documentation file git-svn-id: https://leveldb.googlecode.com/svn/trunk@24 62dab493-f737-651d-591e-8d6aee1b9529	2011-04-20 22:48:11 +00:00
dgrogan@chromium.org	69c6d38342	reverting disastrous MOE commit, returning to r21 git-svn-id: https://leveldb.googlecode.com/svn/trunk@23 62dab493-f737-651d-591e-8d6aee1b9529	2011-04-19 23:11:15 +00:00
dgrogan@chromium.org	b743906eea	Revision created by MOE tool push_codebase. MOE_MIGRATION= git-svn-id: https://leveldb.googlecode.com/svn/trunk@22 62dab493-f737-651d-591e-8d6aee1b9529	2011-04-19 23:01:25 +00:00
dgrogan@chromium.org	b409afe968	chmod a-x git-svn-id: https://leveldb.googlecode.com/svn/trunk@21 62dab493-f737-651d-591e-8d6aee1b9529	2011-04-18 23:15:58 +00:00
dgrogan@chromium.org	f779e7a5d8	@20602303. Default file permission is now 755. git-svn-id: https://leveldb.googlecode.com/svn/trunk@20 62dab493-f737-651d-591e-8d6aee1b9529	2011-04-12 19:38:58 +00:00
jorlow@chromium.org	4671a695fc	Move include files into a leveldb subdir. git-svn-id: https://leveldb.googlecode.com/svn/trunk@18 62dab493-f737-651d-591e-8d6aee1b9529	2011-03-30 18:35:40 +00:00
jorlow@chromium.org	e2da744e12	Upstream changes. git-svn-id: https://leveldb.googlecode.com/svn/trunk@16 62dab493-f737-651d-591e-8d6aee1b9529	2011-03-28 20:43:44 +00:00
jorlow@chromium.org	e11bdf1935	Upstream changes git-svn-id: https://leveldb.googlecode.com/svn/trunk@15 62dab493-f737-651d-591e-8d6aee1b9529	2011-03-25 20:27:43 +00:00
jorlow@chromium.org	8303bb1b33	Pull from upstream. git-svn-id: https://leveldb.googlecode.com/svn/trunk@14 62dab493-f737-651d-591e-8d6aee1b9529	2011-03-22 23:24:02 +00:00
jorlow@chromium.org	13b72af77b	More changes from upstream. git-svn-id: https://leveldb.googlecode.com/svn/trunk@12 62dab493-f737-651d-591e-8d6aee1b9529	2011-03-22 18:32:49 +00:00
jorlow@chromium.org	0e38925490	Sync in bug fixes git-svn-id: https://leveldb.googlecode.com/svn/trunk@9 62dab493-f737-651d-591e-8d6aee1b9529	2011-03-21 19:40:57 +00:00
jorlow@chromium.org	f67e15e50f	Initial checkin. git-svn-id: https://leveldb.googlecode.com/svn/trunk@2 62dab493-f737-651d-591e-8d6aee1b9529	2011-03-18 22:37:00 +00:00

... 14 15 16 17 18 ...

1420 Commits