rocksdb

Author	SHA1	Message	Date
Haobo Xu	0e422308aa	[RocksDB] Remove Log file immediately after memtable flush Summary: As title. The DB log file life cycle is tied up with the memtable it backs. Once the memtable is flushed to sst and committed, we should be able to delete the log file, without holding the mutex. This is part of the bigger change to avoid FindObsoleteFiles at runtime. It deals with log files. sst files will be dealt with later. Test Plan: make check; db_bench Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11709	2013-09-12 11:54:44 -07:00
Dhruba Borthakur	32c965d417	Flush was hanging because the configured options specified that more than 1 memtable need to be merged. Summary: There is an config option called Options.min_write_buffer_number_to_merge that specifies the minimum number of write buffers to merge in memory before flushing to a file in L0. But in the the case when the db is being closed, we should not be using this config, instead we should flush whatever write buffers were available at that time. Test Plan: Unit test attached. Reviewers: haobo, emayanke Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D12717	2013-09-06 16:28:33 -07:00
Mayank Agarwal	aa5c897d42	Return pathname relative to db dir in LogFile and cleanup AppendSortedWalsOfType Summary: So that replication can just download from wherever LogFile.Pathname is pointing them. Test Plan: make all check;./db_repl_stress Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12609	2013-09-04 13:44:43 -07:00
Xing Jin	42c109cc2e	New ldb command to convert compaction style Summary: Add new command "change_compaction_style" to ldb tool. For universal->level, it shows "nothing to do". For level->universal, it compacts all files into a single one and moves the file to level 0. Also add check for number of files at level 1+ when opening db with universal compaction style. Test Plan: 'make all check'. New unit test for internal convertion function. Also manully test various cmd like: ./ldb change_compaction_style --old_compaction_style=0 --new_compaction_style=1 --db=/tmp/leveldbtest-3088/db_test Reviewers: haobo, dhruba Reviewed By: haobo CC: vamsi, emayanke Differential Revision: https://reviews.facebook.net/D12603	2013-09-04 13:13:08 -07:00
Mayank Agarwal	c34271a5a5	Fix bug in Counters and record Sequencenumber using only TickerCount Summary: The way counters/statistics are implemented in rocksdb demands that enum Tickers and TickerNameMap follow the same order, otherwise statistics exposed from fbcode/rocks get out-of-sync. 2 counters for prefix had violated this order and when I built counters for fbcode/mcrocksdb, statistics for sequence number were appearing out-of-sync. The other change is to record sequence-number using setTickerCount only and not recordTick. This is because of difference in statistics as understood by rocks/utils which uses ServiceData::statistics function and rocksdb statistics. In rocksdb there is just 1 counter for a countername. But in ServiceData there are 4 independent buckets for every countername-Count, Sum, Average and Rate. SetTickerCount and RecordTick update the same variable in rocksdb but different buckets in ServiceData. Therefore, I had to choose one consistent function from RecordTick or SetTickerCount for sequence number in rocksdb. I chose SetTickerCount because the statistics object in options passed during rocksdb-open is user-dependent and SetTickerCount makes sense there. There will be a corresponding diff to mcorcksdb in fbcode shortly. Test Plan: make all check; check ticker value using fprintfs Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12669	2013-09-01 17:59:32 -07:00
Mayank Agarwal	ab5c5c28fe	Fix build caused by DeleteFile not tolerating / at the beginning Summary: db->DeleteFile calls ParseFileName to check name that was returned for sst file. Now, sst filename is returned using TableFileName which uses MakeFileName. This puts a / at the front of the name and ParseFileName doesn't like that. Changed ParseFileName to tolerate /s at the beginning. The test delet_file_test used to pass earlier because this behaviour of MakeFileName had been changed a while back to not return a / during which delete_file_test was checked in. But MakeFileName had to be reverted to add / at the front because GetLiveFiles used at many places outside rocksdb used the previous behaviour of MakeFileName. Test Plan: make;./delete_filetest;make all check Reviewers: dhruba, haobo, vamsi Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12663	2013-09-01 17:59:13 -07:00
Dhruba Borthakur	59de2dbad7	Cleanup DeleteFile API Summary: The DeleteFile API was removing files inside the db-lock. This is now changed to remove files outside the db-lock. The GetLiveFilesMetadata() returns the smallest and largest seqnuence number of each file as well. Test Plan: deletefile_test Reviewers: emayanke, haobo Reviewed By: haobo CC: leveldb Maniphest Tasks: T63 Differential Revision: https://reviews.facebook.net/D12567	2013-08-28 21:18:58 -07:00
Haobo Xu	48e5ea0c34	[RocksDB] Fix TransformRepFactory related valgrind problem Summary: Let TransformRepFactory own the passed in transform. Also make it better encapsulated. Test Plan: make valgrind_check; Reviewers: dhruba, emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D12591	2013-08-28 19:27:54 -07:00
Dhruba Borthakur	fc0c399d2e	Introduced a new flag non_blocking_io in ReadOptions. Summary: If ReadOptions.non_blocking_io is set to true, then KeyMayExists and Iterators will return data that is cached in RAM. If the Iterator needs to do IO from storage to serve the data, then the Iterator.status() will return Status::IsRetry(). Test Plan: Enhanced unit test DBTest.KeyMayExist to detect if there were are IOs issues from storage. Added DBTest.NonBlockingIteration to verify nonblocking Iterations. Reviewers: emayanke, haobo Reviewed By: haobo CC: leveldb Maniphest Tasks: T63 Differential Revision: https://reviews.facebook.net/D12531	2013-08-28 10:49:14 -07:00
Haobo Xu	43eef52001	[RocksDB] move stats counting outside of mutex protected region for DB::Get() Summary: As title. This is possible as tickers are atomic now. db_bench on high qps in-memory muti-thread random get workload, showed ~5% throughput improvement. Test Plan: make check; db_bench; db_stress Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12555	2013-08-27 13:36:10 -07:00
Deon Nicholas	573844807c	Fix for no_io Summary: Oops. My bad. Test Plan: Make all check Reviewers: emayanke Reviewed By: emayanke CC: haobo, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D12525	2013-08-23 16:36:01 -07:00
Dhruba Borthakur	1186192ed1	Replace include/leveldb with include/rocksdb. Summary: Replace include/leveldb with include/rocksdb. Test Plan: make clean; make check make clean; make release Differential Revision: https://reviews.facebook.net/D12489	2013-08-23 10:51:00 -07:00
Jim Paton	74781a0c49	Add three new MemTableRep's Summary: This patch adds three new MemTableRep's: UnsortedRep, PrefixHashRep, and VectorRep. UnsortedRep stores keys in an std::unordered_map of std::sets. When an iterator is requested, it dumps the keys into an std::set and iterates over that. VectorRep stores keys in an std::vector. When an iterator is requested, it creates a copy of the vector and sorts it using std::sort. The iterator accesses that new vector. PrefixHashRep stores keys in an unordered_map mapping prefixes to ordered sets. I also added one API change. I added a function MemTableRep::MarkImmutable. This function is called when the rep is added to the immutable list. It doesn't do anything yet, but it seems like that could be useful. In particular, for the vectorrep, it means we could elide the extra copy and just sort in place. The only reason I haven't done that yet is because the use of the ArenaAllocator complicates things (I can elaborate on this if needed). Test Plan: make -j32 check ./db_stress --memtablerep=vector ./db_stress --memtablerep=unsorted ./db_stress --memtablerep=prefixhash --prefix_size=10 Reviewers: dhruba, haobo, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12117	2013-08-22 23:10:02 -07:00
Xing Jin	17dc128048	Pull from https://reviews.facebook.net/D10917 Summary: Pull Mark's patch and slightly revise it. I revised another place in db_impl.cc with similar new formula. Test Plan: make all check. Also run "time ./db_bench --num=2500000000 --numdistinct=2200000000". It has run for 20+ hours and hasn't finished. Looks good so far: Installed stack trace handler for SIGILL SIGSEGV SIGBUS SIGABRT LevelDB: version 2.0 Date: Tue Aug 20 23:11:55 2013 CPU: 32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz CPUCache: 20480 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 2500000000 RawSize: 276565.6 MB (estimated) FileSize: 157356.3 MB (estimated) Write rate limit: 0 Compression: snappy WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ DB path: [/tmp/leveldbtest-3088/dbbench] fillseq : 7202.000 micros/op 138 ops/sec; DB path: [/tmp/leveldbtest-3088/dbbench] fillsync : 7148.000 micros/op 139 ops/sec; (2500000 ops) DB path: [/tmp/leveldbtest-3088/dbbench] fillrandom : 7105.000 micros/op 140 ops/sec; DB path: [/tmp/leveldbtest-3088/dbbench] overwrite : 6930.000 micros/op 144 ops/sec; DB path: [/tmp/leveldbtest-3088/dbbench] readrandom : 1.020 micros/op 980507 ops/sec; (0 of 2500000000 found) DB path: [/tmp/leveldbtest-3088/dbbench] readrandom : 1.021 micros/op 979620 ops/sec; (0 of 2500000000 found) DB path: [/tmp/leveldbtest-3088/dbbench] readseq : 113.000 micros/op 8849 ops/sec; DB path: [/tmp/leveldbtest-3088/dbbench] readreverse : 102.000 micros/op 9803 ops/sec; DB path: [/tmp/leveldbtest-3088/dbbench] Created bg thread 0x7f0ac17f7700 compact : 111701.000 micros/op 8 ops/sec; DB path: [/tmp/leveldbtest-3088/dbbench] readrandom : 1.020 micros/op 980376 ops/sec; (0 of 2500000000 found) DB path: [/tmp/leveldbtest-3088/dbbench] readseq : 120.000 micros/op 8333 ops/sec; DB path: [/tmp/leveldbtest-3088/dbbench] readreverse : 29.000 micros/op 34482 ops/sec; DB path: [/tmp/leveldbtest-3088/dbbench] ... finished 618100000 ops Reviewers: MarkCallaghan, haobo, dhruba, chip Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D12441	2013-08-22 22:37:13 -07:00
Simha Venkataramaiah	60bf2b7d4a	Add APIs to query SST file metadata and to delete specific SST files Summary: An api to query the level, key ranges, size etc for each SST file and an api to delete a specific file from the db and all associated state in the bookkeeping datastructures. Notes: Editing the manifest version does not release the obsolete files right away. However deleting the file directly will mess up the iterator. We may need a more aggressive/timely file deletion api. I have used std::unique_ptr - will switch to boost:: since this is external. thoughts? Unit test is fragile right now as it expects the compaction at certain levels. Test Plan: unittest Reviewers: dhruba, vamsi, emayanke CC: zshao, leveldb, haobo Task ID: # Blame Rev:	2013-08-22 15:27:19 -07:00
Deon Nicholas	b87dcae1a3	Made merge_oprator a shared_ptr; and added TTL unit tests Test Plan: - make all check; - make release; - make stringappend_test; ./stringappend_test Reviewers: haobo, emayanke Reviewed By: haobo CC: leveldb, kailiu Differential Revision: https://reviews.facebook.net/D12381	2013-08-20 13:35:28 -07:00
Mayank Agarwal	8a3547d38e	API for getting archived log files Summary: Also expanded class LogFile to have startSequene and FileSize and exposed it publicly Test Plan: make all check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12087	2013-08-19 13:37:04 -07:00
Mayank Agarwal	387ac0f1e1	Expose statistic for sequence number and implement setTickerCount Summary: statistic for sequence number is needed by wormhole. setTickerCount is demanded for this statistic. I can't simply recordTick(max_sequence) when db recovers because the statistic iobject is owned by client and may/may not be reset during reopen. Eg. statistic is reset in mcrocksdb whereas it is not in db_stress. Therefore it is best to go with setTickerCount Test Plan: ./db_stress ... --statistics=1 and observed expected sequence number Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12327	2013-08-15 23:00:20 -07:00
Xing Jin	0a5afd1afc	Minor fix to current codes Summary: Minor fix to current codes, including: coding style, output format, comments. No major logic change. There are only 2 real changes, please see my inline comments. Test Plan: make all check Reviewers: haobo, dhruba, emayanke Differential Revision: https://reviews.facebook.net/D12297	2013-08-14 23:03:57 -07:00
Mayank Agarwal	f1bf169484	Counter for merge failure Summary: With Merge returning bool, it can keep failing silently(eg. While faling to fetch timestamp in TTL). We need to detect this through a rocksdb counter which can get bumped whenever Merge returns false. This will also be super-useful for the mcrocksdb-counter service where Merge may fail. Added a counter NUMBER_MERGE_FAILURES and appropriately updated db/merge_helper.cc I felt that it would be better to directly add counter-bumping in Merge as a default function of MergeOperator class but user should not be aware of this, so this approach seems better to me. Test Plan: make all check Reviewers: dnicholas, haobo, dhruba, vamsi CC: leveldb Differential Revision: https://reviews.facebook.net/D12129	2013-08-13 14:25:42 -07:00
Tyler Harter	f5f1842282	Prefix filters for scans (v4) Summary: Similar to v2 (db and table code understands prefixes), but use ReadOptions as in v3. Also, make the CreateFilter code faster and cleaner. Test Plan: make db_test; export LEVELDB_TESTS=PrefixScan; ./db_test Reviewers: dhruba Reviewed By: dhruba CC: haobo, emayanke Differential Revision: https://reviews.facebook.net/D12027	2013-08-13 14:04:56 -07:00
sumeet	3b81df34bd	Separate compaction filter for each compaction Summary: If we have same compaction filter for each compaction, application cannot know about the different compaction processes. Later on, we can put in more details in compaction filter for the application to consume and use it according to its needs. For e.g. In the universal compaction, we have a compaction process involving all the files while others don't involve all the files. Applications may want to collect some stats only when during full compaction. Test Plan: run existing unit tests Reviewers: haobo, dhruba Reviewed By: dhruba CC: xinyaohu, leveldb Differential Revision: https://reviews.facebook.net/D12057	2013-08-13 10:56:20 -07:00
Dhruba Borthakur	93d77a27d2	Universal Compaction should keep DeleteMarkers unless it is the earliest file. Summary: The pre-existing code was purging a DeleteMarker if thay key did not exist in deeper levels. But in the Universal Compaction Style, all files are in Level0. For compaction runs that did not include the earliest file, we were erroneously purging the DeleteMarkers. The fix is to purge DeleteMarkers only if the compaction includes the earlist file. Test Plan: DBTest.Randomized triggers this code path. Differential Revision: https://reviews.facebook.net/D12081	2013-08-09 14:03:57 -07:00
Xing Jin	8ae905ed63	Fix unit tests for universal compaction (step 2) Summary: Continue fixing existing unit tests for universal compaction. I have tried to apply universal compaction to all unit tests those haven't called ChangeOptions(). I left a few which are either apparently not applicable to universal compaction (because they check files/keys/values at level 1 or above levels), or apparently not related to compaction (e.g., open a file, open a db). I also add a new unit test for universal compaction. Good news is I didn't see any bugs during this round. Test Plan: Ran "make all check" yesterday. Has rebased and is rerunning Reviewers: haobo, dhruba Differential Revision: https://reviews.facebook.net/D12135	2013-08-09 13:35:44 -07:00
Xing Jin	17b8f786a3	Fix unit tests/bugs for universal compaction (first step) Summary: This is the first step to fix unit tests and bugs for universal compactiion. I added universal compaction option to ChangeOptions(), and fixed all unit tests calling ChangeOptions(). Some of these tests obviously assume more than 1 level and check file number/values in level 1 or above levels. I set kSkipUniversalCompaction for these tests. The major bug I found is manual compaction with universal compaction never stops. I have put a fix for it. I have also set universal compaction as the default compaction and found at least 20+ unit tests failing. I haven't looked into the details. The next step is to check all unit tests without calling ChangeOptions(). Test Plan: make all check Reviewers: dhruba, haobo Differential Revision: https://reviews.facebook.net/D12051	2013-08-07 14:05:44 -07:00
Dhruba Borthakur	f5fa26b6a9	Merge branch 'performance' of github.com:facebook/rocksdb into performance Conflicts: db/builder.cc db/db_impl.cc db/version_set.cc include/leveldb/statistics.h	2013-08-07 11:58:06 -07:00
Deon Nicholas	c2d7826ced	[RocksDB] [MergeOperator] The new Merge Interface! Uses merge sequences. Summary: Here are the major changes to the Merge Interface. It has been expanded to handle cases where the MergeOperator is not associative. It does so by stacking up merge operations while scanning through the key history (i.e.: during Get() or Compaction), until a valid Put/Delete/end-of-history is encountered; it then applies all of the merge operations in the correct sequence starting with the base/sentinel value. I have also introduced an "AssociativeMerge" function which allows the user to take advantage of associative merge operations (such as in the case of counters). The implementation will always attempt to merge the operations/operands themselves together when they are encountered, and will resort to the "stacking" method if and only if the "associative-merge" fails. This implementation is conjectured to allow MergeOperator to handle the general case, while still providing the user with the ability to take advantage of certain efficiencies in their own merge-operator / data-structure. NOTE: This is a preliminary diff. This must still go through a lot of review, revision, and testing. Feedback welcome! Test Plan: -This is a preliminary diff. I have only just begun testing/debugging it. -I will be testing this with the existing MergeOperator use-cases and unit-tests (counters, string-append, and redis-lists) -I will be "desk-checking" and walking through the code with the help gdb. -I will find a way of stress-testing the new interface / implementation using db_bench, db_test, merge_test, and/or db_stress. -I will ensure that my tests cover all cases: Get-Memtable, Get-Immutable-Memtable, Get-from-Disk, Iterator-Range-Scan, Flush-Memtable-to-L0, Compaction-L0-L1, Compaction-Ln-L(n+1), Put/Delete found, Put/Delete not-found, end-of-history, end-of-file, etc. -A lot of feedback from the reviewers. Reviewers: haobo, dhruba, zshao, emayanke Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11499	2013-08-05 20:14:32 -07:00
Jim Paton	8e792e5896	Add soft_rate_limit stats Summary: This diff adds histogram stats for soft_rate_limit stalls. It also renames the old rate_limit stats to hard_rate_limit. Test Plan: make -j32 check Reviewers: dhruba, haobo, MarkCallaghan Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12021	2013-08-05 18:45:23 -07:00
Jim Paton	1036537c94	Add soft and hard rate limit support Summary: This diff adds support for both soft and hard rate limiting. The following changes are included: 1) Options.rate_limit is renamed to Options.hard_rate_limit. 2) Options.rate_limit_delay_milliseconds is renamed to Options.rate_limit_delay_max_milliseconds. 3) Options.soft_rate_limit is added. 4) If the maximum compaction score is > hard_rate_limit and rate_limit_delay_max_milliseconds == 0, then writes are delayed by 1 ms at a time until the max compaction score falls below hard_rate_limit. 5) If the max compaction score is > soft_rate_limit but <= hard_rate_limit, then writes are delayed by 0-1 ms depending on how close we are to hard_rate_limit. 6) Users can disable 4 by setting hard_rate_limit = 0. They can add a limit to the maximum amount of time waited by setting rate_limit_delay_max_milliseconds > 0. Thus, the old behavior can be preserved by setting soft_rate_limit = 0, which is the default. Test Plan: make -j32 check ./db_stress Reviewers: dhruba, haobo, MarkCallaghan Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12003	2013-08-05 15:43:49 -07:00
Dhruba Borthakur	711a30cb30	Merge branch 'master' into performance Conflicts: include/leveldb/options.h include/leveldb/statistics.h util/options.cc	2013-08-02 10:22:08 -07:00
Mayank Agarwal	59d0b02f8b	Expand KeyMayExist to return the proper value if it can be found in memory and also check block_cache Summary: Removed KeyMayExistImpl because KeyMayExist demanded Get like semantics now. Removed no_io from memtable and imm because we need the proper value now and shouldn't just stop when we see Merge in memtable. Added checks to block_cache. Updated documentation and unit-test Test Plan: make all check;db_stress for 1 hour Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11853	2013-08-01 09:07:46 -07:00
Jim Paton	9700677a2b	Slow down writes gradually rather than suddenly Summary: Currently, when a certain number of level0 files (level0_slowdown_writes_trigger) are present, RocksDB will slow down each write by 1ms. There is a second limit of level0 files at which RocksDB will stop writes altogether (level0_stop_writes_trigger). This patch enables the user to supply a third parameter specifying the number of files at which Rocks will start slowing down writes (level0_start_slowdown_writes). When this number is reached, Rocks will slow down writes as a quadratic function of level0_slowdown_writes_trigger - num_level0_files. For some workloads, this improves latency and throughput. I will post some stats momentarily in https://our.intern.facebook.com/intern/tasks/?t=2613384. Test Plan: make -j32 check ./db_stress ./db_bench Reviewers: dhruba, haobo, MarkCallaghan, xjin Reviewed By: xjin CC: leveldb, xjin, zshao Differential Revision: https://reviews.facebook.net/D11859	2013-07-31 16:20:48 -07:00
Xing Jin	0f0a24e298	Make arena block size configurable Summary: Add an option for arena block size, default value 4096 bytes. Arena will allocate blocks with such size. I am not sure about passing parameter to skiplist in the new virtualized framework, though I talked to Jim a bit. So add Jim as reviewer. Test Plan: new unit test, I am running db_test. For passing paramter from configured option to Arena, I tried tests like: TEST(DBTest, Arena_Option) { std::string dbname = test::TmpDir() + "/db_arena_option_test"; DestroyDB(dbname, Options()); DB* db = nullptr; Options opts; opts.create_if_missing = true; opts.arena_block_size = 1000000; // tested 99, 999999 Status s = DB::Open(opts, dbname, &db); db->Put(WriteOptions(), "a", "123"); } and printed some debug info. The results look good. Any suggestion for such a unit-test? Reviewers: haobo, dhruba, emayanke, jpaton Reviewed By: dhruba CC: leveldb, zshao Differential Revision: https://reviews.facebook.net/D11799	2013-07-31 12:42:23 -07:00
Jim Paton	6db52b525a	Don't use redundant Env::NowMicros() calls Summary: After my patch for stall histograms, there are redundant calls to NowMicros() by both the stop watches and DBImpl::MakeRoomForWrites. So I removed the redundant calls such that the information is gotten from the stopwatch. Test Plan: make clean make -j32 check Reviewers: dhruba, haobo, MarkCallaghan Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11883	2013-07-29 15:46:36 -07:00
Jim Paton	18afff2e63	Add stall counts to statistics Summary: Previously, statistics are kept on how much time is spent on stalls of different types. This patch adds support for keeping number of stalls of each type. For example, instead of just reporting how many microseconds are spent waiting for memtables to be compacted, it will also report how many times a write stalled for that to occur. Test Plan: make -j32 check ./db_stress # Not really sure what else should be done... Reviewers: dhruba, MarkCallaghan, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11841	2013-07-29 10:34:23 -07:00
Dhruba Borthakur	d7ba5bce37	Revert `6fbe4e981a`: If disable wal is set, then batch commits are avoided Summary: Revert "If disable wal is set, then batch commits are avoided" because keeping the mutex while inserting into the skiplist means that readers and writes are all serialized on the mutex. Test Plan: Reviewers: CC: Task ID: # Blame Rev:	2013-07-24 10:01:13 -07:00
Jim Paton	52d7ecfc78	Virtualize SkipList Interface Summary: This diff virtualizes the skiplist interface so that users can provide their own implementation of a backing store for MemTables. Eventually, the backing store will be responsible for its own synchronization, allowing users (and us) to experiment with different lockless implementations. Test Plan: make clean make -j32 check ./db_stress Reviewers: dhruba, emayanke, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11739	2013-07-23 14:42:27 -07:00
Dhruba Borthakur	6fbe4e981a	If disable wal is set, then batch commits are avoided. Summary: rocksdb uses batch commit to write to transaction log. But if disable wal is set, then writes to transaction log are anyways avoided. In this case, there is not much value-add to batch things, batching can cause unnecessary delays to Puts(). This patch avoids batching when disableWal is set. Test Plan: make check. I am running db_stress now. Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11763	2013-07-23 14:22:57 -07:00
Mayank Agarwal	bf66c10b13	Use KeyMayExist for WriteBatch-Deletes Summary: Introduced KeyMayExist checking during writebatch-delete and removed from Outer Delete API because it uses writebatch-delete. Added code to skip getting Table from disk if not already present in table_cache. Some renaming of variables. Introduced KeyMayExistImpl which allows checking since specified sequence number in GetImpl useful to check partially written writebatch. Changed KeyMayExist to not be pure virtual and provided a default implementation. Expanded unit-tests in db_test to check appropriately. Ran db_stress for 1 hour with ./db_stress --max_key=100000 --ops_per_thread=10000000 --delpercent=50 --filter_deletes=1 --statistics=1. Test Plan: db_stress;make check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb, xjin Differential Revision: https://reviews.facebook.net/D11745	2013-07-23 13:36:50 -07:00
Haobo Xu	d364eea1fc	[RocksDB] Fix FindMinimumEmptyLevelFitting Summary: as title Test Plan: make check; Reviewers: xjin CC: leveldb Differential Revision: https://reviews.facebook.net/D11751	2013-07-22 12:31:43 -07:00
Haobo Xu	9ee68871dc	[RocksDB] Enable manual compaction to move files back to an appropriate level. Summary: As title. This diff added an option reduce_level to CompactRange. When set to true, it will try to move the files back to the minimum level sufficient to hold the data set. Note that the default is set to true now, just to excerise it in all existing tests. Will set the default to false before check-in, for backward compatibility. Test Plan: make check; Reviewers: dhruba, emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D11553	2013-07-19 16:20:36 -07:00
Dhruba Borthakur	4a745a5666	Merge branch 'master' into performance Conflicts: db/version_set.cc include/leveldb/options.h util/options.cc	2013-07-17 15:05:57 -07:00
Mayank Agarwal	2a986919d6	Make rocksdb-deletes faster using bloom filter Summary: Wrote a new function in db_impl.c-CheckKeyMayExist that calls Get but with a new parameter turned on which makes Get return false only if bloom filters can guarantee that key is not in database. Delete calls this function and if the option- deletes_use_filter is turned on and CheckKeyMayExist returns false, the delete will be dropped saving: 1. Put of delete type 2. Space in the db,and 3. Compaction time Test Plan: make all check; will run db_stress and db_bench and enhance unit-test once the basic design gets approved Reviewers: dhruba, haobo, vamsi Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11607	2013-07-11 12:11:11 -07:00
Haobo Xu	9ba82786ce	[RocksDB] Provide contiguous sequence number even in case of write failure Summary: Replication logic would be simplifeid if we can guarantee that write sequence number is always contiguous, even if write failure occurs. Dhruba and I looked at the sequence number generation part of the code. It seems fixable. Note that if WAL was successful and insert into memtable was not, we would be in an unfortunate state. The approach in this diff is : IO error is expected and error status will be returned to client, sequence number will not be advanced; In-mem error is not expected and we panic. Test Plan: make check; db_stress Reviewers: dhruba, sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D11439	2013-07-08 15:31:09 -07:00
Dhruba Borthakur	116ec527f2	Renamed 'hybrid_compaction' tp be "Universal Compaction'. Summary: All the universal compaction parameters are encapsulated in a new file universal_compaction.h Test Plan: make check	2013-07-03 15:47:53 -07:00
Dhruba Borthakur	47c4191fe8	Reduce write amplification by merging files in L0 back into L0 Summary: There is a new option called hybrid_mode which, when switched on, causes HBase style compactions. Files from L0 are compacted back into L0. This meat of this compaction algorithm is in PickCompactionHybrid(). All files reside in L0. That means all files have overlapping keys. Each file has a time-bound, i.e. each file contains a range of keys that were inserted around the same time. The start-seqno and the end-seqno refers to the timeframe when these keys were inserted. Files that have contiguous seqno are compacted together into a larger file. All files are ordered from most recent to the oldest. The current compaction algorithm starts to look for candidate files starting from the most recent file. It continues to add more files to the same compaction run as long as the sum of the files chosen till now is smaller than the next candidate file size. This logic needs to be debated and validated. The above logic should reduce write amplification to a large extent... will publish numbers shortly. Test Plan: dbstress runs for 6 hours with no data corruption (tested so far). Differential Revision: https://reviews.facebook.net/D11289	2013-06-30 20:07:04 -07:00
Abhishek Kona	5ef6bb8c37	[rocksdb][refactor] statistic printing code to one place Summary: $title Test Plan: db_bench --statistics=1 Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11373	2013-06-18 20:28:41 -07:00
Dhruba Borthakur	6acbe0fc45	Compact multiple memtables before flushing to storage. Summary: Merge multiple multiple memtables in memory before writing it out to a file in L0. There is a new config parameter min_write_buffer_number_to_merge that specifies the number of write buffers that should be merged together to a single file in storage. The system will not flush wrte buffers to storage unless at least these many buffers have accumulated in memory. The default value of this new parameter is 1, which means that a write buffer will be immediately flushed to disk as soon it is ready. Test Plan: make check Differential Revision: https://reviews.facebook.net/D11241	2013-06-18 14:28:04 -07:00
Haobo Xu	1afdf28701	[RocksDB] Compaction Filter Cleanup Summary: This hopefully gives the right semantics to compaction filter. Will write a small wiki to explain the ideas. Test Plan: make check; db_stress Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11121	2013-06-14 14:23:08 -07:00
Deon Nicholas	4985a9f73b	[Rocksdb] [Multiget] Introduced multiget into db_bench Summary: Preliminary! Introduced the --use_multiget=1 and --keys_per_multiget=n flags for db_bench. Also updated and tested the ReadRandom() method to include an option to use multiget. By default, keys_per_multiget=100. Preliminary tests imply that multiget is at least 1.25x faster per key than regular get. Will continue adding Multiget for ReadMissing, ReadHot, RandomWithVerify, ReadRandomWriteRandom; soon. Will also think about ways to better verify benchmarks. Test Plan: 1. make db_bench 2. ./db_bench --benchmarks=fillrandom 3. ./db_bench --benchmarks=readrandom --use_existing_db=1 --use_multiget=1 --threads=4 --keys_per_multiget=100 4. ./db_bench --benchmarks=readrandom --use_existing_db=1 --threads=4 5. Verify ops/sec (and 1000000 of 1000000 keys found) Reviewers: haobo, MarkCallaghan, dhruba Reviewed By: MarkCallaghan CC: leveldb Differential Revision: https://reviews.facebook.net/D11127	2013-06-12 12:42:21 -07:00
Haobo Xu	bdf1085944	[RocksDB] cleanup EnvOptions Summary: This diff simplifies EnvOptions by treating it as POD, similar to Options. - virtual functions are removed and member fields are accessed directly. - StorageOptions is removed. - Options.allow_readahead and Options.allow_readahead_compactions are deprecated. - Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite Test Plan: make check; db_stress Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11175	2013-06-12 11:17:19 -07:00
Mayank Agarwal	184343a061	Max_mem_compaction_level can have maximum value of num_levels-1 Summary: Without this files could be written out to a level greater than the maximum level possible and is the source of the segfaults that wormhole awas getting. The sequence of steps that was followed: 1. WriteLevel0Table was called when memtable was to be flushed for a file. 2. PickLevelForMemTableOutput was called to determine the level to which this file should be pushed. 3. PickLevelForMemTableOutput returned a wrong result because max_mem_compaction_level was equal to 2 even when num_levels was equal to 0. The fix to re-initialize max_mem_compaction_level based on num_levels passed seems correct. Test Plan: make all check; Also made a dummy file to mimic the wormhole-file behaviour which was causing the segfaults and found that the same segfault occurs without this change and not with this. Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11157	2013-06-09 10:38:55 -07:00
Deon Nicholas	d8c7c45ea0	Very basic Multiget and simple test cases. Summary: Implemented the MultiGet operator which takes in a list of keys and returns their associated values. Currently uses std::vector as its container data structure. Otherwise, it works identically to "Get". Test Plan: 1. make db_test ; compile it 2. ./db_test ; test it 3. make all check ; regress / run all tests 4. make release ; (optional) compile with release settings Reviewers: haobo, MarkCallaghan, dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D10875	2013-06-05 11:22:38 -07:00
Abhishek Kona	d91b42ee27	[Rocksdb] Measure all FSYNC/SYNC times Summary: Add stop watches around all sync calls. Test Plan: db_bench check if respective histograms are printed Reviewers: haobo, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11073	2013-06-05 11:06:21 -07:00
Mark Callaghan	d9f538e1a9	Improve output for GetProperty('leveldb.stats') Summary: Display separate values for read, write & total compaction IO. Display compaction amplification and write amplification. Add similar values for the period since the last call to GetProperty. Results since the server started are reported as "cumulative" stats. Results since the last call to GetProperty are reported as "interval" stats. Level Files Size(MB) Time(sec) Read(MB) Write(MB) Rn(MB) Rnp1(MB) Wnew(MB) Amplify Read(MB/s) Write(MB/s) Rn Rnp1 Wnp1 NewW Count Ln-stall ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0 7 13 21 0 211 0 0 211 0.0 0.0 10.1 0 0 0 0 113 0.0 1 79 157 88 993 989 198 795 194 9.0 11.3 11.2 106 405 502 97 14 0.0 2 19 36 5 63 63 37 27 36 2.4 12.3 12.2 19 14 32 18 12 0.0 >>>>>>>>>>>>>>>>>>>>>>>>> text below has been is new and/or reformatted Uptime(secs): 122.2 total, 0.9 interval Compaction IO cumulative (GB): 0.21 new, 1.03 read, 1.23 write, 2.26 read+write Compaction IO cumulative (MB/sec): 1.7 new, 8.6 read, 10.3 write, 19.0 read+write Amplification cumulative: 6.0 write, 11.0 compaction Compaction IO interval (MB): 5.59 new, 0.00 read, 5.59 write, 5.59 read+write Compaction IO interval (MB/sec): 6.5 new, 0.0 read, 6.5 write, 6.5 read+write Amplification interval: 1.0 write, 1.0 compaction >>>>>>>>>>>>>>>>>>>>>>>> text above is new and/or reformatted Stalls(secs): 90.574 level0_slowdown, 0.000 level0_numfiles, 10.165 memtable_compaction, 0.000 leveln_slowdown Task ID: # Blame Rev: Test Plan: make check, run db_bench Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11049	2013-06-03 20:24:49 -07:00
Haobo Xu	2b1fb5b01d	[RocksDB] Add score column to leveldb.stats Summary: Added the 'score' column to the compaction stats output, which shows the level total size devided by level target size. Could be useful when monitoring compaction decisions... Test Plan: make check; db_bench Reviewers: dhruba CC: leveldb, MarkCallaghan Differential Revision: https://reviews.facebook.net/D11025	2013-06-03 17:38:27 -07:00
Haobo Xu	d897d33bf1	[RocksDB] Introduce Fast Mutex option Summary: This diff adds an option to specify whether PTHREAD_MUTEX_ADAPTIVE_NP will be enabled for the rocksdb single big kernel lock. db_bench also have this option now. Quickly tested 8 thread cpu bound 100 byte random read. No fast mutex: ~750k/s ops With fast mutex: ~880k/s ops Test Plan: make check; db_bench; db_stress Reviewers: dhruba CC: MarkCallaghan, leveldb Differential Revision: https://reviews.facebook.net/D11031	2013-06-01 23:11:34 -07:00
Haobo Xu	2df65c118c	[RocksDB] Dump counters and histogram data periodically with compaction stats Summary: As title Test Plan: make check Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D10995	2013-05-29 12:00:18 -07:00
heyongqiang	4b29651206	add block deviation option to terminate a block before it exceeds block_size Summary: a new option block_size_deviation is added. Test Plan: run db_test and db_bench Reviewers: dhruba, haobo Reviewed By: haobo Differential Revision: https://reviews.facebook.net/D10821	2013-05-24 15:52:49 -07:00
Haobo Xu	ef15b9d178	[RocksDB] Fix MaybeDumpStats Summary: MaybeDumpStats was causing lock problem Test Plan: make check; db_stress Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D10935	2013-05-24 15:43:16 -07:00
Haobo Xu	0e879c93de	[RocksDB] dump leveldb.stats periodically in LOG file. Summary: Added an option stats_dump_period_sec to dump leveldb.stats to LOG periodically for diagnosis. By defauly, it's set to a very big number 3600 (1 hour). Test Plan: make check; Reviewers: dhruba Reviewed By: dhruba CC: leveldb, zshao Differential Revision: https://reviews.facebook.net/D10761	2013-05-23 16:56:59 -07:00
Dhruba Borthakur	26541860d3	The max size of the write buffer size can be 64 GB. Summary: There was an artifical limit on the size of the write buffer size. Test Plan: make check Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D10911	2013-05-23 15:00:27 -07:00
Haobo Xu	87d0af15d8	[RocksDB] Introduce an option to skip log error on recovery Summary: Currently, with paranoid_check on, DB::Open will fail on any log read error on recovery. If client is ok with losing most recent updates, we could simply skip those errors. However, it's important to introduce an additional flag, so that paranoid_check can still guard against more serious problems. Test Plan: make check; db_stress Reviewers: dhruba, emayanke Reviewed By: emayanke CC: leveldb, emayanke Differential Revision: https://reviews.facebook.net/D10869	2013-05-21 14:30:36 -07:00
Abhishek Kona	e1174306c5	[RocksDB] Simplify StopWatch implementation Summary: Make stop watch a simple implementation, instead of subclass of a virtual class Allocate stop watches off the stack instead of heap. Code is more terse now. Test Plan: make all check, db_bench with --statistics=1 Reviewers: haobo, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D10809	2013-05-17 10:55:34 -07:00
Haobo Xu	4ca3c67bd3	[RocksDB] Cleanup compaction filter to use a class interface, instead of function pointer and additional context pointer. Summary: This diff replaces compaction_filter_args and CompactionFilter with a single compaction_filter parameter. It gives CompactionFilter better encapsulation and a similar look to Comparator and MergeOpertor, which improves consistency of the overall interface. The change is not backward compatible. Nevertheless, the two references in fbcode are not in production yet. Test Plan: make check Reviewers: dhruba Reviewed By: dhruba CC: leveldb, zshao Differential Revision: https://reviews.facebook.net/D10773	2013-05-13 14:06:10 -07:00
Haobo Xu	73c0a33346	[RocksDB] fix compaction filter trigger condition Summary: Currently, compaction filter is run on internal key older than the oldest snapshot, which is incorrect. Compaction filter should really be run on the most recent internal key when there is no external snapshot. Test Plan: make check; db_stress Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D10641	2013-05-13 12:33:02 -07:00
Abhishek Kona	8d58ecdc29	[RocksDB] Expose compaction stalls via db_statistics Test Plan: make check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D10575	2013-05-10 14:41:45 -07:00
Abhishek Kona	988c20b9f7	[RocksDB] Clear Archive WAL files Summary: WAL files are moved to archive directory and clear only at DB::Open. Can lead to a lot of space consumption in a Database. Added logic to periodically clear Archive Directory too. Test Plan: make all check + add unit test Reviewers: dhruba, heyongqiang Reviewed By: heyongqiang CC: leveldb Differential Revision: https://reviews.facebook.net/D10617	2013-05-06 11:41:01 -07:00
Haobo Xu	05e8854085	[Rocksdb] Support Merge operation in rocksdb Summary: This diff introduces a new Merge operation into rocksdb. The purpose of this review is mostly getting feedback from the team (everyone please) on the design. Please focus on the four files under include/leveldb/, as they spell the client visible interface change. include/leveldb/db.h include/leveldb/merge_operator.h include/leveldb/options.h include/leveldb/write_batch.h Please go over local/my_test.cc carefully, as it is a concerete use case. Please also review the impelmentation files to see if the straw man implementation makes sense. Note that, the diff does pass all make check and truly supports forward iterator over db and a version of Get that's based on iterator. Future work: - Integration with compaction - A raw Get implementation I am working on a wiki that explains the design and implementation choices, but coding comes just naturally and I think it might be a good idea to share the code earlier. The code is heavily commented. Test Plan: run all local tests Reviewers: dhruba, heyongqiang Reviewed By: dhruba CC: leveldb, zshao, sheki, emayanke, MarkCallaghan Differential Revision: https://reviews.facebook.net/D9651	2013-05-03 16:59:02 -07:00
Haobo Xu	eb6d139666	[RocksDB] Move table.h to table/ Summary: - don't see a point exposing table.h to the public. - fixed make clean to remove also *.d files. Test Plan: make check; db_stress Reviewers: dhruba, heyongqiang Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D10479	2013-04-22 16:07:56 -07:00
Haobo Xu	b4243e5a3d	[RocksDB] CompactionFilter cleanup Summary: - removed the compaction_filter_value from the callback interface. Restrict compaction filter to purging values. - modify some comments to reflect curent status. Test Plan: make check Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D10335	2013-04-20 10:26:51 -07:00
Abhishek Kona	7c6c3c0ff4	[Rockdsdb] Better Error messages. Closing db instead of deleting db Summary: A better error message. A local change. Did not look at other places where this could be done. Test Plan: compile Reviewers: dhruba, MarkCallaghan Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D10251	2013-04-15 15:27:15 -07:00
Haobo Xu	013e9ebbf1	[RocksDB] [Performance] Speed up FindObsoleteFiles Summary: FindObsoleteFiles was slow, holding the single big lock, resulted in bad p99 behavior. Didn't profile anything, but several things could be improved: 1. VersionSet::AddLiveFiles works with std::set, which is by itself slow (a tree). You also don't know how many dynamic allocations occur just for building up this tree. switched to std::vector, also added logic to pre-calculate total size and do just one allocation 2. Don't see why env_->GetChildren() needs to be mutex proteced, moved to PurgeObsoleteFiles where mutex could be unlocked. 3. switched std::set to std:unordered_set, the conversion from vector is also inside PurgeObsoleteFiles I have a feeling this should pretty much fix it. Test Plan: make check; db_stress Reviewers: dhruba, heyongqiang, MarkCallaghan Reviewed By: dhruba CC: leveldb, zshao Differential Revision: https://reviews.facebook.net/D10197	2013-04-12 11:29:27 -07:00
Mayank Agarwal	94d86b25a9	Fix memory leak for probableWALfiles in db_impl.cc Summary: using unique_ptr to have automatic delete for probableWALfiles in db_impl.cc Test Plan: make Reviewers: sheki, dhruba Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D10083	2013-04-11 16:57:15 -07:00
Dhruba Borthakur	7730587120	Prevent segfault in OpenCompactionOutputFile Summary: The segfault was happening because the program was unable to open a new sst file (as part of the compaction) because the process ran out of file descriptors. The fix is to check the return status of the file creation before taking any other action. Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fabf03f9700 (LWP 29904)] leveldb::DBImpl::OpenCompactionOutputFile (this=this@entry=0x7fabf9011400, compact=compact@entry=0x7fabf741a2b0) at db/db_impl.cc:1399 1399 db/db_impl.cc: No such file or directory. (gdb) where Test Plan: make check Reviewers: MarkCallaghan, sheki Reviewed By: MarkCallaghan CC: leveldb Differential Revision: https://reviews.facebook.net/D10101	2013-04-10 09:59:48 -07:00
Abhishek Kona	574b76f710	[RocksDB][Bug] Look at all the files, not just the first file in TransactionLogIter as BatchWrites can leave it in Limbo Summary: Transaction Log Iterator did not move to the next file in the series if there was a write batch at the end of the currentFile. The solution is if the last seq no. of the current file is < RequestedSeqNo. Assume the first seqNo. of the next file has to satisfy the request. Also major refactoring around the code. Moved opening the logreader to a seperate function, got rid of goto. Test Plan: added a unit test for it. Reviewers: dhruba, heyongqiang Reviewed By: heyongqiang CC: leveldb, emayanke Differential Revision: https://reviews.facebook.net/D10029	2013-04-08 16:28:09 -07:00
Abhishek Kona	ca789a10cc	[Rocksdb] Recover last updated sequence number from manifest also. Summary: During recovery, last_updated_manifest number was not set if there were no records in the Write-ahead log. Now check for the recovered manifest also and set last_updated_manifest file to the max value. Test Plan: unit test Reviewers: heyongqiang Reviewed By: heyongqiang CC: leveldb Differential Revision: https://reviews.facebook.net/D9891	2013-04-02 17:18:27 -07:00
Haobo Xu	6763110867	[RocksDB] Replace iterator based loop with range based loop for stl containers Summary: As title. Code is shorter and cleaner See https://our.dev.facebook.com/intern/tasks/?t=2233981 Test Plan: make check Reviewers: dhruba, heyongqiang Reviewed By: dhruba CC: leveldb, zshao Differential Revision: https://reviews.facebook.net/D9789	2013-04-02 11:46:45 -07:00
Haobo Xu	645ff8f231	Let's get rid of delete as much as possible, here are some examples. Summary: If a class owns an object: - If the object can be null => use a unique_ptr. no delete - If the object can not be null => don't even need new, let alone delete - for runtime sized array => use vector, no delete. Test Plan: make check Reviewers: dhruba, heyongqiang Reviewed By: heyongqiang CC: leveldb, zshao, sheki, emayanke, MarkCallaghan Differential Revision: https://reviews.facebook.net/D9783	2013-03-28 17:31:44 -07:00
Abhishek Kona	3b51605b8d	[RocksDB] Fix binary search while finding probable wal files Summary: RocksDB does a binary search to look at the files which might contain the requested sequence number at the call GetUpdatesSince. There was a bug in the binary search => when the file pointed by the middle index of bsearch was empty/corrupt it needst to resize the vector and update indexes. This now fixes that. Test Plan: existing unit tests pass. Reviewers: heyongqiang, dhruba Reviewed By: heyongqiang CC: leveldb Differential Revision: https://reviews.facebook.net/D9777	2013-03-28 13:37:15 -07:00
Abhishek Kona	8e9c781ae5	[Rocksdb] Fix Crash on finding a db with no log files. Error out instead Summary: If the vector returned by GetUpdatesSince is empty, it is still returned to the user. This causes it throw an std::range error. The probable file list is checked and it returns an IOError status instead of OK now. Test Plan: added a unit test. Reviewers: dhruba, heyongqiang Reviewed By: heyongqiang CC: leveldb Differential Revision: https://reviews.facebook.net/D9771	2013-03-28 13:19:07 -07:00
Abhishek Kona	7fdd5f5b33	Use non-mmapd files for Write-Ahead Files Summary: Use non mmapd files for Write-Ahead log. Earlier use of MMaped files. made the log iterator read ahead and miss records. Now the reader and writer will point to the same physical location. There is no perf regression : ./db_bench --benchmarks=fillseq --db=/dev/shm/mmap_test --num=$(million 20) --use_existing_db=0 --threads=2 with This diff : fillseq : 10.756 micros/op 185281 ops/sec; 20.5 MB/s without this dif : fillseq : 11.085 micros/op 179676 ops/sec; 19.9 MB/s Test Plan: unit test included Reviewers: dhruba, heyongqiang Reviewed By: heyongqiang CC: leveldb Differential Revision: https://reviews.facebook.net/D9741	2013-03-28 13:13:35 -07:00
Haobo Xu	ecd8db0200	[RocksDB] Minimize Mutex protected code section in the critical path Summary: rocksdb uses a single global lock to protect in memory metadata. We should minimize the mutex protected code section to increase the effective parallelism of the program. See https://our.intern.facebook.com/intern/tasks/?t=2218928 Test Plan: make check db_bench Reviewers: dhruba, heyongqiang CC: zshao, leveldb Differential Revision: https://reviews.facebook.net/D9705	2013-03-26 22:42:26 -07:00
Dhruba Borthakur	d0798f67f4	Run compactions even if workload is readonly or read-mostly. Summary: The events that trigger compaction: * opening the database * Get -> only if seek compaction is not disabled and other checks are true * MakeRoomForWrite -> when memtable is full * BackgroundCall -> If the background thread is about to do a compaction run, it schedules a new background task to trigger a possible compaction. This will cause additional background threads to find and process other compactions that can run concurrently. Test Plan: ran db_bench with overwrite and readonly alternatively. Reviewers: sheki, MarkCallaghan Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D9579	2013-03-20 23:43:29 -07:00
Dhruba Borthakur	ad96563b79	Ability to configure bufferedio-reads, filesystem-readaheads and mmap-read-write per database. Summary: This patch allows an application to specify whether to use bufferedio, reads-via-mmaps and writes-via-mmaps per database. Earlier, there was a global static variable that was used to configure this functionality. The default setting remains the same (and is backward compatible): 1. use bufferedio 2. do not use mmaps for reads 3. use mmap for writes 4. use readaheads for reads needed for compaction I also added a parameter to db_bench to be able to explicitly specify whether to do readaheads for compactions or not. Test Plan: make check Reviewers: sheki, heyongqiang, MarkCallaghan Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D9429	2013-03-20 23:14:03 -07:00
Mayank Agarwal	487168cdcf	Fixed sign-comparison in rocksdb code-base and fixed Makefile Summary: Makefile had options to ignore sign-comparisons and unused-parameters, which should be there. Also fixed the specific errors in the code-base Test Plan: make Reviewers: chip, dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D9531	2013-03-19 14:35:23 -07:00
Mark Callaghan	72d14eafd3	add --benchmarks=levelstats option to db_bench, prevent "nan" in stats output Summary: Add --benchmarks=levelstats option to report per-level stats (#files, #bytes) Change readwhilewriting test to report response time for writes but exclude them from the stats merged by all threads. Prevent "NaN" in stats output by preventing division by 0. Remove "o" file I committed by mistake. Task ID: # Blame Rev: Test Plan: make check Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D9513	2013-03-19 13:14:44 -07:00
Abhishek Kona	02c459805b	Ignore a zero-sized file while looking for a seq-no in GetUpdatesSince Summary: Rocksdb can create 0 sized log files when it is opened and closed without any operations. The GetUpdatesSince fails currently if there is a log file of size zero. This diff fixes this. If there is a log file is 0, it is removed form the probable_file_list Test Plan: unit test Reviewers: dhruba, heyongqiang Reviewed By: heyongqiang CC: leveldb Differential Revision: https://reviews.facebook.net/D9507	2013-03-19 11:00:09 -07:00
Dhruba Borthakur	6d812b6afb	A mechanism to detect manifest file write errors and put db in readonly mode. Summary: If there is an error while writing an edit to the manifest file, the manifest file is closed and reopened to check if the edit made it in. However, if the re-opening of the manifest is unsuccessful and options.paranoid_checks is set t true, then the db refuses to accept new puts, effectively putting the db in readonly mode. In a future diff, I would like to make the default value of paranoid_check to true. Test Plan: make check Reviewers: sheki Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D9201	2013-03-07 09:45:49 -08:00
Abhishek Kona	d68880a1b9	Do not allow Transaction Log Iterator to fall ahead when writer is writing the same file Summary: Store the last flushed, seq no. in db_impl. Check against it in transaction Log iterator. Do not attempt to read ahead if we do not know if the data is flushed completely. Does not work if flush is disabled. Any ideas on fixing that? * Minor change, iter->Next is called the first time automatically for * the first time. Test Plan: existing test pass. More ideas on testing this? Planning to run some stress test. Reviewers: dhruba, heyongqiang CC: leveldb Differential Revision: https://reviews.facebook.net/D9087	2013-03-06 14:05:53 -08:00
Dhruba Borthakur	afed60938f	Fox db_stress crash by copying keys before changing sequencenum to zero. Summary: The compaction process zeros out sequence numbers if the output is part of the bottommost level. The Slice is supposed to refer to an immutable data buffer. The merger that implements the priority queue while reading kvs as the input of a compaction run reies on this fact. The bug was that were updating the sequence number of a record in-place and that was causing suceeding invocations of the merger to return kvs in arbitrary order of sequence numbers. The fix is to copy the key to a local memory buffer before setting its seqno to 0. Test Plan: Set Options.purge_redundant_kvs_while_flush = false and then run db_stress --ops_per_thread=1000 --max_key=320 Reviewers: emayanke, sheki Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D9147	2013-03-06 10:52:08 -08:00
Mark Callaghan	993543d1be	Add rate_delay_limit_milliseconds Summary: This adds the rate_delay_limit_milliseconds option to make the delay configurable in MakeRoomForWrite when the max compaction score is too high. This delay is called the Ln slowdown. This change also counts the Ln slowdown per level to make it possible to see where the stalls occur. From IO-bound performance testing, the Level N stalls occur: * with compression -> at the largest uncompressed level. This makes sense because compaction for compressed levels is much slower. When Lx is uncompressed and Lx+1 is compressed then files pile up at Lx because the (Lx,Lx+1)->Lx+1 compaction process is the first to be slowed by compression. * without compression -> at level 1 Task ID: #1832108 Blame Rev: Test Plan: run with real data, added test Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D9045	2013-03-04 07:41:15 -08:00
Dhruba Borthakur	806e264350	Ability for rocksdb to compact when flushing the in-memory memtable to a file in L0. Summary: Rocks accumulates recent writes and deletes in the in-memory memtable. When the memtable is full, it writes the contents on the memtable to a file in L0. This patch removes redundant records at the time of the flush. If there are multiple versions of the same key in the memtable, then only the most recent one is dumped into the output file. The purging of redundant records occur only if the most recent snapshot is earlier than the earliest record in the memtable. Should we switch on this feature by default or should we keep this feature turned off in the default settings? Test Plan: Added test case to db_test.cc Reviewers: sheki, vamsi, emayanke, heyongqiang Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D8991	2013-03-04 00:01:47 -08:00
Dhruba Borthakur	e45c7a8444	Abilty to support upto a million .sst files in the database Summary: There was an artifical limit of 50K files per database. This is insifficient if the database is 1 TB in size and each file is 2 MB. Test Plan: make check Reviewers: sheki, emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D8919	2013-02-26 16:27:51 -08:00
Abhishek Kona	959337ed5b	Measure compaction time. Summary: just record time consumed in compaction Test Plan: compile Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D8781	2013-02-22 11:38:40 -08:00
Abhishek Kona	ec77366e14	Counters for bytes written and read. Summary: * Counters for bytes read and write. as a part of this diff, I want to=> * Measure compaction times. @dhruba can you point which function, should * I time to get Compaction-times. Was looking at CompactRange. Test Plan: db_test Reviewers: dhruba, emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D8763	2013-02-21 16:06:32 -08:00
Abhishek Kona	fe10200ddc	Introduce histogram in statistics.h Summary: * Introduce is histogram in statistics.h * stop watch to measure time. * introduce two timers as a poc. Replaced NULL with nullptr to fight some lint errors Should be useful for google. Test Plan: ran db_bench and check stats. make all check Reviewers: dhruba, heyongqiang Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D8637	2013-02-20 10:43:32 -08:00
amayank	f3901e0647	Revert "Fix for the weird behaviour encountered by ldb Get where it could read only the second-latest value" This reverts commit `4c696ed001`.	2013-02-18 22:32:27 -08:00
Dhruba Borthakur	4564915446	Zero out redundant sequence numbers for kvs to increase compression efficiency Summary: The sequence numbers in each record eat up plenty of space on storage. The optimization zeroes out sequence numbers on kvs in the Lmax layer that are earlier than the earliest snapshot. Test Plan: Unit test attached. Differential Revision: https://reviews.facebook.net/D8619	2013-02-18 21:51:15 -08:00
amayank	4c696ed001	Fix for the weird behaviour encountered by ldb Get where it could read only the second-latest value Summary: flush_on_destroy has a default value of false and the memtable is flushed in the dbimpl-destructor only when that is set to true. Because we want the memtable to be flushed everytime that the destructor is called(db is closed) and the cases where we work with the memtable only are very less it is a good idea to give this a default value of true. Thus the put from ldb wil have its data flushed to disk in the destructor and the next Get will be able to read it when opened with OpenForReadOnly. The reason that ldb could read the latest value when the db was opened in the normal Open mode is that the Get from normal Open first reads the memtable and directly finds the latest value written there and the Get from OpenForReadOnly doesn't have access to the memtable (which is correct because all its Put/Modify) are disabled Test Plan: make all; ldb put and get and scans Reviewers: dhruba, heyongqiang, sheki Reviewed By: heyongqiang CC: kosievdmerwe, zshao, dilipj, kailiu Differential Revision: https://reviews.facebook.net/D8631	2013-02-15 16:56:06 -08:00
Kai Liu	b63aafce42	Allow the logs to be purged by TTL. Summary: * Add a SplitByTTLLogger to enable this feature. In this diff I implemented generalized AutoSplitLoggerBase class to simplify the development of such classes. * Refactor the existing AutoSplitLogger and fix several bugs. Test Plan: * Added a unit tests for different types of "auto splitable" loggers individually. * Tested the composited logger which allows the log files to be splitted by both TTL and log size. Reviewers: heyongqiang, dhruba Reviewed By: heyongqiang CC: zshao, leveldb Differential Revision: https://reviews.facebook.net/D8037	2013-02-04 19:42:40 -08:00
Chip Turner	0b83a83191	Fix poor error on num_levels mismatch and few other minor improvements Summary: Previously, if you opened a db with num_levels set lower than the database, you received the unhelpful message "Corruption: VersionEdit: new-file entry." Now you get a more verbose message describing the issue. Also, fix handling of compression_levels (both the run-over-the-end issue and the memory management of it). Lastly, unique_ptr'ify a couple of minor calls. Test Plan: make check Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D8151	2013-01-25 15:37:26 -08:00
Chip Turner	772f75b3fb	Stop continually re-creating build_version.c Summary: We continually rebuilt build_version.c because we put the current date into it, but that's what __DATE__ already is. This makes builds faster. This also fixes an issue with 'make clean FOO' not working properly. Also tweak the build rules to be more consistent, always have warnings, and add a 'make release' rule to handle flags for release builds. Test Plan: make, make clean Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D8139	2013-01-24 17:51:39 -08:00
Chip Turner	3dafdfb2c4	Use fallocate to prevent excessive allocation of sst files and logs Summary: On some filesystems, pre-allocation can be a considerable amount of space. xfs in our production environment pre-allocates by 1GB, for instance. By using fallocate to inform the kernel of our expected file sizes, we eliminate this wasteage (that isn't recovered until the file is closed which, in the case of LOG files, can be a considerable amount of time). Test Plan: created an xfs loopback filesystem, mounted with allocsize=4M, and ran db_stress. LOG file without this change was 4M, and with it it was 128k then grew to normal size. Reviewers: dhruba Reviewed By: dhruba CC: adsharma, leveldb Differential Revision: https://reviews.facebook.net/D7953	2013-01-24 12:25:13 -08:00
Chip Turner	2fdf91a4f8	Fix a number of object lifetime/ownership issues Summary: Replace manual memory management with std::unique_ptr in a number of places; not exhaustive, but this fixes a few leaks with file handles as well as clarifies semantics of the ownership of file handles with log classes. Test Plan: db_stress, make check Reviewers: dhruba Reviewed By: dhruba CC: zshao, leveldb, heyongqiang Differential Revision: https://reviews.facebook.net/D8043	2013-01-23 16:54:11 -08:00
Abhishek Kona	16903c35b0	Add counters to count gets and writes Summary: Add Tickers to count Write's and Get's Test Plan: make check Reviewers: dhruba, chip Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7977	2013-01-17 12:27:56 -08:00
Kosie van der Merwe	3c3df7402f	Fixed issues Valgrind found. Summary: Found issues with `db_test` and `db_stress` when running valgrind. `DBImpl` had an issue where if an compaction failed then it will use the uninitialised file size of an output file is used. This manifested as the final call to output to the log in `DoCompactionWork()` branching on uninitialized memory (all the way down in printf's innards). Test Plan: Ran `valgrind --track_origins=yes ./db_test` and `valgrind ./db_stress` to see if issues disappeared. Ran `make check` to see if there were no regressions. Reviewers: vamsi, dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D8001	2013-01-17 10:04:45 -08:00
Abhishek Kona	7d5a4383bb	rollover manifest file. Summary: Check in LogAndApply if the file size is more than the limit set in Options. Things to consider : will this be expensive? Test Plan: make all check. Inputs on a new unit test? Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7701	2013-01-16 12:09:44 -08:00
Chip Turner	c0cb289d57	Various build cleanups/improvements Summary: Specific changes: 1) Turn on -Werror so all warnings are errors 2) Fix some warnings the above now complains about 3) Add proper dependency support so changing a .h file forces a .c file to rebuild 4) Automatically use fbcode gcc on any internal machine rather than whatever system compiler is laying around 5) Fix jemalloc to once again be used in the builds (seemed like it wasn't being?) 6) Fix issue where 'git' would fail in build_detect_version because of LD_LIBRARY_PATH being set in the third-party build system Test Plan: make, make check, make clean, touch a header file, make sure rebuild is expected Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D7887	2013-01-14 18:40:22 -08:00
Kosie van der Merwe	d8371ef1f6	Fixing some issues Valgrind found Summary: Found some issues running Valgrind on `db_test` (there are still some outstanding ones) and fixed them. Test Plan: make check ran `valgrind ./db_test` and saw that errors no longer occur Reviewers: dhruba, vamsi, emayanke, sheki Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7803	2013-01-08 12:16:40 -08:00
Kosie van der Merwe	d6e873f22f	Added clearer error message for failure to create db directory in DBImpl::Recover() Summary: Changed CreateDir() to CreateDirIfMissing() so a directory that already exists now causes and error. Fixed CreateDirIfMissing() and added Env.DirExists() Test Plan: make check to test for regessions Ran the following to test if the error message is not about lock files not existing ./db_bench --db=dir/testdb After creating a file "testdb", ran the following to see if it failed with sane error message: ./db_bench --db=testdb Reviewers: dhruba, emayanke, vamsi, sheki Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D7707	2013-01-07 10:11:18 -08:00
Dhruba Borthakur	f4c2b7cf97	Enhance ReadOnly mode to process the all committed transactions. Summary: Leveldb has an api OpenForReadOnly() that opens the database in readonly mode. This call had an option to not process the transaction log. This patch removes this option and always processes all transactions that had been committed. It has been done in such a way that it does not create/write to any new files in the process. The invariant of "no-writes" to the leveldb data directory is still true. This enhancement allows multiple threads to open the same database in readonly mode and access all trancations that were committed right upto the OpenForReadOnly call. I changed the public API to match the new semantics because there are no users who are currently using this api. Test Plan: make clean check Reviewers: sheki Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D7479	2012-12-19 16:30:46 -08:00
Dhruba Borthakur	3d1e92b05a	Enhancements to rocksdb for better support for replication. Summary: 1. The OpenForReadOnly() call should not lock the db. This is useful so that multiple processes can open the same database concurrently for reading. 2. GetUpdatesSince should not error out if the archive directory does not exist. 3. A new constructor for WriteBatch that can takes a serialized string as a parameter of the constructor. Test Plan: make clean check Reviewers: sheki Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D7449	2012-12-17 11:40:19 -08:00
Kosie van der Merwe	62d48571de	Added meta-database support. Summary: Added kMetaDatabase for meta-databases in db/filename.h along with supporting fuctions. Fixed switch in DBImpl so that it also handles kMetaDatabase. Fixed DestroyDB() that it can handle destroying meta-databases. Test Plan: make check Reviewers: sheki, emayanke, vamsi, dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D7245	2012-12-17 11:26:59 -08:00
Abhishek Kona	2f0585fb97	Fix a bug. Where DestroyDB deletes a non-existant archive directory. Summary: C tests would fail sometimes as DestroyDB would return a Failure Status message when deleting an archival directory which was not created (WAL_ttl_seconds = 0). Fix: Ignore the Status returned on Deleting Archival Directory. Test Plan: * make check Reviewers: dhruba, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7395	2012-12-17 10:25:26 -08:00
Abhishek Kona	22c283625b	Fix Bug in Binary Search for files containing a seq no. and delete Archived Log Files during Destroy DB. Summary: * Fixed implementation bug in Binary_Searvch introduced in https://reviews.facebook.net/D7119 * Binary search is also overflow safe. * Delete archive log files and archive dir during DestroyDB Test Plan: make check Reviewers: dhruba CC: kosievdmerwe, emayanke Differential Revision: https://reviews.facebook.net/D7263	2012-12-11 16:15:02 -08:00
Dhruba Borthakur	24fc379273	An public api to fetch the latest transaction id. Summary: Implement a interface to retrieve the most current transaction id from the database. Test Plan: Added unit test. Reviewers: sheki Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D7269	2012-12-10 16:04:19 -08:00
Abhishek Kona	1c6742e32f	Refactor GetArchivalDirectoryName to filename.h Summary: filename.h has functions to do similar things. Moving code away from db_impl.cc Test Plan: make check Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D7251	2012-12-10 10:51:07 -08:00
Abhishek Kona	8055008909	GetUpdatesSince API to enable replication. Summary: How it works: * GetUpdatesSince takes a SequenceNumber. * A LogFile with the first SequenceNumber nearest and lesser than the requested Sequence Number is found. * Seek in the logFile till the requested SeqNumber is found. * Return an iterator which contains logic to return record's one by one. Test Plan: * Test case included to check the good code path. * Will update with more test-cases. * Feedback required on test-cases. Reviewers: dhruba, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7119	2012-12-07 11:42:13 -08:00
Dhruba Borthakur	c847a31727	Print compaction score for every compaction run. Summary: A compaction is picked based on its score. It is useful to print the compaction score in the LOG because it aids in debugging. If one looks at the logs, one can find out why a compaction was preferred over another. Test Plan: make clean check Differential Revision: https://reviews.facebook.net/D7137	2012-12-04 10:03:47 -08:00
sheki	d4627e6de4	Move WAL files to archive directory, instead of deleting. Summary: Create a directory "archive" in the DB directory. During DeleteObsolteFiles move the WAL files (*.log) to the Archive directory, instead of deleting. Test Plan: Created a DB using DB_Bench. Reopened it. Checked if files move. Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D6975	2012-11-28 17:28:08 -08:00
Abhishek Kona	d29f181923	Fix all the lint errors. Summary: Scripted and removed all trailing spaces and converted all tabs to spaces. Also fixed other lint errors. All lint errors from this point of time should be taken seriously. Test Plan: make all check Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7059	2012-11-28 17:18:41 -08:00
Dhruba Borthakur	9a357847eb	Delete non-visible keys during a compaction even in the presense of snapshots. Summary: LevelDB should delete almost-new keys when a long-open snapshot exists. The previous behavior is to keep all versions that were created after the oldest open snapshot. This can lead to database size bloat for high-update workloads when there are long-open snapshots and long-open snapshot will be used for logical backup. By "almost new" I mean that the key was updated more than once after the oldest snapshot. If there were two snapshots with seq numbers s1 and s2 (s1 < s2), and if we find two instances of the same key k1 that lie entirely within s1 and s2 (i.e. s1 < k1 < s2), then the earlier version of k1 can be safely deleted because that version is not visible in any snapshot. Test Plan: unit test attached make clean check Differential Revision: https://reviews.facebook.net/D6999	2012-11-28 15:47:40 -08:00
Dhruba Borthakur	3366eda839	Print out status at the end of a compaction run. Summary: Print out status at the end of a compaction run. This helps in debugging. Test Plan: make clean check Reviewers: sheki Reviewed By: sheki Differential Revision: https://reviews.facebook.net/D7035	2012-11-27 22:17:38 -08:00
sheki	43f5a07989	Remove unused varibles. Cause compiler warnings. Test Plan: make check Reviewers: dhruba Reviewed By: dhruba CC: emayanke Differential Revision: https://reviews.facebook.net/D6993	2012-11-26 20:55:24 -08:00
Dhruba Borthakur	fbb73a4ac3	Support to disable background compactions on a database. Summary: This option is needed for fast bulk uploads. The goal is to load all the data into files in L0 without any interference from background compactions. Test Plan: make clean check Reviewers: sheki Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D6849	2012-11-20 21:12:06 -08:00
Dhruba Borthakur	62e7583f94	enhance dbstress to simulate hard crash Summary: dbstress has an option to reopen the database. Make it such that the previous handle is not closed before we reopen, this simulates a situation similar to a process crash. Added new api to DMImpl to remove the lock file. Test Plan: run db_stress Reviewers: emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D6777	2012-11-18 23:16:17 -08:00
Dhruba Borthakur	6c5a4d646a	Merge branch 'master' into performance Conflicts: db/db_impl.h	2012-11-14 21:39:52 -08:00
Dhruba Borthakur	5d16e503a6	Improved CompactionFilter api: pass in a opaque argument to CompactionFilter invocation. Summary: There are applications that operate on multiple leveldb instances. These applications will like to pass in an opaque type for each leveldb instance and this type should be passed back to the application with every invocation of the CompactionFilter api. Test Plan: Enehanced unit test for opaque parameter to CompactionFilter. Reviewers: heyongqiang Reviewed By: heyongqiang CC: MarkCallaghan, sheki, emayanke Differential Revision: https://reviews.facebook.net/D6711	2012-11-13 16:22:26 -08:00
Abhishek Kona	0f8e4721a5	Metrics: record compaction drop's and bloom filter effectiveness Summary: Record BloomFliter hits and drop off reasons during compaction. Test Plan: Unit tests work. Reviewers: dhruba, heyongqiang Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D6591	2012-11-09 11:38:45 -08:00
heyongqiang	20d18a89a3	disable size compaction in ldb reduce_levels and added compression and file size parameter to it Summary: disable size compaction in ldb reduce_levels, this will avoid compactions rather than the manual comapction, added --compression=none\|snappy\|zlib\|bzip2 and --file_size= per-file size to ldb reduce_levels command Test Plan: run ldb Reviewers: dhruba, MarkCallaghan Reviewed By: dhruba CC: sheki, emayanke Differential Revision: https://reviews.facebook.net/D6597	2012-11-09 10:14:47 -08:00
Dhruba Borthakur	95dda37858	Move filesize-based-sorting to outside the Mutex Summary: When a new version is created, we sort all the files at every level based on their size. This is necessary because we want to compact the largest file first. The sorting takes quite a bit of CPU. Moved the sorting code to be outside the mutex. Also, the earlier code was sorting files at all levels but we do not need to sort the highest-number level because those files are never the cause of any compaction. To reduce sorting costs, we sort only the first few files in each level because it is likely that those are the only files in that level that will be picked for compaction. At steady state, I have seen that this patch increase throughout from 1500 writes/sec to 1700 writes/sec at the end of a 72 hour run. The cpu saving by not sorting the last level was not distinctive in this test run because there were only 100K files in the highest numbered level. I expect the cpu saving to be significant when the number of files is much higher. This is mostly an early preview and not ready for rigorous review. With this patch, the writs/sec is now bottlenecked not by the sorting code but by GetOverlappingInputs. I am working on a patch to optimize GetOverlappingInputs. Test Plan: make check Reviewers: MarkCallaghan, heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D6411	2012-11-07 15:39:44 -08:00
Dhruba Borthakur	8143062edd	Merge branch 'master' into performance Conflicts: db/db_impl.cc db/version_set.cc util/options.cc	2012-11-07 15:11:37 -08:00
heyongqiang	3fcf533ed0	Add a readonly db Summary: as subject Test Plan: run db_bench readrandom Reviewers: dhruba Reviewed By: dhruba CC: MarkCallaghan, emayanke, sheki Differential Revision: https://reviews.facebook.net/D6495	2012-11-07 14:19:48 -08:00
Abhishek Kona	4e413df3d0	Flush Data at object destruction if disableWal is used. Summary: Added a conditional flush in ~DBImpl to flush. There is still a chance of writes not being persisted if there is a crash (not a clean shutdown) before the DBImpl instance is destroyed. Test Plan: modified db_test to meet the new expectations. Reviewers: dhruba, heyongqiang Differential Revision: https://reviews.facebook.net/D6519	2012-11-06 15:04:42 -08:00
Dhruba Borthakur	aa42c66814	Fix all warnings generated by -Wall option to the compiler. Summary: The default compilation process now uses "-Wall" to compile. Fix all compilation error generated by gcc. Test Plan: make all check Reviewers: heyongqiang, emayanke, sheki Reviewed By: heyongqiang CC: MarkCallaghan Differential Revision: https://reviews.facebook.net/D6525	2012-11-06 14:07:31 -08:00
Dhruba Borthakur	5f91868cee	Merge branch 'master' into performance Conflicts: db/version_set.cc util/options.cc	2012-11-05 16:51:55 -08:00
Dhruba Borthakur	5273c81483	Ability to invoke application hook for every key during compaction. Summary: There are certain use-cases where the application intends to delete older keys aftre they have expired a certian time period. One option for those applications is to periodically scan the entire database and delete appropriate keys. A better way is to allow the application to hook into the compaction process. This patch allows the application to set a method callback for every key that is being compacted. If this method returns true, then the key is not preserved in the output of the compaction. Test Plan: This is mostly to preview the proposed new public api. Since it is a public api, please do due diligence on reviewing it. I will be writing test cases for this api in mynext version of this patch. Reviewers: MarkCallaghan, heyongqiang Reviewed By: heyongqiang CC: sheki, adsharma Differential Revision: https://reviews.facebook.net/D6285	2012-11-05 16:02:13 -08:00
Dhruba Borthakur	81f735d97c	Merge branch 'master' into performance Conflicts: db/db_impl.cc util/options.cc	2012-11-05 09:41:38 -08:00
heyongqiang	3096fa7534	Add two more options: disable block cache and make table cache shard number configuable Summary: as subject Test Plan: run db_bench and db_test Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D6111	2012-11-01 13:23:21 -07:00
Mark Callaghan	3e7e269292	Use timer to measure sleep rather than assume it is 1000 usecs Summary: This makes the stall timers in MakeRoomForWrite more accurate by timing the sleeps. From looking at the logs the real sleep times are usually about 2000 usecs each when SleepForMicros(1000) is called. The modified LOG messages are: 2012/10/29-12:06:33.271984 2b3cc872f700 delaying write 13 usecs for level0_slowdown_writes_trigger 2012/10/29-12:06:34.688939 2b3cc872f700 delaying write 1728 usecs for rate limits with max score 3.83 Task ID: # Blame Rev: Test Plan: run db_bench, look at DB/LOG Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D6297	2012-10-30 07:21:37 -07:00
Dhruba Borthakur	53e04311b1	Merge branch 'master' into performance Conflicts: db/db_bench.cc util/options.cc	2012-10-29 14:18:00 -07:00
Dhruba Borthakur	321dfdc3ae	Allow having different compression algorithms on different levels. Summary: The leveldb API is enhanced to support different compression algorithms at different levels. This adds the option min_level_to_compress to db_bench that specifies the minimum level for which compression should be done when compression is enabled. This can be used to disable compression for levels 0 and 1 which are likely to suffer from stalls because of the CPU load for memtable flushes and (L0,L1) compaction. Level 0 is special as it gets frequent memtable flushes. Level 1 is special as it frequently gets all:all file compactions between it and level 0. But all other levels could be the same. For any level N where N > 1, the rate of sequential IO for that level should be the same. The last level is the exception because it might not be full and because files from it are not read to compact with the next larger level. The same amount of time will be spent doing compaction at any level N excluding N=0, 1 or the last level. By this standard all of those levels should use the same compression. The difference is that the loss (using more disk space) from a faster compression algorithm is less significant for N=2 than for N=3. So we might be willing to trade disk space for faster write rates with no compression for L0 and L1, snappy for L2, zlib for L3. Using a faster compression algorithm for the mid levels also allows us to reclaim some cpu without trading off much loss in disk space overhead. Also note that little is to be gained by compressing levels 0 and 1. For a 4-level tree they account for 10% of the data. For a 5-level tree they account for 1% of the data. With compression enabled: * memtable flush rate is ~18MB/second * (L0,L1) compaction rate is ~30MB/second With compression enabled but min_level_to_compress=2 * memtable flush rate is ~320MB/second * (L0,L1) compaction rate is ~560MB/second This practicaly takes the same code from https://reviews.facebook.net/D6225 but makes the leveldb api more general purpose with a few additional lines of code. Test Plan: make check Differential Revision: https://reviews.facebook.net/D6261	2012-10-29 11:48:09 -07:00
Mark Callaghan	acc8567b24	Add more rates to db_bench output Summary: Adds the "MB/sec in" and "MB/sec out" to this line: Amplification: 1.7 rate, 0.01 GB in, 0.02 GB out, 8.24 MB/sec in, 13.75 MB/sec out Changes all values to be reported per interval and since test start for this line: ... thread 0: (10000,60000) ops and (19155.6,27307.5) ops/second in (0.522041,2.197198) seconds Task ID: # Blame Rev: Test Plan: run db_bench Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D6291	2012-10-29 11:30:07 -07:00
Mark Callaghan	70c42bf05f	Adds DB::GetNextCompaction and then uses that for rate limiting db_bench Summary: Adds a method that returns the score for the next level that most needs compaction. That method is then used by db_bench to rate limit threads. Threads are put to sleep at the end of each stats interval until the score is less than the limit. The limit is set via the --rate_limit=$double option. The specified value must be > 1.0. Also adds the option --stats_per_interval to enable additional metrics reported every stats interval. Task ID: # Blame Rev: Test Plan: run db_bench Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D6243	2012-10-29 10:17:43 -07:00
Kai Liu	d50f8eb603	Enable LevelDb to create a new log file if current log file is too large. Summary: Enable LevelDb to create a new log file if current log file is too large. Test Plan: Write a script and manually check the generated info LOG. Task ID: 1803577 Blame Rev: Reviewers: dhruba, heyongqiang Reviewed By: heyongqiang CC: zshao Differential Revision: https://reviews.facebook.net/D6003	2012-10-26 14:55:02 -07:00
Mark Callaghan	65855dd8d4	Normalize compaction stats by time in compaction Summary: I used server uptime to compute per-level IO throughput rates. I intended to use time spent doing compaction at that level. This fixes that. Task ID: # Blame Rev: Test Plan: run db_bench, look at results Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D6237	2012-10-26 14:19:13 -07:00
Dhruba Borthakur	ea9e087851	Merge branch 'master' into performance Conflicts: db/db_bench.cc db/db_impl.cc db/db_test.cc	2012-10-26 08:57:56 -07:00
Mark Callaghan	e7206f43ee	Improve statistics Summary: This adds more statistics to be reported by GetProperty("leveldb.stats"). The new stats include time spent waiting on stalls in MakeRoomForWrite. This also includes the total amplification rate where that is: (#bytes of sequential IO during compaction) / (#bytes from Put) This also includes a lot more data for the per-level compaction report. * Rn(MB) - MB read from level N during compaction between levels N and N+1 * Rnp1(MB) - MB read from level N+1 during compaction between levels N and N+1 * Wnew(MB) - new data written to the level during compaction * Amplify - ( Write(MB) + Rnp1(MB) ) / Rn(MB) * Rn - files read from level N during compaction between levels N and N+1 * Rnp1 - files read from level N+1 during compaction between levels N and N+1 * Wnp1 - files written to level N+1 during compaction between levels N and N+1 * NewW - new files written to level N+1 during compaction * Count - number of compactions done for this level This is the new output from DB::GetProperty("leveldb.stats"). The old output stopped at Write(MB) Compactions Level Files Size(MB) Time(sec) Read(MB) Write(MB) Rn(MB) Rnp1(MB) Wnew(MB) Amplify Read(MB/s) Write(MB/s) Rn Rnp1 Wnp1 NewW Count ------------------------------------------------------------------------------------------------------------------------------------- 0 3 6 33 0 576 0 0 576 -1.0 0.0 1.3 0 0 0 0 290 1 127 242 351 5316 5314 570 4747 567 17.0 12.1 12.1 287 2399 2685 286 32 2 161 328 54 822 824 326 496 328 4.0 1.9 1.9 160 251 411 160 161 Amplification: 22.3 rate, 0.56 GB in, 12.55 GB out Uptime(secs): 439.8 Stalls(secs): 206.938 level0_slowdown, 0.000 level0_numfiles, 24.129 memtable_compaction Task ID: # Blame Rev: Test Plan: run db_bench Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - (cherry picked from commit ecdeead38f86cc02e754d0032600742c4f02fec8) Reviewers: dhruba Differential Revision: https://reviews.facebook.net/D6153	2012-10-24 14:21:38 -07:00
Dhruba Borthakur	4c107587ed	Delete files outside the mutex. Summary: The compaction process deletes a large number of files. This takes quite a bit of time and is best done outside the mutex lock. Test Plan: make check Differential Revision: https://reviews.facebook.net/D6123	2012-10-22 11:53:23 -07:00

1 2 3 4 5 ...

304 Commits