Commit Graph

743 Commits

Author SHA1 Message Date
Lei Jin
d118707f8d set bg_error_ when background flush goes wrong
Summary: as title

Test Plan: unit test

Reviewers: haobo, igor, sdong, kailiu, dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15435
2014-01-29 15:55:58 -08:00
Igor Canadi
fa99d53e55 Change ColumnFamilyData from struct to class
Summary: ColumnFamilyData grew a lot, there's much more data that it holds now. It makes more sense to encapsulate it better by making it a class.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15579
2014-01-29 15:18:36 -08:00
Igor Canadi
4662969bf5 PurgeObsoleteFiles in DropColumnFamily
Summary: When we drop the column family, we want to delete all the files from that column family.

Test Plan: make check

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15561
2014-01-29 11:58:33 -08:00
Igor Canadi
f24a3ee52d Read from and write to different column families
Summary: This one is big. It adds ability to write to and read from different column families (see the unit test). It also supports recovery of different column families from log, which was the hardest part to reason about. We need to make sure to never delete the log file which has unflushed data from any column family. To support that, I added another concept, which is versions_->MinLogNumber()

Test Plan: Added a unit test in column_family_test

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15537
2014-01-29 11:38:16 -08:00
Igor Canadi
c1071ed95c Merge branch 'master' into columnfamilies 2014-01-28 16:04:00 -08:00
kailiu
a5e220f5ef Merge branch 'master' into performance
Conflicts:
	Makefile
	db/db_impl.cc
	db/db_test.cc
	db/memtable_list.cc
	db/memtable_list.h
	table/block_based_table_reader.cc
	table/table_test.cc
	util/cache.cc
	util/coding.cc
2014-01-28 10:35:55 -08:00
Igor Canadi
4bf25357ae [column families] Removing VersionSet::current()
Summary: Instead of VersionSet::current(), DBImpl uses default_cfd_->current directly.

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15483
2014-01-27 17:04:46 -08:00
Mark Callaghan
90f29ccbef Update monitoring to include average time per compaction and stall
Summary:
The new columns are msComp and msStall that provide average time per compaction and stall for that level in milliseconds.
Level  Files Size(MB) Score Time(sec)  Read(MB) Write(MB)    Rn(MB)  Rnp1(MB)  Wnew(MB) RW-Amplify Read(MB/s) Write(MB/s)      Rn     Rnp1     Wnp1     NewW    Count   msComp   msStall  Ln-stall Stall-cnt
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0        8       15   1.5         2         0        30         0         0        30        0.0       0.0        15.5        0        0        0        0       16      112       0.2       1.3      7568
  1        8       16   1.6         1        26        26        15        11        16        3.5      17.6        18.1        8        6       13        7        3      362       0.0       0.0         0
  2        1        2   0.0         0         0         2         0         0         2        0.0       0.0        18.4        0        0        0        0        1       50       0.0       0.0         0

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15345
2014-01-27 15:06:04 -08:00
Igor Canadi
511b03a5b5 LogAndApply to take ColumnFamilyData
Summary: This removes the default implementation of LogAndApply that applied the changed to the default column family by default. It is mostly simple reformatting.

Test Plan: make check

Reviewers: dhruba, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15465
2014-01-27 13:57:58 -08:00
Igor Canadi
eb055609e4 [column families] Move memtable and immutable memtable list to column family data
Summary: All memtables and immutable memtables are moved from DBImpl to ColumnFamilyData. For now, they are all referenced from default column family in DBImpl. It shouldn't be hard to get them from custom column family.

Test Plan: make check

Reviewers: dhruba, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15459
2014-01-27 13:37:14 -08:00
Igor Canadi
ae16606f98 Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
	db/version_set.h
2014-01-27 11:11:51 -08:00
Igor Canadi
832158e7f7 Fsync directory after we create a new file
Summary:
@dhruba, I'm not sure where we need to sync the directory. I implemented the function in Env() and added the dir sync just after we close the newly created file in the builder.

Should I also add FsyncDir() to new files that get created by a compaction?

Test Plan: Confirmed that FsyncDir is returning Status::OK()

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D14751
2014-01-27 11:02:21 -08:00
Igor Canadi
cf783c674a Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.h
2014-01-27 10:00:24 -08:00
Igor Canadi
6c2ca1d3e6 Move NeedsCompaction() from VersionSet to Version
Summary: There is no reason to have functions NeedCompaction(), MaxCompactionScore() and MaxCompactionScoreLevel() in VersionSet, since they don't access any data in VersionSet.

Test Plan: make check

Reviewers: kailiu, haobo, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15333
2014-01-27 09:59:00 -08:00
Igor Canadi
6a404de4ba Merge branch 'master' into columnfamilies 2014-01-24 15:54:18 -08:00
Igor Canadi
f653fdcf5a Fixing iterator cleanup for Tailing iterator
Immutable tailing iterator doesn't set CleanupState::mem, so we don't
have to unref it.
2014-01-24 15:51:06 -08:00
Igor Canadi
1423e7c9de Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
	db/version_set_reduce_num_levels.cc
	util/ldb_cmd.cc
2014-01-24 15:03:54 -08:00
Igor Canadi
c583157d49 MemTableListVersion
Summary:
MemTableListVersion is to MemTableList what Version is to VersionSet. I took almost the same ideas to develop MemTableListVersion. The reason is to have copying std::list done in background, while flushing, rather than in foreground (MultiGet() and NewIterator()) under a mutex! Also, whenever we copied MemTableList, we copied also some MemTableList metadata (flush_requested_, commit_in_progress_, etc.), which was wasteful.

This diff avoids std::list copy under a mutex in both MultiGet() and NewIterator(). I created a small database with some number of immutable memtables, and creating 100.000 iterators in a single-thread (!) decreased from {188739, 215703, 198028} to {154352, 164035, 159817}. A lot of the savings come from code under a mutex, so we should see much higher savings with multiple threads. Creating new iterator is very important to LogDevice team.

I also think this diff will make SuperVersion obsolete for performance reasons. I will try it in the next diff. SuperVersion gave us huge savings on Get() code path, but I think that most of the savings came from copying MemTableList under a mutex. If we had MemTableListVersion, we would never need to copy the entire object (like we still do in NewIterator() and MultiGet())

Test Plan: `make check` works. I will also do `make valgrind_check` before commit

Reviewers: dhruba, haobo, kailiu, sdong, emayanke, tnovak

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15255
2014-01-24 14:52:08 -08:00
Igor Canadi
28d1a0c6f5 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/db_impl_readonly.h
	db/db_test.cc
	include/rocksdb/db.h
	include/utilities/stackable_db.h
2014-01-24 09:27:29 -08:00
Igor Canadi
09489d395f Fix a bug in DBImpl::CreateColumnFamily 2014-01-24 09:12:00 -08:00
Lei Jin
aba2acb5ec CompactRange() to return status
Summary: as title

Test Plan:
make all check
What else tests shall I cover?

Reviewers: igor, haobo

CC:

Differential Revision: https://reviews.facebook.net/D15339
2014-01-23 16:41:46 -08:00
Kai Liu
054c5dda8c Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/memtable.cc
	db/version_set.cc
	include/rocksdb/statistics.h
	util/statistics_imp.h
2014-01-23 16:32:49 -08:00
Tomislav Novak
81c9cc9b3b Tailing iterator
Summary:
This diff implements a special type of iterator that doesn't create a snapshot
(can be used to read newly inserted data) and is optimized for doing sequential
reads.

TailingIterator uses current superversion number to determine whether to
invalidate its internal iterators. If the version hasn't changed, it can often
avoid doing expensive seeks over immutable structures (sst files and immutable
memtables).

Test Plan:
* new unit tests
* running LD with this patch

Reviewers: igor, dhruba, haobo, sdong, kailiu

Reviewed By: sdong

CC: leveldb, lovro, march

Differential Revision: https://reviews.facebook.net/D15285
2014-01-23 16:26:08 -08:00
Igor Canadi
7c5e583a27 ColumnFamilySet
Summary:
I created a separate class ColumnFamilySet to keep track of column families. Before we did this in VersionSet and I believe this approach is cleaner.

Let me know if you have any comments. I will commit tomorrow.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15357
2014-01-23 14:03:38 -08:00
Igor Canadi
f9a25dda9f Fix wrong merge 2014-01-22 11:30:19 -08:00
Igor Canadi
92a022ad07 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/db_impl_readonly.cc
	db/version_set.cc
2014-01-22 10:59:07 -08:00
Igor Canadi
6fe9b57748 Refactor Recover() code
Summary:
This diff does two things:
* Rethinks how we call Recover() with read_only option. Before, we call it with pointer to memtable where we'd like to apply those changes to. This memtable is set in db_impl_readonly.cc and it's actually DBImpl::mem_. Why don't we just apply updates to mem_ right away? It seems more intuitive.
* Changes when we apply updates to manifest. Before, the process is to recover all the logs, flush it to sst files and then do one giant commit that atomically adds all recovered sst files and sets the next log number. This works good enough, but causes some small troubles for my column family approach, since I can't have one VersionEdit apply to more than single column family[1]. The change here is to commit the files recovered from logs right away. Here is the state of the world before the change:
1. Recover log 5, add new sst files to edit
2. Recover log 7, add new sst files to edit
3. Recover log 8, add new sst files to edit
4. Commit all added sst files to manifest and mark log files 5, 7 and 8 as recoverd (via SetLogNumber(9) function)
After the change, we'll do:
1. Recover log 5, commit the new sst files and set log 5 as recovered
2. Recover log 7, commit the new sst files and set log 7 as recovered
3. Recover log 8, commit the new sst files and set log 8 as recovered

The added (small) benefit is that if we fail after (2), the new recovery will only have to recover log 8. In previous case, we'll have to restart the recovery from the beginning. The bigger benefit will be to enable easier integration of multiple column families in Recovery code path.

[1] I'm happy to dicuss this decison, but I believe this is the cleanest way to go. It also makes backward compatibility much easier. We don't have a requirement of adding multiple column families atomically.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15237
2014-01-22 10:45:26 -08:00
Igor Canadi
23f6791c9e Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl_readonly.cc
	db/db_test.cc
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
	db/version_set.h
	db/version_set_reduce_num_levels.cc
2014-01-21 17:01:52 -08:00
Mark Callaghan
4e8321bfea Boost access before mutex is unlocked
Summary:
This moves the use of versions_ to before the mutex is unlocked
to avoid a possible race.

Task ID: #

Blame Rev:

Test Plan:
make check

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: haobo, dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15279
2014-01-17 21:32:23 -08:00
Igor Canadi
83681bf9ef Statistics code cleanup
Summary: I'm separating code-cleanup part of https://reviews.facebook.net/D14517. This will make D14517 easier to understand and this diff easier to review.

Test Plan: make check

Reviewers: haobo, kailiu, sdong, dhruba, tnovak

Reviewed By: tnovak

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15099
2014-01-17 12:46:06 -08:00
Igor Canadi
8079dd5d24 Merge branch 'master' into performance 2014-01-17 12:20:07 -08:00
Mark Callaghan
439e36db21 Fix SlowdownAmount
Summary:
This had a few bugs.
1) bottom and top were reversed. top is for the max value but the callers were passing the max
value to bottom. The result is that the max sleep is used when n >= bottom.
2) one of the callers passed values with type double and these values are frequently between
1.0 and 2.0 so rounding will do some bad things
3) sometimes the function returned 0 when there should be a stall

With this change and one other diff (out for review soon) there are slightly fewer stalls on one workload.

With the fix.
Stalls(secs): 160.166 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 58.495 leveln_slowdown
Stalls(count): 910261 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 54526 leveln_slowdown

Without the fix.
Stalls(secs): 172.227 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 56.538 leveln_slowdown
Stalls(count): 160831 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 52845 leveln_slowdown

Task ID: #

Blame Rev:

Test Plan:
run db_bench for --benchmarks=overwrite with IO-bound database

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15243
2014-01-17 10:15:30 -08:00
Mike Lin
b1194f4903 Minor compaction logging improvements
1) make summary less likely to be truncated
2) format human-readable file sizes in summary
3) log the motivation for each universal compaction
2014-01-16 22:54:31 -08:00
kailiu
1304d8c8ce Merge branch 'master' into performance
Conflicts:
	Makefile
	db/db_impl.cc
	db/db_impl.h
	db/db_test.cc
	db/memtable.cc
	db/memtable.h
	db/version_edit.h
	db/version_set.cc
	include/rocksdb/options.h
	util/hash_skiplist_rep.cc
	util/options.cc
2014-01-15 23:12:31 -08:00
Igor Canadi
2f4eda7890 Move functions from VersionSet to Version
Summary:
There were some functions in VersionSet that had no reason to be there instead of Version. Moving them to Version will make column families implementation easier.

The functions moved are:
* NumLevelBytes
* LevelSummary
* LevelFileSummary
* MaxNextLevelOverlappingBytes
* AddLiveFiles (previously AddLiveFilesCurrentVersion())
* NeedSlowdownForNumLevel0Files

The diff continues on (and depends on) D15171

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong, emayanke

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15183
2014-01-15 16:18:04 -08:00
Igor Canadi
65a8a52b54 Decrease reliance on VersionSet::NumberLevels()
Summary:
With column families VersionSet will not have a constant number of levels (each CF can have different options), so we'll need to eliminate call to VersionSet::NumberLevels()

This diff decreases number of callsites, but we're not there yet. It associates number of levels with Version (each version is associated with single CF) instead of VersionSet.

I have also slightly changed how VersionSet keeps track of manifest size.

This diff also modifies constructor of Compaction such that it takes input_version and automatically Ref()s it. Before this was done outside of constructor.

In next diffs I will continue to decrease number of callsites of VersionSet::NumberLevels() and also references to current_

Test Plan: make check

Reviewers: haobo, dhruba, kailiu, sdong

Reviewed By: sdong

Differential Revision: https://reviews.facebook.net/D15171
2014-01-15 16:15:43 -08:00
Siying Dong
9b51af5a17 [RocksDB Performance Branch] DBImpl.NewInternalIterator() to reduce works inside mutex
Summary: To reduce mutex contention caused by DBImpl.NewInternalIterator(), in this function, move all the iteration creation works out of mutex, only leaving object ref and get.

Test Plan:
make all check
will run db_stress for a while too to make sure no problem.

Reviewers: haobo, dhruba, kailiu

Reviewed By: haobo

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14589

Conflicts:
	db/db_impl.cc
2014-01-14 17:41:44 -08:00
Igor Canadi
d9cd7a063f Fix CompactRange to apply filter to every key
Summary:
When doing CompactRange(), we should first flush the memtable and then calculate max_level_with_files. Also, we want to compact all the levels that have files, including level `max_level_with_files`.

This patch fixed the unit test.

Test Plan: Added a failing unit test and a fix, so it's not failing anymore.

Reviewers: dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D14421
2014-01-14 16:19:09 -08:00
Igor Canadi
055e6df45b VersionEdit not to take NumLevels()
Summary:
I will submit a sequence of diffs that are preparing master branch for column families. There are a lot of implicit assumptions in the code that are making column family implementation hard. If I make the change only in column family branch, it will make merging back to master impossible.

Most of the diffs will be simple code refactorings, so I hope we can have fast turnaround time. Feel free to grab me in person to discuss any of them.

This diff removes number of level check from VersionEdit. It is used only when VersionEdit is read, not written, but has to be set when it is written. I believe it is a right thing to make VersionEdit dumb and check consistency on the caller side. This will also make it much easier to implement Column Families, since different column families can have different number of levels.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15159
2014-01-14 15:27:09 -08:00
Igor Canadi
7d9f21cf23 BuildBatchGroup -- memcpy outside of lock
Summary: When building batch group, don't actually build a new batch since it requires heavy-weight mem copy and malloc. Only store references to the batches and build the batch group without lock held.

Test Plan:
`make check`

I am also planning to run performance tests. The workload that will benefit from this change is readwhilewriting. I will post the results once I have them.

Reviewers: dhruba, haobo, kailiu

Reviewed By: haobo

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D15063
2014-01-14 14:49:31 -08:00
Naman Gupta
1d9bac4d7f Use sanitized options while opening db
Summary: We use SanitizeOptions() to set appropriate values for some options, based on other options. So we should use the sanitized options by default. Luckily it hasn't caused a bug yet, but can result in a bug in the fugture.

Test Plan: make check

Reviewers: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14103
2014-01-14 11:46:24 -08:00
Siying Dong
9ea8bf90f1 DB::Put() to estimate write batch data size needed and pre-allocate buffer
Summary:
In one of CPU profiles, we see some CPU costs of string::reserve() inside Batch.Put(). This patch should be able to reduce some of the costs by allocating sufficient buffer before hand.

Since it is a trivial percentage of CPU costs, I didn't find a way to show the improvement in one of the benchmarks. I'll deploy it to same application and do the same CPU profiling to make sure those CPU costs are reduced.

Test Plan: make all check

Reviewers: haobo, kailiu, igor

Reviewed By: haobo

CC: leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D15135
2014-01-14 11:24:43 -08:00
Siying Dong
fbbf0d1456 Pre-calculate whether to slow down for too many level 0 files
Summary: Currently in DBImpl::MakeRoomForWrite(), we do  "versions_->NumLevelFiles(0) >= options_.level0_slowdown_writes_trigger" to check whether the writer thread needs to slow down. However, versions_->NumLevelFiles(0) is slightly more expensive than we expected. By caching the result of the comparison when installing a new version, we can avoid this function call every time.

Test Plan:
make all check
Manually trigger this behavior by applying universal compaction style and make sure inserts are made slow after there are certain number of files.

Reviewers: haobo, kailiu, igor

Reviewed By: kailiu

CC: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D15141
2014-01-14 11:23:02 -08:00
Siying Dong
51dd21926c DB::Put() to estimate write batch data size needed and pre-allocate buffer
Summary:
In one of CPU profiles, we see some CPU costs of string::reserve() inside Batch.Put(). This patch should be able to reduce some of the costs by allocating sufficient buffer before hand.

Since it is a trivial percentage of CPU costs, I didn't find a way to show the improvement in one of the benchmarks. I'll deploy it to same application and do the same CPU profiling to make sure those CPU costs are reduced.

Test Plan: make all check

Reviewers: haobo, kailiu, igor

Reviewed By: haobo

CC: leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D15135
2014-01-14 10:53:16 -08:00
Igor Canadi
151f9e144f Merge branch 'master' into columnfamilies 2014-01-13 09:09:51 -08:00
Igor Canadi
d076cef347 [column families] Get rid of VersionSet::current_ and keep current Version for each column family
Summary:
The biggest change here is getting rid of current_ Version and adding a column_family_data->current Version to each column family.

I have also fixed some smaller things in VersionSet that made it easier to implement Column family support.

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15105
2014-01-13 08:59:15 -08:00
Siying Dong
aa0ef6602d [Performance Branch] If options.max_open_files set to be -1, cache table readers in FileMetadata for Get() and NewIterator()
Summary:
In some use cases, table readers for all live files should always be cached. In that case, there will be an opportunity to avoid the table cache look-up while Get() and NewIterator().

We define options.max_open_files = -1 to be the mode that table readers for live files will always be kept. In that mode, table readers are cached in FileMetaData (with a reference count hold in table cache). So that when executing table_cache.Get() and table_cache.newInterator(), LRU cache checking can be by-passed, to reduce latency.

Test Plan: add a test case in db_test

Reviewers: haobo, kailiu

Reviewed By: haobo

CC: dhruba, igor, leveldb

Differential Revision: https://reviews.facebook.net/D15039
2014-01-10 15:57:49 -08:00
Siying Dong
237a3da677 StopWatch not to get time if it is created for statistics and it is disabled
Summary: Currently, even if statistics is not enabled, StopWatch only for the stats still gets the time of the day, which is wasteful. This patch adds a new option to StopWatch to disable this get in this case.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14703

Conflicts:
	db/db_impl.cc
2014-01-09 17:39:48 -08:00
Siying Dong
5575316350 StopWatch not to get time if it is created for statistics and it is disabled
Summary: Currently, even if statistics is not enabled, StopWatch only for the stats still gets the time of the day, which is wasteful. This patch adds a new option to StopWatch to disable this get in this case.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14703
2014-01-08 16:05:36 -08:00
Igor Canadi
19e3ee64ac Add column family information to WAL
Summary:
I have added three new value types:
* kTypeColumnFamilyDeletion
* kTypeColumnFamilyValue
* kTypeColumnFamilyMerge
which include column family Varint32 before the data (value, deletion and merge). These values are used only in WAL (not in memtables yet).

This endeavour required changing some WriteBatch internals.

Test Plan: Added a unittest

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15045
2014-01-08 12:53:33 -08:00
Mark Callaghan
50994bf699 Don't always compress L0 files written by memtable flush
Summary:
Code was always compressing L0 files written by a memtable flush
when compression was enabled. Now this is done when
min_level_to_compress=0 for leveled compaction and when
universal_compaction_size_percent=-1 for universal compaction.

Task ID: #3416472

Blame Rev:

Test Plan:
ran db_bench with compression options

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba, igor, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14757
2014-01-07 21:50:26 -08:00
Igor Canadi
72918efffe [column families] Implement DB::OpenWithColumnFamilies()
Summary:
In addition to implementing OpenWithColumnFamilies, this diff also includes some minor changes:
* Changed all column family names from Slice() to std::string. The performance of column family name handling is not critical, and it's more convenient and cleaner to have names as std::strings
* Implemented ColumnFamilyOptions(const Options&) and DBOptions(const Options&)
* Added ColumnFamilyOptions to VersionSet::ColumnFamilyData. ColumnFamilyOptions are specified on OpenWithColumnFamilies() and CreateColumnFamily()

I will keep the diff in the Phabricator for a day or two and will push to the branch then. Feel free to comment even after the diff has been pushed.

Test Plan: Added a simple unit test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15033
2014-01-07 11:05:50 -08:00
Igor Canadi
d3a2ba9c64 Merge branch 'master' into columnfamilies 2014-01-07 11:05:03 -08:00
Igor Canadi
17a222670b Merge branch 'master' into performance 2014-01-07 11:04:21 -08:00
Tomislav Novak
9f690ec62c Fix a deadlock in CompactRange()
Summary:
The way DBImpl::TEST_CompactRange() throttles down the number of bg compactions
can cause it to deadlock when CompactRange() is called concurrently from
multiple threads. Imagine a following scenario with only two threads
(max_background_compactions is 10 and bg_compaction_scheduled_ is initially 0):

   1. Thread #1 increments bg_compaction_scheduled_ (to LargeNumber), sets
      bg_compaction_scheduled_ to 9 (newvalue), schedules the compaction
      (bg_compaction_scheduled_ is now 10) and waits for it to complete.
   2. Thread #2 calls TEST_CompactRange(), increments bg_compaction_scheduled_
      (now LargeNumber + 10) and waits on a cv for bg_compaction_scheduled_ to
      drop to LargeNumber.
   3. BG thread completes the first manual compaction, decrements
      bg_compaction_scheduled_ and wakes up all threads waiting on bg_cv_.
      Thread #1 runs, increments bg_compaction_scheduled_ by LargeNumber again
      (now 2*LargeNumber + 9). Since that's more than LargeNumber + newvalue,
      thread #2 also goes to sleep (waiting on bg_cv_), without resetting
      bg_compaction_scheduled_.

This diff attempts to address the problem by introducing a new counter
bg_manual_only_ (when positive, MaybeScheduleFlushOrCompaction() will only
schedule manual compactions).

Test Plan:
I could pretty much consistently reproduce the deadlock with a program that
calls CompactRange(nullptr, nullptr) immediately after Write() from multiple
threads. This no longer happens with this patch.

Tests (make check) pass.

Reviewers: dhruba, igor, sdong, haobo

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14799
2014-01-07 10:37:34 -08:00
Igor Canadi
ef6ad1708d [column families] Support to create and drop column families
Summary:
This diff provides basic implementations of CreateColumnFamily(), DropColumnFamily() and ListColumnFamilies(). It builds on top of https://reviews.facebook.net/D14733

It also includes a bug fix for DBImplReadOnly, where Get implementation would be redirected to DBImpl instead of DBImplReadOnly.

Test Plan: Added unit test

Reviewers: dhruba, haobo, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15021
2014-01-03 01:12:16 -08:00
Kai Liu
774ed89c24 Replace vector with autovector
Summary: this diff only replace the cases when we need to frequently create vector with small amount of entries. This diff doesn't aim to improve performance of a specific area, but more like a small scale test for the autovector and see how it works in real life.

Test Plan:
make check

I also ran the performance tests, however there is no performance gain/loss. All performance numbers are pretty much the same before/after the change.

Reviewers: dhruba, haobo, sdong, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14985
2014-01-02 16:43:35 -08:00
kailiu
e72aa37cc5 Merge branch 'master' into performance
Conflicts:
	db/table_cache.cc
2014-01-02 16:34:59 -08:00
kailiu
476416c27c Some minor refactoring on the code
Summary: I made some cleanup while reading the source code in `db`. Most changes are about style, naming or C++ 11 new features.

Test Plan: ran `make check`

Reviewers: haobo, dhruba, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15009
2014-01-02 16:32:31 -08:00
Igor Canadi
6de1b5b83e Merge branch 'master' into columnfamilies 2014-01-02 04:18:07 -08:00
Igor Canadi
b60c14f6ee Support multi-threaded DisableFileDeletions() and EnableFileDeletions()
Summary:
We don't want two threads to clash if they concurrently call DisableFileDeletions() and EnableFileDeletions(). I'm adding a counter that will enable file deletions only after all DisableFileDeletions() calls have been negated with EnableFileDeletions().

However, we also don't want to break the old behavior, so I added a parameter force to EnableFileDeletions(). If force is true, we will still enable file deletions after every call to EnableFileDeletions(), which is what is happening now.

Test Plan: make check

Reviewers: dhruba, haobo, sanketh

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14781
2014-01-02 03:33:42 -08:00
kailiu
f1cec73a76 Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/memtable.cc
	db/version_set.cc
	include/rocksdb/statistics.h
2013-12-27 12:23:17 -08:00
Igor Canadi
1fdb3f7dc6 [RocksDB] Optimize locking for Get
Summary:
Instead of locking and saving a DB state, we can cache a DB state and update it only when it changes. This change reduces lock contention and speeds up read operations on the DB.

Performance improvements are substantial, although there is some cost in no-read workloads. I ran the regression tests on my devserver and here are the numbers:

  overwrite                    56345  ->   63001
  fillseq                      193730 ->  185296
  readrandom                   771301 -> 1219803 (58% improvement!)
  readrandom_smallblockcache   677609 ->  862850
  readrandom_memtable_sst      710440 -> 1109223
  readrandom_fillunique_random 221589 ->  247869
  memtablefillrandom           105286 ->   92643
  memtablereadrandom           763033 -> 1288862

Test Plan:
make asan_check
I am also running db_stress

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14679
2013-12-20 09:57:58 -08:00
Igor Canadi
9385a5247e [RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>

Sharing some of the work I've done so far. This diff compiles and passes the tests.

The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.

Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]

Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.

Please provide feedback.

Test Plan: make check works, the code is backward compatible

Reviewers: dhruba, haobo, sdong, kailiu, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14445
2013-12-18 13:08:22 -08:00
Siying Dong
a8b8b11dc4 Get() Does Not Reserve space for to_delete memtables
Summary: It seems to be a decision tradeoff in current codes: we make a malloc for every Get() to reduce one malloc for a flush inside mutex. It takes about 5% of CPU time in readrandom tests. We might consider the tradeoff to be the other way around.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14697
2013-12-17 17:16:16 -08:00
Mark Callaghan
e9e6b00d29 Add monitoring for universal compaction and add counters for compaction IO
Summary:
Adds these counters
{ WAL_FILE_SYNCED, "rocksdb.wal.synced" }
  number of writes that request a WAL sync
{ WAL_FILE_BYTES, "rocksdb.wal.bytes" },
  number of bytes written to the WAL
{ WRITE_DONE_BY_SELF, "rocksdb.write.self" },
  number of writes processed by the calling thread
{ WRITE_DONE_BY_OTHER, "rocksdb.write.other" },
  number of writes not processed by the calling thread. Instead these were
  processed by the current holder of the write lock
{ WRITE_WITH_WAL, "rocksdb.write.wal" },
  number of writes that request WAL logging
{ COMPACT_READ_BYTES, "rocksdb.compact.read.bytes" },
  number of bytes read during compaction
{ COMPACT_WRITE_BYTES, "rocksdb.compact.write.bytes" },
  number of bytes written during compaction

Per-interval stats output was updated with WAL stats and correct stats for universal compaction
including a correct value for write-amplification. It now looks like:
                               Compactions
Level  Files Size(MB) Score Time(sec)  Read(MB) Write(MB)    Rn(MB)  Rnp1(MB)  Wnew(MB) RW-Amplify Read(MB/s) Write(MB/s)      Rn     Rnp1     Wnp1     NewW    Count  Ln-stall Stall-cnt
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0        7      464  46.4       281      3411      3875      3411         0      3875        2.1      12.1        13.8      621        0      240      240      628       0.0         0
Uptime(secs): 310.8 total, 2.0 interval
Writes cumulative: 9999999 total, 9999999 batches, 1.0 per batch, 1.22 ingest GB
WAL cumulative: 9999999 WAL writes, 9999999 WAL syncs, 1.00 writes per sync, 1.22 GB written
Compaction IO cumulative (GB): 1.22 new, 3.33 read, 3.78 write, 7.12 read+write
Compaction IO cumulative (MB/sec): 4.0 new, 11.0 read, 12.5 write, 23.4 read+write
Amplification cumulative: 4.1 write, 6.8 compaction
Writes interval: 100000 total, 100000 batches, 1.0 per batch, 12.5 ingest MB
WAL interval: 100000 WAL writes, 100000 WAL syncs, 1.00 writes per sync, 0.01 MB written
Compaction IO interval (MB): 12.49 new, 14.98 read, 21.50 write, 36.48 read+write
Compaction IO interval (MB/sec): 6.4 new, 7.6 read, 11.0 write, 18.6 read+write
Amplification interval: 101.7 write, 102.9 compaction
Stalls(secs): 142.924 level0_slowdown, 0.000 level0_numfiles, 0.805 memtable_compaction, 0.000 leveln_slowdown
Stalls(count): 132461 level0_slowdown, 0 level0_numfiles, 3 memtable_compaction, 0 leveln_slowdown

Task ID: #3329644, #3301695

Blame Rev:

Test Plan:
Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14583
2013-12-12 13:27:43 -08:00
Siying Dong
e8ab1934d9 [RocksDB Performance Branch] DBImpl.NewInternalIterator() to reduce works inside mutex
Summary: To reduce mutex contention caused by DBImpl.NewInternalIterator(), in this function, move all the iteration creation works out of mutex, only leaving object ref and get.

Test Plan:
make all check
will run db_stress for a while too to make sure no problem.

Reviewers: haobo, dhruba, kailiu

Reviewed By: haobo

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14589
2013-12-12 11:30:00 -08:00
Siying Dong
aaf9c6203c [RocksDB][Performance Branch]Iterator Cleanup method only tries to find obsolete files if it has the last reference to a version
Summary: When deconstructing an iterator, no need to check obsolete file if it doesn't hold last reference of any version.

Test Plan: make all check

Reviewers: haobo, igor, dhruba, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14595
2013-12-11 13:59:43 -08:00
Siying Dong
a8029fdc75 Introduce MergeContext to Lazily Initialize merge operand list
Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14415

Conflicts:
	db/db_impl.cc
	db/memtable.cc
2013-12-11 11:37:28 -08:00
Siying Dong
0304e3d2ff When flushing mem tables, create iterators out of mutex
Summary:
creating new iterators of mem tables can be expensive. Move them out of mutex.
DBImpl::WriteLevel0Table()'s mems seems to be a local vector and is only used by flushing. memtables to flush are also immutable, so it should be safe to do so.

Test Plan: make all check

Reviewers: haobo, dhruba, kailiu

Reviewed By: dhruba

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14577

Conflicts:
	db/db_impl.cc
2013-12-11 10:02:17 -08:00
Siying Dong
95a411d853 When flushing mem tables, create iterators out of mutex
Summary:
creating new iterators of mem tables can be expensive. Move them out of mutex.
DBImpl::WriteLevel0Table()'s mems seems to be a local vector and is only used by flushing. memtables to flush are also immutable, so it should be safe to do so.

Test Plan: make all check

Reviewers: haobo, dhruba, kailiu

Reviewed By: dhruba

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14577
2013-12-11 09:57:19 -08:00
kailiu
a82f42b765 rename db/memtablelist.{h,cc} 2013-12-10 19:03:13 -08:00
Igor Canadi
204bb9cffd Get rid of LogFlush() in InternalIterator 2013-12-10 10:59:00 -08:00
Igor Canadi
19f5463d3f Don't LogFlush() in foreground threads
Summary: So fflush() takes a lock which is heavyweight. I added flush_pending_, but more importantly, I removed LogFlush() from foreground threads.

Test Plan: ./db_test

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14535
2013-12-10 10:57:46 -08:00
Igor Canadi
fb9fce4fc3 [RocksDB] BackupableDB
Summary:
In this diff I present you BackupableDB v1. You can easily use it to backup your DB and it will do incremental snapshots for you.
Let's first describe how you would use BackupableDB. It's inheriting StackableDB interface so you can easily construct it with your DB object -- it will add a method RollTheSnapshot() to the DB object. When you call RollTheSnapshot(), current snapshot of the DB will be stored in the backup dir. To restore, you can just call RestoreDBFromBackup() on a BackupableDB (which is a static method) and it will restore all files from the backup dir. In the next version, it will even support automatic backuping every X minutes.

There are multiple things you can configure:
1. backup_env and db_env can be different, which is awesome because then you can easily backup to HDFS or wherever you feel like.
2. sync - if true, it *guarantees* backup consistency on machine reboot
3. number of snapshots to keep - this will keep last N snapshots around if you want, for some reason, be able to restore from an earlier snapshot. All the backuping is done in incremental fashion - if we already have 00010.sst, we will not copy it again. *IMPORTANT* -- This is based on assumption that 00010.sst never changes - two files named 00010.sst from the same DB will always be exactly the same. Is this true? I always copy manifest, current and log files.
4. You can decide if you want to flush the memtables before you backup, or you're fine with backing up the log files -- either way, you get a complete and consistent view of the database at a time of backup.
5. More things you can find in BackupableDBOptions

Here is the directory structure I use:

   backup_dir/CURRENT_SNAPSHOT - just 4 bytes holding the latest snapshot
               0, 1, 2, ... - files containing serialized version of each snapshot - containing a list of files
               files/*.sst - sst files shared between snapshots - if one snapshot references 00010.sst and another one needs to backup it from the DB, it will just reference the same file
               files/ 0/, 1/, 2/, ... - snapshot directories containing private snapshot files - current, manifest and log files

All the files are ref counted and deleted immediatelly when they get out of scope.

Some other stuff in this diff:
1. Added GetEnv() method to the DB. Discussed with @haobo and we agreed that it seems right thing to do.
2. Fixed StackableDB interface. The way it was set up before, I was not able to implement BackupableDB.

Test Plan:
I have a unittest, but please don't look at this yet. I just hacked it up to help me with debugging. I will write a lot of good tests and update the diff.

Also, `make asan_check`

Reviewers: dhruba, haobo, emayanke

Reviewed By: dhruba

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D14295
2013-12-09 14:06:52 -08:00
kailiu
551e9428ce Merge branch 'master' into performance 2013-12-06 14:15:42 -08:00
Siying Dong
ef2211a9ca [RocksDB Performance Branch] Introduce MergeContext to Lazily Initialize merge operand list
Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14415
2013-12-06 10:28:59 -08:00
Mayank Agarwal
18802689b8 Make an API to get database identity from the IDENTITY file
Summary: This would enable rocksdb users to get the db identity without depending on implementation details(storing that in IDENTITY file)

Test Plan: db/db_test (has identity checks)

Reviewers: dhruba, haobo, igor, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14463
2013-12-04 22:39:17 -08:00
Sajal Jain
28a1b9b95f [rocksdb] statistics counters for memtable hits and misses
Summary:
added counters
rocksdb.memtable.hit - for memtable hit
rocksdb.memtable.miss - for memtable miss

Test Plan: db_bench tests

Reviewers: igor, dhruba, haobo

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D14433
2013-12-03 12:59:53 -08:00
Igor Canadi
eb12e47e0e Killing Transform Rep
Summary:
Let's get rid of TransformRep and it's children. We have confirmed that HashSkipListRep works better with multifeed, so there is no benefit to keeping this around.

This diff is mostly just deleting references to obsoleted functions. I also have a diff for fbcode that we'll need to push when we switch to new release.

I had to expose HashSkipListRepFactory in the client header files because db_impl.cc needs access to GetTransform() function for SanitizeOptions.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14397
2013-12-03 12:42:15 -08:00
Igor Canadi
043fc14c3e Get rid of some shared_ptrs
Summary:
I went through all remaining shared_ptrs and removed the ones that I found not-necessary. Only GenerateCachePrefix() is called fairly often, so don't expect much perf wins.

The ones that are left are accessed infrequently and I think we're fine with keeping them.

Test Plan: make asan_check

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14427
2013-12-03 11:17:58 -08:00
Dhruba Borthakur
98968ba937 Free obsolete memtables outside the dbmutex had a memory leak.
Summary:
The commit at 27bbef1180 had a memory leak
that was detected by valgrind. The memtable that has a refcount decrement
in MemTableList::InstallMemtableFlushResults was not freed.

Test Plan: valgrind ./db_test --leak-check=full

Reviewers: igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14391
2013-11-28 10:25:22 -08:00
Kai Liu
1966b63137 Merge branch 'master' into perf 2013-11-27 11:47:40 -08:00
Dhruba Borthakur
27bbef1180 Free obsolete memtables outside the dbmutex.
Summary:
Large memory allocations and frees are costly and best done outside the
db-mutex. The memtables are already allocated outside the db-mutex but
they were being freed while holding the db-mutex.
This patch frees obsolete memtables outside the db-mutex.

Test Plan:
make check
db_stress

Unit tests pass, I am in the process of running stress tests.

Reviewers: haobo, igor, emayanke

Reviewed By: haobo

CC: reconnect.grayhat, leveldb

Differential Revision: https://reviews.facebook.net/D14319
2013-11-25 21:04:48 -08:00
Igor Canadi
3ce3658411 DB::GetOptions()
Summary: We need access to options for BackupableDB

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, reconnect.grayhat

Differential Revision: https://reviews.facebook.net/D14331
2013-11-25 15:51:50 -08:00
Igor Canadi
11c26bd4a4 [RocksDB] Interface changes required for BackupableDB
Summary: This is part of https://reviews.facebook.net/D14295 -- smaller diff that is easier to review

Test Plan: make asan_check

Reviewers: dhruba, haobo, emayanke

Reviewed By: emayanke

CC: leveldb, kailiu, reconnect.grayhat

Differential Revision: https://reviews.facebook.net/D14301
2013-11-25 12:39:23 -08:00
Dhruba Borthakur
299f5c76bb Create new log file outside the dbmutex.
Summary:
All filesystem Io should be done outside the dbmutex. There was one place
when we have to roll the transaction log that we were creating the new log file
while holding the dbmutex.

I rearranged this code so that the act of creating the new transaction log
file is done without holding the dbmutex.  I also allocate the new memtable
outside the dbmutex, this is important because creating the memtable
could be heavyweight.

Test Plan: make check and dbstress

Reviewers: haobo, igor

Reviewed By: haobo

CC: leveldb, reconnect.grayhat

Differential Revision: https://reviews.facebook.net/D14283
2013-11-25 11:23:42 -08:00
Haobo Xu
5b825d6964 [RocksDB] Use raw pointer instead of shared pointer when passing Statistics object internally
Summary: liveness of the statistics object is already ensured by the shared pointer in DB options. There's no reason to pass again shared pointer among internal functions. Raw pointer is sufficient and efficient.

Test Plan: make check

Reviewers: dhruba, MarkCallaghan, igor

Reviewed By: dhruba

CC: leveldb, reconnect.grayhat

Differential Revision: https://reviews.facebook.net/D14289
2013-11-25 10:38:15 -08:00
Siying Dong
3e35aa6412 Revert "Allow users to profile a query and see bottleneck of the query"
This reverts commit 3d8ac31d71.
2013-11-21 17:40:39 -08:00
Siying Dong
b135d01e7b Allow users to profile a query and see bottleneck of the query
Summary:
Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later.

Test Plan: Enable this profiling in seveal existing tests.

Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14001

Conflicts:
	table/merger.cc
2013-11-21 17:39:19 -08:00
Siying Dong
3d8ac31d71 Allow users to profile a query and see bottleneck of the query
Summary:
Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later.

Test Plan: Enable this profiling in seveal existing tests.

Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14001
2013-11-21 16:29:57 -08:00
kailiu
6eb5649800 Move flush_block_policy from Options to TableFactory
Summary:
Previously we introduce a `flush_block_policy_factory` in Options, however, that options is strongly releated to Table based tables.
It will make more sense to move it to block based table's own factory class.

Test Plan: make check to pass existing tests

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14211
2013-11-19 22:00:48 -08:00
kailiu
1415f8820d Improve the "table stats"
Summary:
The primary motivation of the changes is to make it easier to figure out the inside of the tables.

* rename "table stats" to "table properties" since now we have more than "integers" to store in the property block.
* Add filter block size to the basic table properties.
* Whenever a table is built, we'll log the table properties (the sample output is in Test Plan).
* Make an api to expose deleted keys.

Test Plan:
Passed all existing test. and the sample output of table stats:

    ==================================================================
        Basic Properties
    ------------------------------------------------------------------
                  # data blocks: 1
                      # entries: 1

                   raw key size: 9
           raw average key size: 9
                 raw value size: 9
         raw average value size: 0

                data block size: 25
               index block size: 27
              filter block size: 18
         (estimated) table size: 70

                  filter policy: rocksdb.BuiltinBloomFilter
    ==================================================================
        User collected properties: InternalKeyPropertiesCollector
    ------------------------------------------------------------------
                    kDeletedKeys: 1
    ==================================================================

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14187
2013-11-19 16:29:42 -08:00
Kai Liu
75df72f2a5 Change the logic in KeyMayExist()
Summary:
Previously in KeyMayExist(), if DB::Get() returns non-Status::OK(), we assumes key may not exist.
However, as if index block is not in block cache, Status::Incomplete() will return. Worse still, if
options::filter_delete is enabled, we may falsely ignore the "delete" operation:

  https://github.com/facebook/rocksdb/blob/master/db/write_batch.cc#L217-L220

This diff fixes this bug and will let crash-test pass.

Test Plan:
Ran:

  ./db_stress --test_batches_snapshots=1 --ops_per_thread=1000000 --threads=32 --write_buffer_size=4194304 --destroy_db_initially=1 --reopen=0 --readpercent=5 --prefixpercent=45 --writepercent=35 --delpercent=5 --iterpercent=10 --db=/home/kailiu/local/newer --max_key=100000000 --disable_seek_compaction=0 --mmap_read=0 --block_size=16384 --cache_size=1048576 --open_files=500000 --verify_checksum=1 --sync=0 --disable_wal=0 --disable_data_sync=0 --target_file_size_base=2097152
--target_file_size_multiplier=2 --max_write_buffer_number=3 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --filter_deletes=1

Previously we'll see crash happens very soon.

Reviewers: igor, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14115
2013-11-17 01:00:34 -08:00
Pascal Borreli
443e04e62d Fixed typos 2013-11-16 11:21:34 +00:00
Igor Canadi
29c931f70b Avoid populating live set if we don't need to
Summary: Also changed some comments

Test Plan: ./deletefile_test

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14091
2013-11-14 22:42:02 -08:00
Igor Canadi
a0ce3fd00a PurgeObsoleteFiles() unittest
Summary:
Created a unittest that verifies that automatic deletion performed by PurgeObsoleteFiles() works correctly.

Also, few small fixes on the logic part -- call version_set_->GetObsoleteFiles() in FindObsoleteFiles() instead of on some arbitrary positions.

Test Plan: Created a unit test

Reviewers: dhruba, haobo, nkg-

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14079
2013-11-14 18:03:57 -08:00
Igor Canadi
fda8142f29 Delete log files in the correct dir
Summary: Log files are stored in wal_dir, not dbname_

Test Plan: deletfile_test

Reviewers: nkg-

Reviewed By: nkg-

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14067
2013-11-13 14:54:54 -08:00
Kai Liu
35460ccb53 Fix the string format issue
Summary:

mac and our dev server has totally differnt definition of uint64_t, therefore fixing the warning in mac has actually made code in linux uncompileable.

Test Plan:

make clean && make -j32
2013-11-12 21:05:39 -08:00
Igor Canadi
d88d8ecf80 Fix deleting files
Summary: One more fix! In some cases, our filenames start with "/". Apparently, env_ can't handle filenames with double //

Test Plan:
deletefile_test does not include this line in the LOG anymore:
2013/11/12-18:11:43.150149 7fe4a6fff700 RenameFile logfile #3 FAILED -- IO error: /tmp/rocksdbtest-3574/deletefile_test//000003.log: No such file or directory

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14055
2013-11-12 20:32:07 -08:00
kailiu
21587760b9 Fixing the warning messages captured under mac os # Consider using git commit -m 'One line title' && arc diff. # You will save time by running lint and unit in the background.
Summary: The work to make sure mac os compiles rocksdb is not completed yet. But at least we can start cleaning some warnings captured only by g++ from mac os..

Test Plan: ran make in mac os

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14049
2013-11-12 20:05:28 -08:00
Igor Canadi
9bc4a26f56 Small changes in Deleting obsolete files
Summary:
@haobo's suggestions from https://reviews.facebook.net/D13827

Renaming some variables, deprecating purge_log_after_flush, changing for loop into auto for loop.

I have not implemented deleting objects outside of mutex yet because it would require a big code change - we would delete object in db_impl, which currently does not know anything about object because it's defined in version_edit.h (FileMetaData). We should do it at some point, though.

Test Plan: Ran deletefile_test

Reviewers: haobo

Reviewed By: haobo

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D14025
2013-11-12 11:53:26 -08:00
Igor Canadi
dad425562f Move the comment
Summary: Moving the comment per @haobo suggestion.

Test Plan: No

Reviewers: haobo

Reviewed By: haobo

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D14019
2013-11-12 10:07:55 -08:00
Igor Canadi
4abd219cfc Combine two FindObsoleteFiles()
Summary: We don't need to call FindObsoleteFiles() twice

Test Plan: deletefile_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14007
2013-11-11 21:41:32 -08:00
Igor Canadi
94e139f94d Fixing failed delete file test
Summary: FindObsoleteFiles() has to be called before PurgeObsoleteFiles() because FindObsoleteFiles() sets manifest_file_number, log_number and prev_log_number to valid values.

Test Plan: deletefile_test now works

Reviewers: dhruba, emayanke, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13995
2013-11-11 21:03:41 -08:00
Igor Canadi
1510339e52 Speed up FindObsoleteFiles
Summary:
Here's one solution we discussed on speeding up FindObsoleteFiles. Keep a set of all files in DBImpl and update the set every time we create a file. I probably missed few other spots where we create a file.

It might speed things up a bit, but makes code uglier. I don't really like it.

Much better approach would be to abstract all file handling to a separate class. Think of it as layer between DBImpl and Env. Having a separate class deal with file namings and deletion would benefit both code cleanliness (especially with huge DBImpl) and speed things up. It will take a huge effort to do this, though.

Let's discuss offline today.

Test Plan: Ran ./db_stress, verified that files are getting deleted

Reviewers: dhruba, haobo, kailiu, emayanke

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D13827
2013-11-08 15:23:46 -08:00
Kai Liu
fd075d6edd Provide mechanism to configure when to flush the block
Summary: Allow block based table to configure the way flushing the blocks. This feature will allow us to add support for prefix-aligned block.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, igor

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13875
2013-11-07 21:27:21 -08:00
Igor Canadi
444cf88a56 Flush the log outside of lock
Summary:
Added a new call LogFlush() that flushes the log contents to the OS buffers. We never call it with lock held.

We call it once for every Read/Write and often in compaction/flush process so the frequency should not be a problem.

Test Plan: db_test

Reviewers: dhruba, haobo, kailiu, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13935
2013-11-07 11:31:56 -08:00
Haobo Xu
fd2044883a [RocksDB] Generalize prefix-aware iterator to be used for more than one Seek
Summary: Added a prefix_seek flag in ReadOptions to indicate that Seek is prefix aware(might not return data with different prefix), and also not bound to a specific prefix. Multiple Seeks and range scans can be invoked on the same iterator. If a specific prefix is specified, this flag will be ignored. Just a quick prototype that works for PrefixHashRep, the new lockless memtable could be easily extended with this support too.

Test Plan: test it on Leaf

Reviewers: dhruba, kailiu, sdong, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13929
2013-11-06 20:45:49 -08:00
shamdor
c2be2cba04 WAL log retention policy based on archive size.
Summary:
Archive cleaning will still happen every WAL_ttl seconds
but archived logs will be deleted only if archive size
is greater then a WAL_size_limit value.
Empty archived logs will be deleted evety WAL_ttl.

Test Plan:
1. Unit tests pass.
2. Benchmark.

Reviewers: emayanke, dhruba, haobo, sdong, kailiu, igor

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13869
2013-11-06 18:46:28 -08:00
Mayank Agarwal
f837f5b1c9 Making the transaction log iterator more robust
Summary:
strict essentially means that we MUST find the startsequence. Thus we should return if starteSequence is not found in the first file in case strict is set. This will take care of ending the iterator in case of permanent gaps due to corruptions in the log files
Also created NextImpl function that will have internal variable to distinguish whether Next is being called from StartSequence or by application.
Set NotFoudn::gaps status to give an indication of gaps happeneing.
Polished the inline documentation at various places

Test Plan:
* db_repl_stress test
* db_test relating to transaction log iterator
* fbcode/wormhole/rocksdb/rocks_log_iterator
* sigma production machine sigmafio032.prn1

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13689
2013-11-04 20:49:03 -08:00
Dhruba Borthakur
b4ad5e89ae Implement a compressed block cache.
Summary:
Rocksdb can now support a uncompressed block cache, or a compressed
block cache or both. Lookups first look for a block in the
uncompressed cache, if it is not found only then it is looked up
in the compressed cache. If it is found in the compressed cache,
then it is uncompressed and inserted into the uncompressed cache.

It is possible that the same block resides in the compressed cache
as well as the uncompressed cache at the same time. Both caches
have their own individual LRU policy.

Test Plan: Unit test case attached.

Reviewers: kailiu, sdong, haobo, leveldb

Reviewed By: haobo

CC: xjin, haobo

Differential Revision: https://reviews.facebook.net/D12675
2013-11-01 14:31:35 -07:00
Haobo Xu
8cbe5bb56b [RocksDB] Add OnCompactionStart to CompactionFilter class
Summary: This is to give application compaction filter a chance to access context information of a specific compaction run. For example, depending on whether a compaction goes through all data files, the application could do things differently.

Test Plan: make check

Reviewers: dhruba, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13683
2013-10-31 13:36:43 -07:00
Siying Dong
f03b2df010 Follow-up Cleaning-up After D13521
Summary:
This patch is to address @haobo's comments on D13521:
1. rename Table to be TableReader and make its factory function to be GetTableReader
2. move the compression type selection logic out of TableBuilder but to compaction logic
3. more accurate comments
4. Move stat name constants into BlockBasedTable implementation.
5. remove some uncleaned codes in simple_table_db_test

Test Plan: pass test suites.

Reviewers: haobo, dhruba, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13785
2013-10-30 10:52:33 -07:00
Siying Dong
d4eec30ed0 Make "Table" pluggable
Summary: This patch makes Table and TableBuilder a abstract class and make all the implementation of the current table into BlockedBasedTable and BlockedBasedTable Builder.

Test Plan: Make db_test.cc to work with block based table. Add a new test simple_table_db_test.cc where a different simple table format is implemented.

Reviewers: dhruba, haobo, kailiu, emayanke, vamsi

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13521
2013-10-28 17:54:09 -07:00
Kai Liu
994575c134 Support user-defined table stats collector
Summary:
1. Added a new option that support user-defined table stats collection.
2. Added a deleted key stats collector in `utilities`

Test Plan:
Added a unit test for newly added code.
Also ran make check to make sure other tests are not broken.

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13491
2013-10-28 15:45:14 -07:00
Igor Canadi
100fa8e013 If a Put fails, fail all other puts
Summary:
When a Put fails, it can leave database in a messy state. We don't want to pretend that everything is OK when it may not be. We fail every write following the failed one.

I added checks for corruption to DBImpl::Write(). Is there anywhere else I need to add them?

Test Plan: Corruption unit test.

Reviewers: dhruba, haobo, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13671
2013-10-28 12:36:02 -07:00
Mayank Agarwal
56305221c4 Unify DeleteFile and DeleteWalFiles
Summary:
This is to simplify rocksdb public APIs and improve the code quality.
Created an additional parameter to ParseFileName for log sub type and improved the code for deleting a wal file.
Wrote exhaustive unit-tests in delete_file_test
Unification of other redundant APIs can be taken up in a separate diff

Test Plan: Expanded delete_file test

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13647
2013-10-25 08:32:14 -07:00
Kai Liu
c17607a251 Fix the log number bug when updating MANIFEST file
Summary:
Crash may occur during the flushes of more than two mem tables.

As the info log suggested, even when both were successfully flushed,
the recovery process still pick up one of the memtable's log for recovery.

This diff fix the problem by setting the correct "log number" in MANIFEST.

Test Plan: make test; deployed to leaf4 and make sure it doesn't result in crashes of this type.

Reviewers: haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13659
2013-10-24 21:05:33 -07:00
Haobo Xu
2fb361ad98 [RocksDB] Add perf_context.wal_write_time to track time spent on writing the recovery log.
Summary: as title

Test Plan: make check; ./perf_context_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13629
2013-10-23 13:38:39 -07:00
Mayank Agarwal
9b50106f9a Dbid feature
Summary:
Create a new type of file on startup if it doesn't already exist called DBID.
This will store a unique number generated from boost library's uuid header file.
The use-case is to identify the case of a db losing all its data and coming back up either empty or from an image(backup/live replica's recovery)
the key point to note is that DBID is not stored in a backup or db snapshot
It's preferable to use Boost for uuid because:
1) A non-standard way of generating uuid is not good
2) /proc/sys/kernel/random/uuid generates a uuid but only on linux environments and the solution would not be clean
3) c++ doesn't have any direct way to get a uuid
4) Boost is a very good library that was already having linkage in rocksdb from third-party
Note: I had to update the TOOLCHAIN_REV in build files to get latest verison of boost from third-party as the older version had a bug.
I had to put Wno-uninitialized in Makefile because boost-1.51 has an unitialized variable and rocksdb would not comiple otherwise. Latet open-source for boost is 1.54 but is not there in third-party. I have notified the concerned people in fbcode about it.
@kailiu : While releasing to third-party, an additional dependency will need to be created for boost in TARGETS file. I can help identify.

Test Plan:
Expand db_test to test 2 cases
1) Restarting db with Id file present - verify that no change to Id
2)Restarting db with Id file deleted - verify that a different Id is there after reopen
Also run make all check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13587
2013-10-22 12:23:34 -07:00
Siying Dong
9edda37027 Universal Compaction to Have a Size Percentage Threshold To Decide Whether to Compress
Summary:
This patch adds a option for universal compaction to allow us to only compress output files if the files compacted previously did not yet reach a specified ratio, to save CPU costs in some cases.

Compression is always skipped for flushing. This is because the size information is not easy to evaluate for flushing case. We can improve it later.

Test Plan:
add test
DBTest.UniversalCompactionCompressRatio1 and DBTest.UniversalCompactionCompressRatio12

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13467
2013-10-17 13:33:39 -07:00
Dhruba Borthakur
9cd221094c Add appropriate LICENSE and Copyright message.
Summary:
Add appropriate LICENSE and Copyright message.

Test Plan:
make check

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-10-16 17:48:41 -07:00
Siying Dong
073cbfc8f0 Enable background flush thread by default and fix issues related to it
Summary:
Enable background flush thread in this patch and fix unit tests with:
(1) After background flush, schedule a background compaction if condition satisfied;
(2) Fix a bug that if universal compaction is enabled and number of levels are set to be 0, compaction will not be automatically triggered
(3) Fix unit tests to wait for compaction to finish instead of flush, before checking the compaction results.

Test Plan: pass all unit tests

Reviewers: haobo, xjin, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13461
2013-10-16 13:32:53 -07:00
Mayank Agarwal
fe3713961e Features in Transaction log iterator
Summary:
* Logstore requests a valid change of reutrning an empty iterator and not an error in case of no log files.
* Changed the code to return the writebatch containing the sequence number requested from GetupdatesSince even if it lies in the middle. Earlier we used to return the next writebatch,. This also allows me oto guarantee that no files played upon by the iterator are redundant. I mean the starting log file has at least a sequence number >= the sequence number requested form GetupdatesSince.
* Cleaned up redundant logic in Iterator::Next and made a new function SeekToStartSequence for greater readability and maintainibilty.
* Modified a test in db_test accordingly
Please check the logic carefully and suggest improvements. I have a separate patch out for more improvements like restricting reader to read till written sequences.

Test Plan:
* transaction log iterator tests in db_test,
* db_repl_stress.
* rocks_log_iterator_test in fbcode/wormhole/rocksdb/test - 2 tests thriving on hacks till now can get simplified
* testing on the shadow setup for sigma with replication

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13437
2013-10-14 18:16:21 -07:00
Kai Liu
86ef6c3f74 Add statistics to sst file
Summary:
So far we only have key/value pairs as well as bloom filter stored in the
sst file.  It will be great if we are able to store more metadata about
this table itself, for example, the entry size, bloom filter name, etc.

This diff is the first step of this effort. It allows table to keep the
basic statistics mentioned in http://fburl.com/14995441, as well as
allowing writing user-collected stats to stats block.

After this diff, we will figure out the interface of how to allow user to collect their interested statistics.

Test Plan:
1. Added several unit tests.
2. Ran `make check` to ensure it doesn't break other tests.

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13419
2013-10-14 15:56:13 -07:00
Siying Dong
88f2f89068 Change Function names from Compaction->Flush When they really mean Flush
Summary: When I debug the unit test failures when enabling background flush thread, I feel the function names can be made clearer for people to understand. Also, if the names are fixed, in many places, some tests' bugs are obvious (and some of those tests are failing). This patch is to clean it up for future maintenance.

Test Plan: Run test suites.

Reviewers: haobo, dhruba, xjin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13431
2013-10-14 15:12:15 -07:00
Mayank Agarwal
a8b4a69de0 Fixing error in ParseFileName causing DestroyDB to fail on archive directory
Summary:
This careless error was causing ASSERT_OK(DestroyDB) to fail in db_test.
Basically .. was being returned as a child of db/archive and ParseFileName returned false on that,
but 'type' was set to LogFile from earlier and not reset. The return of ParseFileName was not being checked to delete the log file or not.

Test Plan: make all check

Reviewers: dhruba, haobo, xjin, kailiu, nkg-

Reviewed By: nkg-

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13413
2013-10-10 18:18:31 -07:00
Naman Gupta
cbf4a06427 Add option for storing transaction logs in a separate dir
Summary: In some cases, you might not want to store the data log (write ahead log) files in the same dir as the sst files. An example use case is leaf, which stores sst files in tmpfs. And would like to save the log files in a separate dir (disk) to save memory.

Test Plan: make all. Ran db_test test. A few test failing. P2785018. If you guys don't see an obvious problem with the code, maybe somebody from the rocksdb team could help me debug the issue here. Running this on leaf worked well. I could see logs stored on disk, and deleted appropriately after compactions. Obviously this is only one set of options. The unit tests cover different options. Seems like I'm missing some edge cases.

Reviewers: dhruba, haobo, leveldb

CC: xinyaohu, sumeet

Differential Revision: https://reviews.facebook.net/D13239
2013-10-08 17:40:27 -07:00
Dhruba Borthakur
4463b11cad Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix.
Summary: Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix.

Test Plan: make check

Reviewers: emayanke, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13311
2013-10-06 00:14:26 -07:00
Haobo Xu
bf89edf78b [RocksDB] Added a property "leveldb.num-immutable-mem-table" so that Flush can be called without blocking, and application still has a way to check when it's done also without blocking.
Summary: as title

Test Plan: DBTest.NumImmutableMemTable

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13305
2013-10-05 11:54:08 -07:00
Dhruba Borthakur
0a9f873f4b Removed scribe, thrift and java modules.
Summary: Removed scribe, thrift and java modules.

Test Plan:
make release
make check

Reviewers: emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13293
2013-10-04 15:36:00 -07:00
Dhruba Borthakur
a143ef9b38 Change namespace from leveldb to rocksdb
Summary:
Change namespace from leveldb to rocksdb. This allows a single
application to link in open-source leveldb code as well as
rocksdb code into the same process.

Test Plan: compile rocksdb

Reviewers: emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13287
2013-10-04 11:59:26 -07:00
Mayank Agarwal
b3ed08129b Add a statistic to count the number of calls to GetUpdatesSince
Summary: This is useful to keep track of refreshes in transaction log iterator

Test Plan: make; db_stress --statistics=1 shows it

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13281
2013-10-04 10:47:20 -07:00
Haobo Xu
200c05a23f [RocksDB] Still honor DisableFileDeletions when purge_log_after_memtable_flush is on
Summary: as title

Test Plan: make check

Reviewers: emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13263
2013-10-03 16:12:43 -07:00
Haobo Xu
fa798e9e28 [Rocksdb] Submit mem table flush job in a different thread pool
Summary: As title. This is just a quick hack and not ready for commit. fails a lot of unit test. I will test/debug it directly in ViewState shadow .

Test Plan: Try it in shadow test.

Reviewers: dhruba, xjin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12933
2013-10-03 14:37:19 -07:00
Xing Jin
658a3ce2fa Fix SIGSEGV issue in universal compaction
Summary:
We saw SIGSEGV when set options.num_levels=1 in universal compaction
style. Dug into this issue for a while, and finally found the root cause (thank Haobo for discussion).

Test Plan: Add new unit test. It throws SIGSEGV without this change. Also run "make all check".

Reviewers: haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13251
2013-10-02 17:33:31 -07:00
Xing Jin
8eb552bf4d New unit test for iterator with snapshot
Summary:
I played with the reported bug about iterator with snapshot:
https://code.google.com/p/leveldb/issues/detail?id=200.

I turned the original test program
(https://code.google.com/p/leveldb/issues/attachmentText?id=200&aid=2000000000&name=test.cc&token=7uOUQW-HFlbAFMUm7EqtaAEy7Tw%3A1378320724136)
into a new unit test, but I cannot reproduce the problem. Notice lines
31-34 in above link. I have ran the new test with and without such Put()
operations. Both succeed.

So this diff simply adds the test, without changing any source codes.

Test Plan: run new test.

Reviewers: dhruba, haobo, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12735
2013-09-28 11:39:08 -07:00
Haobo Xu
0c4040681a [RocksDB] Move last_sequence and last_flushed_sequence_ update back into lock protected area
Summary: A previous diff moved these outside of lock protected area. Moved back in now. Also moved tmp_batch_ update outside of lock protected area, as only the single write thread can access it.

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13137
2013-09-26 20:43:11 -07:00
Haobo Xu
0e422308aa [RocksDB] Remove Log file immediately after memtable flush
Summary: As title. The DB log file life cycle is tied up with the memtable it backs. Once the memtable is flushed to sst and committed, we should be able to delete the log file, without holding the mutex. This is part of the bigger change to avoid FindObsoleteFiles at runtime. It deals with log files. sst files will be dealt with later.

Test Plan: make check; db_bench

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11709
2013-09-12 11:54:44 -07:00
Dhruba Borthakur
32c965d417 Flush was hanging because the configured options specified that more than 1 memtable need to be merged.
Summary:
There is an config option called Options.min_write_buffer_number_to_merge
that specifies the minimum number of write buffers to merge in memory
before flushing to a file in L0. But in the the case when the db is
being closed, we should not be using this config, instead we should
flush whatever write buffers were available at that time.

Test Plan: Unit test attached.

Reviewers: haobo, emayanke

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12717
2013-09-06 16:28:33 -07:00
Mayank Agarwal
aa5c897d42 Return pathname relative to db dir in LogFile and cleanup AppendSortedWalsOfType
Summary: So that replication can just download from wherever LogFile.Pathname is pointing them.

Test Plan: make all check;./db_repl_stress

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12609
2013-09-04 13:44:43 -07:00
Xing Jin
42c109cc2e New ldb command to convert compaction style
Summary:
Add new command "change_compaction_style" to ldb tool. For
universal->level, it shows "nothing to do". For level->universal, it
compacts all files into a single one and moves the file to level 0.

Also add check for number of files at level 1+ when opening db with
universal compaction style.

Test Plan:
'make all check'. New unit test for internal convertion function. Also manully test various
cmd like:

./ldb change_compaction_style --old_compaction_style=0
--new_compaction_style=1 --db=/tmp/leveldbtest-3088/db_test

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: vamsi, emayanke

Differential Revision: https://reviews.facebook.net/D12603
2013-09-04 13:13:08 -07:00
Mayank Agarwal
c34271a5a5 Fix bug in Counters and record Sequencenumber using only TickerCount
Summary:
The way counters/statistics are implemented in rocksdb demands that enum Tickers and TickerNameMap follow the same order, otherwise statistics exposed from fbcode/rocks get out-of-sync. 2 counters for prefix had violated this order and when I built counters for fbcode/mcrocksdb, statistics for sequence number were appearing out-of-sync.
The other change is to record sequence-number using setTickerCount only and not recordTick. This is because of difference in statistics as understood by rocks/utils which uses ServiceData::statistics function and rocksdb statistics. In rocksdb there is just 1 counter for a countername. But in ServiceData there are 4 independent buckets for every countername-Count, Sum, Average and Rate. SetTickerCount and RecordTick update the same variable in rocksdb but different buckets in ServiceData. Therefore, I had to choose one consistent function from RecordTick or SetTickerCount for sequence number in rocksdb. I chose SetTickerCount because the statistics object in options passed during rocksdb-open is user-dependent and SetTickerCount makes sense there.
There will be a corresponding diff to mcorcksdb in fbcode shortly.

Test Plan: make all check; check ticker value using fprintfs

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12669
2013-09-01 17:59:32 -07:00
Mayank Agarwal
ab5c5c28fe Fix build caused by DeleteFile not tolerating / at the beginning
Summary: db->DeleteFile calls ParseFileName to check name that was returned for sst file. Now, sst filename is returned using TableFileName which uses MakeFileName. This puts a / at the front of the name and ParseFileName doesn't like that. Changed ParseFileName to tolerate /s at the beginning. The test delet_file_test used to pass earlier because this behaviour of MakeFileName had been changed a while back to not return a / during which delete_file_test was checked in. But MakeFileName had to be reverted to add / at the front because GetLiveFiles used at many places outside rocksdb used the previous behaviour of MakeFileName.

Test Plan: make;./delete_filetest;make all check

Reviewers: dhruba, haobo, vamsi

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12663
2013-09-01 17:59:13 -07:00
Dhruba Borthakur
59de2dbad7 Cleanup DeleteFile API
Summary:
The DeleteFile API was removing files inside the db-lock. This
is now changed to remove files outside the db-lock.
The GetLiveFilesMetadata() returns the smallest and largest
seqnuence number of each file as well.

Test Plan: deletefile_test

Reviewers: emayanke, haobo

Reviewed By: haobo

CC: leveldb

Maniphest Tasks: T63

Differential Revision: https://reviews.facebook.net/D12567
2013-08-28 21:18:58 -07:00
Haobo Xu
48e5ea0c34 [RocksDB] Fix TransformRepFactory related valgrind problem
Summary: Let TransformRepFactory own the passed in transform. Also make it better encapsulated.

Test Plan: make valgrind_check;

Reviewers: dhruba, emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12591
2013-08-28 19:27:54 -07:00
Dhruba Borthakur
fc0c399d2e Introduced a new flag non_blocking_io in ReadOptions.
Summary:
If ReadOptions.non_blocking_io is set to true, then KeyMayExists
and Iterators will return data that is cached in RAM.
If the Iterator needs to do IO from storage to serve the data,
then the Iterator.status() will return Status::IsRetry().

Test Plan:
Enhanced unit test DBTest.KeyMayExist to detect if there were are IOs
issues from storage. Added DBTest.NonBlockingIteration to verify
nonblocking Iterations.

Reviewers: emayanke, haobo

Reviewed By: haobo

CC: leveldb

Maniphest Tasks: T63

Differential Revision: https://reviews.facebook.net/D12531
2013-08-28 10:49:14 -07:00
Haobo Xu
43eef52001 [RocksDB] move stats counting outside of mutex protected region for DB::Get()
Summary:
As title. This is possible as tickers are atomic now.
db_bench on high qps in-memory muti-thread random get workload, showed ~5% throughput improvement.

Test Plan: make check; db_bench; db_stress

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12555
2013-08-27 13:36:10 -07:00
Deon Nicholas
573844807c Fix for no_io
Summary: Oops. My bad.

Test Plan: Make all check

Reviewers: emayanke

Reviewed By: emayanke

CC: haobo, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D12525
2013-08-23 16:36:01 -07:00
Dhruba Borthakur
1186192ed1 Replace include/leveldb with include/rocksdb.
Summary: Replace include/leveldb with include/rocksdb.

Test Plan:
make clean; make check
make clean; make release

Differential Revision: https://reviews.facebook.net/D12489
2013-08-23 10:51:00 -07:00
Jim Paton
74781a0c49 Add three new MemTableRep's
Summary:
This patch adds three new MemTableRep's: UnsortedRep, PrefixHashRep, and VectorRep.

UnsortedRep stores keys in an std::unordered_map of std::sets. When an iterator is requested, it dumps the keys into an std::set and iterates over that.

VectorRep stores keys in an std::vector. When an iterator is requested, it creates a copy of the vector and sorts it using std::sort. The iterator accesses that new vector.

PrefixHashRep stores keys in an unordered_map mapping prefixes to ordered sets.

I also added one API change. I added a function MemTableRep::MarkImmutable. This function is called when the rep is added to the immutable list. It doesn't do anything yet, but it seems like that could be useful. In particular, for the vectorrep, it means we could elide the extra copy and just sort in place. The only reason I haven't done that yet is because the use of the ArenaAllocator complicates things (I can elaborate on this if needed).

Test Plan:
make -j32 check
./db_stress --memtablerep=vector
./db_stress --memtablerep=unsorted
./db_stress --memtablerep=prefixhash --prefix_size=10

Reviewers: dhruba, haobo, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12117
2013-08-22 23:10:02 -07:00
Xing Jin
17dc128048 Pull from https://reviews.facebook.net/D10917
Summary: Pull Mark's patch and slightly revise it. I revised another place in db_impl.cc with similar new formula.

Test Plan:
make all check. Also run "time ./db_bench --num=2500000000 --numdistinct=2200000000". It has run for 20+ hours and hasn't finished. Looks good so far:

Installed stack trace handler for SIGILL SIGSEGV SIGBUS SIGABRT
LevelDB:    version 2.0
Date:       Tue Aug 20 23:11:55 2013
CPU:        32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
CPUCache:   20480 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    2500000000
RawSize:    276565.6 MB (estimated)
FileSize:   157356.3 MB (estimated)
Write rate limit: 0
Compression: snappy
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
DB path: [/tmp/leveldbtest-3088/dbbench]
fillseq      :    7202.000 micros/op 138 ops/sec;
DB path: [/tmp/leveldbtest-3088/dbbench]
fillsync     :    7148.000 micros/op 139 ops/sec; (2500000 ops)
DB path: [/tmp/leveldbtest-3088/dbbench]
fillrandom   :    7105.000 micros/op 140 ops/sec;
DB path: [/tmp/leveldbtest-3088/dbbench]
overwrite    :    6930.000 micros/op 144 ops/sec;
DB path: [/tmp/leveldbtest-3088/dbbench]
readrandom   :       1.020 micros/op 980507 ops/sec; (0 of 2500000000 found)
DB path: [/tmp/leveldbtest-3088/dbbench]
readrandom   :       1.021 micros/op 979620 ops/sec; (0 of 2500000000 found)
DB path: [/tmp/leveldbtest-3088/dbbench]
readseq      :     113.000 micros/op 8849 ops/sec;
DB path: [/tmp/leveldbtest-3088/dbbench]
readreverse  :     102.000 micros/op 9803 ops/sec;
DB path: [/tmp/leveldbtest-3088/dbbench]
Created bg thread 0x7f0ac17f7700
compact      :  111701.000 micros/op 8 ops/sec;
DB path: [/tmp/leveldbtest-3088/dbbench]
readrandom   :       1.020 micros/op 980376 ops/sec; (0 of 2500000000 found)
DB path: [/tmp/leveldbtest-3088/dbbench]
readseq      :     120.000 micros/op 8333 ops/sec;
DB path: [/tmp/leveldbtest-3088/dbbench]
readreverse  :      29.000 micros/op 34482 ops/sec;
DB path: [/tmp/leveldbtest-3088/dbbench]
... finished 618100000 ops

Reviewers: MarkCallaghan, haobo, dhruba, chip

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D12441
2013-08-22 22:37:13 -07:00
Simha Venkataramaiah
60bf2b7d4a Add APIs to query SST file metadata and to delete specific SST files
Summary: An api to query the level, key ranges, size etc for each SST file and an api to delete a specific file from the db and all associated state in the bookkeeping datastructures.

Notes: Editing the manifest version does not release the obsolete files right away. However deleting the file directly will mess up the iterator. We may need a more aggressive/timely file deletion api.

I have used std::unique_ptr - will switch to boost:: since this is external. thoughts?

Unit test is fragile right now as it expects the compaction at certain levels.

Test Plan: unittest

Reviewers: dhruba, vamsi, emayanke

CC: zshao, leveldb, haobo

Task ID: #

Blame Rev:
2013-08-22 15:27:19 -07:00
Deon Nicholas
b87dcae1a3 Made merge_oprator a shared_ptr; and added TTL unit tests
Test Plan:
- make all check;
- make release;
- make stringappend_test; ./stringappend_test

Reviewers: haobo, emayanke

Reviewed By: haobo

CC: leveldb, kailiu

Differential Revision: https://reviews.facebook.net/D12381
2013-08-20 13:35:28 -07:00
Mayank Agarwal
8a3547d38e API for getting archived log files
Summary: Also expanded class LogFile to have startSequene and FileSize and exposed it publicly

Test Plan: make all check

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12087
2013-08-19 13:37:04 -07:00
Mayank Agarwal
387ac0f1e1 Expose statistic for sequence number and implement setTickerCount
Summary: statistic for sequence number is needed by wormhole. setTickerCount is demanded for this statistic. I can't simply recordTick(max_sequence) when db recovers because the statistic iobject is owned by client and may/may not be reset during reopen. Eg. statistic is reset in mcrocksdb whereas it is not in db_stress. Therefore it is best to go with setTickerCount

Test Plan: ./db_stress ... --statistics=1 and observed expected sequence number

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12327
2013-08-15 23:00:20 -07:00
Xing Jin
0a5afd1afc Minor fix to current codes
Summary:
Minor fix to current codes, including: coding style, output format,
comments. No major logic change. There are only 2 real changes, please see my inline comments.

Test Plan: make all check

Reviewers: haobo, dhruba, emayanke

Differential Revision: https://reviews.facebook.net/D12297
2013-08-14 23:03:57 -07:00
Mayank Agarwal
f1bf169484 Counter for merge failure
Summary:
With Merge returning bool, it can keep failing silently(eg. While faling to fetch timestamp in TTL). We need to detect this through a rocksdb counter which can get bumped whenever Merge returns false. This will also be super-useful for the mcrocksdb-counter service where Merge may fail.
Added a counter NUMBER_MERGE_FAILURES and appropriately updated db/merge_helper.cc

I felt that it would be better to directly add counter-bumping in Merge as a default function of MergeOperator class but user should not be aware of this, so this approach seems better to me.

Test Plan: make all check

Reviewers: dnicholas, haobo, dhruba, vamsi

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12129
2013-08-13 14:25:42 -07:00
Tyler Harter
f5f1842282 Prefix filters for scans (v4)
Summary: Similar to v2 (db and table code understands prefixes), but use ReadOptions as in v3.  Also, make the CreateFilter code faster and cleaner.

Test Plan: make db_test; export LEVELDB_TESTS=PrefixScan; ./db_test

Reviewers: dhruba

Reviewed By: dhruba

CC: haobo, emayanke

Differential Revision: https://reviews.facebook.net/D12027
2013-08-13 14:04:56 -07:00
sumeet
3b81df34bd Separate compaction filter for each compaction
Summary:
If we have same compaction filter for each compaction,
application cannot know about the different compaction processes.
Later on, we can put in more details in compaction filter for the
application to consume and use it according to its needs. For e.g. In
the universal compaction, we have a compaction process involving all the
files while others don't involve all the files. Applications may want to
collect some stats only when during full compaction.

Test Plan: run existing unit tests

Reviewers: haobo, dhruba

Reviewed By: dhruba

CC: xinyaohu, leveldb

Differential Revision: https://reviews.facebook.net/D12057
2013-08-13 10:56:20 -07:00
Dhruba Borthakur
93d77a27d2 Universal Compaction should keep DeleteMarkers unless it is the earliest file.
Summary:
The pre-existing code was purging a DeleteMarker if thay key did not
exist in deeper levels.  But in the Universal Compaction Style, all
files are in Level0. For compaction runs that did not include the
earliest file, we were erroneously purging the DeleteMarkers.

The fix is to purge DeleteMarkers only if the compaction includes
the earlist file.

Test Plan: DBTest.Randomized triggers this code path.

Differential Revision: https://reviews.facebook.net/D12081
2013-08-09 14:03:57 -07:00
Xing Jin
8ae905ed63 Fix unit tests for universal compaction (step 2)
Summary:
Continue fixing existing unit tests for universal compaction. I have
tried to apply universal compaction to all unit tests those haven't
called ChangeOptions(). I left a few which are either apparently not
applicable to universal compaction (because they check files/keys/values
at level 1 or above levels), or apparently not related to compaction
(e.g., open a file, open a db).

I also add a new unit test for universal compaction.

Good news is I didn't see any bugs during this round.

Test Plan: Ran "make all check" yesterday. Has rebased and is rerunning

Reviewers: haobo, dhruba

Differential Revision: https://reviews.facebook.net/D12135
2013-08-09 13:35:44 -07:00
Xing Jin
17b8f786a3 Fix unit tests/bugs for universal compaction (first step)
Summary:
This is the first step to fix unit tests and bugs for universal
compactiion. I added universal compaction option to ChangeOptions(), and
fixed all unit tests calling ChangeOptions(). Some of these tests
obviously assume more than 1 level and check file number/values in level
1 or above levels. I set kSkipUniversalCompaction for these tests.

The major bug I found is manual compaction with universal compaction never stops. I have put a fix for
it.

I have also set universal compaction as the default compaction and found
at least 20+ unit tests failing. I haven't looked into the details. The
next step is to check all unit tests without calling ChangeOptions().

Test Plan: make all check

Reviewers: dhruba, haobo

Differential Revision: https://reviews.facebook.net/D12051
2013-08-07 14:05:44 -07:00
Dhruba Borthakur
f5fa26b6a9 Merge branch 'performance' of github.com:facebook/rocksdb into performance
Conflicts:
	db/builder.cc
	db/db_impl.cc
	db/version_set.cc
	include/leveldb/statistics.h
2013-08-07 11:58:06 -07:00
Deon Nicholas
c2d7826ced [RocksDB] [MergeOperator] The new Merge Interface! Uses merge sequences.
Summary:
Here are the major changes to the Merge Interface. It has been expanded
to handle cases where the MergeOperator is not associative. It does so by stacking
up merge operations while scanning through the key history (i.e.: during Get() or
Compaction), until a valid Put/Delete/end-of-history is encountered; it then
applies all of the merge operations in the correct sequence starting with the
base/sentinel value.

I have also introduced an "AssociativeMerge" function which allows the user to
take advantage of associative merge operations (such as in the case of counters).
The implementation will always attempt to merge the operations/operands themselves
together when they are encountered, and will resort to the "stacking" method if
and only if the "associative-merge" fails.

This implementation is conjectured to allow MergeOperator to handle the general
case, while still providing the user with the ability to take advantage of certain
efficiencies in their own merge-operator / data-structure.

NOTE: This is a preliminary diff. This must still go through a lot of review,
revision, and testing. Feedback welcome!

Test Plan:
  -This is a preliminary diff. I have only just begun testing/debugging it.
  -I will be testing this with the existing MergeOperator use-cases and unit-tests
(counters, string-append, and redis-lists)
  -I will be "desk-checking" and walking through the code with the help gdb.
  -I will find a way of stress-testing the new interface / implementation using
db_bench, db_test, merge_test, and/or db_stress.
  -I will ensure that my tests cover all cases: Get-Memtable,
Get-Immutable-Memtable, Get-from-Disk, Iterator-Range-Scan, Flush-Memtable-to-L0,
Compaction-L0-L1, Compaction-Ln-L(n+1), Put/Delete found, Put/Delete not-found,
end-of-history, end-of-file, etc.
  -A lot of feedback from the reviewers.

Reviewers: haobo, dhruba, zshao, emayanke

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11499
2013-08-05 20:14:32 -07:00
Jim Paton
8e792e5896 Add soft_rate_limit stats
Summary: This diff adds histogram stats for soft_rate_limit stalls. It also renames the old rate_limit stats to hard_rate_limit.

Test Plan: make -j32 check

Reviewers: dhruba, haobo, MarkCallaghan

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12021
2013-08-05 18:45:23 -07:00
Jim Paton
1036537c94 Add soft and hard rate limit support
Summary:
This diff adds support for both soft and hard rate limiting. The following changes are included:

1) Options.rate_limit is renamed to Options.hard_rate_limit.
2) Options.rate_limit_delay_milliseconds is renamed to Options.rate_limit_delay_max_milliseconds.
3) Options.soft_rate_limit is added.
4) If the maximum compaction score is > hard_rate_limit and rate_limit_delay_max_milliseconds == 0, then writes are delayed by 1 ms at a time until the max compaction score falls below hard_rate_limit.
5) If the max compaction score is > soft_rate_limit but <= hard_rate_limit, then writes are delayed by 0-1 ms depending on how close we are to hard_rate_limit.
6) Users can disable 4 by setting hard_rate_limit = 0. They can add a limit to the maximum amount of time waited by setting rate_limit_delay_max_milliseconds > 0. Thus, the old behavior can be preserved by setting soft_rate_limit = 0, which is the default.

Test Plan:
make -j32 check
./db_stress

Reviewers: dhruba, haobo, MarkCallaghan

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12003
2013-08-05 15:43:49 -07:00
Dhruba Borthakur
711a30cb30 Merge branch 'master' into performance
Conflicts:
	include/leveldb/options.h
	include/leveldb/statistics.h
	util/options.cc
2013-08-02 10:22:08 -07:00
Mayank Agarwal
59d0b02f8b Expand KeyMayExist to return the proper value if it can be found in memory and also check block_cache
Summary: Removed KeyMayExistImpl because KeyMayExist demanded Get like semantics now. Removed no_io from memtable and imm because we need the proper value now and shouldn't just stop when we see Merge in memtable. Added checks to block_cache. Updated documentation and unit-test

Test Plan: make all check;db_stress for 1 hour

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11853
2013-08-01 09:07:46 -07:00
Jim Paton
9700677a2b Slow down writes gradually rather than suddenly
Summary:
Currently, when a certain number of level0 files (level0_slowdown_writes_trigger) are present, RocksDB will slow down each write by 1ms. There is a second limit of level0 files at which RocksDB will stop writes altogether (level0_stop_writes_trigger).

This patch enables the user to supply a third parameter specifying the number of files at which Rocks will start slowing down writes (level0_start_slowdown_writes). When this number is reached, Rocks will slow down writes as a quadratic function of level0_slowdown_writes_trigger - num_level0_files.

For some workloads, this improves latency and throughput. I will post some stats momentarily in https://our.intern.facebook.com/intern/tasks/?t=2613384.

Test Plan:
make -j32 check
./db_stress
./db_bench

Reviewers: dhruba, haobo, MarkCallaghan, xjin

Reviewed By: xjin

CC: leveldb, xjin, zshao

Differential Revision: https://reviews.facebook.net/D11859
2013-07-31 16:20:48 -07:00
Xing Jin
0f0a24e298 Make arena block size configurable
Summary:
Add an option for arena block size, default value 4096 bytes. Arena will allocate blocks with such size.

I am not sure about passing parameter to skiplist in the new virtualized framework, though I talked to Jim a bit. So add Jim as reviewer.

Test Plan:
new unit test, I am running db_test.

For passing paramter from configured option to Arena, I tried tests like:

  TEST(DBTest, Arena_Option) {
  std::string dbname = test::TmpDir() + "/db_arena_option_test";
  DestroyDB(dbname, Options());

  DB* db = nullptr;
  Options opts;
  opts.create_if_missing = true;
  opts.arena_block_size = 1000000; // tested 99, 999999
  Status s = DB::Open(opts, dbname, &db);
  db->Put(WriteOptions(), "a", "123");
  }

and printed some debug info. The results look good. Any suggestion for such a unit-test?

Reviewers: haobo, dhruba, emayanke, jpaton

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D11799
2013-07-31 12:42:23 -07:00
Jim Paton
6db52b525a Don't use redundant Env::NowMicros() calls
Summary: After my patch for stall histograms, there are redundant calls to NowMicros() by both the stop watches and DBImpl::MakeRoomForWrites. So I removed the redundant calls such that the information is gotten from the stopwatch.

Test Plan:
make clean
make -j32 check

Reviewers: dhruba, haobo, MarkCallaghan

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11883
2013-07-29 15:46:36 -07:00
Jim Paton
18afff2e63 Add stall counts to statistics
Summary: Previously, statistics are kept on how much time is spent on stalls of different types. This patch adds support for keeping number of stalls of each type. For example, instead of just reporting how many microseconds are spent waiting for memtables to be compacted, it will also report how many times a write stalled for that to occur.

Test Plan:
make -j32 check
./db_stress

# Not really sure what else should be done...

Reviewers: dhruba, MarkCallaghan, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11841
2013-07-29 10:34:23 -07:00
Dhruba Borthakur
d7ba5bce37 Revert 6fbe4e981a: If disable wal is set, then batch commits are avoided
Summary:
Revert "If disable wal is set, then batch commits are avoided" because
keeping the mutex while inserting into the skiplist means that readers
and writes are all serialized on the mutex.

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-07-24 10:01:13 -07:00
Jim Paton
52d7ecfc78 Virtualize SkipList Interface
Summary: This diff virtualizes the skiplist interface so that users can provide their own implementation of a backing store for MemTables. Eventually, the backing store will be responsible for its own synchronization, allowing users (and us) to experiment with different lockless implementations.

Test Plan:
make clean
make -j32 check
./db_stress

Reviewers: dhruba, emayanke, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11739
2013-07-23 14:42:27 -07:00
Dhruba Borthakur
6fbe4e981a If disable wal is set, then batch commits are avoided.
Summary:
rocksdb uses batch commit to write to transaction log. But if
disable wal is set, then writes to transaction log are anyways
avoided. In this case, there is not much value-add to batch things,
batching can cause unnecessary delays to Puts().
This patch avoids batching when disableWal is set.

Test Plan:
make check.

I am running db_stress now.

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11763
2013-07-23 14:22:57 -07:00
Mayank Agarwal
bf66c10b13 Use KeyMayExist for WriteBatch-Deletes
Summary:
Introduced KeyMayExist checking during writebatch-delete and removed from Outer Delete API because it uses writebatch-delete.
Added code to skip getting Table from disk if not already present in table_cache.
Some renaming of variables.
Introduced KeyMayExistImpl which allows checking since specified sequence number in GetImpl useful to check partially written writebatch.
Changed KeyMayExist to not be pure virtual and provided a default implementation.
Expanded unit-tests in db_test to check appropriately.
Ran db_stress for 1 hour with ./db_stress --max_key=100000 --ops_per_thread=10000000 --delpercent=50 --filter_deletes=1 --statistics=1.

Test Plan: db_stress;make check

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D11745
2013-07-23 13:36:50 -07:00
Haobo Xu
d364eea1fc [RocksDB] Fix FindMinimumEmptyLevelFitting
Summary: as title

Test Plan: make check;

Reviewers: xjin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11751
2013-07-22 12:31:43 -07:00
Haobo Xu
9ee68871dc [RocksDB] Enable manual compaction to move files back to an appropriate level.
Summary: As title. This diff added an option reduce_level to CompactRange. When set to true, it will try to move the files back to the minimum level sufficient to hold the data set. Note that the default is set to true now, just to excerise it in all existing tests. Will set the default to false before check-in, for backward compatibility.

Test Plan: make check;

Reviewers: dhruba, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11553
2013-07-19 16:20:36 -07:00
Dhruba Borthakur
4a745a5666 Merge branch 'master' into performance
Conflicts:
	db/version_set.cc
	include/leveldb/options.h
	util/options.cc
2013-07-17 15:05:57 -07:00
Mayank Agarwal
2a986919d6 Make rocksdb-deletes faster using bloom filter
Summary:
Wrote a new function in db_impl.c-CheckKeyMayExist that calls Get but with a new parameter turned on which makes Get return false only if bloom filters can guarantee that key is not in database. Delete calls this function and if the option- deletes_use_filter is turned on and CheckKeyMayExist returns false, the delete will be dropped saving:
1. Put of delete type
2. Space in the db,and
3. Compaction time

Test Plan:
make all check;
will run db_stress and db_bench and enhance unit-test once the basic design gets approved

Reviewers: dhruba, haobo, vamsi

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11607
2013-07-11 12:11:11 -07:00
Haobo Xu
9ba82786ce [RocksDB] Provide contiguous sequence number even in case of write failure
Summary: Replication logic would be simplifeid if we can guarantee that write sequence number is always contiguous, even if write failure occurs. Dhruba and I looked at the sequence number generation part of the code. It seems fixable. Note that if WAL was successful and insert into memtable was not, we would be in an unfortunate state. The approach in this diff is : IO error is expected and error status will be returned to client, sequence number will not be advanced; In-mem error is not expected and we panic.

Test Plan: make check; db_stress

Reviewers: dhruba, sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11439
2013-07-08 15:31:09 -07:00
Dhruba Borthakur
116ec527f2 Renamed 'hybrid_compaction' tp be "Universal Compaction'.
Summary:
All the universal compaction parameters are encapsulated in
a new file universal_compaction.h

Test Plan:
make check
2013-07-03 15:47:53 -07:00
Dhruba Borthakur
47c4191fe8 Reduce write amplification by merging files in L0 back into L0
Summary:
There is a new option called hybrid_mode which, when switched on,
causes HBase style compactions.  Files from L0 are
compacted back into L0. This meat of this compaction algorithm
is in PickCompactionHybrid().

All files reside in L0. That means all files have overlapping
keys. Each file has a time-bound, i.e. each file contains a
range of keys that were inserted around the same time. The
start-seqno and the end-seqno refers to the timeframe when
these keys were inserted.  Files that have contiguous seqno
are compacted together into a larger file. All files are
ordered from most recent to the oldest.

The current compaction algorithm starts to look for
candidate files starting from the most recent file. It continues to
add more files to the same compaction run as long as the
sum of the files chosen till now is smaller than the next
candidate file size. This logic needs to be debated
and validated.

The above logic should reduce write amplification to a
large extent... will publish numbers shortly.

Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).

Differential Revision: https://reviews.facebook.net/D11289
2013-06-30 20:07:04 -07:00
Abhishek Kona
5ef6bb8c37 [rocksdb][refactor] statistic printing code to one place
Summary: $title

Test Plan: db_bench --statistics=1

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11373
2013-06-18 20:28:41 -07:00
Dhruba Borthakur
6acbe0fc45 Compact multiple memtables before flushing to storage.
Summary:
Merge multiple multiple memtables in memory before writing it
out to a file in L0.

There is a new config parameter min_write_buffer_number_to_merge
that specifies the number of write buffers that should be merged
together to a single file in storage. The system will not flush
wrte buffers to storage unless at least these many buffers have
accumulated in memory.
The default value of this new parameter is 1, which means that
a write buffer will be immediately flushed to disk as soon it is
ready.

Test Plan: make check

Differential Revision: https://reviews.facebook.net/D11241
2013-06-18 14:28:04 -07:00
Haobo Xu
1afdf28701 [RocksDB] Compaction Filter Cleanup
Summary: This hopefully gives the right semantics to compaction filter. Will write a small wiki to explain the ideas.

Test Plan: make check; db_stress

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11121
2013-06-14 14:23:08 -07:00
Deon Nicholas
4985a9f73b [Rocksdb] [Multiget] Introduced multiget into db_bench
Summary:
Preliminary! Introduced the --use_multiget=1 and --keys_per_multiget=n
flags for db_bench. Also updated and tested the ReadRandom() method
to include an option to use multiget. By default,
keys_per_multiget=100.

Preliminary tests imply that multiget is at least 1.25x faster per
key than regular get.

Will continue adding Multiget for ReadMissing, ReadHot,
RandomWithVerify, ReadRandomWriteRandom; soon. Will also think
about ways to better verify benchmarks.

Test Plan:
1. make db_bench
2. ./db_bench --benchmarks=fillrandom
3. ./db_bench --benchmarks=readrandom --use_existing_db=1
	      --use_multiget=1 --threads=4 --keys_per_multiget=100
4. ./db_bench --benchmarks=readrandom --use_existing_db=1
	      --threads=4
5. Verify ops/sec (and 1000000 of 1000000 keys found)

Reviewers: haobo, MarkCallaghan, dhruba

Reviewed By: MarkCallaghan

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11127
2013-06-12 12:42:21 -07:00
Haobo Xu
bdf1085944 [RocksDB] cleanup EnvOptions
Summary:
This diff simplifies EnvOptions by treating it as POD, similar to Options.
- virtual functions are removed and member fields are accessed directly.
- StorageOptions is removed.
- Options.allow_readahead and Options.allow_readahead_compactions are deprecated.
- Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite

Test Plan: make check; db_stress

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11175
2013-06-12 11:17:19 -07:00
Mayank Agarwal
184343a061 Max_mem_compaction_level can have maximum value of num_levels-1
Summary:
Without this files could be written out to a level greater than the maximum level possible and is the source of the segfaults that wormhole awas getting. The sequence of steps that was followed:
1. WriteLevel0Table was called when memtable was to be flushed for a file.
2. PickLevelForMemTableOutput was called to determine the level to which this file should be pushed.
3. PickLevelForMemTableOutput returned a wrong result because max_mem_compaction_level was equal to 2 even when num_levels was equal to 0.
The fix to re-initialize max_mem_compaction_level based on num_levels passed seems correct.

Test Plan: make all check; Also made a dummy file to mimic the wormhole-file behaviour which was causing the segfaults and found that the same segfault occurs without this change and not with this.

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11157
2013-06-09 10:38:55 -07:00
Deon Nicholas
d8c7c45ea0 Very basic Multiget and simple test cases.
Summary:
Implemented the MultiGet operator which takes in a list of keys
and returns their associated values. Currently uses std::vector as its
container data structure. Otherwise, it works identically to "Get".

Test Plan:
 1. make db_test      ; compile it
 2. ./db_test         ; test it
 3. make all check    ; regress / run all tests
 4. make release      ; (optional) compile with release settings

Reviewers: haobo, MarkCallaghan, dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10875
2013-06-05 11:22:38 -07:00
Abhishek Kona
d91b42ee27 [Rocksdb] Measure all FSYNC/SYNC times
Summary: Add stop watches around all sync calls.

Test Plan: db_bench check if respective histograms are printed

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11073
2013-06-05 11:06:21 -07:00
Mark Callaghan
d9f538e1a9 Improve output for GetProperty('leveldb.stats')
Summary:
Display separate values for read, write & total compaction IO.
Display compaction amplification and write amplification.
Add similar values for the period since the last call to GetProperty. Results since the server started
are reported as "cumulative" stats. Results since the last call to GetProperty are reported as
"interval" stats.

Level  Files Size(MB) Time(sec)  Read(MB) Write(MB)    Rn(MB)  Rnp1(MB)  Wnew(MB) Amplify Read(MB/s) Write(MB/s)      Rn     Rnp1     Wnp1     NewW    Count  Ln-stall
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0        7       13        21         0       211         0         0       211     0.0       0.0        10.1        0        0        0        0      113       0.0
  1       79      157        88       993       989       198       795       194     9.0      11.3        11.2      106      405      502       97       14       0.0
  2       19       36         5        63        63        37        27        36     2.4      12.3        12.2       19       14       32       18       12       0.0
>>>>>>>>>>>>>>>>>>>>>>>>> text below has been is new and/or reformatted
Uptime(secs): 122.2 total, 0.9 interval
Compaction IO cumulative (GB): 0.21 new, 1.03 read, 1.23 write, 2.26 read+write
Compaction IO cumulative (MB/sec): 1.7 new, 8.6 read, 10.3 write, 19.0 read+write
Amplification cumulative: 6.0 write, 11.0 compaction
Compaction IO interval (MB): 5.59 new, 0.00 read, 5.59 write, 5.59 read+write
Compaction IO interval (MB/sec): 6.5 new, 0.0 read, 6.5 write, 6.5 read+write
Amplification interval: 1.0 write, 1.0 compaction
>>>>>>>>>>>>>>>>>>>>>>>> text above is new and/or reformatted
Stalls(secs): 90.574 level0_slowdown, 0.000 level0_numfiles, 10.165 memtable_compaction, 0.000 leveln_slowdown

Task ID: #

Blame Rev:

Test Plan:
make check, run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11049
2013-06-03 20:24:49 -07:00
Haobo Xu
2b1fb5b01d [RocksDB] Add score column to leveldb.stats
Summary: Added the 'score' column to the compaction stats output, which shows the level total size devided by level target size. Could be useful when monitoring compaction decisions...

Test Plan: make check; db_bench

Reviewers: dhruba

CC: leveldb, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D11025
2013-06-03 17:38:27 -07:00
Haobo Xu
d897d33bf1 [RocksDB] Introduce Fast Mutex option
Summary:
This diff adds an option to specify whether PTHREAD_MUTEX_ADAPTIVE_NP will be enabled for the rocksdb single big kernel lock. db_bench also have this option now.
Quickly tested 8 thread cpu bound 100 byte random read.
No fast mutex: ~750k/s ops
With fast mutex: ~880k/s ops

Test Plan: make check; db_bench; db_stress

Reviewers: dhruba

CC: MarkCallaghan, leveldb

Differential Revision: https://reviews.facebook.net/D11031
2013-06-01 23:11:34 -07:00
Haobo Xu
2df65c118c [RocksDB] Dump counters and histogram data periodically with compaction stats
Summary: As title

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10995
2013-05-29 12:00:18 -07:00
heyongqiang
4b29651206 add block deviation option to terminate a block before it exceeds block_size
Summary: a new option block_size_deviation is added.

Test Plan: run db_test and db_bench

Reviewers: dhruba, haobo

Reviewed By: haobo

Differential Revision: https://reviews.facebook.net/D10821
2013-05-24 15:52:49 -07:00
Haobo Xu
ef15b9d178 [RocksDB] Fix MaybeDumpStats
Summary: MaybeDumpStats was causing lock problem

Test Plan: make check; db_stress

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D10935
2013-05-24 15:43:16 -07:00
Haobo Xu
0e879c93de [RocksDB] dump leveldb.stats periodically in LOG file.
Summary:
Added an option stats_dump_period_sec to dump leveldb.stats to LOG periodically for diagnosis.
By defauly, it's set to a very big number 3600 (1 hour).

Test Plan: make check;

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D10761
2013-05-23 16:56:59 -07:00
Dhruba Borthakur
26541860d3 The max size of the write buffer size can be 64 GB.
Summary: There was an artifical limit on the size of the write buffer size.

Test Plan: make check

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10911
2013-05-23 15:00:27 -07:00
Haobo Xu
87d0af15d8 [RocksDB] Introduce an option to skip log error on recovery
Summary:
Currently, with paranoid_check on, DB::Open will fail on any log read error on recovery.
If client is ok with losing most recent updates, we could simply skip those errors.
However, it's important to introduce an additional flag, so that paranoid_check can
still guard against more serious problems.

Test Plan: make check; db_stress

Reviewers: dhruba, emayanke

Reviewed By: emayanke

CC: leveldb, emayanke

Differential Revision: https://reviews.facebook.net/D10869
2013-05-21 14:30:36 -07:00
Abhishek Kona
e1174306c5 [RocksDB] Simplify StopWatch implementation
Summary:
Make stop watch a simple implementation, instead of subclass of a virtual class
Allocate stop watches off the stack instead of heap.
Code is more terse now.

Test Plan: make all check, db_bench with --statistics=1

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10809
2013-05-17 10:55:34 -07:00
Haobo Xu
4ca3c67bd3 [RocksDB] Cleanup compaction filter to use a class interface, instead of function pointer and additional context pointer.
Summary:
This diff replaces compaction_filter_args and CompactionFilter with a single compaction_filter parameter. It gives CompactionFilter better encapsulation and a similar look to Comparator and MergeOpertor, which improves consistency of the overall interface.
The change is not backward compatible. Nevertheless, the two references in fbcode are not in production yet.

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D10773
2013-05-13 14:06:10 -07:00
Haobo Xu
73c0a33346 [RocksDB] fix compaction filter trigger condition
Summary:
Currently, compaction filter is run on internal key older than the oldest snapshot, which is incorrect.
Compaction filter should really be run on the most recent internal key when there is no external snapshot.

Test Plan: make check; db_stress

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D10641
2013-05-13 12:33:02 -07:00
Abhishek Kona
8d58ecdc29 [RocksDB] Expose compaction stalls via db_statistics
Test Plan: make check

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10575
2013-05-10 14:41:45 -07:00
Abhishek Kona
988c20b9f7 [RocksDB] Clear Archive WAL files
Summary:
WAL files are moved to archive directory and clear only at DB::Open.
Can lead to a lot of space consumption in a Database. Added logic to periodically clear Archive Directory too.

Test Plan: make all check + add unit test

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10617
2013-05-06 11:41:01 -07:00
Haobo Xu
05e8854085 [Rocksdb] Support Merge operation in rocksdb
Summary:
This diff introduces a new Merge operation into rocksdb.
The purpose of this review is mostly getting feedback from the team (everyone please) on the design.

Please focus on the four files under include/leveldb/, as they spell the client visible interface change.
include/leveldb/db.h
include/leveldb/merge_operator.h
include/leveldb/options.h
include/leveldb/write_batch.h

Please go over local/my_test.cc carefully, as it is a concerete use case.

Please also review the impelmentation files to see if the straw man implementation makes sense.

Note that, the diff does pass all make check and truly supports forward iterator over db and a version
of Get that's based on iterator.

Future work:
- Integration with compaction
- A raw Get implementation

I am working on a wiki that explains the design and implementation choices, but coding comes
just naturally and I think it might be a good idea to share the code earlier. The code is
heavily commented.

Test Plan: run all local tests

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: leveldb, zshao, sheki, emayanke, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D9651
2013-05-03 16:59:02 -07:00
Haobo Xu
eb6d139666 [RocksDB] Move table.h to table/
Summary:
- don't see a point exposing table.h to the public.
- fixed make clean to remove also *.d files.

Test Plan: make check; db_stress

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10479
2013-04-22 16:07:56 -07:00
Haobo Xu
b4243e5a3d [RocksDB] CompactionFilter cleanup
Summary:
- removed the compaction_filter_value from the callback interface. Restrict compaction filter to purging values.
- modify some comments to reflect curent status.

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10335
2013-04-20 10:26:51 -07:00
Abhishek Kona
7c6c3c0ff4 [Rockdsdb] Better Error messages. Closing db instead of deleting db
Summary: A better error message. A local change. Did not look at other places where this could be done.

Test Plan: compile

Reviewers: dhruba, MarkCallaghan

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10251
2013-04-15 15:27:15 -07:00
Haobo Xu
013e9ebbf1 [RocksDB] [Performance] Speed up FindObsoleteFiles
Summary:
FindObsoleteFiles was slow, holding the single big lock, resulted in bad p99 behavior.
Didn't profile anything, but several things could be improved:
1. VersionSet::AddLiveFiles works with std::set, which is by itself slow (a tree).
   You also don't know how many dynamic allocations occur just for building up this tree.
   switched to std::vector, also added logic to pre-calculate total size and do just one allocation
2. Don't see why env_->GetChildren() needs to be mutex proteced, moved to PurgeObsoleteFiles where
   mutex could be unlocked.
3. switched std::set to std:unordered_set, the conversion from vector is also inside PurgeObsoleteFiles
I have a feeling this should pretty much fix it.

Test Plan: make check;  db_stress

Reviewers: dhruba, heyongqiang, MarkCallaghan

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D10197
2013-04-12 11:29:27 -07:00
Mayank Agarwal
94d86b25a9 Fix memory leak for probableWALfiles in db_impl.cc
Summary: using unique_ptr to have automatic delete for probableWALfiles in db_impl.cc

Test Plan: make

Reviewers: sheki, dhruba

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10083
2013-04-11 16:57:15 -07:00
Dhruba Borthakur
7730587120 Prevent segfault in OpenCompactionOutputFile
Summary:
The segfault was happening because the program was unable to open a new
sst file (as part of the compaction) because the process ran out of
file descriptors.

The fix is to check the return status of the file creation before taking
any other action.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fabf03f9700 (LWP 29904)]
leveldb::DBImpl::OpenCompactionOutputFile (this=this@entry=0x7fabf9011400, compact=compact@entry=0x7fabf741a2b0) at db/db_impl.cc:1399
1399    db/db_impl.cc: No such file or directory.
(gdb) where

Test Plan: make check

Reviewers: MarkCallaghan, sheki

Reviewed By: MarkCallaghan

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10101
2013-04-10 09:59:48 -07:00
Abhishek Kona
574b76f710 [RocksDB][Bug] Look at all the files, not just the first file in TransactionLogIter as BatchWrites can leave it in Limbo
Summary:
Transaction Log Iterator did not move to the next file in the series if there was a write batch at the end of the currentFile.
The solution is if the last seq no. of the current file is < RequestedSeqNo. Assume the first seqNo. of the next file has to satisfy the request.

Also major refactoring around the code. Moved opening the logreader to a seperate function, got rid of goto.

Test Plan: added a unit test for it.

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb, emayanke

Differential Revision: https://reviews.facebook.net/D10029
2013-04-08 16:28:09 -07:00
Abhishek Kona
ca789a10cc [Rocksdb] Recover last updated sequence number from manifest also.
Summary:
During recovery, last_updated_manifest number was not set if there were no records in the Write-ahead log.
Now check for the recovered manifest also and set last_updated_manifest file to the max value.

Test Plan: unit test

Reviewers: heyongqiang

Reviewed By: heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9891
2013-04-02 17:18:27 -07:00
Haobo Xu
6763110867 [RocksDB] Replace iterator based loop with range based loop for stl containers
Summary:
As title.
Code is shorter and cleaner
See https://our.dev.facebook.com/intern/tasks/?t=2233981

Test Plan: make check

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D9789
2013-04-02 11:46:45 -07:00
Haobo Xu
645ff8f231 Let's get rid of delete as much as possible, here are some examples.
Summary:
If a class owns an object:
 - If the object can be null => use a unique_ptr. no delete
 - If the object can not be null => don't even need new, let alone delete
 - for runtime sized array => use vector, no delete.

Test Plan: make check

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb, zshao, sheki, emayanke, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D9783
2013-03-28 17:31:44 -07:00
Abhishek Kona
3b51605b8d [RocksDB] Fix binary search while finding probable wal files
Summary:
RocksDB does a binary search to look at the files which might contain the requested sequence number at the call GetUpdatesSince.
There was a bug in the binary search => when the file pointed by the middle index of bsearch was empty/corrupt it needst to resize the vector and update indexes.
This now fixes that.

Test Plan: existing unit tests pass.

Reviewers: heyongqiang, dhruba

Reviewed By: heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9777
2013-03-28 13:37:15 -07:00
Abhishek Kona
8e9c781ae5 [Rocksdb] Fix Crash on finding a db with no log files. Error out instead
Summary:
If the vector returned by GetUpdatesSince is empty, it is still returned to the
user. This causes it throw an std::range error.
The probable file list is checked and it returns an IOError status instead of OK now.

Test Plan: added a unit test.

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9771
2013-03-28 13:19:07 -07:00
Abhishek Kona
7fdd5f5b33 Use non-mmapd files for Write-Ahead Files
Summary:
Use non mmapd files for Write-Ahead log.
Earlier use of MMaped files. made the log iterator read ahead and miss records.
Now the reader and writer will point to the same physical location.

There is no perf regression :
./db_bench --benchmarks=fillseq --db=/dev/shm/mmap_test --num=$(million 20) --use_existing_db=0 --threads=2
with This diff :
fillseq      :      10.756 micros/op 185281 ops/sec;   20.5 MB/s
without this dif :
fillseq      :      11.085 micros/op 179676 ops/sec;   19.9 MB/s

Test Plan: unit test included

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9741
2013-03-28 13:13:35 -07:00
Haobo Xu
ecd8db0200 [RocksDB] Minimize Mutex protected code section in the critical path
Summary: rocksdb uses a single global lock to protect in memory metadata. We should minimize the mutex protected code section to increase the effective parallelism of the program. See https://our.intern.facebook.com/intern/tasks/?t=2218928

Test Plan:
make check
db_bench

Reviewers: dhruba, heyongqiang

CC: zshao, leveldb

Differential Revision: https://reviews.facebook.net/D9705
2013-03-26 22:42:26 -07:00
Dhruba Borthakur
d0798f67f4 Run compactions even if workload is readonly or read-mostly.
Summary:
The events that trigger compaction:
* opening the database
* Get -> only if seek compaction is not disabled and other checks are true
* MakeRoomForWrite -> when memtable is full
* BackgroundCall ->
  If the background thread is about to do a compaction run, it schedules
  a new background task to trigger a possible compaction. This will cause
  additional background threads to find and process other compactions that
  can run concurrently.

Test Plan: ran db_bench with overwrite and readonly alternatively.

Reviewers: sheki, MarkCallaghan

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9579
2013-03-20 23:43:29 -07:00
Dhruba Borthakur
ad96563b79 Ability to configure bufferedio-reads, filesystem-readaheads and mmap-read-write per database.
Summary:
This patch allows an application to specify whether to use bufferedio,
reads-via-mmaps and writes-via-mmaps per database. Earlier, there
was a global static variable that was used to configure this functionality.

The default setting remains the same (and is backward compatible):
 1. use bufferedio
 2. do not use mmaps for reads
 3. use mmap for writes
 4. use readaheads for reads needed for compaction

I also added a parameter to db_bench to be able to explicitly specify
whether to do readaheads for compactions or not.

Test Plan: make check

Reviewers: sheki, heyongqiang, MarkCallaghan

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9429
2013-03-20 23:14:03 -07:00
Mayank Agarwal
487168cdcf Fixed sign-comparison in rocksdb code-base and fixed Makefile
Summary: Makefile had options to ignore sign-comparisons and unused-parameters, which should be there. Also fixed the specific errors in the code-base

Test Plan: make

Reviewers: chip, dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9531
2013-03-19 14:35:23 -07:00
Mark Callaghan
72d14eafd3 add --benchmarks=levelstats option to db_bench, prevent "nan" in stats output
Summary:
Add --benchmarks=levelstats option to report per-level stats (#files, #bytes)
Change readwhilewriting test to report response time for writes but exclude
them from the stats merged by all threads.
Prevent "NaN" in stats output by preventing division by 0.
Remove "o" file I committed by mistake.

Task ID: #

Blame Rev:

Test Plan:
make check

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D9513
2013-03-19 13:14:44 -07:00
Abhishek Kona
02c459805b Ignore a zero-sized file while looking for a seq-no in GetUpdatesSince
Summary:
Rocksdb can create 0 sized log files when it is opened and closed without any operations.
The GetUpdatesSince fails currently if there is a log file of size zero.

This diff fixes this. If there is a log file is 0, it is removed form the probable_file_list

Test Plan: unit test

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9507
2013-03-19 11:00:09 -07:00
Dhruba Borthakur
6d812b6afb A mechanism to detect manifest file write errors and put db in readonly mode.
Summary:
If there is an error while writing an edit to the manifest file, the manifest
file is closed and reopened to check if the edit made it in. However, if the
re-opening of the manifest is unsuccessful and options.paranoid_checks is set
t true, then the db refuses to accept new puts, effectively putting the db
in readonly mode.

In a future diff, I would like to make the default value of paranoid_check
to true.

Test Plan: make check

Reviewers: sheki

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9201
2013-03-07 09:45:49 -08:00
Abhishek Kona
d68880a1b9 Do not allow Transaction Log Iterator to fall ahead when writer is writing the same file
Summary:
Store the last flushed, seq no. in db_impl. Check against it in
transaction Log iterator. Do not attempt to read ahead if we do not know
if the data is flushed completely.
Does not work if flush is disabled. Any ideas on fixing that?
* Minor change, iter->Next is called the first time automatically for
* the first time.

Test Plan:
existing test pass.
More ideas on testing this?
Planning to run some stress test.

Reviewers: dhruba, heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9087
2013-03-06 14:05:53 -08:00
Dhruba Borthakur
afed60938f Fox db_stress crash by copying keys before changing sequencenum to zero.
Summary:
The compaction process zeros out sequence numbers if the output is
part of the bottommost level.
The Slice is supposed to refer to an immutable data buffer. The
merger that implements the priority queue while reading kvs as
the input of a compaction run reies on this fact. The bug was that
were updating the sequence number of a record in-place and that was
causing suceeding invocations of the merger to return kvs in
arbitrary order of sequence numbers.
The fix is to copy the key to a local memory buffer before setting
its seqno to 0.

Test Plan:
Set Options.purge_redundant_kvs_while_flush = false and then run
db_stress --ops_per_thread=1000 --max_key=320

Reviewers: emayanke, sheki

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9147
2013-03-06 10:52:08 -08:00
Mark Callaghan
993543d1be Add rate_delay_limit_milliseconds
Summary:
This adds the rate_delay_limit_milliseconds option to make the delay
configurable in MakeRoomForWrite when the max compaction score is too high.
This delay is called the Ln slowdown. This change also counts the Ln slowdown
per level to make it possible to see where the stalls occur.

From IO-bound performance testing, the Level N stalls occur:
* with compression -> at the largest uncompressed level. This makes sense
                      because compaction for compressed levels is much
                      slower. When Lx is uncompressed and Lx+1 is compressed
                      then files pile up at Lx because the (Lx,Lx+1)->Lx+1
                      compaction process is the first to be slowed by
                      compression.
* without compression -> at level 1

Task ID: #1832108

Blame Rev:

Test Plan:
run with real data, added test

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D9045
2013-03-04 07:41:15 -08:00
Dhruba Borthakur
806e264350 Ability for rocksdb to compact when flushing the in-memory memtable to a file in L0.
Summary:
Rocks accumulates recent writes and deletes in the in-memory memtable.
When the memtable is full, it writes the contents on the memtable to
a file in L0.

This patch removes redundant records at the time of the flush. If there
are multiple versions of the same key in the memtable, then only the
most recent one is dumped into the output file. The purging of
redundant records occur only if the most recent snapshot is earlier
than the earliest record in the memtable.

Should we switch on this feature by default or should we keep this feature
turned off in the default settings?

Test Plan: Added test case to db_test.cc

Reviewers: sheki, vamsi, emayanke, heyongqiang

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8991
2013-03-04 00:01:47 -08:00
Dhruba Borthakur
e45c7a8444 Abilty to support upto a million .sst files in the database
Summary:
There was an artifical limit of 50K files per database. This is
insifficient if the database is 1 TB in size and each file is 2 MB.

Test Plan: make check

Reviewers: sheki, emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8919
2013-02-26 16:27:51 -08:00
Abhishek Kona
959337ed5b Measure compaction time.
Summary: just record time consumed in compaction

Test Plan: compile

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8781
2013-02-22 11:38:40 -08:00
Abhishek Kona
ec77366e14 Counters for bytes written and read.
Summary:
* Counters for bytes read and write.
as a part of this diff, I want to=>
* Measure compaction times. @dhruba can you point which function, should
* I time to get Compaction-times. Was looking at CompactRange.

Test Plan: db_test

Reviewers: dhruba, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8763
2013-02-21 16:06:32 -08:00
Abhishek Kona
fe10200ddc Introduce histogram in statistics.h
Summary:
* Introduce is histogram in statistics.h
* stop watch to measure time.
* introduce two timers as a poc.
Replaced NULL with nullptr to fight some lint errors
Should be useful for google.

Test Plan:
ran db_bench and check stats.
make all check

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8637
2013-02-20 10:43:32 -08:00
amayank
f3901e0647 Revert "Fix for the weird behaviour encountered by ldb Get where it could read only the second-latest value"
This reverts commit 4c696ed001.
2013-02-18 22:32:27 -08:00
Dhruba Borthakur
4564915446 Zero out redundant sequence numbers for kvs to increase compression efficiency
Summary:
The sequence numbers in each record eat up plenty of space on storage.
The optimization zeroes out sequence numbers on kvs in the Lmax
layer that are earlier than the earliest snapshot.

Test Plan: Unit test attached.

Differential Revision: https://reviews.facebook.net/D8619
2013-02-18 21:51:15 -08:00
amayank
4c696ed001 Fix for the weird behaviour encountered by ldb Get where it could read only the second-latest value
Summary:
flush_on_destroy has a default value of false and the memtable is flushed
in the dbimpl-destructor only when that is set to true. Because we want the memtable to be flushed everytime that
the destructor is called(db is closed) and the cases where we work with the memtable only are very less
it is a good idea to give this a default value of true. Thus the put from ldb
wil have its data flushed to disk in the destructor and the next Get will be able to
read it when opened with OpenForReadOnly. The reason that ldb could read the latest value when
the db was opened in the normal Open mode is that the Get from normal Open first reads
the memtable and directly finds the latest value written there and the Get from OpenForReadOnly
doesn't have access to the memtable (which is correct because all its Put/Modify) are disabled

Test Plan: make all; ldb put and get and scans

Reviewers: dhruba, heyongqiang, sheki

Reviewed By: heyongqiang

CC: kosievdmerwe, zshao, dilipj, kailiu

Differential Revision: https://reviews.facebook.net/D8631
2013-02-15 16:56:06 -08:00
Kai Liu
b63aafce42 Allow the logs to be purged by TTL.
Summary:
* Add a SplitByTTLLogger to enable this feature. In this diff I implemented generalized AutoSplitLoggerBase class to simplify the
development of such classes.
* Refactor the existing AutoSplitLogger and fix several bugs.

Test Plan:
* Added a unit tests for different types of "auto splitable" loggers individually.
* Tested the composited logger which allows the log files to be splitted by both TTL and log size.

Reviewers: heyongqiang, dhruba

Reviewed By: heyongqiang

CC: zshao, leveldb

Differential Revision: https://reviews.facebook.net/D8037
2013-02-04 19:42:40 -08:00
Chip Turner
0b83a83191 Fix poor error on num_levels mismatch and few other minor improvements
Summary:
Previously, if you opened a db with num_levels set lower than
the database, you received the unhelpful message "Corruption:
VersionEdit: new-file entry."  Now you get a more verbose message
describing the issue.

Also, fix handling of compression_levels (both the run-over-the-end
issue and the memory management of it).

Lastly, unique_ptr'ify a couple of minor calls.

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8151
2013-01-25 15:37:26 -08:00
Chip Turner
772f75b3fb Stop continually re-creating build_version.c
Summary:
We continually rebuilt build_version.c because we put the
current date into it, but that's what __DATE__ already is.  This makes
builds faster.

This also fixes an issue with 'make clean FOO' not working properly.

Also tweak the build rules to be more consistent, always have warnings,
and add a 'make release' rule to handle flags for release builds.

Test Plan: make, make clean

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D8139
2013-01-24 17:51:39 -08:00
Chip Turner
3dafdfb2c4 Use fallocate to prevent excessive allocation of sst files and logs
Summary:
On some filesystems, pre-allocation can be a considerable
amount of space.  xfs in our production environment pre-allocates by
1GB, for instance.  By using fallocate to inform the kernel of our
expected file sizes, we eliminate this wasteage (that isn't recovered
until the file is closed which, in the case of LOG files, can be a
considerable amount of time).

Test Plan:
created an xfs loopback filesystem, mounted with
allocsize=4M, and ran db_stress.  LOG file without this change was 4M,
and with it it was 128k then grew to normal size.

Reviewers: dhruba

Reviewed By: dhruba

CC: adsharma, leveldb

Differential Revision: https://reviews.facebook.net/D7953
2013-01-24 12:25:13 -08:00
Chip Turner
2fdf91a4f8 Fix a number of object lifetime/ownership issues
Summary:
Replace manual memory management with std::unique_ptr in a
number of places; not exhaustive, but this fixes a few leaks with file
handles as well as clarifies semantics of the ownership of file handles
with log classes.

Test Plan: db_stress, make check

Reviewers: dhruba

Reviewed By: dhruba

CC: zshao, leveldb, heyongqiang

Differential Revision: https://reviews.facebook.net/D8043
2013-01-23 16:54:11 -08:00
Abhishek Kona
16903c35b0 Add counters to count gets and writes
Summary: Add Tickers to count Write's and Get's

Test Plan: make check

Reviewers: dhruba, chip

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7977
2013-01-17 12:27:56 -08:00
Kosie van der Merwe
3c3df7402f Fixed issues Valgrind found.
Summary:
Found issues with `db_test` and `db_stress` when running valgrind.

`DBImpl` had an issue where if an compaction failed then it will use the uninitialised file size of an output file is used. This manifested as the final call to output to the log in `DoCompactionWork()` branching on uninitialized memory (all the way down in printf's innards).

Test Plan:
Ran `valgrind --track_origins=yes ./db_test` and `valgrind ./db_stress` to see if issues disappeared.

Ran `make check` to see if there were no regressions.

Reviewers: vamsi, dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8001
2013-01-17 10:04:45 -08:00
Abhishek Kona
7d5a4383bb rollover manifest file.
Summary:
Check in LogAndApply if the file size is more than the limit set in
Options.
Things to consider : will this be expensive?

Test Plan: make all check. Inputs on a new unit test?

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7701
2013-01-16 12:09:44 -08:00
Chip Turner
c0cb289d57 Various build cleanups/improvements
Summary:
Specific changes:

1) Turn on -Werror so all warnings are errors
2) Fix some warnings the above now complains about
3) Add proper dependency support so changing a .h file forces a .c file
to rebuild
4) Automatically use fbcode gcc on any internal machine rather than
whatever system compiler is laying around
5) Fix jemalloc to once again be used in the builds (seemed like it
wasn't being?)
6) Fix issue where 'git' would fail in build_detect_version because of
LD_LIBRARY_PATH being set in the third-party build system

Test Plan:
make, make check, make clean, touch a header file, make sure
rebuild is expected

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D7887
2013-01-14 18:40:22 -08:00
Kosie van der Merwe
d8371ef1f6 Fixing some issues Valgrind found
Summary: Found some issues running Valgrind on `db_test` (there are still some outstanding ones) and fixed them.

Test Plan:
make check

ran `valgrind ./db_test` and saw that errors no longer occur

Reviewers: dhruba, vamsi, emayanke, sheki

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7803
2013-01-08 12:16:40 -08:00
Kosie van der Merwe
d6e873f22f Added clearer error message for failure to create db directory in DBImpl::Recover()
Summary:
Changed CreateDir() to CreateDirIfMissing() so a directory that already exists now causes and error.

Fixed CreateDirIfMissing() and added Env.DirExists()

Test Plan:
make check to test for regessions

Ran the following to test if the error message is not about lock files not existing
./db_bench --db=dir/testdb

After creating a file "testdb", ran the following to see if it failed with sane error message:
./db_bench --db=testdb

Reviewers: dhruba, emayanke, vamsi, sheki

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7707
2013-01-07 10:11:18 -08:00
Dhruba Borthakur
f4c2b7cf97 Enhance ReadOnly mode to process the all committed transactions.
Summary:
Leveldb has an api OpenForReadOnly() that opens the database
in readonly mode. This call had an option to not process the
transaction log.  This patch removes this option and always
processes all transactions that had been committed. It has
been done in such a way that it does not create/write to
any new files in the process. The invariant of "no-writes"
to the leveldb data directory is still true.

This enhancement allows multiple threads to open the same database
in readonly mode and access all trancations that were committed right
upto the OpenForReadOnly call.

I changed the public API to match the new semantics because
there are no users who are currently using this api.

Test Plan: make clean check

Reviewers: sheki

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7479
2012-12-19 16:30:46 -08:00
Dhruba Borthakur
3d1e92b05a Enhancements to rocksdb for better support for replication.
Summary:
1. The OpenForReadOnly() call should not lock the db. This is useful
so that multiple processes can open the same database concurrently
for reading.
2. GetUpdatesSince should not error out if the archive directory
does not exist.
3. A new constructor for WriteBatch that can takes a serialized
string as a parameter of the constructor.

Test Plan: make clean check

Reviewers: sheki

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7449
2012-12-17 11:40:19 -08:00
Kosie van der Merwe
62d48571de Added meta-database support.
Summary:
Added kMetaDatabase for meta-databases in db/filename.h along with supporting
fuctions.
Fixed switch in DBImpl so that it also handles kMetaDatabase.
Fixed DestroyDB() that it can handle destroying meta-databases.

Test Plan: make check

Reviewers: sheki, emayanke, vamsi, dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D7245
2012-12-17 11:26:59 -08:00
Abhishek Kona
2f0585fb97 Fix a bug. Where DestroyDB deletes a non-existant archive directory.
Summary:
C tests would fail sometimes as DestroyDB would return a Failure Status
message when deleting an archival directory which was not created
(WAL_ttl_seconds = 0).

Fix: Ignore the Status returned on Deleting Archival Directory.

Test Plan: * make check

Reviewers: dhruba, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7395
2012-12-17 10:25:26 -08:00
Abhishek Kona
22c283625b Fix Bug in Binary Search for files containing a seq no. and delete Archived Log Files during Destroy DB.
Summary:
* Fixed implementation bug in Binary_Searvch introduced in https://reviews.facebook.net/D7119
* Binary search is also overflow safe.
* Delete archive log files and archive dir during DestroyDB

Test Plan: make check

Reviewers: dhruba

CC: kosievdmerwe, emayanke

Differential Revision: https://reviews.facebook.net/D7263
2012-12-11 16:15:02 -08:00
Dhruba Borthakur
24fc379273 An public api to fetch the latest transaction id.
Summary:
Implement a interface to retrieve the most current transaction
id from the database.

Test Plan: Added unit test.

Reviewers: sheki

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7269
2012-12-10 16:04:19 -08:00
Abhishek Kona
1c6742e32f Refactor GetArchivalDirectoryName to filename.h
Summary:
filename.h has functions to do similar things.
Moving code away from db_impl.cc

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D7251
2012-12-10 10:51:07 -08:00
Abhishek Kona
8055008909 GetUpdatesSince API to enable replication.
Summary:
How it works:
* GetUpdatesSince takes a SequenceNumber.
* A LogFile with the first SequenceNumber nearest and lesser than the requested Sequence Number is found.
* Seek in the logFile till the requested SeqNumber is found.
* Return an iterator which contains logic to return record's one by one.

Test Plan:
* Test case included to check the good code path.
* Will update with more test-cases.
* Feedback required on test-cases.

Reviewers: dhruba, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7119
2012-12-07 11:42:13 -08:00
Dhruba Borthakur
c847a31727 Print compaction score for every compaction run.
Summary:
A compaction is picked based on its score. It is useful to
print the compaction score in the LOG because it aids in
debugging. If one looks at the logs, one can find out why
a compaction was preferred over another.

Test Plan: make clean check

Differential Revision: https://reviews.facebook.net/D7137
2012-12-04 10:03:47 -08:00
sheki
d4627e6de4 Move WAL files to archive directory, instead of deleting.
Summary:
Create a directory "archive" in the DB directory.
During DeleteObsolteFiles move the WAL files (*.log) to the Archive directory,
instead of deleting.

Test Plan: Created a DB using DB_Bench. Reopened it. Checked if files move.

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6975
2012-11-28 17:28:08 -08:00
Abhishek Kona
d29f181923 Fix all the lint errors.
Summary:
Scripted and removed all trailing spaces and converted all tabs to
spaces.

Also fixed other lint errors.
All lint errors from this point of time should be taken seriously.

Test Plan: make all check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7059
2012-11-28 17:18:41 -08:00
Dhruba Borthakur
9a357847eb Delete non-visible keys during a compaction even in the presense of snapshots.
Summary:
 LevelDB should delete almost-new keys when a long-open snapshot exists.
The previous behavior is to keep all versions that were created after the
oldest open snapshot. This can lead to database size bloat for
high-update workloads when there are long-open snapshots and long-open
snapshot will be used for logical backup. By "almost new" I mean that the
key was updated more than once after the oldest snapshot.

If there were two snapshots with seq numbers s1 and s2 (s1 < s2), and if
we find two instances of the same key k1 that lie entirely within s1 and
s2 (i.e. s1 < k1 < s2), then the earlier version
of k1 can be safely deleted because that version is not visible in any snapshot.

Test Plan:
unit test attached
make clean check

Differential Revision: https://reviews.facebook.net/D6999
2012-11-28 15:47:40 -08:00
Dhruba Borthakur
3366eda839 Print out status at the end of a compaction run.
Summary:
Print out status at the end of a compaction run. This helps in
debugging.

Test Plan: make clean check

Reviewers: sheki

Reviewed By: sheki

Differential Revision: https://reviews.facebook.net/D7035
2012-11-27 22:17:38 -08:00
sheki
43f5a07989 Remove unused varibles. Cause compiler warnings.
Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: emayanke

Differential Revision: https://reviews.facebook.net/D6993
2012-11-26 20:55:24 -08:00
Dhruba Borthakur
fbb73a4ac3 Support to disable background compactions on a database.
Summary:
This option is needed for fast bulk uploads. The goal is to load
all the data into files in L0 without any interference from
background compactions.

Test Plan: make clean check

Reviewers: sheki

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D6849
2012-11-20 21:12:06 -08:00
Dhruba Borthakur
62e7583f94 enhance dbstress to simulate hard crash
Summary:
dbstress has an option to reopen the database. Make it such that the
previous handle is not closed before we reopen, this simulates a
situation similar to a process crash.

Added new api to DMImpl to remove the lock file.

Test Plan: run db_stress

Reviewers: emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D6777
2012-11-18 23:16:17 -08:00
Dhruba Borthakur
6c5a4d646a Merge branch 'master' into performance
Conflicts:
	db/db_impl.h
2012-11-14 21:39:52 -08:00
Dhruba Borthakur
5d16e503a6 Improved CompactionFilter api: pass in a opaque argument to CompactionFilter invocation.
Summary:
There are applications that operate on multiple leveldb instances.
These applications will like to pass in an opaque type for each
leveldb instance and this type should be passed back to the application
with every invocation of the CompactionFilter api.

Test Plan: Enehanced unit test for opaque parameter to CompactionFilter.

Reviewers: heyongqiang

Reviewed By: heyongqiang

CC: MarkCallaghan, sheki, emayanke

Differential Revision: https://reviews.facebook.net/D6711
2012-11-13 16:22:26 -08:00
Abhishek Kona
0f8e4721a5 Metrics: record compaction drop's and bloom filter effectiveness
Summary: Record BloomFliter hits and drop off reasons during compaction.

Test Plan: Unit tests work.

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6591
2012-11-09 11:38:45 -08:00
heyongqiang
20d18a89a3 disable size compaction in ldb reduce_levels and added compression and file size parameter to it
Summary:
disable size compaction in ldb reduce_levels, this will avoid compactions rather than the manual comapction,

added --compression=none|snappy|zlib|bzip2 and --file_size= per-file size to ldb reduce_levels command

Test Plan: run ldb

Reviewers: dhruba, MarkCallaghan

Reviewed By: dhruba

CC: sheki, emayanke

Differential Revision: https://reviews.facebook.net/D6597
2012-11-09 10:14:47 -08:00
Dhruba Borthakur
95dda37858 Move filesize-based-sorting to outside the Mutex
Summary:
When a new version is created, we sort all the files at every
level based on their size. This is necessary because we want
to compact the largest file first. The sorting takes quite a
bit of CPU.

Moved the sorting code to be outside the mutex. Also, the
earlier code was sorting files at all levels but we do not
need to sort the highest-number level because those files
are never the cause of any compaction. To reduce sorting
costs, we sort only the first few files in each level
because it is likely that those are the only files in that
level that will be picked for compaction.

At steady state, I have seen that this patch increase
throughout from 1500 writes/sec to 1700 writes/sec at the
end of a 72 hour run. The cpu saving by not sorting the
last level was not distinctive in this test run because
there were only 100K files in the highest numbered level.
I expect the cpu saving to be significant when the number of
files is much higher.

This is mostly an early preview and not ready for rigorous review.

With this patch, the writs/sec is now bottlenecked not by the sorting code but by GetOverlappingInputs. I am working on a patch to optimize GetOverlappingInputs.

Test Plan: make check

Reviewers: MarkCallaghan, heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D6411
2012-11-07 15:39:44 -08:00
Dhruba Borthakur
8143062edd Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	db/version_set.cc
	util/options.cc
2012-11-07 15:11:37 -08:00
heyongqiang
3fcf533ed0 Add a readonly db
Summary: as subject

Test Plan: run db_bench readrandom

Reviewers: dhruba

Reviewed By: dhruba

CC: MarkCallaghan, emayanke, sheki

Differential Revision: https://reviews.facebook.net/D6495
2012-11-07 14:19:48 -08:00
Abhishek Kona
4e413df3d0 Flush Data at object destruction if disableWal is used.
Summary:
Added a conditional flush in ~DBImpl to flush.
There is still a chance of writes not being persisted if there is a
crash (not a clean shutdown) before the DBImpl instance is destroyed.

Test Plan: modified db_test to meet the new expectations.

Reviewers: dhruba, heyongqiang

Differential Revision: https://reviews.facebook.net/D6519
2012-11-06 15:04:42 -08:00
Dhruba Borthakur
aa42c66814 Fix all warnings generated by -Wall option to the compiler.
Summary:
The default compilation process now uses "-Wall" to compile.
Fix all compilation error generated by gcc.

Test Plan: make all check

Reviewers: heyongqiang, emayanke, sheki

Reviewed By: heyongqiang

CC: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D6525
2012-11-06 14:07:31 -08:00
Dhruba Borthakur
5f91868cee Merge branch 'master' into performance
Conflicts:
	db/version_set.cc
	util/options.cc
2012-11-05 16:51:55 -08:00
Dhruba Borthakur
5273c81483 Ability to invoke application hook for every key during compaction.
Summary:
There are certain use-cases where the application intends to
delete older keys aftre they have expired a certian time period.
One option for those applications is to periodically scan the
entire database and delete appropriate keys.

A better way is to allow the application to hook into the
compaction process. This patch allows the application to set
a method callback for every key that is being compacted. If
this method returns true, then the key is not preserved in
the output of the compaction.

Test Plan:
This is mostly to preview the proposed new public api.
Since it is a public api, please do due diligence on reviewing it.

I will be writing test cases for this api in mynext version of
this patch.

Reviewers: MarkCallaghan, heyongqiang

Reviewed By: heyongqiang

CC: sheki, adsharma

Differential Revision: https://reviews.facebook.net/D6285
2012-11-05 16:02:13 -08:00
Dhruba Borthakur
81f735d97c Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	util/options.cc
2012-11-05 09:41:38 -08:00
heyongqiang
3096fa7534 Add two more options: disable block cache and make table cache shard number configuable
Summary:

as subject

Test Plan:

run db_bench and db_test

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6111
2012-11-01 13:23:21 -07:00
Mark Callaghan
3e7e269292 Use timer to measure sleep rather than assume it is 1000 usecs
Summary:
This makes the stall timers in MakeRoomForWrite more accurate by timing
the sleeps. From looking at the logs the real sleep times are usually
about 2000 usecs each when SleepForMicros(1000) is called. The modified LOG messages are:
2012/10/29-12:06:33.271984 2b3cc872f700 delaying write 13 usecs for level0_slowdown_writes_trigger
2012/10/29-12:06:34.688939 2b3cc872f700 delaying write 1728 usecs for rate limits with max score 3.83

Task ID: #

Blame Rev:

Test Plan:
run db_bench, look at DB/LOG

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6297
2012-10-30 07:21:37 -07:00
Dhruba Borthakur
53e04311b1 Merge branch 'master' into performance
Conflicts:
	db/db_bench.cc
	util/options.cc
2012-10-29 14:18:00 -07:00
Dhruba Borthakur
321dfdc3ae Allow having different compression algorithms on different levels.
Summary:
The leveldb API is enhanced to support different compression algorithms at
different levels.

This adds the option min_level_to_compress to db_bench that specifies
the minimum level for which compression should be done when
compression is enabled. This can be used to disable compression for levels
0 and 1 which are likely to suffer from stalls because of the CPU load
for memtable flushes and (L0,L1) compaction.  Level 0 is special as it
gets frequent memtable flushes. Level 1 is special as it frequently
gets all:all file compactions between it and level 0. But all other levels
could be the same. For any level N where N > 1, the rate of sequential
IO for that level should be the same. The last level is the
exception because it might not be full and because files from it are
not read to compact with the next larger level.

The same amount of time will be spent doing compaction at any
level N excluding N=0, 1 or the last level. By this standard all
of those levels should use the same compression. The difference is that
the loss (using more disk space) from a faster compression algorithm
is less significant for N=2 than for N=3. So we might be willing to
trade disk space for faster write rates with no compression
for L0 and L1, snappy for L2, zlib for L3. Using a faster compression
algorithm for the mid levels also allows us to reclaim some cpu
without trading off much loss in disk space overhead.

Also note that little is to be gained by compressing levels 0 and 1. For
a 4-level tree they account for 10% of the data. For a 5-level tree they
account for 1% of the data.

With compression enabled:
* memtable flush rate is ~18MB/second
* (L0,L1) compaction rate is ~30MB/second

With compression enabled but min_level_to_compress=2
* memtable flush rate is ~320MB/second
* (L0,L1) compaction rate is ~560MB/second

This practicaly takes the same code from https://reviews.facebook.net/D6225
but makes the leveldb api more general purpose with a few additional
lines of code.

Test Plan: make check

Differential Revision: https://reviews.facebook.net/D6261
2012-10-29 11:48:09 -07:00
Mark Callaghan
acc8567b24 Add more rates to db_bench output
Summary:
Adds the "MB/sec in" and "MB/sec out" to this line:
Amplification: 1.7 rate, 0.01 GB in, 0.02 GB out, 8.24 MB/sec in, 13.75 MB/sec out

Changes all values to be reported per interval and since test start for this line:
... thread 0: (10000,60000) ops and (19155.6,27307.5) ops/second in (0.522041,2.197198) seconds

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6291
2012-10-29 11:30:07 -07:00
Mark Callaghan
70c42bf05f Adds DB::GetNextCompaction and then uses that for rate limiting db_bench
Summary:
Adds a method that returns the score for the next level that most
needs compaction. That method is then used by db_bench to rate limit threads.
Threads are put to sleep at the end of each stats interval until the score
is less than the limit. The limit is set via the --rate_limit=$double option.
The specified value must be > 1.0. Also adds the option --stats_per_interval
to enable additional metrics reported every stats interval.

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6243
2012-10-29 10:17:43 -07:00
Kai Liu
d50f8eb603 Enable LevelDb to create a new log file if current log file is too large.
Summary: Enable LevelDb to create a new log file if current log file is too large.

Test Plan:
Write a script and manually check the generated info LOG.

Task ID: 1803577

Blame Rev:

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: zshao

Differential Revision: https://reviews.facebook.net/D6003
2012-10-26 14:55:02 -07:00
Mark Callaghan
65855dd8d4 Normalize compaction stats by time in compaction
Summary:
I used server uptime to compute per-level IO throughput rates. I
intended to use time spent doing compaction at that level. This fixes that.

Task ID: #

Blame Rev:

Test Plan:
run db_bench, look at results

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6237
2012-10-26 14:19:13 -07:00
Dhruba Borthakur
ea9e087851 Merge branch 'master' into performance
Conflicts:
	db/db_bench.cc
	db/db_impl.cc
	db/db_test.cc
2012-10-26 08:57:56 -07:00
Mark Callaghan
e7206f43ee Improve statistics
Summary:
This adds more statistics to be reported by GetProperty("leveldb.stats").
The new stats include time spent waiting on stalls in MakeRoomForWrite.
This also includes the total amplification rate where that is:
    (#bytes of sequential IO during compaction) / (#bytes from Put)
This also includes a lot more data for the per-level compaction report.
* Rn(MB) - MB read from level N during compaction between levels N and N+1
* Rnp1(MB) - MB read from level N+1 during compaction between levels N and N+1
* Wnew(MB) - new data written to the level during compaction
* Amplify - ( Write(MB) + Rnp1(MB) ) / Rn(MB)
* Rn - files read from level N during compaction between levels N and N+1
* Rnp1 - files read from level N+1 during compaction between levels N and N+1
* Wnp1 - files written to level N+1 during compaction between levels N and N+1
* NewW - new files written to level N+1 during compaction
* Count - number of compactions done for this level

This is the new output from DB::GetProperty("leveldb.stats"). The old output stopped at Write(MB)

                               Compactions
Level  Files Size(MB) Time(sec) Read(MB) Write(MB)  Rn(MB) Rnp1(MB) Wnew(MB) Amplify Read(MB/s) Write(MB/s)   Rn Rnp1 Wnp1 NewW Count
-------------------------------------------------------------------------------------------------------------------------------------
  0        3        6        33        0       576       0        0      576    -1.0       0.0         1.3     0    0    0    0   290
  1      127      242       351     5316      5314     570     4747      567    17.0      12.1        12.1   287 2399 2685  286    32
  2      161      328        54      822       824     326      496      328     4.0       1.9         1.9   160  251  411  160   161
Amplification: 22.3 rate, 0.56 GB in, 12.55 GB out
Uptime(secs): 439.8
Stalls(secs): 206.938 level0_slowdown, 0.000 level0_numfiles, 24.129 memtable_compaction

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
(cherry picked from commit ecdeead38f86cc02e754d0032600742c4f02fec8)

Reviewers: dhruba

Differential Revision: https://reviews.facebook.net/D6153
2012-10-24 14:21:38 -07:00
Dhruba Borthakur
4c107587ed Delete files outside the mutex.
Summary:
The compaction process deletes a large number of files. This takes
quite a bit of time and is best done outside the mutex lock.

Test Plan: make check

Differential Revision: https://reviews.facebook.net/D6123
2012-10-22 11:53:23 -07:00
Dhruba Borthakur
3489cd615c Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	db/db_impl.h
2012-10-21 02:15:19 -07:00
Dhruba Borthakur
f95219fb32 Delete files outside the mutex.
Summary:
The compaction process deletes a large number of files. This takes
quite a bit of time and is best done outside the mutex lock.

Test Plan: make check

Differential Revision: https://reviews.facebook.net/D6123
2012-10-21 02:03:00 -07:00
Dhruba Borthakur
98f23cf04a Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	db/db_impl.h
2012-10-21 01:55:19 -07:00
Dhruba Borthakur
64c4b9f0e2 Delete files outside the mutex.
Summary:
The compaction process deletes a large number of files. This takes
quite a bit of time and is best done outside the mutex lock.

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2012-10-21 01:49:48 -07:00
Dhruba Borthakur
1ca0584345 This is the mega-patch multi-threaded compaction
published in https://reviews.facebook.net/D5997.

Summary:
This patch allows compaction to occur in multiple background threads
concurrently.

If a manual compaction is issued, the system falls back to a
single-compaction-thread model. This is done to ensure correctess
and simplicity of code. When the manual compaction is finished,
the system resumes its concurrent-compaction mode automatically.

The updates to the manifest are done via group-commit approach.

Test Plan: run db_bench
2012-10-19 14:00:53 -07:00
Dhruba Borthakur
aa73538f2a The deletion of obsolete files should not occur very frequently.
Summary:
The method DeleteObsolete files is a very costly methind, especially
when the number of files in a system is large. It makes a list of
all live-files and then scans the directory to compute the diff.
By default, this method is executed after every compaction run.

This patch makes it such that DeleteObsolete files is never
invoked twice within a configured period.

Test Plan: run all unit tests

Reviewers: heyongqiang, MarkCallaghan

Reviewed By: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D6045
2012-10-16 10:26:10 -07:00
Dhruba Borthakur
24eea931ef If ReadCompaction is switched off, then it is better to not even submit background compaction jobs.
Summary:
If ReadCompaction is switched off, then it is better to not even
submit background compaction jobs. I see about 3% increase in
read-throughput on a pure memory database.

Test Plan: run db_bench

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5673
2012-09-25 11:07:01 -07:00
Dhruba Borthakur
bb2dcd2457 Segfault in DoCompactionWork caused by buffer overflow
Summary:
The code was allocating 200 bytes on the stack but it
writes 256 bytes into the array.

x8a8ea5 std::_Rb_tree<>::erase()
    @     0x7f134bee7eb0 (unknown)
    @           0x8a8ea5 std::_Rb_tree<>::erase()
    @           0x8a35d6 leveldb::DBImpl::CleanupCompaction()
    @           0x8a7810 leveldb::DBImpl::BackgroundCompaction()
    @           0x8a804d leveldb::DBImpl::BackgroundCall()
    @           0x8c4eff leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper()
    @     0x7f134b3c010d start_thread
    @     0x7f134bf9f10d clone

Test Plan: run db_bench with overwrite option

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5595
2012-09-21 10:55:38 -07:00
Dhruba Borthakur
fb4b381a0c Print out the compile version in the LOG.
Summary: Print out the compile version in the LOG.

Test Plan: run dbbench and verify LOG

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5529
2012-09-18 13:24:32 -07:00
Dhruba Borthakur
ba55d77b5d Ability to take a file-lvel snapshot from leveldb.
Summary:
A set of apis that allows an application to backup data from the
leveldb database based on a set of files.

Test Plan: unint test attached. more coming soon.

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5439
2012-09-17 09:14:50 -07:00
Dhruba Borthakur
407727b75f Fix compiler warnings. Use uint64_t instead of uint.
Summary: Fix compiler warnings. Use uint64_t instead of uint.

Test Plan: build using -Wall

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5355
2012-09-12 14:42:36 -07:00
heyongqiang
0f43aa474e put log in a seperate dir
Summary: added a new option db_log_dir, which points the log dir. Inside that dir, in order to make log names unique, the log file name is prefixed with the leveldb data dir absolute path.

Test Plan: db_test

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D5205
2012-09-06 17:52:08 -07:00
Dhruba Borthakur
fe93631678 Clean up compiler warnings generated by -Wall option.
Summary:
Clean up compiler warnings generated by -Wall option.
make clean all OPT=-Wall

This is a pre-requisite before making a new release.

Test Plan: compile and run unit tests

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5019
2012-08-29 14:24:51 -07:00
heyongqiang
a4f9b8b49e merge 1.5
Summary:

as subject

Test Plan:

db_test table_test

Reviewers: dhruba
2012-08-28 11:43:33 -07:00
heyongqiang
6fee5a74f5 Do not spin in a tight loop attempting compactions if there is a compaction error
Summary: as subject. ported the change from google code leveldb 1.5

Test Plan: run db_test

Reviewers: dhruba

Differential Revision: https://reviews.facebook.net/D4839
2012-08-28 11:43:33 -07:00
heyongqiang
d3759ca121 fix db_test error with scribe logger turned on
Summary: as subject

Test Plan: db_test

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D4929
2012-08-28 11:22:58 -07:00
Dhruba Borthakur
fc20273e73 Introduce a new method Env->Fsync() that issues fsync (instead of fdatasync).
Summary:
Introduce a new method Env->Fsync() that issues fsync (instead of fdatasync).
This is needed for data durability when running on ext3 filesystems.
Added options to the benchmark db_bench to generate performance numbers
with either fsync or fdatasync enabled.

Cleaned up Makefile to build leveldb_shell only when building the thrift
leveldb server.

Test Plan: build and run benchmark

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D4911
2012-08-27 21:24:17 -07:00
heyongqiang
1c99b0a6b3 add more logs
Summary:

as subject

Test Plan:db_test

Reviewers: dhruba
2012-08-24 15:18:43 -07:00
Dhruba Borthakur
e5a7c8e580 Log the open-options to the LOG.
Summary: Log the open-options to the LOG. Use options_ instead of options because SanitizeOptions could modify the max_file_open limit.

Test Plan: num db_bench

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D4833
2012-08-22 12:22:12 -07:00
Dhruba Borthakur
a098207c95 Fixed unit test c_test by initializing logger=NULL.
Summary:
Fixed unit test c_test by initializing logger=NULL.

Removed "atomic" from last_log_ts so that unit tests do not require C11 compiler.
Anyway, last_log_ts is mostly used for logging, so it is ok if it is loosely
accurate.

Test Plan: run c_test

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D4803
2012-08-21 17:10:29 -07:00
Dhruba Borthakur
f4e7febf22 Record the version of the source repository that was used to build the leveldb library.
Summary: Record the version of the source that we are compiling. We keep a record of the git revision in util/version.cc. This source file is then built as a regular source file as part of the compilation process. One can run "strings executable_filename | grep _build_" to find the version of the source that we used to build the executable file.

Test Plan: none

Differential Revision: https://reviews.facebook.net/D4785
2012-08-21 14:47:15 -07:00
heyongqiang
6ba1f17789 adding a scribe logger in leveldb to log leveldb deploy stats
Summary:
as subject.

A new log is written to scribe via thrift client when a new db is opened and when there is
a compaction.

a new option var scribe_log_db_stats is added.

Test Plan: manually checked using command "ptail -time 0 leveldb_deploy_stats"

Reviewers: dhruba

Differential Revision: https://reviews.facebook.net/D4659
2012-08-21 11:43:22 -07:00
heyongqiang
680e571c4c add compaction log Summary:
Summary:
add compaction summary to log

log looks like:

2012/08/17-18:18:32.557334 7fdcaa2bb700 Compaction summary: Base level 0, input file:[11 9 7 ],[]

Test Plan: tested via db_test

Reviewers: dhruba

Differential Revision: https://reviews.facebook.net/D4749
2012-08-17 19:29:39 -07:00
heyongqiang
20ee76bd34 use ts as suffix for LOG.old files
Summary: as subject and only maintain 10 log files.

Test Plan: new test in db_test

Reviewers: dhruba

Differential Revision: https://reviews.facebook.net/D4731
2012-08-17 16:22:04 -07:00
Dhruba Borthakur
c3096afd61 Introduce a new option disableDataSync for opening the database. If this is set to true, then the data written to newly created data files are not sycned to disk, instead depend on the OS to flush dirty data to stable storage. This option is good for bulk
Test Plan:
manual tests

Task ID: #

Blame Rev:

Differential Revision: https://reviews.facebook.net/D4515
2012-08-03 15:23:53 -07:00
heyongqiang
22ee777f68 add flush interface to DB
Summary: as subject. The flush will flush everything in the db.

Test Plan: new test in db_test.cc

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D4029
2012-07-06 12:11:19 -07:00
heyongqiang
a347d4ac0d add disable WAL option
Summary: add disable WAL option

Test Plan: new testcase in db_test.cc

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D4011
2012-07-05 15:06:56 -07:00
heyongqiang
4e4b6812ff Make some variables configurable for each db instance
Summary:
Make configurable 'targetFileSize', 'targetFileSizeMultiplier',
'maxBytesForLevelBase', 'maxBytesForLevelMultiplier',
'expandedCompactionFactor', 'maxGrandParentOverlapFactor'

Test Plan: N/A

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D3801
2012-06-27 14:36:31 -07:00
Dhruba Borthakur
338939e5c1 Print log message when we are throttling writes.
Summary:
Added option --writes=xxx to specify the number of keys that we want to overwrite in the benchmark.

Task ID: #

Blame Rev:

Test Plan: Revert Plan:

Reviewers: adsharma

CC: sc

Differential Revision: https://reviews.facebook.net/D3465
2012-06-01 14:03:37 -07:00
Sanjay Ghemawat
85584d497e Added bloom filter support.
In particular, we add a new FilterPolicy class.  An instance
of this class can be supplied in Options when opening a
database.  If supplied, the instance is used to generate
summaries of keys (e.g., a bloom filter) which are placed in
sstables.  These summaries are consulted by DB::Get() so we
can avoid reading sstable blocks that are guaranteed to not
contain the key we are looking for.

This change provides one implementation of FilterPolicy
based on bloom filters.

Other changes:
- Updated version number to 1.4.
- Some build tweaks.
- C binding for CompactRange.
- A few more benchmarks: deleteseq, deleterandom, readmissing, seekrandom.
- Minor .gitignore update.
2012-04-17 08:36:46 -07:00
Sanjay Ghemawat
583f1499c0 fix LOCK file deletion to prevent crash on windows 2012-03-09 07:51:04 -08:00
Sanjay Ghemawat
d79762e273 added group commit; drastically speeds up mult-threaded synchronous write workloads 2012-03-08 16:23:21 -08:00
Sanjay Ghemawat
3c8be108bf fixed issues 66 (leaking files on disk error) and 68 (no sync of CURRENT file) 2012-01-25 14:56:52 -08:00
Hans Wennborg
36a5f8ed7f A number of fixes:
- Replace raw slice comparison with a call to user comparator.
  Added test for custom comparators.

- Fix end of namespace comments.

- Fixed bug in picking inputs for a level-0 compaction.

  When finding overlapping files, the covered range may expand
  as files are added to the input set.  We now correctly expand
  the range when this happens instead of continuing to use the
  old range.  For example, suppose L0 contains files with the
  following ranges:

      F1: a .. d
      F2:    c .. g
      F3:       f .. j

  and the initial compaction target is F3.  We used to search
  for range f..j which yielded {F2,F3}.  However we now expand
  the range as soon as another file is added.  In this case,
  when F2 is added, we expand the range to c..j and restart the
  search.  That picks up file F1 as well.

  This change fixes a bug related to deleted keys showing up
  incorrectly after a compaction as described in Issue 44.

(Sync with upstream @25072954)
2011-10-31 17:22:06 +00:00
Gabor Cselle
299ccedfec A number of bugfixes:
- Added DB::CompactRange() method.

  Changed manual compaction code so it breaks up compactions of
  big ranges into smaller compactions.

  Changed the code that pushes the output of memtable compactions
  to higher levels to obey the grandparent constraint: i.e., we
  must never have a single file in level L that overlaps too
  much data in level L+1 (to avoid very expensive L-1 compactions).

  Added code to pretty-print internal keys.

- Fixed bug where we would not detect overlap with files in
  level-0 because we were incorrectly using binary search
  on an array of files with overlapping ranges.

  Added "leveldb.sstables" property that can be used to dump
  all of the sstables and ranges that make up the db state.

- Removing post_write_snapshot support.  Email to leveldb mailing
  list brought up no users, just confusion from one person about
  what it meant.

- Fixing static_cast char to unsigned on BIG_ENDIAN platforms.

  Fixes	Issue 35 and Issue 36.

- Comment clarification to address leveldb Issue 37.

- Change license in posix_logger.h to match other files.

- A build problem where uint32 was used instead of uint32_t.

Sync with upstream @24408625
2011-10-05 16:30:28 -07:00
gabor@google.com
7263023651 Bugfixes: for Get(), don't hold mutex while writing log.
- Fix bug in Get: when it triggers a compaction, it could sometimes
  mark the compaction with the wrong level (if there was a gap
  in the set of levels examined for the Get).

- Do not hold mutex while writing to the log file or to the
  MANIFEST file.

  Added a new benchmark that runs a writer thread concurrently with
  reader threads.

  Percentiles
  ------------------------------
  micros/op: avg  median 99   99.9  99.99  99.999 max
  ------------------------------------------------------
  before:    42   38     110  225   32000  42000  48000
  after:     24   20     55   65    130    1100   7000

- Fixed race in optimized Get.  It should have been using the
  pinned memtables, not the current memtables.



git-svn-id: https://leveldb.googlecode.com/svn/trunk@50 62dab493-f737-651d-591e-8d6aee1b9529
2011-09-01 19:08:02 +00:00
gabor@google.com
e3584f9c28 Bugfix for issue 33; reduce lock contention in Get(), parallel benchmarks.
- Fix for issue 33 (non-null-terminated result from
  leveldb_property_value())

- Support for running multiple instances of a benchmark in parallel.

- Reduce lock contention on Get():
  (1) Do not hold the lock while searching memtables.
  (2) Shard block and table caches 16-ways.

  Benchmark for evaluating this change:
  $ db_bench --benchmarks=fillseq1,readrandom --threads=$n
  (fillseq1 is a small hack to make sure fillseq runs once regardless
  of number of threads specified on the command line).



git-svn-id: https://leveldb.googlecode.com/svn/trunk@49 62dab493-f737-651d-591e-8d6aee1b9529
2011-08-22 21:08:51 +00:00
dgrogan@chromium.org
a05525d13b @23023120
git-svn-id: https://leveldb.googlecode.com/svn/trunk@47 62dab493-f737-651d-591e-8d6aee1b9529
2011-08-06 00:19:37 +00:00
gabor@google.com
60bd8015f2 Speed up Snappy uncompression, new Logger interface.
- Removed one copy of an uncompressed block contents changing
  the signature of Snappy_Uncompress() so it uncompresses into a
  flat array instead of a std::string.
        
  Speeds up readrandom ~10%.

- Instead of a combination of Env/WritableFile, we now have a
  Logger interface that can be easily overridden applications
  that want to supply their own logging.

- Separated out the gcc and Sun Studio parts of atomic_pointer.h
  so we can use 'asm', 'volatile' keywords for Sun Studio.




git-svn-id: https://leveldb.googlecode.com/svn/trunk@39 62dab493-f737-651d-591e-8d6aee1b9529
2011-07-21 02:40:18 +00:00
gabor@google.com
6699c7ebe6 Small tweaks and bugfixes for Issue 18 and 19.
Slight tweak to the no-overlap optimization: only push to
level 2 to reduce the amount of wasted space when the same
small key range is being repeatedly overwritten.

Fix for Issue 18: Avoid failure on Windows by avoiding
deletion of lock file until the end of DestroyDB().

Fix for Issue 19: Disregard sequence numbers when checking for 
overlap in sstable ranges. This fixes issue 19: when writing 
the same key over and over again, we would generate a sequence 
of sstables that were never merged together since their sequence
numbers were disjoint.

Don't ignore map/unmap error checks.

Miscellaneous fixes for small problems Sanjay found while diagnosing
issue/9 and issue/16 (corruption_testr failures).
- log::Reader reports the record type when it finds an unexpected type.
- log::Reader no longer reports an error when it encounters an expected
  zero record regardless of the setting of the "checksum" flag.
- Added a missing forward declaration.
- Documented a side-effects of larger write buffer sizes
  (longer recovery time).



git-svn-id: https://leveldb.googlecode.com/svn/trunk@37 62dab493-f737-651d-591e-8d6aee1b9529
2011-07-15 00:20:57 +00:00
gabor@google.com
ccf0fcd5c2 A number of smaller fixes and performance improvements:
- Implemented Get() directly instead of building on top of a full
  merging iterator stack.  This speeds up the "readrandom" benchmark
  by up to 15-30%.

- Fixed an opensource compilation problem.
  Added --db=<name> flag to control where the database is placed.

- Automatically compact a file when we have done enough
  overlapping seeks to that file.

- Fixed a performance bug where we would read from at least one
  file in a level even if none of the files overlapped the key
  being read.

- Makefile fix for Mac OSX installations that have XCode 4 without XCode 3.

- Unified the two occurrences of binary search in a file-list
  into one routine.

- Found and fixed a bug where we would unnecessarily search the
  last file when looking for a key larger than all data in the
  level.

- A fix to avoid the need for trivial move compactions and
  therefore gets rid of two out of five syncs in "fillseq".

- Removed the MANIFEST file write when switching to a new
  memtable/log-file for a 10-20% improvement on fill speed on ext4.

- Adding a SNAPPY setting in the Makefile for folks who have
  Snappy installed. Snappy compresses values and speeds up writes.



git-svn-id: https://leveldb.googlecode.com/svn/trunk@32 62dab493-f737-651d-591e-8d6aee1b9529
2011-06-22 02:36:45 +00:00
hans@chromium.org
80e5b0d944 sync with upstream @21706995
Fixed race condition reported by Dave Smit (dizzyd@dizzyd,com)
on the leveldb mailing list.  We were not signalling
waiters after a trivial move from level-0.  The result was
that in some cases (hard to reproduce), a write would get
stuck forever waiting for the number of level-0 files to drop
below its hard limit.

The new code is simpler: there is just one condition variable
instead of two, and the condition variable is signalled after
every piece of background work finishes.  Also, all compaction
work (including for manual compactions) is done in the
background thread, and therefore we can remove the
"compacting_" variable.



git-svn-id: https://leveldb.googlecode.com/svn/trunk@31 62dab493-f737-651d-591e-8d6aee1b9529
2011-06-07 14:40:26 +00:00
dgrogan@chromium.org
740d8b3d00 Update from upstream @21551990
* Patch LevelDB to build for OSX and iOS
* Fix race condition in memtable iterator deletion.
* Other small fixes.

git-svn-id: https://leveldb.googlecode.com/svn/trunk@29 62dab493-f737-651d-591e-8d6aee1b9529
2011-05-28 00:53:58 +00:00
dgrogan@chromium.org
da79909507 sync with upstream @ 21409451
Check the NEWS file for details of what changed.

git-svn-id: https://leveldb.googlecode.com/svn/trunk@28 62dab493-f737-651d-591e-8d6aee1b9529
2011-05-21 02:17:43 +00:00
dgrogan@chromium.org
ba6dac0e80 @20776309
* env_chromium.cc should not export symbols.
* Fix MSVC warnings.
* Removed large value support.
* Fix broken reference to documentation file

git-svn-id: https://leveldb.googlecode.com/svn/trunk@24 62dab493-f737-651d-591e-8d6aee1b9529
2011-04-20 22:48:11 +00:00
dgrogan@chromium.org
69c6d38342 reverting disastrous MOE commit, returning to r21
git-svn-id: https://leveldb.googlecode.com/svn/trunk@23 62dab493-f737-651d-591e-8d6aee1b9529
2011-04-19 23:11:15 +00:00
dgrogan@chromium.org
b743906eea Revision created by MOE tool push_codebase.
MOE_MIGRATION=


git-svn-id: https://leveldb.googlecode.com/svn/trunk@22 62dab493-f737-651d-591e-8d6aee1b9529
2011-04-19 23:01:25 +00:00
dgrogan@chromium.org
b409afe968 chmod a-x
git-svn-id: https://leveldb.googlecode.com/svn/trunk@21 62dab493-f737-651d-591e-8d6aee1b9529
2011-04-18 23:15:58 +00:00
dgrogan@chromium.org
f779e7a5d8 @20602303. Default file permission is now 755.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@20 62dab493-f737-651d-591e-8d6aee1b9529
2011-04-12 19:38:58 +00:00
jorlow@chromium.org
4671a695fc Move include files into a leveldb subdir.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@18 62dab493-f737-651d-591e-8d6aee1b9529
2011-03-30 18:35:40 +00:00
jorlow@chromium.org
e2da744e12 Upstream changes.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@16 62dab493-f737-651d-591e-8d6aee1b9529
2011-03-28 20:43:44 +00:00
jorlow@chromium.org
8303bb1b33 Pull from upstream.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@14 62dab493-f737-651d-591e-8d6aee1b9529
2011-03-22 23:24:02 +00:00
jorlow@chromium.org
13b72af77b More changes from upstream.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@12 62dab493-f737-651d-591e-8d6aee1b9529
2011-03-22 18:32:49 +00:00
jorlow@chromium.org
f67e15e50f Initial checkin.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@2 62dab493-f737-651d-591e-8d6aee1b9529
2011-03-18 22:37:00 +00:00