Commit Graph

1008 Commits

Author SHA1 Message Date
Igor Canadi
7c5e583a27 ColumnFamilySet
Summary:
I created a separate class ColumnFamilySet to keep track of column families. Before we did this in VersionSet and I believe this approach is cleaner.

Let me know if you have any comments. I will commit tomorrow.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15357
2014-01-23 14:03:38 -08:00
Igor Canadi
f9a25dda9f Fix wrong merge 2014-01-22 11:30:19 -08:00
Igor Canadi
92a022ad07 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/db_impl_readonly.cc
	db/version_set.cc
2014-01-22 10:59:07 -08:00
Igor Canadi
fb01755aa4 Unfriending classes
Summary:
In this diff I made some effort to reduce usage of friending. To do that, I had to expose Compaction::inputs_ through a method inputs(). Not sure if this is a good idea, there is a trade-off. I think it's less confusing than having lots of friends.

I also thought about other friendship relationships, but they are too much tangled at this point. Once you friend two classes, it's very hard to unfriend them :)

Test Plan: make check

Reviewers: haobo, kailiu, sdong, dhruba

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15267
2014-01-22 10:55:16 -08:00
Igor Canadi
6fe9b57748 Refactor Recover() code
Summary:
This diff does two things:
* Rethinks how we call Recover() with read_only option. Before, we call it with pointer to memtable where we'd like to apply those changes to. This memtable is set in db_impl_readonly.cc and it's actually DBImpl::mem_. Why don't we just apply updates to mem_ right away? It seems more intuitive.
* Changes when we apply updates to manifest. Before, the process is to recover all the logs, flush it to sst files and then do one giant commit that atomically adds all recovered sst files and sets the next log number. This works good enough, but causes some small troubles for my column family approach, since I can't have one VersionEdit apply to more than single column family[1]. The change here is to commit the files recovered from logs right away. Here is the state of the world before the change:
1. Recover log 5, add new sst files to edit
2. Recover log 7, add new sst files to edit
3. Recover log 8, add new sst files to edit
4. Commit all added sst files to manifest and mark log files 5, 7 and 8 as recoverd (via SetLogNumber(9) function)
After the change, we'll do:
1. Recover log 5, commit the new sst files and set log 5 as recovered
2. Recover log 7, commit the new sst files and set log 7 as recovered
3. Recover log 8, commit the new sst files and set log 8 as recovered

The added (small) benefit is that if we fail after (2), the new recovery will only have to recover log 8. In previous case, we'll have to restart the recovery from the beginning. The bigger benefit will be to enable easier integration of multiple column families in Recovery code path.

[1] I'm happy to dicuss this decison, but I believe this is the cleanest way to go. It also makes backward compatibility much easier. We don't have a requirement of adding multiple column families atomically.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15237
2014-01-22 10:45:26 -08:00
kailiu
4036d58dc9 Fix a Statistics-related unit test faulure
Summary:
In my MacOS, the member variables are populated with random numbers after initialization.
This diff fixes it by fill these arrays with 0.

Test Plan: make && ./table_test

Reviewers: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15315
2014-01-21 18:02:55 -08:00
Igor Canadi
23f6791c9e Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl_readonly.cc
	db/db_test.cc
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
	db/version_set.h
	db/version_set_reduce_num_levels.cc
2014-01-21 17:01:52 -08:00
Mark Callaghan
4e8321bfea Boost access before mutex is unlocked
Summary:
This moves the use of versions_ to before the mutex is unlocked
to avoid a possible race.

Task ID: #

Blame Rev:

Test Plan:
make check

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: haobo, dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15279
2014-01-17 21:32:23 -08:00
Igor Canadi
83681bf9ef Statistics code cleanup
Summary: I'm separating code-cleanup part of https://reviews.facebook.net/D14517. This will make D14517 easier to understand and this diff easier to review.

Test Plan: make check

Reviewers: haobo, kailiu, sdong, dhruba, tnovak

Reviewed By: tnovak

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15099
2014-01-17 12:46:06 -08:00
Igor Canadi
0f4a75b710 Fix SIGSEGV in compaction picker
Summary:
The SIGSEGV was introduced by https://reviews.facebook.net/D15171

I also fixed ExpandWhileOverlapping() which returned the failure by setting it's own stack variable to nullptr (!). This bug is present in 2.6 release, so I guess ExpandWhileOverlapping never fails :)

Test Plan: `make check`. Also MarkCallaghan confirmed it fixed the SIGSEGV he reported.

Reviewers: MarkCallaghan, kailiu, sdong, dhruba, haobo

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15261
2014-01-17 12:02:03 -08:00
Mark Callaghan
439e36db21 Fix SlowdownAmount
Summary:
This had a few bugs.
1) bottom and top were reversed. top is for the max value but the callers were passing the max
value to bottom. The result is that the max sleep is used when n >= bottom.
2) one of the callers passed values with type double and these values are frequently between
1.0 and 2.0 so rounding will do some bad things
3) sometimes the function returned 0 when there should be a stall

With this change and one other diff (out for review soon) there are slightly fewer stalls on one workload.

With the fix.
Stalls(secs): 160.166 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 58.495 leveln_slowdown
Stalls(count): 910261 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 54526 leveln_slowdown

Without the fix.
Stalls(secs): 172.227 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 56.538 leveln_slowdown
Stalls(count): 160831 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 52845 leveln_slowdown

Task ID: #

Blame Rev:

Test Plan:
run db_bench for --benchmarks=overwrite with IO-bound database

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15243
2014-01-17 10:15:30 -08:00
Kai Liu
e19bad9bbd Fix some "make format" issue
Summary:
* make sure when some pre-check fails, the script won't halt immediately.
* change fburl to google's short url.
* Fix a bug in this script: now it checks the uncommitted code only.

Test Plan: Ran the script under differnet environments.

Reviewers: igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15231
2014-01-16 14:26:51 -08:00
Igor Canadi
6d6fb70960 Remove compaction pointers
Summary: The only thing we do with compaction pointers is set them to some values, we never actually read them. I don't know what we used them for, but it doesn't look like we use them anymore.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15225
2014-01-16 14:06:53 -08:00
Igor Canadi
c699c84af4 CompactionPicker
Summary:
This is a big one. This diff moves all the code related to picking compactions from VersionSet to new class CompactionPicker. Column families' compactions will be completely separate processes, so we need to have multiple CompactionPickers.

To make this easier to review, most of the code change is just copy/paste. There is also a small change not to use VersionSet::current_, but rather to take `Version* version` as a parameter. Most of the other code is exactly the same.

In future diffs, I will also make some improvements to CompactionPickers. I think the most important part will be encapsulating it better. Currently Version, VersionSet, Compaction and CompactionPicker are all friend classes, which makes it harder to change the implementation.

This diff depends on D15171, D15183, D15189 and D15201

Test Plan: `make check`

Reviewers: kailiu, sdong, dhruba, haobo

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15207
2014-01-16 13:03:52 -08:00
kailiu
eae1804f29 Remove the unnecessary use of shared_ptr
Summary:
shared_ptr is slower than unique_ptr (which literally comes with no performance cost compare with raw pointers).
In memtable and memtable rep, we use shared_ptr when we'd actually should use unique_ptr.

According to igor's previous work, we are likely to make quite some performance gain from this diff.

Test Plan: make check

Reviewers: dhruba, igor, sdong, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15213
2014-01-15 18:22:01 -08:00
Igor Canadi
787f11bb3b Move more functions from VersionSet to Version
Summary:
This moves functions:
* VersionSet::Finalize() -> Version::UpdateCompactionStats()
* VersionSet::UpdateFilesBySize() -> Version::UpdateFilesBySize()

The diff depends on D15189, D15183 and D15171

Test Plan: make check

Reviewers: kailiu, sdong, haobo, dhruba

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15201
2014-01-15 16:23:36 -08:00
Igor Canadi
615d1ea2f4 Moving Compaction class to separate header file
Summary:
I'm sure we'll all agree that version_set.cc needs simplifying. This diff moves Compaction class to a separate file.

The diff depends on D15171 and D15183

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15189
2014-01-15 16:22:34 -08:00
Igor Canadi
2f4eda7890 Move functions from VersionSet to Version
Summary:
There were some functions in VersionSet that had no reason to be there instead of Version. Moving them to Version will make column families implementation easier.

The functions moved are:
* NumLevelBytes
* LevelSummary
* LevelFileSummary
* MaxNextLevelOverlappingBytes
* AddLiveFiles (previously AddLiveFilesCurrentVersion())
* NeedSlowdownForNumLevel0Files

The diff continues on (and depends on) D15171

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong, emayanke

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15183
2014-01-15 16:18:04 -08:00
Igor Canadi
65a8a52b54 Decrease reliance on VersionSet::NumberLevels()
Summary:
With column families VersionSet will not have a constant number of levels (each CF can have different options), so we'll need to eliminate call to VersionSet::NumberLevels()

This diff decreases number of callsites, but we're not there yet. It associates number of levels with Version (each version is associated with single CF) instead of VersionSet.

I have also slightly changed how VersionSet keeps track of manifest size.

This diff also modifies constructor of Compaction such that it takes input_version and automatically Ref()s it. Before this was done outside of constructor.

In next diffs I will continue to decrease number of callsites of VersionSet::NumberLevels() and also references to current_

Test Plan: make check

Reviewers: haobo, dhruba, kailiu, sdong

Reviewed By: sdong

Differential Revision: https://reviews.facebook.net/D15171
2014-01-15 16:15:43 -08:00
kailiu
c8f16221ed Fix the return type of WriteBatch::Data().
Summary: Quick fix for https://reviews.facebook.net/D15123

Test Plan: Make check

Reviewers: sdong, vkrest

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15165
2014-01-14 20:24:48 -08:00
Siying Dong
9b51af5a17 [RocksDB Performance Branch] DBImpl.NewInternalIterator() to reduce works inside mutex
Summary: To reduce mutex contention caused by DBImpl.NewInternalIterator(), in this function, move all the iteration creation works out of mutex, only leaving object ref and get.

Test Plan:
make all check
will run db_stress for a while too to make sure no problem.

Reviewers: haobo, dhruba, kailiu

Reviewed By: haobo

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14589

Conflicts:
	db/db_impl.cc
2014-01-14 17:41:44 -08:00
Igor Canadi
d9cd7a063f Fix CompactRange to apply filter to every key
Summary:
When doing CompactRange(), we should first flush the memtable and then calculate max_level_with_files. Also, we want to compact all the levels that have files, including level `max_level_with_files`.

This patch fixed the unit test.

Test Plan: Added a failing unit test and a fix, so it's not failing anymore.

Reviewers: dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D14421
2014-01-14 16:19:09 -08:00
Igor Canadi
1ed2404f27 Wrong number of levels is Invalid argument now, not corruption 2014-01-14 15:54:11 -08:00
Igor Canadi
6291020284 Fix test 2014-01-14 15:41:30 -08:00
Igor Canadi
7f3e417f59 Fix memtable construction in tests 2014-01-14 15:36:12 -08:00
Igor Canadi
055e6df45b VersionEdit not to take NumLevels()
Summary:
I will submit a sequence of diffs that are preparing master branch for column families. There are a lot of implicit assumptions in the code that are making column family implementation hard. If I make the change only in column family branch, it will make merging back to master impossible.

Most of the diffs will be simple code refactorings, so I hope we can have fast turnaround time. Feel free to grab me in person to discuss any of them.

This diff removes number of level check from VersionEdit. It is used only when VersionEdit is read, not written, but has to be set when it is written. I believe it is a right thing to make VersionEdit dumb and check consistency on the caller side. This will also make it much easier to implement Column Families, since different column families can have different number of levels.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15159
2014-01-14 15:27:09 -08:00
Igor Canadi
7d9f21cf23 BuildBatchGroup -- memcpy outside of lock
Summary: When building batch group, don't actually build a new batch since it requires heavy-weight mem copy and malloc. Only store references to the batches and build the batch group without lock held.

Test Plan:
`make check`

I am also planning to run performance tests. The workload that will benefit from this change is readwhilewriting. I will post the results once I have them.

Reviewers: dhruba, haobo, kailiu

Reviewed By: haobo

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D15063
2014-01-14 14:49:31 -08:00
kailiu
481c77e526 Move the compilation of the shared libraries to "make release"
Compiling the shared libraries took a long time. Thus to speed up the development speed, it still makes sense to be separated from regular compilation.
2014-01-14 13:54:33 -08:00
Naman Gupta
78ee22508b Merge branch 'master' of github.com:facebook/rocksdb into sanitizedOptions 2014-01-14 12:41:29 -08:00
Kai Liu
d702d8073e A script that automatically reformat affected lines
Summary:
Added a script that reformat only the affected lines in a given diff.

I planned to make that file as pre-commit hook but looks it's a little bit more difficult than I thought. Since I don't want to spend too much time on this task right now, I eventually added a "make command" to achieve this with a few additional key strokes.

Also make the clang-format solely inherited from Google's style -- there are still debates on some of the style issues, but we can address them later once we reach a consensus.

Test Plan: Did some ugly format change and ran "make format", all affected lines are formatted as expected.

Reviewers: igor, sdong, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15147
2014-01-14 12:21:24 -08:00
Naman Gupta
1d9bac4d7f Use sanitized options while opening db
Summary: We use SanitizeOptions() to set appropriate values for some options, based on other options. So we should use the sanitized options by default. Luckily it hasn't caused a bug yet, but can result in a bug in the fugture.

Test Plan: make check

Reviewers: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14103
2014-01-14 11:46:24 -08:00
Siying Dong
fbbf0d1456 Pre-calculate whether to slow down for too many level 0 files
Summary: Currently in DBImpl::MakeRoomForWrite(), we do  "versions_->NumLevelFiles(0) >= options_.level0_slowdown_writes_trigger" to check whether the writer thread needs to slow down. However, versions_->NumLevelFiles(0) is slightly more expensive than we expected. By caching the result of the comparison when installing a new version, we can avoid this function call every time.

Test Plan:
make all check
Manually trigger this behavior by applying universal compaction style and make sure inserts are made slow after there are certain number of files.

Reviewers: haobo, kailiu, igor

Reviewed By: kailiu

CC: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D15141
2014-01-14 11:23:02 -08:00
Siying Dong
51dd21926c DB::Put() to estimate write batch data size needed and pre-allocate buffer
Summary:
In one of CPU profiles, we see some CPU costs of string::reserve() inside Batch.Put(). This patch should be able to reduce some of the costs by allocating sufficient buffer before hand.

Since it is a trivial percentage of CPU costs, I didn't find a way to show the improvement in one of the benchmarks. I'll deploy it to same application and do the same CPU profiling to make sure those CPU costs are reduced.

Test Plan: make all check

Reviewers: haobo, kailiu, igor

Reviewed By: haobo

CC: leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D15135
2014-01-14 10:53:16 -08:00
kailiu
ac2fe72832 Compile dynamic library by default
Summary:
Per request, some users need to use dynamic rocksdb library instead of static one.

However currently the dynamic libraries have to be manually compiled by default, which is inconvenient. I made dymamic libraries to be compiled by default.

Test Plan: make clean; make; make clean;

Reviewers: haobo, sdong, dhruba, igor

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15117
2014-01-14 00:28:10 -08:00
Siying Dong
c4548d5f1f WriteBatch to provide a way for user to query data size directly and only return constant reference of data in Data()
Summary:
WriteBatch::Data() now is easily to be misuse by users. Also, there is no cheap way for user of WriteBatch to know the data size accumulated. This patch fix the problem by:
(1) return a constant reference to Data() so it's obvious to caller what it means.
(2) add a function to return data size directly

Test Plan: make all check

Reviewers: haobo, igor, kailiu

Reviewed By: kailiu

CC: zshao, leveldb

Differential Revision: https://reviews.facebook.net/D15123
2014-01-13 16:52:14 -08:00
Igor Canadi
00065d0d5d Fix merge test
Copy max_successive_merges from options to column family options
2014-01-13 13:27:47 -08:00
Igor Canadi
a107691711 [column families] Implement refcounting ColumnFamilyData
Summary: We don't want to delete ColumnFamilyData object if somebody has references to it.

Test Plan: `make check` for now, but will need to implement bigger column family test case

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15111
2014-01-13 09:22:44 -08:00
Igor Canadi
151f9e144f Merge branch 'master' into columnfamilies 2014-01-13 09:09:51 -08:00
Igor Canadi
d076cef347 [column families] Get rid of VersionSet::current_ and keep current Version for each column family
Summary:
The biggest change here is getting rid of current_ Version and adding a column_family_data->current Version to each column family.

I have also fixed some smaller things in VersionSet that made it easier to implement Column family support.

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15105
2014-01-13 08:59:15 -08:00
Igor Canadi
dd6ecdf342 Use ASSERT_EQ() instead of assert() in merge_test 2014-01-11 09:25:47 -08:00
Schalk-Willem Kruger
a09ee1069d Improve RocksDB "get" performance by computing merge result in memtable
Summary:
Added an option (max_successive_merges) that can be used to specify the
maximum number of successive merge operations on a key in the memtable.
This can be used to improve performance of the "get" operation. If many
successive merge operations are performed on a key, the performance of "get"
operations on the key deteriorates, as the value has to be computed for each
"get" operation by applying all the successive merge operations.

FB Task ID: #3428853

Test Plan:
make all check
db_bench --benchmarks=readrandommergerandom
counter_stress_test

Reviewers: haobo, vamsi, dhruba, sdong

Reviewed By: haobo

CC: zshao

Differential Revision: https://reviews.facebook.net/D14991
2014-01-10 17:33:56 -08:00
Igor Canadi
62197d28b6 Merge pull request #62 from matope/fix-BackupableDBTest-NoDoubleCopy-test-fail
Fix share_table_files condition in BackupEngine constructor.
2014-01-10 13:40:46 -08:00
ono_matope
f8642dacde Fix share_table_files condition in BackupEngine constructor.
That makes BackupableDBTest.NoDoubleCopy test error.
2014-01-11 05:12:07 +09:00
Kai Liu
9996e2d21c Merge pull request #61 from Yancey1989/master
fix compile warning
2014-01-10 10:28:50 -08:00
Yancey
afdd2d1a46 fix compile warning 2014-01-10 17:56:35 +08:00
Igor Canadi
cb37ddf229 Feature requests for BackupableDB
Summary:
This diff introduces some features that were requested by two internal customers:
* Ability for backups not to share table files, because we can't guarantee that equal filename means equal content accross replicas
* Ability for two threads to call EnableFileDeletions() and DisableFileDeletions()
* Ability to stop backup from another thread and not slow down the DB close
* Copy the files to the temporary folder first and then atomically rename

Test Plan: Added some tests to backupable_db_test

Reviewers: dhruba, sanketh, muthu, sdong, haobo

Reviewed By: haobo

CC: leveldb, sanketh, muthu

Differential Revision: https://reviews.facebook.net/D14769
2014-01-09 12:24:28 -08:00
Igor Canadi
d0406675c2 readwhilewriting benchmark
Summary:
Added readwhilewriting benchmark to our regression tests.
Changed block cache shards from 16 to 64, as Mark found that cache mutex contention is a big bottleneck.

Test Plan: Ran it.

Reviewers: dhruba, haobo, MarkCallaghan, xjin

Reviewed By: MarkCallaghan

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15075
2014-01-08 17:44:58 -08:00
Siying Dong
5575316350 StopWatch not to get time if it is created for statistics and it is disabled
Summary: Currently, even if statistics is not enabled, StopWatch only for the stats still gets the time of the day, which is wasteful. This patch adds a new option to StopWatch to disable this get in this case.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14703
2014-01-08 16:05:36 -08:00
kailiu
12b6d2b839 Separate the aligned and unaligned memory allocation
Summary: Use two vectors for different types of memory allocation.

Test Plan: run all unit tests.

Reviewers: haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15027
2014-01-08 15:11:42 -08:00
Igor Canadi
19e3ee64ac Add column family information to WAL
Summary:
I have added three new value types:
* kTypeColumnFamilyDeletion
* kTypeColumnFamilyValue
* kTypeColumnFamilyMerge
which include column family Varint32 before the data (value, deletion and merge). These values are used only in WAL (not in memtables yet).

This endeavour required changing some WriteBatch internals.

Test Plan: Added a unittest

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15045
2014-01-08 12:53:33 -08:00