Commit Graph

964 Commits

Author SHA1 Message Date
Igor Canadi
22507aff6c Fix compile issue in Mac OS
Summary:
Compile issues are:
* Unused variable env_
* Unused fallocate_with_keep_size_

Test Plan: compiles

Reviewers: dhruba, haobo, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17043
2014-03-19 15:40:12 -07:00
Lei Jin
6dc940d4c9 avoid shared_ptr assignment in Version::Get()
Summary:
This is a 500ns operation while the whole Get() call takes only a few
micro!

Test Plan: ran db_bench, for a DB with 50M keys, QPS jumps from 5.2M/s to 7.2M/s

Reviewers: haobo, igor, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17007
2014-03-19 10:54:32 -07:00
sdong
71e6a34271 Add a DB property to indicate number of background errors encountered
Summary: Add a property to calculate number of background errors encountered to help users build their monitoring

Test Plan: Add a unit test. make all check

Reviewers: haobo, igor, dhruba

Reviewed By: igor

CC: ljin, nkg-, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16959
2014-03-18 14:28:30 -07:00
Igor Canadi
69aa6ecb26 Finalize fist version in column family 2014-03-18 14:23:47 -07:00
Igor Canadi
e25819a185 Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
2014-03-18 14:00:20 -07:00
Kai Liu
1ec72b37b1 Several easy-to-add properties related to compaction and flushes
Summary: To partly address the request @nkg- raised, add three easy-to-add properties to compactions and flushes.

Test Plan: run unit tests and add a new unit test to cover new properties.

Reviewers: haobo, dhruba

Reviewed By: dhruba

CC: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D13677
2014-03-18 14:00:09 -07:00
Igor Canadi
758fa8c359 Don't Finalize in CompactionPicker
Summary:
Finalize re-sorts (read: mutates) the files_ in Version* and it is called by CompactionPicker during normal runtime. At the same time, this same Version* lives in the SuperVersion* and is accessed without the mutex in GetImpl() code path.

Mutating the files_ in one thread and reading the same files_ in another thread is a bad idea. It caused this issue: http://ci-builds.fb.com/job/rocksdb_crashtest/285/console

Long-term, we need to be more careful with method contracts and clearly document what state can be mutated when. Now that we are much faster because we don't lock in GetImpl(), we keep running into data races that were not a problem before when we were slower. db_stress has been very helpful in detecting those.

Short-term, I removed Finalize() from CompactionPicker.

Note: I believe this is an issue in current 2.7 version running in production.

Test Plan:
make check
Will also run db_stress to see if issue is gone

Reviewers: sdong, ljin, dhruba, haobo

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16983
2014-03-18 13:59:59 -07:00
Igor Canadi
3055a15b29 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
2014-03-18 13:24:27 -07:00
Lei Jin
63cef90078 disable the log_number check in Recover()
Summary:
There is a chance that an old MANIFEST is corrupted in 2.7 but just not noticed.
This check would fail them. Change it to log instead of returning a
Corruption status.

Test Plan: make

Reviewers: haobo, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16923
2014-03-18 12:46:29 -07:00
Igor Canadi
bcea9c1296 Finalize version in dumpmanifest 2014-03-18 09:45:52 -07:00
Igor Canadi
f26cb0f093 Optimize fallocation
Summary:
Based on my recent findings (posted in our internal group), if we use fallocate without KEEP_SIZE flag, we get superior performance of fdatasync() in append-only workloads.

This diff provides an option for user to not use KEEP_SIZE flag, thus optimizing his sync performance by up to 2x-3x.

At one point we also just called posix_fallocate instead of fallocate, which isn't very fast: http://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html (tl;dr it manually writes out zero bytes to allocate storage). This diff also fixes that, by first calling fallocate and then posix_fallocate if fallocate is not supported.

Test Plan: make check

Reviewers: dhruba, sdong, haobo, ljin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16761
2014-03-17 21:52:14 -07:00
Igor Canadi
ae25742af9 Fix race condition in manifest roll
Summary:
When the manifest is getting rolled the following happens:
1) manifest_file_number_ is assigned to a new manifest number (even though the old one is still current)
2) mutex is unlocked
3) SetCurrentFile() creates temporary file manifest_file_number_.dbtmp
4) SetCurrentFile() renames manifest_file_number_.dbtmp to CURRENT
5) mutex is locked

If FindObsoleteFiles happens between (3) and (4) it will:
1) Delete manifest_file_number_.dbtmp (because it's not in pending_outputs_)
2) Delete old manifest (because the manifest_file_number_ already points to a new one)

I introduce the concept of prev_manifest_file_number_ that will avoid the race condition.

However, we should discuss the future of MANIFEST file rolling. We found some race conditions with it last week and who knows how many more are there. Nobody is using it in production because we don't trust the implementation. Should we even support it?

Test Plan: make check

Reviewers: ljin, dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16929
2014-03-17 21:50:15 -07:00
Igor Canadi
d63ae5cb59 Adjust memtable sizes in unit test 2014-03-17 18:37:34 -07:00
Igor Canadi
64904b39a0 Merge branch 'master' into columnfamilies
Conflicts:
	utilities/backupable/backupable_db.cc
2014-03-17 17:57:14 -07:00
Igor Canadi
e0c1211555 Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
	tools/db_stress.cc
2014-03-17 12:21:05 -07:00
Yueh-Hsuan Chiang
a5fafd4f46 Correct the logic of MemTable::ShouldFlushNow().
Summary:
Memtable will now be forced to flush if the one of the following
conditions is met:
1. Already allocated more than write_buffer_size + 60% arena block size.
   (the overflowing condition)
2. Unable to safely allocate one more arena block without hitting the
   overflowing condition AND the unused allocated memory < 25% arena
   block size.

Test Plan: make all check

Reviewers: sdong, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16893
2014-03-17 12:20:11 -07:00
sdong
c61c9830d4 Fix a bug that Prev() can hang.
Summary: Prev() now can hang when there is a key with more than max_skipped number of appearance internally but all of them are newer than the sequence ID to seek. Add unit tests to confirm the bug and fix it.

Test Plan: make all check

Reviewers: igor, haobo

Reviewed By: igor

CC: ljin, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16899
2014-03-17 10:00:41 -07:00
Igor Canadi
30447b7251 Merge pull request #99 from caiosba/master
Make it compile on Debian/GCC 4.7
2014-03-17 08:51:35 -07:00
Lei Jin
0cf6c8f7ce fix: use the correct edit when comparing log_number
Summary:
In the last fix, I forgot to point to the writer when comparing edit,
which is apparently not correct.

Test Plan: still running make whitebox_crash_test

Reviewers: igor, haobo, igor2

Reviewed By: igor2

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16911
2014-03-15 23:30:43 -07:00
Lei Jin
453ec52ca1 journal log_number correctly in MANIFEST
Summary:
Here is what it can cause probelm:
There is one memtable flush and one compaction. Both call LogAndApply(). If both edits are applied in the same batch with flush edit first and the compaction edit followed. LogAndApplyHelper() will assign compaction edit current VersionSet's log number(which should be smaller than the log number from flush edit). It cause log_numbers in MANIFEST to be not monotonic increasing, which violates the assume Recover() makes. What is more is after comitting to MANIFEST file, log_number_ in VersionSet is updated to the log_number from the last edit, which is the compaction one. It ends up not updating the log_number.

Test Plan:
make whitebox_crash_test
got another assertion about iter->valid(), not sure if that is related
to this.

Reviewers: igor, haobo

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16875
2014-03-14 18:36:47 -07:00
Caio SBA
b9c78d2db6 Make it compile on Debian/GCC 4.7 2014-03-14 22:44:35 +00:00
Igor Canadi
a782bb989e Fix log_number in LogAndApply 2014-03-14 13:45:30 -07:00
Igor Canadi
8b169e949a Merge branch 'master' into columnfamilies 2014-03-14 13:42:36 -07:00
Igor Canadi
928ee23567 Change WriteBatch interface 2014-03-14 13:40:06 -07:00
Igor Canadi
2bad3cb0db Missing includes 2014-03-14 13:02:20 -07:00
Igor Canadi
db234133a9 [CF] WriteBatch to take in ColumnFamilyHandle
Summary: Client doesn't need to know anything about ColumnFamily ID. By making WriteBatch take ColumnFamilyHandle as a parameter, we can eliminate method GetID() from ColumnFamilyHandle

Test Plan: column_family_test

Reviewers: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16887
2014-03-14 11:30:14 -07:00
Igor Canadi
3c75cc15a9 Fix HashSkipList and HashLinkedList SIGSEGV
Summary:
Original Summary:
Yesterday, @ljin and I were debugging various db_stress issues. We suspected one of them happens when we concurrently call NewIterator without prefix_seek on HashSkipList. This test demonstrates it.

Update:
Arena is not thread-safe!! When creating a new full iterator, we *have* to create a new arena, otherwise we're doomed.

Test Plan: SIGSEGV and assertion-throwing test now works!

Reviewers: ljin, haobo, sdong

Reviewed By: sdong

CC: leveldb, ljin

Differential Revision: https://reviews.facebook.net/D16857
2014-03-14 10:02:04 -07:00
Igor Canadi
6c72079d77 Fix warning on Mac OS 2014-03-14 09:54:23 -07:00
Igor Canadi
f0e1e3ebf1 CF cleanup part 2 2014-03-13 14:34:01 -07:00
Igor Canadi
f071a20f6e Need more data in memtable to flush due to 11da8b 2014-03-13 13:52:20 -07:00
Igor Canadi
e1f56e12cf Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	tools/db_stress.cc
2014-03-13 13:21:20 -07:00
sdong
5aa81f04fa Fix extra compaction tasks scheduled after D16767 in some cases
Summary:
With D16767, there is a case compaction tasks are scheduled infinitely:
(1) no flush thread is configured and more than 1 compaction threads
(2) a flush is going on by one compaction hread
(3) the state of SST files is in the state that versions_->current()->NeedsCompaction() will generate a false positive (return true actually there is no work to be done)
In that case, a infinite loop will be formed.

This patch would fix it.

Test Plan: make all check

Reviewers: haobo, igor, ljin

Reviewed By: igor

CC: dhruba, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16863
2014-03-13 13:06:08 -07:00
Kai Liu
11da8bc5df A heuristic way to check if a memtable is full
Summary:
This is is based on https://reviews.facebook.net/D15027. It's not finished but I would like to give a prototype to avoid arena over-allocation while making better use of the already allocated memory blocks.

Instead of check approximate memtable size, we will take a deeper look at the arena, which incorporate essential idea that @sdong suggests: flush when arena has allocated its last and the last is "almost full"

Test Plan: N/A

Reviewers: haobo, sdong

Reviewed By: sdong

CC: leveldb, sdong

Differential Revision: https://reviews.facebook.net/D15051
2014-03-12 16:40:14 -07:00
Igor Canadi
25c8a1a20f More bug fixed introduced by code cleanup 2014-03-12 12:28:23 -07:00
Igor Canadi
b5d6ad69fc Bug fixes introduced by code cleanup 2014-03-12 11:10:26 -07:00
Igor Canadi
dff9214165 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	tools/db_stress.cc
2014-03-12 10:17:41 -07:00
Igor Canadi
fb2346fc1f [CF] Code cleanup part 1
Summary:
I'm cleaning up some code preparing for the big diff review tomorrow. This is the first part of the cleanup.

Changes are mostly cosmetic. The goal is to decrease amount of code difference between columnfamilies and master branch.

This diff also fixes race condition when dropping column family.

Test Plan: Ran db_stress with variety of parameters

Reviewers: dhruba, haobo

Differential Revision: https://reviews.facebook.net/D16833
2014-03-12 09:56:53 -07:00
Igor Canadi
45ad75db80 Correct version of D16821 2014-03-12 09:38:59 -07:00
Igor Canadi
2b95dc1542 Revert "Fix bad merge of D16791 and D16767"
This reverts commit 839c8ecfcd.
2014-03-12 09:37:43 -07:00
sdong
839c8ecfcd Fix bad merge of D16791 and D16767
Summary: A bad Auto-Merge caused log buffer is flushed twice. Remove the unintended one.

Test Plan: Should already be tested (the code looks the same as when I ran unit tests).

Reviewers: haobo, igor

Reviewed By: haobo

CC: ljin, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16821
2014-03-11 21:31:57 -07:00
sdong
bd45633b71 Fix data race against logging data structure because of LogBuffer
Summary:
@igor pointed out that there is a potential data race because of the way we use the newly introduced LogBuffer. After "bg_compaction_scheduled_--" or "bg_flush_scheduled_--", they can both become 0. As soon as the lock is released after that, DBImpl's deconstructor can go ahead and deconstruct all the states inside DB, including the info_log object hold in a shared pointer of the options object it keeps. At that point it is not safe anymore to continue using the info logger to write the delayed logs.

With the patch, lock is released temporarily for log buffer to be flushed before "bg_compaction_scheduled_--" or "bg_flush_scheduled_--". In order to make sure we don't miss any pending flush or compaction, a new flag bg_schedule_needed_ is added, which is set to be true if there is a pending flush or compaction but not scheduled because of the max thread limit. If the flag is set to be true, the scheduling function will be called before compaction or flush thread finishes.

Thanks @igor for this finding!

Test Plan: make all check

Reviewers: haobo, igor

Reviewed By: haobo

CC: dhruba, ljin, yhchiang, igor, leveldb

Differential Revision: https://reviews.facebook.net/D16767
2014-03-11 16:09:53 -07:00
Igor Canadi
d833f15738 Fix bug in VersionEdit::DebugString() 2014-03-11 12:14:09 -07:00
Igor Canadi
37472bb279 Add MaxColumnFamily to VersionEdit::DebugString() 2014-03-11 12:08:42 -07:00
Igor Canadi
457c78eb89 [CF] db_stress for column families
Summary:
I had this diff for a while to test column families implementation. Last night, I ran it sucessfully for 10 hours with the command:

   time ./db_stress --threads=30 --ops_per_thread=200000000 --max_key=5000 --column_families=20 --clear_column_family_one_in=3000000 --verify_before_write=1  --reopen=50 --max_background_compactions=10 --max_background_flushes=10 --db=/tmp/db_stress

It is ready to be committed :)

Test Plan: Ran it for 10 hours

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16797
2014-03-11 12:06:12 -07:00
sdong
6c66bc08d9 Temp Fix of LogBuffer flushing
Summary: To temp fix the log buffer flushing. Flush the buffer inside the lock. Clean the trunk before we find an eventual fix.

Test Plan: make all check

Reviewers: haobo, igor

Reviewed By: igor

CC: ljin, leveldb, yhchiang

Differential Revision: https://reviews.facebook.net/D16791
2014-03-11 11:37:40 -07:00
Igor Canadi
cb9802168f Add a comment after SignalAll()
Summary: Having code after SignalAll has already caused 2 bugs. Let's make sure this doesn't happen again.

Test Plan: no test

Reviewers: sdong, dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16785
2014-03-11 11:27:19 -07:00
Igor Canadi
dad8603fc4 [CF] Fix column family dropping
Summary: Column family should be dropped after the change has been commited

Test Plan: db stress

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16779
2014-03-11 09:47:29 -07:00
Igor Canadi
9634ba42ac Merge branch 'master' into columnfamilies
Conflicts:
	db/compaction_picker.cc
	db/db_impl.cc
	db/db_impl.h
	db/tailing_iter.cc
	db/version_set.h
	include/rocksdb/options.h
	util/options.cc
2014-03-10 17:26:09 -07:00
Igor Canadi
d5de22dc09 Call PurgeObsoleteFiles() only when HaveSomethingToDelete()
Summary: as title

Test Plan: fixed the build failure http://ci-builds.fb.com/job/rocksdb_build/987/console

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16743
2014-03-10 15:42:14 -07:00
sdong
fac58c0504 DBTest: remove perf_context's time > 0 check
Summary: DBTest checks  perf_context.seek_internal_seek_time > 0 and perf_context.find_next_user_entry_time > 0, which is not reliable. Remove them.

Test Plan: ./db_test

Reviewers: igor, haobo, ljin

Reviewed By: igor

CC: dhruba, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16737
2014-03-10 14:24:56 -07:00
Haobo Xu
a91aed615a [RocksDB] Minor cleanup of PurgeObsoleteFiles
Summary: as title. also made info log output of file deletion a bit more descriptive.

Test Plan: make check; db_bench and look at LOG output

Reviewers: igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16731
2014-03-10 14:13:38 -07:00
Lei Jin
8d007b4aaf Consolidate SliceTransform object ownership
Summary:
(1) Fix SanitizeOptions() to also check HashLinkList. The current
dynamic case just happens to work because the 2 classes have the same
layout.
(2) Do not delete SliceTransform object in HashSkipListFactory and
HashLinkListFactory destructor. Reason: SanitizeOptions() enforces
prefix_extractor and SliceTransform to be the same object when
Hash**Factory is used. This makes the behavior strange: when
Hash**Factory is used, prefix_extractor will be released by RocksDB. If
other memtable factory is used, prefix_extractor should be released by
user.

Test Plan: db_bench && make asan_check

Reviewers: haobo, igor, sdong

Reviewed By: igor

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16587
2014-03-10 12:56:46 -07:00
Haobo Xu
9e0e6aa7f6 [RocksDB] make sure KSVObsolete does not get accessed as a valid pointer.
Summary: KSVObsolete is no longer nullptr and needs to be checked explicitly. Also did some minor code cleanup and added a stat counter to track superversion cleanups incurred in the foreground.

Test Plan: make check

Reviewers: ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16701
2014-03-10 12:55:25 -07:00
Haobo Xu
66da467983 [RocksDB] LogBuffer Cleanup
Summary: Moved LogBuffer class to an internal header. Removed some unneccesary indirection. Enabled log buffer for BackgroundCallFlush. Forced log buffer flush right after Unlock to improve time ordering of info log.

Test Plan: make check; db_bench compare LOG output

Reviewers: sdong

Reviewed By: sdong

CC: leveldb, igor

Differential Revision: https://reviews.facebook.net/D16707
2014-03-10 11:05:44 -07:00
Igor Canadi
04d2c26e17 Add option verify_checksums_in_compaction
Summary:
If verify_checksums_in_compaction is true, compaction will verify checksums. This is default.
If it's false, compaction doesn't verify checksums. This is useful for in-memory workloads.

Test Plan: corruption_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16695
2014-03-10 10:06:34 -07:00
Igor Canadi
d4f2c610d3 Ignore dropped column families -- don't flush or compact them 2014-03-07 18:43:21 -08:00
Igor Canadi
1e0d47276c Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
2014-03-07 16:59:47 -08:00
Igor Canadi
9f15092ebd [CF] NewIterators
Summary: Adding the last missing function -- NewIterators(). Pretty simple implementation

Test Plan: added a unit test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16689
2014-03-07 16:15:25 -08:00
Lei Jin
e5fa4944fc use CAS when returning SuperVersion to ThreadLocal
Summary:
Add a check at the end of GetImpl to release SuperVersion if it becomes
obsolete. Also do Scrape() inside InstallSuperVersion so it happens more
frequent.

Test Plan:
make all check
running asan_check now

Reviewers: igor, haobo, sdong, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16641
2014-03-07 14:43:22 -08:00
Igor Canadi
eec8695206 Delete local sv when destroying DB from stress test
Summary: Not deleting local SV caused some an crash test issue: http://ci-builds.fb.com/job/rocksdb_asan_crash_test/83/console

Test Plan: ran unit tests

Reviewers: ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16635
2014-03-06 18:15:26 -08:00
Igor Canadi
80a207fc90 Merge branch 'master' into columnfamilies
Conflicts:
	db/compaction_picker.cc
	db/compaction_picker.h
	db/db_impl.cc
	db/version_set.cc
	db/version_set.h
	include/rocksdb/options.h
	util/options.cc
2014-03-05 16:59:22 -08:00
sdong
ecb1ffa2a8 Buffer info logs when picking compactions and write them out after releasing the mutex
Summary: Now while the background thread is picking compactions, it writes out multiple info_logs, especially for universal compaction, which introduces a chance of waiting log writing in mutex, which is bad. To remove this risk, write all those info logs to a buffer and flush it after releasing the mutex.

Test Plan:
make all check
check the log lines while running some tests that trigger compactions.

Reviewers: haobo, igor, dhruba

Reviewed By: dhruba

CC: i.am.jin.lei, dhruba, yhchiang, leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D16515
2014-03-05 15:36:32 -08:00
Igor Canadi
e2dd148a8b Fix compile fail introduced by merge 2014-03-05 12:47:44 -08:00
Igor Canadi
a329dd1b25 Fix TEST_Destroy_DBImpl() to work with column families 2014-03-05 12:27:39 -08:00
Igor Canadi
0738ae6dc9 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
2014-03-05 12:25:05 -08:00
Igor Canadi
9625acbf70 [CF] Dont reuse dropped column family IDs
Summary:
Column family IDs should be unique, even if column family is dropped. To achieve this, we save max column family in manifest.

Note that the diff is still not ready. I'm only using differential to move the patch to my Mac machine.

Test Plan: added a test to column_family_test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16581
2014-03-05 12:13:44 -08:00
Igor Canadi
8ca30bd51b Merge pull request #47 from mlin/kCompactionStopStyleSimilarSize
An initial implementation of kCompactionStopStyleSimilarSize for universal compaction
2014-03-05 10:35:30 -08:00
Lei Jin
04298f8c33 output perf_context in db_bench readrandom
Summary:
Add helper function to print perf context data in db_bench if enabled.
I didn't find any code that actually exports perf context data. Not sure
if I missed anything

Test Plan: ran db_bench

Reviewers: haobo, sdong, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16575
2014-03-05 10:32:54 -08:00
Lei Jin
64138b5d9c fix db_bench to use HashSkipList for real
Summary:
For HashSkipList case, DBImpl has sanity check to see if prefix_extractor in
options is the same as the one in memtable factory. If not, it falls
back to SkipList. As result, I was experimenting with SkipList
performance. No wonder it is much worse than LinkedList

Test Plan: ran benchmark

Reviewers: haobo, sdong, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16569
2014-03-05 10:28:53 -08:00
Lei Jin
51560ba755 config max_background_flushes in db_bench
Summary: as title

Test Plan: make release

Reviewers: haobo, sdong, igor

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16437
2014-03-05 10:27:17 -08:00
Igor Canadi
c0ccf43648 MergingIterator assertion
Summary: I wrote a test that triggers assertion in MergingIterator. I have not touched that code ever, so I'm looking for somebody with good understanding of the MergingIterator code to fix this. The solution is probably a one-liner. Let me know if you're willing to take a look.

Test Plan: This test fails with an assertion `use_heap_ == false`

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16521
2014-03-05 09:13:07 -08:00
sdong
e8ecca9e86 CleanupIteratorState() only to initialize DeletionState when super version cleanup needed
Summary:
Two changes:
1. DeletionState is only constructed when cleaning up is needed
2. Fix the bug of deletion state construction bug. A change was made in a previous patch: https://reviews.facebook.net/rROCKSDB774ed89c2405ee058086b099cbc8b29e243739cc#71a34e2e However, it somehow got lost when merging

Test Plan: make all check

Reviewers: kailiu, haobo, igor

Reviewed By: igor

CC: igor, dhruba, i.am.jin.lei, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16233
2014-03-04 20:58:20 -08:00
Igor Canadi
e21d5b8bbc [CF] Flush all memtables on column family drop
Summary: When column family is dropped, we want to delete all WALs that refer to it. To do that, we need to make them obsolete by flushing all the memtables

Test Plan: column_family_test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16557
2014-03-04 17:21:30 -08:00
Lei Jin
a5b1d2f146 make key evenly distributed between 0 and FLAGS_num
Summary:
The issue is that when FLAGS_num is small, the leading bytes of the key
are padded with 0s. This makes all keys have the same prefix 00000000

Most of the changes are just to make lint happy

Test Plan: ran db_bench

Reviewers: sdong, haobo, igor

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16317
2014-03-04 17:08:05 -08:00
Igor Canadi
fa34697237 Merge branch 'master' into columnfamilies 2014-03-04 09:39:14 -08:00
Igor Canadi
335b207974 [CF] Delete SuperVersion in a special function
Summary: Added a function DeleteSuperVersion that can be called in DBImpl destructor before PurgingObsoleteFiles. That way, PurgeObsoleteFiles will be able to delete all files held by alive super versions.

Test Plan: column_family_test with valgrind

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16545
2014-03-04 09:35:44 -08:00
kailiu
906f3dca72 Add a hash-index component for block
Summary:
this is the key component extracted from diff: https://reviews.facebook.net/D14271
I separate it to a dedicated patch to make the review easier.

Test Plan: added a unit test and passed it.

Reviewers: haobo, sdong, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16245
2014-03-03 21:11:49 -08:00
Igor Canadi
9d0577a6be Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/transaction_log_impl.cc
	db/transaction_log_impl.h
	include/rocksdb/options.h
	util/env.cc
	util/options.cc
2014-03-03 18:29:03 -08:00
Igor Canadi
5142b37000 Fix a group commit bug in LogAndApply
Summary:
EncodeTo(&record) does not overwrite, it appends to it.

This means that group commit log and apply will look something like:
record1
record1record2
record1record2record3

I'm surprised this didn't show up in production, but I think the reason is that MANIFEST group commit almost never happens.

This bug turned up in column family work, where opening a database failed with "adding a same column family twice".

Test Plan: Tested the change in column family branch and observed that the problem is gone (with db_stress)

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16461
2014-03-03 17:10:43 -08:00
Igor Canadi
f9b2f0ad79 [CF] Fix CF bugs in WriteBatch
Summary:
This diff fixes two bugs:
* Increase sequence number even if WriteBatch fails. This is important because WriteBatches in WAL logs have implictly increasing sequence number, even if one update in a write batch fails. This caused some writes to get lost in my CF stress testing
* Tolerate 'invalid column family' errors on recovery. When a column family is dropped, processing WAL logs can have some WriteBatches that still refer to the dropped column family. In recovery environment, we want to ignore those errors. In client's Write() code path, however, we want to return the failure to the client if he's trying to add data to invalid column family.

Test Plan: db_stress's verification works now

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16533
2014-03-03 17:07:46 -08:00
kailiu
bf86af5174 Remove the terrible hack in for flush_block_policy_factory
Summary:
Previous code is too convoluted and I must be drunk for letting
such code to be written without a second thought.

Thanks to the discussion with @sdong, I added the `Options` when
generating the flusher, thus avoiding the tricks.

Just FYI: I resisted to add Options in flush_block_policy.h since I
wanted to avoid cyclic dependencies: FlushBlockPolicy dpends on Options
and Options also depends FlushBlockPolicy... While I appreciate my
effort to prevent it, the old design turns out creating more troubles than
it tried to avoid.

Test Plan: ran ./table_test

Reviewers: sdong

Reviewed By: sdong

CC: sdong, leveldb

Differential Revision: https://reviews.facebook.net/D16503
2014-02-28 16:39:27 -08:00
Igor Canadi
8ea21a778b [CF] Rething LogAndApply for column families
Summary:
I though I might get away with as little changes to LogAndApply() as possible. It turns out this is not the case.

This diff introduces different behavior of LogAndApply() for three cases:
1. column family add
2. column family drop
3. no-column family manipulation

(1) and (2) don't support group commit yet.

There were a lot of problems with old version od LogAndApply, detected by db_stress. The biggest was non-atomicity of manifest writes and metadata changes (i.e. if column family add is in manifest, it also has to be in in-memory data structure).

Test Plan: db_stress

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16491
2014-02-28 14:46:48 -08:00
Igor Canadi
58ca641d53 Make Log::Reader more robust
Summary:
This diff does two things:
(1) Log::Reader does not report a corruption when the last record in a log or manifest file is truncated (meaning that log writer died in the middle of the write). Inherited the code from LevelDB: https://code.google.com/p/leveldb/source/detail?r=269fc6ca9416129248db5ca57050cd5d39d177c8#
(2) Turn off mmap writes for all writes to log and manifest files

(2) is necessary because if we use mmap writes, the last record is not truncated, but is actually filled with zeros, making checksum fail. It is hard to recover from checksum failing.

Test Plan:
Added unit tests from LevelDB
Actually recovered a "corrupted" MANIFEST file.

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16119
2014-02-28 13:19:47 -08:00
Igor Canadi
12966ec1bb Fix LogAndApply() group commit 2014-02-28 12:22:45 -08:00
Yueh-Hsuan Chiang
a77527f2af Add ReadOptions to TransactionLogIterator.
Summary:
Add an optional input parameter ReadOptions to DB::GetUpdateSince(),
which allows the verification of checksums to be disabled by setting
ReadOptions::verify_checksums to false.

Test Plan: Tests are done off-line and will not be included in the regular unit test.

Reviewers: igor

Reviewed By: igor

CC: leveldb, xjin, dhruba

Differential Revision: https://reviews.facebook.net/D16305
2014-02-28 11:50:36 -08:00
Igor Canadi
f6a257b6a1 Set dropped column family before persisting in the manifest 2014-02-28 11:49:32 -08:00
Igor Canadi
670f3ba212 [CF] Small refactor of Recover() and DumpManifest() 2014-02-28 11:25:38 -08:00
Igor Canadi
099ad94306 Set log number for column family 2014-02-28 11:08:24 -08:00
Igor Canadi
510f84b686 [CF] CreateColumnFamily fix
Summary:
This fixes few bugs with CreateColumnFamily
* We first have to LogAndApply and then call VersionSet::CreateColumnFamily. Otherwise, WriteSnapshot might be invoked, writing out column family add inside of LogAndApply, even though it's not really committed
* Fix LogAndApplyHelper() to not apply log number to column_family_data, which is in case of column family add, just a dummy (default) column family
* Create SuperVerion when creating column family

Test Plan: column_family_test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16443
2014-02-28 10:40:52 -08:00
Igor Canadi
206b38f31c SetLogNumber in CreateColumnFamily 2014-02-27 16:53:45 -08:00
Igor Canadi
b41a3bc4da [CF] Change flow of CreateColumnFamily
Summary:
Previously, we first wrote to the manifest and then created internal data structure.
Now, we first create internal data structure. That way, we can write out internal comparator to the manifest

Test Plan: column_family_test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16425
2014-02-27 16:49:49 -08:00
Igor Canadi
492c9f71c6 [CF] Column family support for LDB tool
Summary: Added list_column_family command and also updated dump_manifest

Test Plan: no

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16419
2014-02-27 16:39:23 -08:00
Lei Jin
ad0c3747cb cache SuperVersion in thread local storage to avoid mutex lock
Summary: as title

Test Plan:
asan_check
will post results later

Reviewers: haobo, igor, dhruba, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16257
2014-02-27 11:38:55 -08:00
Igor Canadi
85b1b5e1b9 [CF] WaitForFlush() instead of sleeping
Summary: If we sleep for 300ms the test fails in valgrind because it takes more than 300ms to flush. This way we WaitForFlush() when we're expecting flush, but still sleep and check if the flush happens even though it's not supposed to.

Test Plan: notest

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16401
2014-02-27 10:31:05 -08:00
Igor Canadi
4c42201204 [CF] Test fixes and speedup 2014-02-26 17:34:39 -08:00
Igor Canadi
343c32be7b [CF] DifferentMergeOperators and DifferentCompactionStyles tests
Summary:
Two new column family tests:
* DifferentMergeOperators -- three column families, one without merge operator, one with add operator and one with append operator. verify that operations work as expected.
* DifferentCompactionStyles -- three column families, two with level compactions and one with universal compaction. trigger the compactions and verify they work as expected.

Test Plan: nope

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16377
2014-02-26 16:05:24 -08:00
Igor Canadi
3c81546422 [CF] Make LogDeletionTest less flakey
Summary: Retry GetSortedWalFiles() and also wait 20ms before counting number of log files. WaitForFlush() doesn't necessarily wait for logs to be deleted, since logs are deleted outside of the mutex.

Test Plan: column_family_test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16371
2014-02-26 14:41:18 -08:00
Igor Canadi
6e7cae7711 [CF] More tests
Summary: New unit tests for column families

Test Plan: this is a test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16359
2014-02-26 14:16:23 -08:00
Igor Canadi
9bce2b2a84 [CF] Fix lint errors in CF code
Summary: Big CF diff uncovered some lint errors. This diff fixes some of them. Not much to see here

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16347
2014-02-26 10:10:00 -08:00
Igor Canadi
8b7ab9951c [CF] Handle failure in WriteBatch::Handler
Summary:
* Add ColumnFamilyHandle::GetID() function. Client needs to know column family's ID to be able to construct WriteBatch
* Handle WriteBatch::Handler failure gracefully. Since WriteBatch is not a very smart function (it takes raw CF id), client can add data to WriteBatch for column family that doesn't exist. In that case, we need to gracefully return failure status from DB::Write(). To do that, I added a return Status to WriteBatch functions PutCF, DeleteCF and MergeCF.

Test Plan: Added test to column_family_test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16323
2014-02-26 10:10:00 -08:00
Igor Canadi
8895526308 Merge branch 'master' into columnfamilies 2014-02-25 17:04:48 -08:00
Igor Canadi
5ad7ee03ea [CF] Log deletion in column families
Summary:
* Added unit test that verifies that obsolete files are deleted.
* Advance log number for empty column family when cutting log file.
* MinLogNumber() bug fix! (caught by the new unit test)

Test Plan: unit test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16311
2014-02-25 16:54:41 -08:00
Igor Canadi
dc277f0ab7 [CF] Adaptation of GetLiveFiles for CF
Summary: Even if user flushes the memtables before getting live files, we still can't guarantee that new data didn't come in (to already-flushed memtables). If we want backups to provide consistent view of the database, we still need to get WAL files.

Test Plan: backupable_db_test

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16299
2014-02-25 13:21:14 -08:00
Igor Canadi
5a91746277 log file is uint64_t 2014-02-25 12:57:43 -08:00
Igor Canadi
4209516359 Schedule flush when waiting on flush
Summary:
This will also help with avoiding the deadlock. If a flush failed and we're waiting for a memtable to be flushed, we should schedule a new flush and hope a new one succeedes.

If paranoid_checks = false, Wait() will still hang on ENOSPC, but at least it will automatically continue when the space frees up. Current behavior both hangs and deadlocks.

Also, I renamed some 'compaction' to 'flush'. 'compaction' was leveldb way of saying things.

Test Plan: make check

Reviewers: dhruba, haobo, ljin

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16281
2014-02-25 12:04:14 -08:00
Lei Jin
dea894ef8d expose wal_dir in db_bench
Summary: as title

Test Plan: ran db_bench

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16269
2014-02-25 10:43:46 -08:00
Albert Strasheim
72aacf6b96 A few more C API functions. 2014-02-25 10:32:28 -08:00
Igor Canadi
b69e7d99d5 [CF] Better handling of memtable logs
Summary: DBImpl now keeps a list of alive_log_files_. On every FindObsoleteFiles, it deletes all alive log files that are smaller than versions_->MinLogNumber()

Test Plan:
make check passes
no specific unit tests yet, will add

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16293
2014-02-25 09:55:13 -08:00
Igor Canadi
d39da4b578 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
2014-02-24 17:09:05 -08:00
Igor Canadi
6ed450a58c DeleteFile should schedule Flush or Compaction
Summary:
More info here: https://github.com/facebook/rocksdb/issues/89
If flush fails because of ENOSPC, we have a deadlock problem. This is a quick fix that will continue the normal operation when user deletes the file and frees up the space on the device.

We need to address the issue more broadly with bg_error_ cleanup.

Test Plan: make check

Reviewers: dhruba, haobo, ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16275
2014-02-24 16:00:13 -08:00
Igor Canadi
2bf1151a25 Fix C API 2014-02-24 15:15:34 -08:00
Igor Canadi
18a7cdfba0 Merge pull request #82 from tecbot/api-enhancements
Enhancements to the API
2014-02-24 14:20:13 -08:00
Thomas Adam
ce2b1f7b44 added a test case for custom merge operator 2014-02-23 17:58:38 +01:00
Thomas Adam
68248a2ac5 added a delete method for custom filter policy and merge operator to make it possible to override the cleanup behaviour of the return value 2014-02-23 17:58:11 +01:00
sdong
b2d29675c8 Add a test in prefix_test to verify correctness of results
Summary:
Add a test to verify HashLinkList and HashSkipList (mainly for the former one) returns the correct results when inserting the same bucket in the different orders.

Some other changes:
(1) add the test to test list
(2) fix compile error
(3) add header

Test Plan: ./prefix_test

Reviewers: haobo, kailiu

Reviewed By: haobo

CC: igor, yhchiang, i.am.jin.lei, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D16143
2014-02-19 17:00:34 -08:00
Kai Liu
2b205b35d8 Disable putting filter block to block cache
Summary: This bug caused server crash issues because the filter block is too big and kept purging out of cache.

Test Plan: Wrote a new unit tests to make sure it works.

Reviewers: dhruba, haobo, igor, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16221
2014-02-19 15:38:57 -08:00
Thomas Adam
d74c9b79ea Enhancements to the API 2014-02-19 23:59:54 +01:00
sdong
e90d3f7752 First Transaction Logs Should Not Skip Storage Options Given
Summary: Currently, the first transaction log file ignore bytes_per_sync and other storage-related options. It is not consistent. Fix it.

Test Plan: make all check. See the options set in GDB.

Reviewers: haobo, kailiu

Reviewed By: haobo

CC: igor, ljin, yhchiang, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16215
2014-02-19 10:58:39 -08:00
Igor Canadi
6aef661230 some improvements to CompressedCache test 2014-02-14 17:47:53 -08:00
Igor Canadi
422bb09cb0 Fix table properties
Summary: Adapt table properties to column family world

Test Plan: make check

Reviewers: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16161
2014-02-14 17:13:10 -08:00
Igor Canadi
76c048183c Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	include/rocksdb/db.h
2014-02-14 16:46:03 -08:00
Igor Canadi
be7e273d83 fix u/s comparison #83 2014-02-14 16:18:55 -08:00
Igor Canadi
c67d48c852 [CF] DB test to run on non-default column family
Summary:
This is a huge diff and it was hectic, but the idea is actually quite simple. Every operation (Put, Get, etc.) done on default column family in DBTest is now forwarded to non-default ("pikachu"). The good news is that we had zero test failures! Column families look stable so far.

One interesting test that I adapted for column families is MultiThreadedTest. I replaced every Put() with a WriteBatch writing to all column families concurrently. Every Put in the write batch contains unique_id. Instead of Get() I do a multiget across all column families with the same key. If atomicity holds, I expect to see the same unique_id in all column families.

Test Plan: This is a test!

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16149
2014-02-14 16:08:59 -08:00
kailiu
63690625cd Expose the table properties to application
Summary: Provide a public API for users to access the table properties for each SSTable.

Test Plan: Added a unit tests to test the function correctness under differnet conditions.

Reviewers: haobo, dhruba, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16083
2014-02-13 16:28:21 -08:00
Kai Liu
b2e7ee8b41 Followup code refactor on plain table
Summary:
Fixed most comments in https://reviews.facebook.net/D15429.
Still have some remaining comments left.

Test Plan: make all check

Reviewers: sdong, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15885
2014-02-13 15:27:59 -08:00
Siying Dong
f3ae3d07cc Add more black-box tests for PlainTable and explicitly support total order mode
Summary:
1. Add some more implementation-aware tests for PlainTable
2. move from a hard-coded one index per 16 rows in one prefix to a configurable number. Also, make hash table ratio = 0  means binary search only. Also fixes some divide 0 risks.
3. Explicitly support total order (only use binary search)
4. some code cleaning up.

Test Plan: make all check

Reviewers: haobo, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16023
2014-02-12 17:37:22 -08:00
Igor Canadi
ccdb93e775 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/memtable_list.cc
	db/memtable_list.h
	db/version_set.cc
	db/version_set.h
2014-02-12 14:01:30 -08:00
Igor Canadi
b06840aa7d [CF] Rethinking ColumnFamilyHandle and fix to dropping column families
Summary:
The change to the public behavior:
* When opening a DB or creating new column family client gets a ColumnFamilyHandle.
* As long as column family handle is alive, client can do whatever he wants with it, even drop it
* Dropped column family can still be read from (using the column family handle)
* Added a new call CloseColumnFamily(). Client has to close all column families that he has opened before deleting the DB
* As soon as column family is closed, any calls to DB using that column family handle will fail (also any outstanding calls)

Internally:
* Ref-counting ColumnFamilyData
* New thread-safety for ColumnFamilySet
* Dropped column families are now completely dropped and their memory cleaned-up

Test Plan: added some tests to column_family_test

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16101
2014-02-12 13:47:09 -08:00
Igor Canadi
ca5f1a225a CompactionContext to include is_manual_compaction
Summary: Added a bit more information to compaction context, requested by internal team at FB.

Test Plan: Modified CompactionFilter test to make sure is_manual_compaction is properly set.

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16095
2014-02-12 12:24:18 -08:00
Lei Jin
994c327b86 IOError cleanup
Summary: Clean up IOErrors so that it only indicates errors talking to device.

Test Plan: make all check

Reviewers: igor, haobo, dhruba, emayanke

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15831
2014-02-12 11:42:54 -08:00
Lei Jin
5fbf2ef42d preload table handle on Recover() when max_open_files == -1
Summary: This covers existing table files before DB open happens and avoids contention on table cache

Test Plan: db_test

Reviewers: haobo, sdong, igor, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16089
2014-02-12 10:43:27 -08:00
Lei Jin
28b7f7faa8 enable plain table in db_bench
Summary: as title

Test Plan: ran db_bench to gather stats

Reviewers: haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16059
2014-02-12 10:41:55 -08:00
kailiu
265150cb49 Fix problem 3 for issue #80 2014-02-11 17:52:18 -08:00
Siying Dong
33042669f6 Reduce malloc of iterators in Get() code paths
Summary:
This patch optimized Get() code paths by avoiding malloc of iterators. Iterator creation is moved to mem table rep implementations, where a callback is called when any key is found. This is the same practice as what we do in (SST) table readers.

db_bench result for readrandom following a writeseq, with no compression, single thread and tmpfs, we see throughput improved to 144958 from 139027, about 3%.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

Reviewed By: haobo

CC: leveldb, yhchiang

Differential Revision: https://reviews.facebook.net/D14685
2014-02-11 10:32:51 -08:00
Kai Liu
d4b789fdee Add LIBRARY back to make dbg 2014-02-10 20:15:09 -08:00
Igor Canadi
8e634d3ea4 Merge pull request #74 from alberts/lz4
Support for LZ4 compression.
2014-02-10 15:46:56 -08:00
Albert Strasheim
df2f92214a Support for LZ4 compression. 2014-02-08 14:15:51 -08:00
kailiu
161ab42a8a Make table properties shareable
Summary:
We are going to expose properties of all tables to end users through "some" db interface.
However, current design doesn't naturally fit for this need, which is because:

1. If a table presents in table cache, we cannot simply return the reference to its table properties, because the table may be destroy after compaction (and we don't want to hold the ref of the version).
2. Copy table properties is OK, but it's slow.

Thus in this diff, I change the table reader's interface to return a shared pointer (for const table properties), instead a const refernce.

Test Plan: `make check` passed

Reviewers: haobo, sdong, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15999
2014-02-07 19:26:49 -08:00
Igor Canadi
8d4db63a2d [CF] OpenWithColumnFamilies -> Open
Summary: By discussion with @dhruba, overloading Open makes more sense

Test Plan: compiles!

Reviewers: dhruba

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16017
2014-02-07 14:49:33 -08:00
Igor Canadi
99e61fdd5c [CF] Separate dumping of DBOptions and ColumnFamilyOptions
Summary: When we open a DB, we should dump only DBOptions and then when we create a new column family, we dump ColumnFamilyOptions for each one.

Test Plan: make check, confirm contents of the LOG

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16011
2014-02-07 13:52:51 -08:00
Igor Canadi
2b8c44639a Merge branch 'master' into columnfamilies 2014-02-07 11:42:56 -08:00
Igor Canadi
1560bb913e Readrandom with tailing iterator
Summary:
Added an option for readrandom benchmark to run with tailing iterator instead of Get. Benefit of tailing iterator is that it doesn't require locking DB mutex on access.

I also have some results when running on my machine. The results highly depend on number of cache shards. With our current benchmark setting of 4 table cache shards and 6 block cache shards, I don't see much improvements of using tailing iterator. In that case, we're probably seeing cache mutex contention.

Here are the results for different number of shards

    cache shards       tailing iterator        get
       6                      1.38M           1.16M
      10                      1.58M           1.15M

As soon as we get rid of cache mutex contention, we're seeing big improvements in using tailing iterator vs. ordinary get.

Test Plan: ran regression test

Reviewers: dhruba, haobo, ljin, kailiu, sding

Reviewed By: haobo

CC: tnovak

Differential Revision: https://reviews.facebook.net/D15867
2014-02-07 09:47:47 -08:00
Igor Canadi
0143abdbb0 Merge branch 'master' into columnfamilies
Conflicts:
	HISTORY.md
	db/db_impl.cc
	db/db_impl.h
	db/db_iter.cc
	db/db_test.cc
	db/dbformat.h
	db/memtable.cc
	db/memtable_list.cc
	db/memtable_list.h
	db/table_cache.cc
	db/table_cache.h
	db/version_edit.h
	db/version_set.cc
	db/version_set.h
	db/write_batch.cc
	db/write_batch_test.cc
	include/rocksdb/options.h
	util/options.cc
2014-02-06 15:58:20 -08:00
Igor Canadi
0b4ccf765c Flushes should always go to HIGH priority thread pool
Summary:
This is not column-family related diff. It is in columnfamily branch because the change is significant and we want to push it with next major release (3.0).

It removes the leveldb notion of one thread pool and expands it to two thread pools by default (HIGH and LOW). Flush process is removed from compaction process and all flush threads are executed on HIGH thread pool, since we don't want long-running compactions to influence flush latency.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15987
2014-02-06 14:17:13 -08:00
Igor Canadi
f8d5443efe [CF] Thread-safety guarantees for ColumnFamilySet
Summary: Revised thread-safety guarantees and implemented a way to spinlock the object.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15975
2014-02-06 11:57:36 -08:00
Igor Canadi
8fa8a708ef [CF] Propagate correct options to WriteBatch::InsertInto
Summary:
WriteBatch can have multiple column families in one batch. Every column family has different options. So we have to add a way for write batch to get options for an arbitrary column family.

This required a bit more acrobatics since lots of interfaces had to be changed.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15957
2014-02-06 10:23:31 -08:00
kailiu
84f8185fc0 Merge branch 'master' into performance
Conflicts:
	HISTORY.md
	db/db_impl.cc
	db/memtable.cc
2014-02-05 21:21:00 -08:00
Igor Canadi
b4f441f48a Fixed a bug introduced by previous commit 2014-02-05 14:58:24 -08:00
Igor Canadi
f276e0e59d [CF] Options -> DBOptions
Summary: Replaced most of occurrences of Options with more specific DBOptions. This brings us very close to supporting different configuration options for each column family.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15933
2014-02-05 14:56:09 -08:00
Igor Canadi
6e56ab5702 [CF] Add full_options_ to ColumnFamilyData
Summary:
Lots of code expects Options on construction/function call. My original idea was to split Options argument into ColumnFamilyOptions and DBOptions (the latter only if needed). However, this will require huge code changes very deep in the stack.

The better idea is to have ColumnFamilyData hold both ColumnFamilyOptions and Options. ColumnFamilyData::Options would be constructed from DBOptions (same for each column family) and ColumnFamilyOptions (different for each column family)

Now when we construct a class or call any method that requires Options, we can just push him ColumnFamilyData::Options and be sure that it's using column-family-specific settings.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15927
2014-02-05 12:26:40 -08:00
Igor Canadi
328ac7ee02 Merge branch 'master' into columnfamilies 2014-02-05 12:00:33 -08:00
Igor Canadi
c24d8c4e90 [CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.

To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.

This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15915
2014-02-05 11:55:30 -08:00
Igor Canadi
7b9f134959 [CF] Move InternalStats to ColumnFamilyData
Summary: InternalStats is a messy thing, keeping both DB data and column family data. However, it's better off living in ColumnFamilyData than in DBImpl. For now, at least.

Test Plan: make check

Reviewers: dhruba, kailiu, haobo, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15879
2014-02-04 17:59:50 -08:00
Igor Canadi
73f62255c1 [CF] Split SanitizeOptions into two
Summary:
There are three SanitizeOption-s now : one for DBOptions, one for ColumnFamilyOptions and one for Options (which just calls the other two)

I have also reshuffled some options -- table_cache options and info_log should live in DBOptions, for example.

Test Plan: make check doesn't complain

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15873
2014-02-04 17:26:51 -08:00
kailiu
d43ebd8c65 Put table factory back to public api
Summary:
Previous I am too ambitious to hide every detail about table factory
to internal api. However, we cannot pass the compilatoin for external
users since we use table factory as the shared_ptr, which requires
the definition of table factory's destructor.

Test Plan: make check;

Reviewers: sdong, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15861
2014-02-03 19:51:20 -08:00
Igor Canadi
5e2c4fe766 Get rid of DBImpl::user_comparator()
Summary: user_comparator() is a Column Family property, not DBImpl

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15855
2014-02-03 17:50:47 -08:00
Igor Canadi
0e22badc08 [column families] Iterator and MultiGet
Summary: Support for different column families in Iterator and MultiGet code path.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15849
2014-02-03 17:44:40 -08:00
Igor Canadi
2966d764cd Fix some 32-bit compile errors
Summary: RocksDB doesn't compile on 32-bit architecture apparently. This is attempt to fix some of 32-bit errors. They are reported here: https://gist.github.com/paxos/8789697

Test Plan: RocksDB still compiles on 64-bit :)

Reviewers: kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15825
2014-02-03 13:48:30 -08:00
Igor Canadi
2a9271b403 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/db_impl_readonly.cc
2014-02-03 13:47:54 -08:00
Lei Jin
5b3b6549d6 use super_version in NewIterator() and MultiGet() function
Summary:
Use super_version insider NewIterator to avoid Ref() each component
separately under mutex
The new added bench shows NewIterator QPS increases from 515K to 719K
No meaningful improvement for multiget I guess due to its relatively small
cost comparing to 90 keys fetch in the test.

Test Plan: unit test and db_bench

Reviewers: igor, sdong

Reviewed By: igor

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D15609
2014-02-03 13:13:36 -08:00
Igor Canadi
29bacb2eb6 VersionSet cleanup
Summary:
Removed icmp_ from VersionSet (since it's per-column-family, not per-DB-instance)
Unfriended VersionSet and ColumnFamilyData (yay!)
Removed VersionSet::NumberLevels()
Cleaned up DBImpl

Test Plan: make check

Reviewers: dhruba, haobo, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15819
2014-02-03 13:10:47 -08:00
Siying Dong
d169b67680 [Performance Branch] PlainTable to encode rows with seqID 0, value type using 1 internal byte.
Summary: In PlainTable, use one single byte to represent 8 bytes of internal bytes, if seqID = 0 and it is value type (which should be common for bottom most files). It is to save 7 bytes for uncompressed cases.

Test Plan: make all check

Reviewers: haobo, dhruba, kailiu

Reviewed By: haobo

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D15489
2014-02-03 12:19:30 -08:00
Igor Canadi
5c6ef56152 Fix printf format 2014-02-03 10:25:37 -08:00
kailiu
4f6cb17bdb First phase API clean up
Summary:
Addressed all the issues in https://reviews.facebook.net/D15447.
Now most table-related modules are hidden from user land.

Test Plan: make check

Reviewers: sdong, haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15525
2014-02-03 00:30:43 -08:00
Igor Canadi
27a8856c23 Compacting column families
Summary: This diff enables non-default column families to get compacted both automatically and also by calling CompactRange()

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15813
2014-01-31 19:54:03 -08:00
Igor Canadi
5661ed8b80 Fix reduce_levels_test 2014-01-31 19:44:48 -08:00
Igor Canadi
fe9bd300d6 Merge branch 'master' into columnfamilies 2014-01-31 17:17:22 -08:00
Dhruba Borthakur
30a700657d Fix corruption_test failure caused by auto-enablement of checksum verification.
Summary:
Patch  https://reviews.facebook.net/D15591 enabled checksum
verification by default. This caused the unit test to fail.

Test Plan: ./corruption_test

Reviewers: igor, kailiu

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15795
2014-01-31 17:16:38 -08:00
Igor Canadi
f7489123e2 Move compaction picker and internal key comparator to ColumnFamilyData
Summary: Compaction picker and internal key comparator are different for each column family (not global), so they should live in ColumnFamilyData

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15801
2014-01-31 16:06:55 -08:00
Igor Canadi
5693db2a02 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
2014-01-31 14:45:15 -08:00
Igor Canadi
dbbffbd772 Mark the log_number file number used
Summary:
VersionSet::next_file_number_ is always assumed to be strictly greater than VersionSet::log_number_. In our new recovery code, we artificially set log_number_  to be (log_number + 1), so that once we flush, we don't recover from the same log file again (this is important because of merge operator non-idempotence)

When we set VersionSet::log_number_ to (log_number + 1), we also have to mark that file number used, such that next_file_number_ is increased to a legal level. Otherwise, VersionSet might assert.

This has not be a problem so far because here's what happens:
1. assume next_file_number is 5, we're recovering log_number 10
2. in DBImpl::Recover() we call MarkFileNumberUsed with 10. This will set VersionSet::next_file_number_ to 11.
3. If there are some updates, we will call WriteTable0ForRecovery(), which will use file number 11 as a new table file and advance VersionSet::next_file_number_ to 12.
4. When we LogAndApply() with log_number 11, assertion is true: assert(11 <= 12);

However, this was a lucky occurrence. Even though this diff doesn't cause a bug, I think the issue is important to fix.

Test Plan: In column families I have different recovery logic and this code path asserted. When adding MarkFileNumberUsed(log_number + 1) assert is gone.

Reviewers: dhruba, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15783
2014-01-31 14:43:16 -08:00
Igor Canadi
3615f534d1 Enable flushing memtables from arbitrary column families
Summary: Removed default_cfd_ from all flush code paths. This means we can now flush memtables from arbitrary column families!

Test Plan: Added a new unit test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15789
2014-01-31 14:42:52 -08:00
Siying Dong
56bea9f80d When using Universal Compaction, Zero out seqID in the last file too
Summary: I didn't figure out the reason why the feature of zeroing out earlier sequence ID is disabled in universal compaction. I do see bottommost_level is set correctly. It should simply work if we remove the constraint of universal compaction.

Test Plan: make all check

Reviewers: haobo, dhruba, kailiu, igor

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15423
2014-01-31 09:59:52 -08:00
kailiu
4e0298f23c Clean up arena API
Summary:
Easy thing goes first. This patch moves arena to internal dir; based
on which, the coming patch will deal with memtable_rep.

Test Plan: make check

Reviewers: haobo, sdong, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15615
2014-01-30 22:10:10 -08:00
Dhruba Borthakur
abd70ecc2b The default settings enable checksum verification on every read.
Summary: The default settings enable checksum verification on every read.

Test Plan: make check

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15591
2014-01-30 19:14:03 -08:00
Igor Canadi
9ca638a86d Enable iterating column families with a concurrent writer
Summary:
Sometimes we iterate through column families, and unlock the mutex in the body of the iteration. While mutex is unlocked, some column family might be created or dropped. We need to be able to continue iterating through column families even though our current column family got dropped.

This diff implements circular linked lists that connect all column families. It then uses the link list to enable iterating through linked lists. Even if the column family is dropped, its next_ pointer still can be used to advance to another alive column family.

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15603
2014-01-30 16:57:52 -08:00
Igor Canadi
6973bb1722 MakeRoomForWrite() support for column families
Summary: Making room for write will be the hardest part of the column family implementation. For now, I just iterate through all column families and run MakeRoomForWrite() for every one.

Test Plan: make check does not complain

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15597
2014-01-30 16:12:08 -08:00
Igor Canadi
c37e7de669 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
2014-01-30 11:47:23 -08:00
Igor Canadi
ac92420fc5 Merge branch 'master' into performance
Conflicts:
	db/db_impl.h
2014-01-30 10:09:23 -08:00
Igor Canadi
3c0dcf0e25 InternalStatistics
Summary:
In DBImpl we keep track of some statistics internally and expose them via GetProperty(). This diff encapsulates all the internal statistics into a class InternalStatisics. Most of it is copy/paste.

Apart from cleaning up db_impl.cc, this diff is also necessary for Column families, since every column family should have its own CompactionStats, MakeRoomForWrite-stall stats, etc. It's much easier to keep track of it in every column family if it's nicely encapsulated in its own class.

Test Plan: make check

Reviewers: dhruba, kailiu, haobo, sdong, emayanke

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15273
2014-01-29 20:40:41 -08:00
Lei Jin
d118707f8d set bg_error_ when background flush goes wrong
Summary: as title

Test Plan: unit test

Reviewers: haobo, igor, sdong, kailiu, dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15435
2014-01-29 15:55:58 -08:00
Igor Canadi
514e42c7cc Fix some lint warnings 2014-01-29 15:27:27 -08:00
Igor Canadi
fa99d53e55 Change ColumnFamilyData from struct to class
Summary: ColumnFamilyData grew a lot, there's much more data that it holds now. It makes more sense to encapsulate it better by making it a class.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15579
2014-01-29 15:18:36 -08:00
Igor Canadi
15999e728e Fix column family test (create directory) 2014-01-29 14:06:59 -08:00
Igor Canadi
4662969bf5 PurgeObsoleteFiles in DropColumnFamily
Summary: When we drop the column family, we want to delete all the files from that column family.

Test Plan: make check

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15561
2014-01-29 11:58:33 -08:00
Igor Canadi
20b231d712 Merge branch 'master' into columnfamilies 2014-01-29 11:47:43 -08:00
Igor Canadi
f24a3ee52d Read from and write to different column families
Summary: This one is big. It adds ability to write to and read from different column families (see the unit test). It also supports recovery of different column families from log, which was the hardest part to reason about. We need to make sure to never delete the log file which has unflushed data from any column family. To support that, I added another concept, which is versions_->MinLogNumber()

Test Plan: Added a unit test in column_family_test

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15537
2014-01-29 11:38:16 -08:00
Igor Canadi
e57f0cc1a1 add include <atomic> to version_set.h 2014-01-29 08:17:43 -08:00
Igor Canadi
c1071ed95c Merge branch 'master' into columnfamilies 2014-01-28 16:04:00 -08:00
Igor Canadi
5d2c62822e Only get the manifest file size if there is no error
Summary:
I came across this while working on column families. CorruptionTest::RecoverWriteError threw a SIGSEG because the descriptor_log_->file() was nullptr. I'm not sure why it doesn't happen in master, but better safe than sorry.

@kailiu, can we get this in release, too?

Test Plan: make check

Reviewers: kailiu, dhruba, haobo

Reviewed By: haobo

CC: leveldb, kailiu

Differential Revision: https://reviews.facebook.net/D15513
2014-01-28 16:02:51 -08:00
kailiu
a5e220f5ef Merge branch 'master' into performance
Conflicts:
	Makefile
	db/db_impl.cc
	db/db_test.cc
	db/memtable_list.cc
	db/memtable_list.h
	table/block_based_table_reader.cc
	table/table_test.cc
	util/cache.cc
	util/coding.cc
2014-01-28 10:35:55 -08:00
Igor Canadi
4bf25357ae [column families] Removing VersionSet::current()
Summary: Instead of VersionSet::current(), DBImpl uses default_cfd_->current directly.

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15483
2014-01-27 17:04:46 -08:00
Mark Callaghan
90f29ccbef Update monitoring to include average time per compaction and stall
Summary:
The new columns are msComp and msStall that provide average time per compaction and stall for that level in milliseconds.
Level  Files Size(MB) Score Time(sec)  Read(MB) Write(MB)    Rn(MB)  Rnp1(MB)  Wnew(MB) RW-Amplify Read(MB/s) Write(MB/s)      Rn     Rnp1     Wnp1     NewW    Count   msComp   msStall  Ln-stall Stall-cnt
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0        8       15   1.5         2         0        30         0         0        30        0.0       0.0        15.5        0        0        0        0       16      112       0.2       1.3      7568
  1        8       16   1.6         1        26        26        15        11        16        3.5      17.6        18.1        8        6       13        7        3      362       0.0       0.0         0
  2        1        2   0.0         0         0         2         0         0         2        0.0       0.0        18.4        0        0        0        0        1       50       0.0       0.0         0

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15345
2014-01-27 15:06:04 -08:00
Schalk-Willem Kruger
3d33da75ef Fix UnmarkEOF for partial blocks
Summary:
Blocks in the transaction log are a fixed size, but the last block in the transaction log file is usually a partial block. When a new record is added after the reader hit the end of the file, a new physical record will be appended to the last block. ReadPhysicalRecord can only read full blocks and assumes that the file position indicator is aligned to the start of a block. If the reader is forced to read further by simply clearing the EOF flag, ReadPhysicalRecord will read a full block starting from somewhere in the middle of a real block, causing it to lose alignment and to have a partial physical record at the end of the read buffer. This will result in length mismatches and checksum failures. When the log file is tailed for replication this will cause the log iterator to become invalid, necessitating the creation of a new iterator which will have to read the log file from scratch.

This diff fixes this issue by reading the remaining portion of the last block we read from. This is done when the reader is forced to read further (UnmarkEOF is called).

Test Plan:
- Added unit tests
- Stress test (with replication). Check dbdir/LOG file for corruptions.
- Test on test tier

Reviewers: emayanke, haobo, dhruba

Reviewed By: haobo

CC: vamsi, sheki, dhruba, kailiu, igor

Differential Revision: https://reviews.facebook.net/D15249
2014-01-27 14:49:10 -08:00
Igor Canadi
511b03a5b5 LogAndApply to take ColumnFamilyData
Summary: This removes the default implementation of LogAndApply that applied the changed to the default column family by default. It is mostly simple reformatting.

Test Plan: make check

Reviewers: dhruba, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15465
2014-01-27 13:57:58 -08:00
Igor Canadi
eb055609e4 [column families] Move memtable and immutable memtable list to column family data
Summary: All memtables and immutable memtables are moved from DBImpl to ColumnFamilyData. For now, they are all referenced from default column family in DBImpl. It shouldn't be hard to get them from custom column family.

Test Plan: make check

Reviewers: dhruba, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15459
2014-01-27 13:37:14 -08:00
Igor Canadi
ae16606f98 Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
	db/version_set.h
2014-01-27 11:11:51 -08:00
Igor Canadi
832158e7f7 Fsync directory after we create a new file
Summary:
@dhruba, I'm not sure where we need to sync the directory. I implemented the function in Env() and added the dir sync just after we close the newly created file in the builder.

Should I also add FsyncDir() to new files that get created by a compaction?

Test Plan: Confirmed that FsyncDir is returning Status::OK()

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D14751
2014-01-27 11:02:21 -08:00
Siying Dong
b20486f294 [Performance Branch] HashLinkList to avoid to convert length prefixed string back to internal keys
Summary: Converting from length prefixed buffer back to internal key costs some CPU but it is not necessary. In this patch, internal keys are pass though the functions so that we don't need to convert back to it.

Test Plan: make all check

Reviewers: haobo, kailiu

Reviewed By: kailiu

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D15393
2014-01-27 10:26:14 -08:00
Igor Canadi
cf783c674a Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.h
2014-01-27 10:00:24 -08:00