Summary:
clean up the db_bench a little bit. also avoid allocating memory for key
in the loop
Test Plan:
I verified a run with filluniquerandom & readrandom. Iterator seek will be used lot
to measure performance. Will fix whatever comes up
Reviewers: haobo, igor, yhchiang
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17559
Summary: ToString() is expensive. Profiling shows that most compaction threads are stuck in jemalloc, allocating a new string. This will help out a litte.
Test Plan: make check
Reviewers: haobo, danguo
Reviewed By: danguo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17583
Summary: This will allow us to disable them completely for iOS or for better performance
Test Plan: will run make all check
Reviewers: igor, haobo, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17511
Summary:
Move PlainTableIterator's copied key from std::string local buffer to avoid paying the extra costs in std::string related to sharing. Reuse the same buffer class in DbIter. Move the class to dbformat.h.
This patch improves iterator performance significantly. Running this benchmark:
./table_reader_bench --num_keys2=17 --iterator --plain_table --time_unit=nanosecond
The average latency is improved to about 750 nanoseconds from 1100 nanoseconds.
Test Plan:
Add a unit test.
make all check
Reviewers: haobo, ljin
Reviewed By: haobo
CC: igor, yhchiang, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D17547
Summary: If a client specifies wal_dir with trailing '/', we will fail in deleting obsolete log files. See task #4083746
Test Plan: make check
Reviewers: haobo, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17535
Summary: Our measurement shows that sometimes new log::Write's constructor can take hundreds of milliseconds. It's unclear why but just simply move it out of DB mutex.
Test Plan: make all check
Reviewers: haobo, ljin, igor
Reviewed By: haobo
CC: nkg-, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D17487
Summary: per sdong's request, this will help processor prefetch on n->key case.
Test Plan: make all check
Reviewers: sdong, haobo, igor
Reviewed By: sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17415
Summary: Flushing log buffer earlier to avoid confusion of time holding the locks.
Test Plan: Should be safe as long as several related db test passes
Reviewers: haobo, igor, ljin
Reviewed By: igor
CC: nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D17493
Summary: Move several some common logging still in DB mutex to log buffer.
Test Plan: make all check
Reviewers: haobo, igor, ljin, nkg-
Reviewed By: nkg-
CC: nkg-, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D17439
Summary: This patch fixed a race condition where a log file is moved to archived dir in the middle of GetSortedWalFiles. Without the fix, the log file would be missed in the result, which leads to transaction log iterator gap. A test utility SyncPoint is added to help reproducing the race condition.
Test Plan: TransactionLogIteratorRace; make check
Reviewers: dhruba, ljin
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17121
Summary: As we know, logging can be slow, or even hang for some file systems. Move one more logging out of DB mutex.
Test Plan: make all check
Reviewers: haobo, igor, ljin
Reviewed By: igor
CC: yhchiang, nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D17427
Summary: The previous change D15087 changed existing compaction filter, which makes the commonly used class not backward compatible. Revert the older interface. Use a new interface for V2 instead.
Test Plan: make all check
Reviewers: haobo, yhchiang, igor
CC: danguo, dhruba, ljin, igor, leveldb
Differential Revision: https://reviews.facebook.net/D17223
Summary: It is a regression valgrind bug caused by using FileMetaData as index block handle. One of the fields of FileMetaData is not initialized after being contructed and copied, but I'm not able to find which one. Also, I realized that it's not a good idea to use FileMetaData as in TwoLevelIterator::InitDataBlock(), a copied FileMetaData can be compared with the one in version set byte by byte, but the refs can be changed. Also comparing such a large structure is slightly more expensive. Use a simpler structure instead
Test Plan:
Run the failing valgrind test (Harness.RandomizedLongDB)
make all check
Reviewers: igor, haobo, ljin
Reviewed By: igor
CC: yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D17409
Summary: DBIter now uses a std::string for saved_key. Based on some profiling, it could be more expensive than we though. Optimize it with the same technique as LookupKey -- if it is short, we copy it to a static allocated char. Otherwise, dynamically allocate memory for it.
Test Plan: make all check
Reviewers: haobo, ljin
Reviewed By: haobo
CC: dhruba, igor, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D17289
Summary:
In total order mode, iterator's seek() shouldn't check total order.
Also some cleaning up about checking null for shared pointers. I don't know the behavior before it.
This bug was reported by @igor.
Test Plan: test plain_table_db_test
Reviewers: ljin, haobo, igor
Reviewed By: igor
CC: yhchiang, dhruba, igor, leveldb
Differential Revision: https://reviews.facebook.net/D17391
Summary: Talked to <insert internal project name> folks and they found it really scary that they won't be able to roll back once they upgrade to 2.8. We should fix this.
Test Plan: make check
Reviewers: haobo, ljin
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17343
Summary: Fix some more CompactionFilterV2 valgrind issues. Maybe it would make sense for CompactionFilterV2 to delete its prefix_extractor?
Test Plan: ran CompactionFilterV2* tests with valgrind. issues before patch -> no issues after
Reviewers: haobo, sdong, ljin, dhruba
Reviewed By: dhruba
CC: leveldb, danguo
Differential Revision: https://reviews.facebook.net/D17337
Summary: Since we are optimizing for server workloads, some default values are not optimized any more. We change some of those values that I feel it's less prone to regression bugs.
Test Plan: make all check
Reviewers: dhruba, haobo, ljin, igor, yhchiang
Reviewed By: igor
CC: leveldb, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D16995
Summary: In one of the perf, I shows 10%-25% CPU costs of MemTableIterator.Seek(), when doing dynamic prefix seek, are spent on checking whether we need to do bloom filter check or finding out the prefix extractor. Seems that more level of pointer checking makes CPU cache miss more likely. This patch makes things slightly simpler by copying pointer of bloom of prefix extractor into the iterator.
Test Plan: make all check
Reviewers: haobo, ljin
Reviewed By: ljin
CC: igor, dhruba, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D17247
Summary:
By constraining the probes within cache line(s), we can improve the
cache miss rate thus performance. This probably only makes sense for
in-memory workload so defaults the option to off.
Numbers and comparision can be found in wiki:
https://our.intern.facebook.com/intern/wiki/index.php/Ljin/rocksdb_perf/2014_03_17#Bloom_Filter_Study
Test Plan: benchmarked this change substantially. Will run make all check as well
Reviewers: haobo, igor, dhruba, sdong, yhchiang
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17133
Summary: Fix the bug in MergeUtil which causes mixing values of different keys.
Test Plan:
stringappend_test
make all check
Reviewers: haobo, igor
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17235
Summary:
NewFixedPrefixTransform is leaked in default options. Broken by b47812fba6
Also included in the diff some code cleanup
Test Plan:
valgrind env_test
also make check
Reviewers: haobo, danguo, yhchiang
Reviewed By: danguo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17211
Summary:
Previously, we used to sort all files by BySmallestFirst comparator and then re-sort level0 files in the Finalize() (recently moved to end of SaveTo).
In this diff, I chose the correct comparator at the beginning and sort the files correctly in Builder::SaveTo.
I also added a verification that all files are sorted correctly in CheckConsistency()
NOTE: This diff depends on D17037
Test Plan: make check. Will also run db_stress
Reviewers: dhruba, haobo, sdong, ljin
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17049
Summary:
AssertHeld() was a no-op before. Now it does things.
Also, this change caught a bad bug in SuperVersion::Init(). The method is calling db->mutex.AssertHeld(), but db variable is not initialized yet! I also fixed that issue.
Test Plan: make check
Reviewers: dhruba, haobo, ljin, sdong, yhchiang
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17193
Summary: We don't preallocate MANIFEST file, even though we have an option for that. This diff preallocates manifest file every time we create it
Test Plan: make check
Reviewers: dhruba, haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17163
Summary:
This is the part that was not finished when doing the Options.max_num_files=-1 feature. For iterating non level0 SST files (which was done using two level iterator), table cache is not bypassed. With this patch, the leftover feature is done.
Test Plan: make all check; change Options.max_num_files=-1 in one of the tests to cover the codes.
Reviewers: haobo, igor, dhruba, ljin, yhchiang
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17001
Summary:
Add assert(operands_list.size() >= 2) in MergeOperator::PartialMergeMulti
to ensure it's only be called when we have at least two merge operands.
Test Plan: run merge_test and stringappend_test.
Reviewers: haobo, igor
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17169
Summary:
Currently if client uses kNULLString as the prefix, it will confuse
compaction filter v2. This diff added a bool to indicate if the prefix
has been intialized. I also added a unit test to cover this case and
make sure the new code path is hit.
Test Plan: db_test
Reviewers: igor, haobo
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17151
Summary:
Fix a bug that PartialMergeMulti will try to merge the first operand
with an empty slice.
Test Plan: run stringappend_test and merge_test.
Reviewers: haobo, igor
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17157
Summary:
This diff adds a new CompactionFilterV2 API that roll up the
decisions of kv pairs during compactions. These kv pairs must share the
same key prefix. They are buffered inside the db.
typedef std::vector<Slice> SliceVector;
virtual std::vector<bool> Filter(int level,
const SliceVector& keys,
const SliceVector& existing_values,
std::vector<std::string>* new_values,
std::vector<bool>* values_changed
) const = 0;
Application can override the Filter() function to operate
on the buffered kv pairs. More details in the inline documentation.
Test Plan:
make check. Added unit tests to make sure Keep, Delete,
Change all works.
Reviewers: haobo
CCs: leveldb
Differential Revision: https://reviews.facebook.net/D15087
Summary:
* PartialMerge api now takes a list of operands instead of two operands.
* Add min_pertial_merge_operands to Options, indicating the minimum
number of operands to trigger partial merge.
* This diff is based on Schalk's previous diff (D14601), but it also
includes necessary changes such as updating the pure C api for
partial merge.
Test Plan:
* make check all
* develop tests for cases where partial merge takes more than two
operands.
TODOs (from Schalk):
* Add test with min_partial_merge_operands > 2.
* Perform benchmarks to measure the performance improvements (can probably
use results of task #2837810.)
* Add description of problem to doc/index.html.
* Change wiki pages to reflect the interface changes.
Reviewers: haobo, igor, vamsi
Reviewed By: haobo
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D16815
Summary:
Everytime a client opens a DB, we do a sanity check that:
* checks the existance of all the necessary files
* verifies that file sizes are correct
Some of the code was stolen from https://reviews.facebook.net/D16935
Test Plan: added a unit test
Reviewers: dhruba, haobo, sdong
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17097
Summary: Added a function/command to check the consistency of live files' meta data
Test Plan:
Manual test (size mismatch, file not exist).
Command test script.
Reviewers: haobo
Reviewed By: haobo
CC: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D16935
Summary:
Whenever we get an IOError from GetImpl() or NewIterator(), we should immediatelly mark the DB read-only. The same check already exists in Write() and Compaction().
This should help with clients that are somehow missing a file.
Test Plan: make check
Reviewers: dhruba, haobo, sdong, ljin
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17061
Summary: D17067 breaks DBTest.UniversalCompactionTrigger because of wrong location of the checking. Fix it.
Test Plan: Run the test and make sure it passes.
Reviewers: igor, haobo
Reviewed By: igor
CC: dhruba, ljin, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D17079
Summary: Add unit tests to make sure CompactionFilterContext::is_manual_compaction_ and CompactionFilterContext::is_full_compaction_ are set correctly.
Test Plan: run the new tests.
Reviewers: haobo, igor, dhruba, yhchiang, ljin
Reviewed By: haobo
CC: nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D17067
Summary:
As it turns out, we need the call to ComputeCompactionScore (previously: Finalize) in CompactionPicker.
The issue caused a deadlock in db_stress: http://ci-builds.fb.com/job/rocksdb_crashtest/290/console
The last two lines before a deadlock were:
2014/03/18-22:43:41.481029 7facafbee700 (Original Log Time 2014/03/18-22:43:41.480989) Compaction nothing to do
2014/03/18-22:43:41.481041 7faccf7fc700 wait for fewer level0 files...
"Compaction nothing to do" and other thread waiting for fewer level0 files. Hm hm.
I moved the pre-sorting to SaveTo, which should fix both the original and the new issue.
Test Plan: make check for now, will run db_stress in jenkins
Reviewers: dhruba, haobo, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17037
Summary:
We have an issue with internal service trying to run compaction with zero input files:
2014/02/07-02:26:58.386531 7f79117ec700 Compaction start summary: Base version 1420 Base level 3, seek compaction:0, inputs:[Ï›~^Qy^?],[]
2014/02/07-02:26:58.386539 7f79117ec700 Compacted 0@3 + 0@4 files => 0 bytes
There are two issues:
* inputsummary is printing out junk
* it's constantly retrying (since I guess madeProgress is true), so it prints out a lot of data in the LOG file (40GB in one day).
I read through the Level compaction picker and added some failure condition if input[0] is empty. I think PickCompaction() should not return compaction with zero input files with this change. I'm not confident enough to add an assertion though :)
Test Plan: make check
Reviewers: dhruba, haobo, sdong, kailiu
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16005
Summary:
This is a 500ns operation while the whole Get() call takes only a few
micro!
Test Plan: ran db_bench, for a DB with 50M keys, QPS jumps from 5.2M/s to 7.2M/s
Reviewers: haobo, igor, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17007
Summary: Add a property to calculate number of background errors encountered to help users build their monitoring
Test Plan: Add a unit test. make all check
Reviewers: haobo, igor, dhruba
Reviewed By: igor
CC: ljin, nkg-, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D16959
Summary: To partly address the request @nkg- raised, add three easy-to-add properties to compactions and flushes.
Test Plan: run unit tests and add a new unit test to cover new properties.
Reviewers: haobo, dhruba
Reviewed By: dhruba
CC: nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D13677
Summary:
Finalize re-sorts (read: mutates) the files_ in Version* and it is called by CompactionPicker during normal runtime. At the same time, this same Version* lives in the SuperVersion* and is accessed without the mutex in GetImpl() code path.
Mutating the files_ in one thread and reading the same files_ in another thread is a bad idea. It caused this issue: http://ci-builds.fb.com/job/rocksdb_crashtest/285/console
Long-term, we need to be more careful with method contracts and clearly document what state can be mutated when. Now that we are much faster because we don't lock in GetImpl(), we keep running into data races that were not a problem before when we were slower. db_stress has been very helpful in detecting those.
Short-term, I removed Finalize() from CompactionPicker.
Note: I believe this is an issue in current 2.7 version running in production.
Test Plan:
make check
Will also run db_stress to see if issue is gone
Reviewers: sdong, ljin, dhruba, haobo
Reviewed By: sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16983
Summary:
There is a chance that an old MANIFEST is corrupted in 2.7 but just not noticed.
This check would fail them. Change it to log instead of returning a
Corruption status.
Test Plan: make
Reviewers: haobo, igor
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16923
Summary:
Based on my recent findings (posted in our internal group), if we use fallocate without KEEP_SIZE flag, we get superior performance of fdatasync() in append-only workloads.
This diff provides an option for user to not use KEEP_SIZE flag, thus optimizing his sync performance by up to 2x-3x.
At one point we also just called posix_fallocate instead of fallocate, which isn't very fast: http://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html (tl;dr it manually writes out zero bytes to allocate storage). This diff also fixes that, by first calling fallocate and then posix_fallocate if fallocate is not supported.
Test Plan: make check
Reviewers: dhruba, sdong, haobo, ljin
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16761
Summary:
When the manifest is getting rolled the following happens:
1) manifest_file_number_ is assigned to a new manifest number (even though the old one is still current)
2) mutex is unlocked
3) SetCurrentFile() creates temporary file manifest_file_number_.dbtmp
4) SetCurrentFile() renames manifest_file_number_.dbtmp to CURRENT
5) mutex is locked
If FindObsoleteFiles happens between (3) and (4) it will:
1) Delete manifest_file_number_.dbtmp (because it's not in pending_outputs_)
2) Delete old manifest (because the manifest_file_number_ already points to a new one)
I introduce the concept of prev_manifest_file_number_ that will avoid the race condition.
However, we should discuss the future of MANIFEST file rolling. We found some race conditions with it last week and who knows how many more are there. Nobody is using it in production because we don't trust the implementation. Should we even support it?
Test Plan: make check
Reviewers: ljin, dhruba, haobo, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16929
Summary:
Memtable will now be forced to flush if the one of the following
conditions is met:
1. Already allocated more than write_buffer_size + 60% arena block size.
(the overflowing condition)
2. Unable to safely allocate one more arena block without hitting the
overflowing condition AND the unused allocated memory < 25% arena
block size.
Test Plan: make all check
Reviewers: sdong, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16893
Summary: Prev() now can hang when there is a key with more than max_skipped number of appearance internally but all of them are newer than the sequence ID to seek. Add unit tests to confirm the bug and fix it.
Test Plan: make all check
Reviewers: igor, haobo
Reviewed By: igor
CC: ljin, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D16899
Summary:
In the last fix, I forgot to point to the writer when comparing edit,
which is apparently not correct.
Test Plan: still running make whitebox_crash_test
Reviewers: igor, haobo, igor2
Reviewed By: igor2
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16911
Summary:
Here is what it can cause probelm:
There is one memtable flush and one compaction. Both call LogAndApply(). If both edits are applied in the same batch with flush edit first and the compaction edit followed. LogAndApplyHelper() will assign compaction edit current VersionSet's log number(which should be smaller than the log number from flush edit). It cause log_numbers in MANIFEST to be not monotonic increasing, which violates the assume Recover() makes. What is more is after comitting to MANIFEST file, log_number_ in VersionSet is updated to the log_number from the last edit, which is the compaction one. It ends up not updating the log_number.
Test Plan:
make whitebox_crash_test
got another assertion about iter->valid(), not sure if that is related
to this.
Reviewers: igor, haobo
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16875
Summary: Client doesn't need to know anything about ColumnFamily ID. By making WriteBatch take ColumnFamilyHandle as a parameter, we can eliminate method GetID() from ColumnFamilyHandle
Test Plan: column_family_test
Reviewers: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16887
Summary:
Original Summary:
Yesterday, @ljin and I were debugging various db_stress issues. We suspected one of them happens when we concurrently call NewIterator without prefix_seek on HashSkipList. This test demonstrates it.
Update:
Arena is not thread-safe!! When creating a new full iterator, we *have* to create a new arena, otherwise we're doomed.
Test Plan: SIGSEGV and assertion-throwing test now works!
Reviewers: ljin, haobo, sdong
Reviewed By: sdong
CC: leveldb, ljin
Differential Revision: https://reviews.facebook.net/D16857
Summary:
With D16767, there is a case compaction tasks are scheduled infinitely:
(1) no flush thread is configured and more than 1 compaction threads
(2) a flush is going on by one compaction hread
(3) the state of SST files is in the state that versions_->current()->NeedsCompaction() will generate a false positive (return true actually there is no work to be done)
In that case, a infinite loop will be formed.
This patch would fix it.
Test Plan: make all check
Reviewers: haobo, igor, ljin
Reviewed By: igor
CC: dhruba, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D16863
Summary:
This is is based on https://reviews.facebook.net/D15027. It's not finished but I would like to give a prototype to avoid arena over-allocation while making better use of the already allocated memory blocks.
Instead of check approximate memtable size, we will take a deeper look at the arena, which incorporate essential idea that @sdong suggests: flush when arena has allocated its last and the last is "almost full"
Test Plan: N/A
Reviewers: haobo, sdong
Reviewed By: sdong
CC: leveldb, sdong
Differential Revision: https://reviews.facebook.net/D15051
Summary:
I'm cleaning up some code preparing for the big diff review tomorrow. This is the first part of the cleanup.
Changes are mostly cosmetic. The goal is to decrease amount of code difference between columnfamilies and master branch.
This diff also fixes race condition when dropping column family.
Test Plan: Ran db_stress with variety of parameters
Reviewers: dhruba, haobo
Differential Revision: https://reviews.facebook.net/D16833