Summary:
This patch adds a target size parameter in options.db_paths and universal compaction will base it to determine which DB path to place a new file.
Level-style stays the same.
Test Plan: Add new unit tests
Reviewers: ljin, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, dhruba, igor, leveldb
Differential Revision: https://reviews.facebook.net/D19869
Summary:
Use FileLevel in LevelFileNumIterator, thus use new version of findFile.
Old version of findFile function is deleted.
Write a function in version_set.cc to generate FileLevel from files_.
Add GenerateFileLevelTest in version_set_test.cc
Test Plan:
make all check
Reviewers: ljin, haobo, yhchiang, sdong
Reviewed By: sdong
Subscribers: igor, dhruba
Differential Revision: https://reviews.facebook.net/D19659
Summary:
Define CompressedFileMetaData that just contains fd, smallest_slice, largest_slice. Create compressed_levels_ in Version, the space is allocated using arena
Thus increase the file meta data locality, speed up "Get" and "FindFile"
benchmark with in-memory tmpfs, could have 4% improvement under "random read" and 2% improvement under "read while writing"
benchmark command:
./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=readwhilewriting,readwhilewriting,readwhilewriting --use_existing_db=1 --num=52428800 --threads=1 —writes_per_second=81920
Read Random:
From 1.8363 ms/op, improve to 1.7587 ms/op.
Read while writing:
From 2.985 ms/op, improve to 2.924 ms/op.
Test Plan:
make all check
Reviewers: ljin, haobo, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, igor
Differential Revision: https://reviews.facebook.net/D19419
Summary:
This patch include two fixes:
1. newly created Version will now takes the aggregated stats for average-value-size from the latest Version.
2. compensated size of a file is now computed only for newly created / loaded file, this addresses the issue where files are already sorted by their compensated file size but might sometimes observe some out-of-order due to later update on compensated file size.
Test Plan:
export ROCKSDB_TESTS=CompactionDele
./db_test
Reviewers: ljin, igor, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19557
Summary:
In this patch, we allow RocksDB to support multiple DB paths internally.
No user interface is supported yet so this patch is silent to users.
Test Plan: make all check
Reviewers: igor, haobo, ljin, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D18921
Summary: files_by_size_ is sorted by time in case of universal compaction. However, Version::files_ is also sorted by time. So no need for files_by_size_
Test Plan:
1) make check with the change
2) make check with `assert(last_index == c->input_version_->files_[level].size() - 1);` in compaction picker
Reviewers: dhruba, haobo, yhchiang, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19125
Summary:
This diff allows compaction to reclaim storage more effectively.
In the current design, compactions are mainly triggered based on
the file sizes. However, since deletion entries does not have
value, files which have many deletion entries are less likely
to be compacted. As a result, it may took a while to make
deletion entries to be compacted.
This diff address issue by compensating the size of deletion
entries during compaction process: the size of each deletion
entry in the compaction process is augmented by 2x average
value size. The diff applies to both leveled and universal
compacitons.
Test Plan:
develop CompactionDeletionTrigger
make db_test
./db_test
Reviewers: haobo, igor, ljin, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19029
Summary:
As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases.
This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction.
There is one test case that relied on didIO information. I'll try to find another way to implement it.
Test Plan: make check
Reviewers: sdong, haobo, yhchiang, ljin, dhruba
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19161
Summary:
We decided that one of the long term goals is to unify level and universal compaction.
As a small first step, I'm unifying level 0 sorting methods.
Previously, we used to sort level 0 files in level compaction by file number and in universal compaction by sequence number.
But it turns out that in level compaction, sorting by file number is exactly the same as sorting by sequence number.
Test Plan:
Ran make check with bunch of asserts to verify the sorting order is exactly the same.
Also, make check with this patch
Reviewers: haobo, yhchiang, ljin, dhruba, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19131
Summary:
We added multiple fields to FileMetaData recently and are planning to add more.
This refactoring separate the minimum information for accessing the file. This object is copyable (FileMetaData is not copyable since the ref counter). I hope this refactoring can enable further improvements:
(1) use it to design a more efficient data structure to speed up read queries.
(2) in the future, when we add information of storage level, we can easily do the encoding, instead of enlarge this structure, which might expand memory work set for file meta data.
The definition is same as current EncodedFileMetaData used in two level iterator, so now the logic in two level iterator is easier to understand.
Test Plan: make all check
Reviewers: haobo, igor, ljin
Reviewed By: ljin
Subscribers: leveldb, dhruba, yhchiang
Differential Revision: https://reviews.facebook.net/D18933
Summary:
https://reviews.facebook.net/D17205 removed the logic of skipping file key range check when there are less than 3 level 0 files. This patch brings it back.
Other than that, add another small optimization to avoid to check all the levels if most higher levels don't have any file.
Test Plan: make all check
Reviewers: ljin
Reviewed By: ljin
Subscribers: yhchiang, igor, haobo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D19035
Summary: as title
Test Plan:
db_bench
the initial result is very promising. I will post results of complete
runs
Reviewers: dhruba, haobo, sdong, igor
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D18867
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.
Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.
Test Plan: make all check
Reviewers: ljin, haobo
Reviewed By: haobo
Subscribers: leveldb, dhruba, yhchiang, igor
Differential Revision: https://reviews.facebook.net/D18513
Summary:
Introducing new compaction style -- FIFO.
FIFO compaction style has write amplification of 1 (+1 for WAL) and it deletes the oldest files when the total DB size exceeds pre-configured values.
FIFO compaction style is suited for storing high-frequency event logs.
Test Plan: Added a unit test
Reviewers: dhruba, haobo, sdong
Reviewed By: dhruba
Subscribers: alberts, leveldb
Differential Revision: https://reviews.facebook.net/D18765
Summary: Cleaned up compaction logging a little bit. Now file sizes are easier to read. Also, removed the trailing space.
Test Plan:
verified that i'm happy with logging output:
files_size[#33(seq=101,sz=98KB,0) #31(seq=81,sz=159KB,0) #26(seq=0,sz=637KB,0)]
Reviewers: sdong, haobo, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18549
Summary: One of our users reported current file corruption. The machine was rebooted during the time. This is the only think I can think of which could cause current file corruption. Just add this paranoid check.
Test Plan: make all check
Reviewers: haobo, igor
Reviewed By: haobo
CC: yhchiang, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D18495
Summary:
As a follow-up diff for https://reviews.facebook.net/D17805, add
optimization to check PrefixMayMatch on Seek()
Test Plan: make all check
Reviewers: igor, haobo, sdong, yhchiang, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17853
Summary:
also add an override option total_order_iteration if you want to use full
iterator with prefix_extractor
Test Plan: make all check
Reviewers: igor, haobo, sdong, yhchiang
Reviewed By: haobo
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D17805
Summary:
Now that we have column families involved, we need to add extra context to every log message. They now start with "[column family name] log message"
Also added some logging that I think would be useful, like level summary after every flush (I often needed that when going through the logs).
Test Plan: make check + ran db_bench to confirm I'm happy with log output
Reviewers: dhruba, haobo, ljin, yhchiang, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18303
Summary:
The file tree structure in Version is prebuilt and the range of each file is known.
On the Get() code path, we do binary search in FindFile() by comparing
target key with each file's largest key and also check the range for each L0 file.
With some pre-calculated knowledge, each key comparision that has been done can serve
as a hint to narrow down further searches:
(1) If a key falls within a L0 file's range, we can safely skip the next
file if its range does not overlap with the current one.
(2) If a key falls within a file's range in level L0 - Ln-1, we should only
need to binary search in the next level for files that overlap with the current one.
(1) will be able to skip some files depending one the key distribution.
(2) can greatly reduce the range of binary search, especially for bottom
levels, given that one file most likely only overlaps with N files from
the level below (where N is max_bytes_for_level_multiplier). So on level
L, we will only look at ~N files instead of N^L files.
Some inital results: measured with 500M key DB, when write is light (10k/s = 1.2M/s), this
improves QPS ~7% on top of blocked bloom. When write is heavier (80k/s =
9.6M/s), it gives us ~13% improvement.
Test Plan: make all check
Reviewers: haobo, igor, dhruba, sdong, yhchiang
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17205
Summary:
D17961 has two bugs:
(1) two level iterator fails to populate FileMetaData.table_reader, causing performance regression.
(2) table cache handle the !status.ok() case in the wrong place, causing seg fault which shouldn't happen.
Test Plan: make all check
Reviewers: ljin, igor, haobo
Reviewed By: ljin
CC: yhchiang, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D17991
Summary:
One of our profilings shows that Version::Get() sometimes is slow when getting pointer of user comparators or other global objects. In this patch:
(1) we keep pointers of immutable objects in Version to avoid accesses them though option objects or cfd objects
(2) table_reader is directly cached in FileMetaData so that table cache don't have to go through handle first to fetch it
(3) If level 0 has less than 3 files, skip the filtering logic based on SST tables' key range. Smallest and largest key are stored in separated memory locations, which has potential cache misses
Test Plan: make all check
Reviewers: haobo, ljin
Reviewed By: haobo
CC: igor, yhchiang, nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D17739
Summary:
Introducing RocksDBLite! Removes all the non-essential features and reduces the binary size. This effort should help our adoption on mobile.
Binary size when compiling for IOS (`TARGET_OS=IOS m static_lib`) is down to 9MB from 15MB (without stripping)
Test Plan: compiles :)
Reviewers: dhruba, haobo, ljin, sdong, yhchiang
Reviewed By: yhchiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17835
Summary:
With multiple column families, especially when manual Flush is executed, we might roll the log file, although the current log file is empty (no data has been written to the log).
After the diff, we won't create new log file if current is empty.
Next, I will write an algorithm that will flush column families that reference old log files (i.e., that weren't flushed in a while)
Test Plan: Added an unit test. Confirmed that unit test failes in master
Reviewers: dhruba, haobo, ljin, sdong
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17631
Summary: When opening DB in read-only mode, client can choose to only specify a subset of column families ("default" column family can't be omitted, though)
Test Plan: added a unit test in column_family_test
Reviewers: haobo, sdong, ljin, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17565
Summary: It is a regression valgrind bug caused by using FileMetaData as index block handle. One of the fields of FileMetaData is not initialized after being contructed and copied, but I'm not able to find which one. Also, I realized that it's not a good idea to use FileMetaData as in TwoLevelIterator::InitDataBlock(), a copied FileMetaData can be compared with the one in version set byte by byte, but the refs can be changed. Also comparing such a large structure is slightly more expensive. Use a simpler structure instead
Test Plan:
Run the failing valgrind test (Harness.RandomizedLongDB)
make all check
Reviewers: igor, haobo, ljin
Reviewed By: igor
CC: yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D17409
Summary: Talked to <insert internal project name> folks and they found it really scary that they won't be able to roll back once they upgrade to 2.8. We should fix this.
Test Plan: make check
Reviewers: haobo, ljin
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17343
Summary:
Previously, we used to sort all files by BySmallestFirst comparator and then re-sort level0 files in the Finalize() (recently moved to end of SaveTo).
In this diff, I chose the correct comparator at the beginning and sort the files correctly in Builder::SaveTo.
I also added a verification that all files are sorted correctly in CheckConsistency()
NOTE: This diff depends on D17037
Test Plan: make check. Will also run db_stress
Reviewers: dhruba, haobo, sdong, ljin
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17049
Summary: We don't preallocate MANIFEST file, even though we have an option for that. This diff preallocates manifest file every time we create it
Test Plan: make check
Reviewers: dhruba, haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17163
Summary:
This is the part that was not finished when doing the Options.max_num_files=-1 feature. For iterating non level0 SST files (which was done using two level iterator), table cache is not bypassed. With this patch, the leftover feature is done.
Test Plan: make all check; change Options.max_num_files=-1 in one of the tests to cover the codes.
Reviewers: haobo, igor, dhruba, ljin, yhchiang
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17001
Summary:
* PartialMerge api now takes a list of operands instead of two operands.
* Add min_pertial_merge_operands to Options, indicating the minimum
number of operands to trigger partial merge.
* This diff is based on Schalk's previous diff (D14601), but it also
includes necessary changes such as updating the pure C api for
partial merge.
Test Plan:
* make check all
* develop tests for cases where partial merge takes more than two
operands.
TODOs (from Schalk):
* Add test with min_partial_merge_operands > 2.
* Perform benchmarks to measure the performance improvements (can probably
use results of task #2837810.)
* Add description of problem to doc/index.html.
* Change wiki pages to reflect the interface changes.
Reviewers: haobo, igor, vamsi
Reviewed By: haobo
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D16815
Summary:
As it turns out, we need the call to ComputeCompactionScore (previously: Finalize) in CompactionPicker.
The issue caused a deadlock in db_stress: http://ci-builds.fb.com/job/rocksdb_crashtest/290/console
The last two lines before a deadlock were:
2014/03/18-22:43:41.481029 7facafbee700 (Original Log Time 2014/03/18-22:43:41.480989) Compaction nothing to do
2014/03/18-22:43:41.481041 7faccf7fc700 wait for fewer level0 files...
"Compaction nothing to do" and other thread waiting for fewer level0 files. Hm hm.
I moved the pre-sorting to SaveTo, which should fix both the original and the new issue.
Test Plan: make check for now, will run db_stress in jenkins
Reviewers: dhruba, haobo, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17037
Summary:
This is a 500ns operation while the whole Get() call takes only a few
micro!
Test Plan: ran db_bench, for a DB with 50M keys, QPS jumps from 5.2M/s to 7.2M/s
Reviewers: haobo, igor, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17007
Summary:
Finalize re-sorts (read: mutates) the files_ in Version* and it is called by CompactionPicker during normal runtime. At the same time, this same Version* lives in the SuperVersion* and is accessed without the mutex in GetImpl() code path.
Mutating the files_ in one thread and reading the same files_ in another thread is a bad idea. It caused this issue: http://ci-builds.fb.com/job/rocksdb_crashtest/285/console
Long-term, we need to be more careful with method contracts and clearly document what state can be mutated when. Now that we are much faster because we don't lock in GetImpl(), we keep running into data races that were not a problem before when we were slower. db_stress has been very helpful in detecting those.
Short-term, I removed Finalize() from CompactionPicker.
Note: I believe this is an issue in current 2.7 version running in production.
Test Plan:
make check
Will also run db_stress to see if issue is gone
Reviewers: sdong, ljin, dhruba, haobo
Reviewed By: sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16983
Summary:
There is a chance that an old MANIFEST is corrupted in 2.7 but just not noticed.
This check would fail them. Change it to log instead of returning a
Corruption status.
Test Plan: make
Reviewers: haobo, igor
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16923
Summary:
Based on my recent findings (posted in our internal group), if we use fallocate without KEEP_SIZE flag, we get superior performance of fdatasync() in append-only workloads.
This diff provides an option for user to not use KEEP_SIZE flag, thus optimizing his sync performance by up to 2x-3x.
At one point we also just called posix_fallocate instead of fallocate, which isn't very fast: http://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html (tl;dr it manually writes out zero bytes to allocate storage). This diff also fixes that, by first calling fallocate and then posix_fallocate if fallocate is not supported.
Test Plan: make check
Reviewers: dhruba, sdong, haobo, ljin
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16761
Summary:
When the manifest is getting rolled the following happens:
1) manifest_file_number_ is assigned to a new manifest number (even though the old one is still current)
2) mutex is unlocked
3) SetCurrentFile() creates temporary file manifest_file_number_.dbtmp
4) SetCurrentFile() renames manifest_file_number_.dbtmp to CURRENT
5) mutex is locked
If FindObsoleteFiles happens between (3) and (4) it will:
1) Delete manifest_file_number_.dbtmp (because it's not in pending_outputs_)
2) Delete old manifest (because the manifest_file_number_ already points to a new one)
I introduce the concept of prev_manifest_file_number_ that will avoid the race condition.
However, we should discuss the future of MANIFEST file rolling. We found some race conditions with it last week and who knows how many more are there. Nobody is using it in production because we don't trust the implementation. Should we even support it?
Test Plan: make check
Reviewers: ljin, dhruba, haobo, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16929
Summary:
In the last fix, I forgot to point to the writer when comparing edit,
which is apparently not correct.
Test Plan: still running make whitebox_crash_test
Reviewers: igor, haobo, igor2
Reviewed By: igor2
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16911
Summary:
Here is what it can cause probelm:
There is one memtable flush and one compaction. Both call LogAndApply(). If both edits are applied in the same batch with flush edit first and the compaction edit followed. LogAndApplyHelper() will assign compaction edit current VersionSet's log number(which should be smaller than the log number from flush edit). It cause log_numbers in MANIFEST to be not monotonic increasing, which violates the assume Recover() makes. What is more is after comitting to MANIFEST file, log_number_ in VersionSet is updated to the log_number from the last edit, which is the compaction one. It ends up not updating the log_number.
Test Plan:
make whitebox_crash_test
got another assertion about iter->valid(), not sure if that is related
to this.
Reviewers: igor, haobo
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16875
Summary:
I'm cleaning up some code preparing for the big diff review tomorrow. This is the first part of the cleanup.
Changes are mostly cosmetic. The goal is to decrease amount of code difference between columnfamilies and master branch.
This diff also fixes race condition when dropping column family.
Test Plan: Ran db_stress with variety of parameters
Reviewers: dhruba, haobo
Differential Revision: https://reviews.facebook.net/D16833
Summary: Column family should be dropped after the change has been commited
Test Plan: db stress
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16779
Summary:
If verify_checksums_in_compaction is true, compaction will verify checksums. This is default.
If it's false, compaction doesn't verify checksums. This is useful for in-memory workloads.
Test Plan: corruption_test
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16695
Summary: Now while the background thread is picking compactions, it writes out multiple info_logs, especially for universal compaction, which introduces a chance of waiting log writing in mutex, which is bad. To remove this risk, write all those info logs to a buffer and flush it after releasing the mutex.
Test Plan:
make all check
check the log lines while running some tests that trigger compactions.
Reviewers: haobo, igor, dhruba
Reviewed By: dhruba
CC: i.am.jin.lei, dhruba, yhchiang, leveldb, nkg-
Differential Revision: https://reviews.facebook.net/D16515
Summary:
Column family IDs should be unique, even if column family is dropped. To achieve this, we save max column family in manifest.
Note that the diff is still not ready. I'm only using differential to move the patch to my Mac machine.
Test Plan: added a test to column_family_test
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16581
Summary:
EncodeTo(&record) does not overwrite, it appends to it.
This means that group commit log and apply will look something like:
record1
record1record2
record1record2record3
I'm surprised this didn't show up in production, but I think the reason is that MANIFEST group commit almost never happens.
This bug turned up in column family work, where opening a database failed with "adding a same column family twice".
Test Plan: Tested the change in column family branch and observed that the problem is gone (with db_stress)
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16461
Summary:
Previous code is too convoluted and I must be drunk for letting
such code to be written without a second thought.
Thanks to the discussion with @sdong, I added the `Options` when
generating the flusher, thus avoiding the tricks.
Just FYI: I resisted to add Options in flush_block_policy.h since I
wanted to avoid cyclic dependencies: FlushBlockPolicy dpends on Options
and Options also depends FlushBlockPolicy... While I appreciate my
effort to prevent it, the old design turns out creating more troubles than
it tried to avoid.
Test Plan: ran ./table_test
Reviewers: sdong
Reviewed By: sdong
CC: sdong, leveldb
Differential Revision: https://reviews.facebook.net/D16503
Summary:
I though I might get away with as little changes to LogAndApply() as possible. It turns out this is not the case.
This diff introduces different behavior of LogAndApply() for three cases:
1. column family add
2. column family drop
3. no-column family manipulation
(1) and (2) don't support group commit yet.
There were a lot of problems with old version od LogAndApply, detected by db_stress. The biggest was non-atomicity of manifest writes and metadata changes (i.e. if column family add is in manifest, it also has to be in in-memory data structure).
Test Plan: db_stress
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16491
Summary:
This diff does two things:
(1) Log::Reader does not report a corruption when the last record in a log or manifest file is truncated (meaning that log writer died in the middle of the write). Inherited the code from LevelDB: https://code.google.com/p/leveldb/source/detail?r=269fc6ca9416129248db5ca57050cd5d39d177c8#
(2) Turn off mmap writes for all writes to log and manifest files
(2) is necessary because if we use mmap writes, the last record is not truncated, but is actually filled with zeros, making checksum fail. It is hard to recover from checksum failing.
Test Plan:
Added unit tests from LevelDB
Actually recovered a "corrupted" MANIFEST file.
Reviewers: dhruba, haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16119
Summary:
This fixes few bugs with CreateColumnFamily
* We first have to LogAndApply and then call VersionSet::CreateColumnFamily. Otherwise, WriteSnapshot might be invoked, writing out column family add inside of LogAndApply, even though it's not really committed
* Fix LogAndApplyHelper() to not apply log number to column_family_data, which is in case of column family add, just a dummy (default) column family
* Create SuperVerion when creating column family
Test Plan: column_family_test
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16443
Summary: Added list_column_family command and also updated dump_manifest
Test Plan: no
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16419
Summary: New unit tests for column families
Test Plan: this is a test
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16359
Summary: Big CF diff uncovered some lint errors. This diff fixes some of them. Not much to see here
Test Plan: make check
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16347
Summary: Adapt table properties to column family world
Test Plan: make check
Reviewers: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16161
Summary: Provide a public API for users to access the table properties for each SSTable.
Test Plan: Added a unit tests to test the function correctness under differnet conditions.
Reviewers: haobo, dhruba, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16083
Summary:
The change to the public behavior:
* When opening a DB or creating new column family client gets a ColumnFamilyHandle.
* As long as column family handle is alive, client can do whatever he wants with it, even drop it
* Dropped column family can still be read from (using the column family handle)
* Added a new call CloseColumnFamily(). Client has to close all column families that he has opened before deleting the DB
* As soon as column family is closed, any calls to DB using that column family handle will fail (also any outstanding calls)
Internally:
* Ref-counting ColumnFamilyData
* New thread-safety for ColumnFamilySet
* Dropped column families are now completely dropped and their memory cleaned-up
Test Plan: added some tests to column_family_test
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16101
Summary: This covers existing table files before DB open happens and avoids contention on table cache
Test Plan: db_test
Reviewers: haobo, sdong, igor, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16089
Summary: Replaced most of occurrences of Options with more specific DBOptions. This brings us very close to supporting different configuration options for each column family.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15933
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.
To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.
This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15915
Summary: InternalStats is a messy thing, keeping both DB data and column family data. However, it's better off living in ColumnFamilyData than in DBImpl. For now, at least.
Test Plan: make check
Reviewers: dhruba, kailiu, haobo, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15879
Summary:
There are three SanitizeOption-s now : one for DBOptions, one for ColumnFamilyOptions and one for Options (which just calls the other two)
I have also reshuffled some options -- table_cache options and info_log should live in DBOptions, for example.
Test Plan: make check doesn't complain
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15873
Summary: In PlainTable, use one single byte to represent 8 bytes of internal bytes, if seqID = 0 and it is value type (which should be common for bottom most files). It is to save 7 bytes for uncompressed cases.
Test Plan: make all check
Reviewers: haobo, dhruba, kailiu
Reviewed By: haobo
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D15489
Summary: This diff enables non-default column families to get compacted both automatically and also by calling CompactRange()
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15813
Summary: Compaction picker and internal key comparator are different for each column family (not global), so they should live in ColumnFamilyData
Test Plan: make check
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15801
Summary: Removed default_cfd_ from all flush code paths. This means we can now flush memtables from arbitrary column families!
Test Plan: Added a new unit test
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15789
Summary: The default settings enable checksum verification on every read.
Test Plan: make check
Reviewers: haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15591
Summary: Making room for write will be the hardest part of the column family implementation. For now, I just iterate through all column families and run MakeRoomForWrite() for every one.
Test Plan: make check does not complain
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15597
Summary: ColumnFamilyData grew a lot, there's much more data that it holds now. It makes more sense to encapsulate it better by making it a class.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15579
Summary: This one is big. It adds ability to write to and read from different column families (see the unit test). It also supports recovery of different column families from log, which was the hardest part to reason about. We need to make sure to never delete the log file which has unflushed data from any column family. To support that, I added another concept, which is versions_->MinLogNumber()
Test Plan: Added a unit test in column_family_test
Reviewers: dhruba, haobo, sdong, kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15537
Summary:
I came across this while working on column families. CorruptionTest::RecoverWriteError threw a SIGSEG because the descriptor_log_->file() was nullptr. I'm not sure why it doesn't happen in master, but better safe than sorry.
@kailiu, can we get this in release, too?
Test Plan: make check
Reviewers: kailiu, dhruba, haobo
Reviewed By: haobo
CC: leveldb, kailiu
Differential Revision: https://reviews.facebook.net/D15513
Summary: This removes the default implementation of LogAndApply that applied the changed to the default column family by default. It is mostly simple reformatting.
Test Plan: make check
Reviewers: dhruba, kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15465
Summary: All memtables and immutable memtables are moved from DBImpl to ColumnFamilyData. For now, they are all referenced from default column family in DBImpl. It shouldn't be hard to get them from custom column family.
Test Plan: make check
Reviewers: dhruba, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15459
Summary:
@dhruba, I'm not sure where we need to sync the directory. I implemented the function in Env() and added the dir sync just after we close the newly created file in the builder.
Should I also add FsyncDir() to new files that get created by a compaction?
Test Plan: Confirmed that FsyncDir is returning Status::OK()
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D14751
Summary: There is no reason to have functions NeedCompaction(), MaxCompactionScore() and MaxCompactionScoreLevel() in VersionSet, since they don't access any data in VersionSet.
Test Plan: make check
Reviewers: kailiu, haobo, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15333
ReduceNumberOfLevels had segmentation fault in WriteSnapshot() since we
didn't change the number of levels in VersionSet (we consider them
immutable from now on). This fixes the problem.
Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.
Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.
Test Plan: reduce_levels_test
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15303
Summary:
I created a separate class ColumnFamilySet to keep track of column families. Before we did this in VersionSet and I believe this approach is cleaner.
Let me know if you have any comments. I will commit tomorrow.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15357
Summary:
In this diff I made some effort to reduce usage of friending. To do that, I had to expose Compaction::inputs_ through a method inputs(). Not sure if this is a good idea, there is a trade-off. I think it's less confusing than having lots of friends.
I also thought about other friendship relationships, but they are too much tangled at this point. Once you friend two classes, it's very hard to unfriend them :)
Test Plan: make check
Reviewers: haobo, kailiu, sdong, dhruba
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15267
Summary: The only thing we do with compaction pointers is set them to some values, we never actually read them. I don't know what we used them for, but it doesn't look like we use them anymore.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15225
Summary:
This is a big one. This diff moves all the code related to picking compactions from VersionSet to new class CompactionPicker. Column families' compactions will be completely separate processes, so we need to have multiple CompactionPickers.
To make this easier to review, most of the code change is just copy/paste. There is also a small change not to use VersionSet::current_, but rather to take `Version* version` as a parameter. Most of the other code is exactly the same.
In future diffs, I will also make some improvements to CompactionPickers. I think the most important part will be encapsulating it better. Currently Version, VersionSet, Compaction and CompactionPicker are all friend classes, which makes it harder to change the implementation.
This diff depends on D15171, D15183, D15189 and D15201
Test Plan: `make check`
Reviewers: kailiu, sdong, dhruba, haobo
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15207
Summary:
shared_ptr is slower than unique_ptr (which literally comes with no performance cost compare with raw pointers).
In memtable and memtable rep, we use shared_ptr when we'd actually should use unique_ptr.
According to igor's previous work, we are likely to make quite some performance gain from this diff.
Test Plan: make check
Reviewers: dhruba, igor, sdong, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15213
Summary:
I'm sure we'll all agree that version_set.cc needs simplifying. This diff moves Compaction class to a separate file.
The diff depends on D15171 and D15183
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15189
Summary:
There were some functions in VersionSet that had no reason to be there instead of Version. Moving them to Version will make column families implementation easier.
The functions moved are:
* NumLevelBytes
* LevelSummary
* LevelFileSummary
* MaxNextLevelOverlappingBytes
* AddLiveFiles (previously AddLiveFilesCurrentVersion())
* NeedSlowdownForNumLevel0Files
The diff continues on (and depends on) D15171
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong, emayanke
Reviewed By: sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15183
Summary:
With column families VersionSet will not have a constant number of levels (each CF can have different options), so we'll need to eliminate call to VersionSet::NumberLevels()
This diff decreases number of callsites, but we're not there yet. It associates number of levels with Version (each version is associated with single CF) instead of VersionSet.
I have also slightly changed how VersionSet keeps track of manifest size.
This diff also modifies constructor of Compaction such that it takes input_version and automatically Ref()s it. Before this was done outside of constructor.
In next diffs I will continue to decrease number of callsites of VersionSet::NumberLevels() and also references to current_
Test Plan: make check
Reviewers: haobo, dhruba, kailiu, sdong
Reviewed By: sdong
Differential Revision: https://reviews.facebook.net/D15171
Summary:
When doing CompactRange(), we should first flush the memtable and then calculate max_level_with_files. Also, we want to compact all the levels that have files, including level `max_level_with_files`.
This patch fixed the unit test.
Test Plan: Added a failing unit test and a fix, so it's not failing anymore.
Reviewers: dhruba, haobo, sdong
Reviewed By: haobo
CC: leveldb, xjin
Differential Revision: https://reviews.facebook.net/D14421
Summary:
I will submit a sequence of diffs that are preparing master branch for column families. There are a lot of implicit assumptions in the code that are making column family implementation hard. If I make the change only in column family branch, it will make merging back to master impossible.
Most of the diffs will be simple code refactorings, so I hope we can have fast turnaround time. Feel free to grab me in person to discuss any of them.
This diff removes number of level check from VersionEdit. It is used only when VersionEdit is read, not written, but has to be set when it is written. I believe it is a right thing to make VersionEdit dumb and check consistency on the caller side. This will also make it much easier to implement Column Families, since different column families can have different number of levels.
Test Plan: make check
Reviewers: dhruba, haobo, sdong, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15159
Summary: Currently in DBImpl::MakeRoomForWrite(), we do "versions_->NumLevelFiles(0) >= options_.level0_slowdown_writes_trigger" to check whether the writer thread needs to slow down. However, versions_->NumLevelFiles(0) is slightly more expensive than we expected. By caching the result of the comparison when installing a new version, we can avoid this function call every time.
Test Plan:
make all check
Manually trigger this behavior by applying universal compaction style and make sure inserts are made slow after there are certain number of files.
Reviewers: haobo, kailiu, igor
Reviewed By: kailiu
CC: nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D15141
Summary: We don't want to delete ColumnFamilyData object if somebody has references to it.
Test Plan: `make check` for now, but will need to implement bigger column family test case
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15111
Summary:
The biggest change here is getting rid of current_ Version and adding a column_family_data->current Version to each column family.
I have also fixed some smaller things in VersionSet that made it easier to implement Column family support.
Test Plan: make check
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15105
Summary:
In some use cases, table readers for all live files should always be cached. In that case, there will be an opportunity to avoid the table cache look-up while Get() and NewIterator().
We define options.max_open_files = -1 to be the mode that table readers for live files will always be kept. In that mode, table readers are cached in FileMetaData (with a reference count hold in table cache). So that when executing table_cache.Get() and table_cache.newInterator(), LRU cache checking can be by-passed, to reduce latency.
Test Plan: add a test case in db_test
Reviewers: haobo, kailiu
Reviewed By: haobo
CC: dhruba, igor, leveldb
Differential Revision: https://reviews.facebook.net/D15039
Summary:
I have added three new value types:
* kTypeColumnFamilyDeletion
* kTypeColumnFamilyValue
* kTypeColumnFamilyMerge
which include column family Varint32 before the data (value, deletion and merge). These values are used only in WAL (not in memtables yet).
This endeavour required changing some WriteBatch internals.
Test Plan: Added a unittest
Reviewers: dhruba, haobo, sdong, kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15045
Summary:
In addition to implementing OpenWithColumnFamilies, this diff also includes some minor changes:
* Changed all column family names from Slice() to std::string. The performance of column family name handling is not critical, and it's more convenient and cleaner to have names as std::strings
* Implemented ColumnFamilyOptions(const Options&) and DBOptions(const Options&)
* Added ColumnFamilyOptions to VersionSet::ColumnFamilyData. ColumnFamilyOptions are specified on OpenWithColumnFamilies() and CreateColumnFamily()
I will keep the diff in the Phabricator for a day or two and will push to the branch then. Feel free to comment even after the diff has been pushed.
Test Plan: Added a simple unit test
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15033
Summary:
This diff provides basic implementations of CreateColumnFamily(), DropColumnFamily() and ListColumnFamilies(). It builds on top of https://reviews.facebook.net/D14733
It also includes a bug fix for DBImplReadOnly, where Get implementation would be redirected to DBImpl instead of DBImplReadOnly.
Test Plan: Added unit test
Reviewers: dhruba, haobo, kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15021
Summary: I made some cleanup while reading the source code in `db`. Most changes are about style, naming or C++ 11 new features.
Test Plan: ran `make check`
Reviewers: haobo, dhruba, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15009
Summary:
<This diff is for Column Family branch>
Added fields in manifest file to support adding and deleting column families.
Pretty simple change, each version edit record can be:
1. add column family
2. drop column family
3. add and delete N files from a single column family (compactions and flushes will generate such records)
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14733
Summary:
In some places we have NotFound status created with empty message, but it doesn't avoid a malloc. With this patch, the malloc is avoided for that case.
The motivation of it is that I found in db_bench readrandom test when all keys are not existing, about 4% of the total running time is spent on malloc of Status, plus a similar amount of CPU spent on free of them, which is not necessary.
Test Plan: make all check
Reviewers: dhruba, haobo, igor
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14691
Summary: I realized that "D14409 Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo()" is not done in an optimized place. SaveTo() is usually inside mutex. Move it to Finalize(), which is called out of mutex.
Test Plan: make all check
Reviewers: dhruba, haobo, kailiu
Reviewed By: dhruba
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D14607
Summary: When deconstructing an iterator, no need to check obsolete file if it doesn't hold last reference of any version.
Test Plan: make all check
Reviewers: haobo, igor, dhruba, kailiu
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14595
Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases
Test Plan: make all check
Reviewers: haobo, kailiu, dhruba
Reviewed By: haobo
CC: igor, nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D14415
Conflicts:
db/db_impl.cc
db/memtable.cc
Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get().
Test Plan: make all check
Reviewers: haobo, kailiu, dhruba
Reviewed By: dhruba
CC: nkg-, igor, leveldb
Differential Revision: https://reviews.facebook.net/D14409
Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get().
Test Plan: make all check
Reviewers: haobo, kailiu, dhruba
Reviewed By: dhruba
CC: nkg-, igor, leveldb
Differential Revision: https://reviews.facebook.net/D14409
Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases
Test Plan: make all check
Reviewers: haobo, kailiu, dhruba
Reviewed By: haobo
CC: igor, nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D14415
Summary:
This code path can potentially accumulate multiple important_files for level 0.
But for other levels, it should have only one file in the
important_files, so it is ok not to reserve excessive space, is it not?
Test Plan: make check
Reviewers: haobo
Reviewed By: haobo
CC: reconnect.grayhat, leveldb
Differential Revision: https://reviews.facebook.net/D14349