Summary:
Change namespace from leveldb to rocksdb. This allows a single
application to link in open-source leveldb code as well as
rocksdb code into the same process.
Test Plan: compile rocksdb
Reviewers: emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13287
Summary:
Added a new field called max_size_amplification_ratio in the
CompactionOptionsUniversal structure. This determines the maximum
percentage overhead of space amplification.
The size amplification is defined to be the ratio between the size of
the oldest file to the sum of the sizes of all other files. If the
size amplification exceeds the specified value, then min_merge_width
and max_merge_width are ignored and a full compaction of all files is done.
A value of 10 means that the size a database that stores 100 bytes
of user data could occupy 110 bytes of physical storage.
Test Plan: Unit test DBTest.UniversalCompactionSpaceAmplification added.
Reviewers: haobo, emayanke, xjin
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12825
Summary:
If ReadOptions.non_blocking_io is set to true, then KeyMayExists
and Iterators will return data that is cached in RAM.
If the Iterator needs to do IO from storage to serve the data,
then the Iterator.status() will return Status::IsRetry().
Test Plan:
Enhanced unit test DBTest.KeyMayExist to detect if there were are IOs
issues from storage. Added DBTest.NonBlockingIteration to verify
nonblocking Iterations.
Reviewers: emayanke, haobo
Reviewed By: haobo
CC: leveldb
Maniphest Tasks: T63
Differential Revision: https://reviews.facebook.net/D12531
Summary: Fix code so that the filter_block layer only assumes keys are internal when prefix_extractor is set.
Test Plan: ./filter_block_test
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12501
Summary: If use_prefix_filters is set and read_range>1, then the random seeks will set a the prefix filter to be the prefix of the key which was randomly selected as the target. Still need to add statistics (perhaps in a separate diff).
Test Plan: ./db_bench --benchmarks=fillseq,prefixscanrandom --num=10000000 --statistics=1 --use_prefix_blooms=1 --use_prefix_api=1 --bloom_bits=10
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, haobo
Differential Revision: https://reviews.facebook.net/D12273
Summary: An api to query the level, key ranges, size etc for each SST file and an api to delete a specific file from the db and all associated state in the bookkeeping datastructures.
Notes: Editing the manifest version does not release the obsolete files right away. However deleting the file directly will mess up the iterator. We may need a more aggressive/timely file deletion api.
I have used std::unique_ptr - will switch to boost:: since this is external. thoughts?
Unit test is fragile right now as it expects the compaction at certain levels.
Test Plan: unittest
Reviewers: dhruba, vamsi, emayanke
CC: zshao, leveldb, haobo
Task ID: #
Blame Rev:
Summary:
The pre-existing code was purging a DeleteMarker if thay key did not
exist in deeper levels. But in the Universal Compaction Style, all
files are in Level0. For compaction runs that did not include the
earliest file, we were erroneously purging the DeleteMarkers.
The fix is to purge DeleteMarkers only if the compaction includes
the earlist file.
Test Plan: DBTest.Randomized triggers this code path.
Differential Revision: https://reviews.facebook.net/D12081
Summary:
Here are the major changes to the Merge Interface. It has been expanded
to handle cases where the MergeOperator is not associative. It does so by stacking
up merge operations while scanning through the key history (i.e.: during Get() or
Compaction), until a valid Put/Delete/end-of-history is encountered; it then
applies all of the merge operations in the correct sequence starting with the
base/sentinel value.
I have also introduced an "AssociativeMerge" function which allows the user to
take advantage of associative merge operations (such as in the case of counters).
The implementation will always attempt to merge the operations/operands themselves
together when they are encountered, and will resort to the "stacking" method if
and only if the "associative-merge" fails.
This implementation is conjectured to allow MergeOperator to handle the general
case, while still providing the user with the ability to take advantage of certain
efficiencies in their own merge-operator / data-structure.
NOTE: This is a preliminary diff. This must still go through a lot of review,
revision, and testing. Feedback welcome!
Test Plan:
-This is a preliminary diff. I have only just begun testing/debugging it.
-I will be testing this with the existing MergeOperator use-cases and unit-tests
(counters, string-append, and redis-lists)
-I will be "desk-checking" and walking through the code with the help gdb.
-I will find a way of stress-testing the new interface / implementation using
db_bench, db_test, merge_test, and/or db_stress.
-I will ensure that my tests cover all cases: Get-Memtable,
Get-Immutable-Memtable, Get-from-Disk, Iterator-Range-Scan, Flush-Memtable-to-L0,
Compaction-L0-L1, Compaction-Ln-L(n+1), Put/Delete found, Put/Delete not-found,
end-of-history, end-of-file, etc.
-A lot of feedback from the reviewers.
Reviewers: haobo, dhruba, zshao, emayanke
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11499
Summary: Removed KeyMayExistImpl because KeyMayExist demanded Get like semantics now. Removed no_io from memtable and imm because we need the proper value now and shouldn't just stop when we see Merge in memtable. Added checks to block_cache. Updated documentation and unit-test
Test Plan: make all check;db_stress for 1 hour
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11853
Summary:
Introduced KeyMayExist checking during writebatch-delete and removed from Outer Delete API because it uses writebatch-delete.
Added code to skip getting Table from disk if not already present in table_cache.
Some renaming of variables.
Introduced KeyMayExistImpl which allows checking since specified sequence number in GetImpl useful to check partially written writebatch.
Changed KeyMayExist to not be pure virtual and provided a default implementation.
Expanded unit-tests in db_test to check appropriately.
Ran db_stress for 1 hour with ./db_stress --max_key=100000 --ops_per_thread=10000000 --delpercent=50 --filter_deletes=1 --statistics=1.
Test Plan: db_stress;make check
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb, xjin
Differential Revision: https://reviews.facebook.net/D11745
Summary: As title. This diff added an option reduce_level to CompactRange. When set to true, it will try to move the files back to the minimum level sufficient to hold the data set. Note that the default is set to true now, just to excerise it in all existing tests. Will set the default to false before check-in, for backward compatibility.
Test Plan: make check;
Reviewers: dhruba, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11553
Summary:
The maxGrandParentOverlapBytes_ was signed which was causing
an erroneous comparision between signed and unsigned longs.
This, in turn, was causing compaction-created-output-files
to be very small in size.
Test Plan: make check
Differential Revision: https://reviews.facebook.net/D11727
Summary:
Wrote a new function in db_impl.c-CheckKeyMayExist that calls Get but with a new parameter turned on which makes Get return false only if bloom filters can guarantee that key is not in database. Delete calls this function and if the option- deletes_use_filter is turned on and CheckKeyMayExist returns false, the delete will be dropped saving:
1. Put of delete type
2. Space in the db,and
3. Compaction time
Test Plan:
make all check;
will run db_stress and db_bench and enhance unit-test once the basic design gets approved
Reviewers: dhruba, haobo, vamsi
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11607
Summary:
Rename PickCompactionHybrid to PickCompactionUniversal.
Changed a few LOG message from "Hybrid:" to "Universal:".
Test Plan:
Reviewers:
CC:
Task ID: #
Blame Rev:
Summary:
There is a new option called hybrid_mode which, when switched on,
causes HBase style compactions. Files from L0 are
compacted back into L0. This meat of this compaction algorithm
is in PickCompactionHybrid().
All files reside in L0. That means all files have overlapping
keys. Each file has a time-bound, i.e. each file contains a
range of keys that were inserted around the same time. The
start-seqno and the end-seqno refers to the timeframe when
these keys were inserted. Files that have contiguous seqno
are compacted together into a larger file. All files are
ordered from most recent to the oldest.
The current compaction algorithm starts to look for
candidate files starting from the most recent file. It continues to
add more files to the same compaction run as long as the
sum of the files chosen till now is smaller than the next
candidate file size. This logic needs to be debated
and validated.
The above logic should reduce write amplification to a
large extent... will publish numbers shortly.
Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).
Differential Revision: https://reviews.facebook.net/D11289
Summary:
This diff simplifies EnvOptions by treating it as POD, similar to Options.
- virtual functions are removed and member fields are accessed directly.
- StorageOptions is removed.
- Options.allow_readahead and Options.allow_readahead_compactions are deprecated.
- Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite
Test Plan: make check; db_stress
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11175
Summary: Added the 'score' column to the compaction stats output, which shows the level total size devided by level target size. Could be useful when monitoring compaction decisions...
Test Plan: make check; db_bench
Reviewers: dhruba
CC: leveldb, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D11025
Summary:
This diff introduces a new Merge operation into rocksdb.
The purpose of this review is mostly getting feedback from the team (everyone please) on the design.
Please focus on the four files under include/leveldb/, as they spell the client visible interface change.
include/leveldb/db.h
include/leveldb/merge_operator.h
include/leveldb/options.h
include/leveldb/write_batch.h
Please go over local/my_test.cc carefully, as it is a concerete use case.
Please also review the impelmentation files to see if the straw man implementation makes sense.
Note that, the diff does pass all make check and truly supports forward iterator over db and a version
of Get that's based on iterator.
Future work:
- Integration with compaction
- A raw Get implementation
I am working on a wiki that explains the design and implementation choices, but coding comes
just naturally and I think it might be a good idea to share the code earlier. The code is
heavily commented.
Test Plan: run all local tests
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
CC: leveldb, zshao, sheki, emayanke, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D9651
Summary: Simplified level_ptrs by using a std:vector
Test Plan: make check
Reviewers: sheki, emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10245
Summary:
FindObsoleteFiles was slow, holding the single big lock, resulted in bad p99 behavior.
Didn't profile anything, but several things could be improved:
1. VersionSet::AddLiveFiles works with std::set, which is by itself slow (a tree).
You also don't know how many dynamic allocations occur just for building up this tree.
switched to std::vector, also added logic to pre-calculate total size and do just one allocation
2. Don't see why env_->GetChildren() needs to be mutex proteced, moved to PurgeObsoleteFiles where
mutex could be unlocked.
3. switched std::set to std:unordered_set, the conversion from vector is also inside PurgeObsoleteFiles
I have a feeling this should pretty much fix it.
Test Plan: make check; db_stress
Reviewers: dhruba, heyongqiang, MarkCallaghan
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D10197
Summary:
This patch allows an application to specify whether to use bufferedio,
reads-via-mmaps and writes-via-mmaps per database. Earlier, there
was a global static variable that was used to configure this functionality.
The default setting remains the same (and is backward compatible):
1. use bufferedio
2. do not use mmaps for reads
3. use mmap for writes
4. use readaheads for reads needed for compaction
I also added a parameter to db_bench to be able to explicitly specify
whether to do readaheads for compactions or not.
Test Plan: make check
Reviewers: sheki, heyongqiang, MarkCallaghan
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9429
Summary: Makefile had options to ignore sign-comparisons and unused-parameters, which should be there. Also fixed the specific errors in the code-base
Test Plan: make
Reviewers: chip, dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9531
Summary:
SizeBeingCompacted was called without any lock protection. This causes
crashes, especially when running db_bench with value_size=128K.
The fix is to compute SizeUnderCompaction while holding the mutex and
passing in these values into the call to Finalize.
(gdb) where
#4 leveldb::VersionSet::SizeBeingCompacted (this=this@entry=0x7f0b490931c0, level=level@entry=4) at db/version_set.cc:1827
#5 0x000000000043a3c8 in leveldb::VersionSet::Finalize (this=this@entry=0x7f0b490931c0, v=v@entry=0x7f0b3b86b480) at db/version_set.cc:1420
#6 0x00000000004418d1 in leveldb::VersionSet::LogAndApply (this=0x7f0b490931c0, edit=0x7f0b3dc8c200, mu=0x7f0b490835b0, new_descriptor_log=<optimized out>) at db/version_set.cc:1016
#7 0x00000000004222b2 in leveldb::DBImpl::InstallCompactionResults (this=this@entry=0x7f0b49083400, compact=compact@entry=0x7f0b2b8330f0) at db/db_impl.cc:1473
#8 0x0000000000426027 in leveldb::DBImpl::DoCompactionWork (this=this@entry=0x7f0b49083400, compact=compact@entry=0x7f0b2b8330f0) at db/db_impl.cc:1757
#9 0x0000000000426690 in leveldb::DBImpl::BackgroundCompaction (this=this@entry=0x7f0b49083400, madeProgress=madeProgress@entry=0x7f0b41bf2d1e, deletion_state=...) at db/db_impl.cc:1268
#10 0x0000000000428f42 in leveldb::DBImpl::BackgroundCall (this=0x7f0b49083400) at db/db_impl.cc:1170
#11 0x000000000045348e in BGThread (this=0x7f0b49023100) at util/env_posix.cc:941
#12 leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper (arg=0x7f0b49023100) at util/env_posix.cc:874
#13 0x00007f0b4a7cf10d in start_thread (arg=0x7f0b41bf3700) at pthread_create.c:301
#14 0x00007f0b49b4b11d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Test Plan:
make check
I am running db_bench with a value size of 128K to see if the segfault is fixed.
Reviewers: MarkCallaghan, sheki, emayanke
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9279
Summary:
This adds the rate_delay_limit_milliseconds option to make the delay
configurable in MakeRoomForWrite when the max compaction score is too high.
This delay is called the Ln slowdown. This change also counts the Ln slowdown
per level to make it possible to see where the stalls occur.
From IO-bound performance testing, the Level N stalls occur:
* with compression -> at the largest uncompressed level. This makes sense
because compaction for compressed levels is much
slower. When Lx is uncompressed and Lx+1 is compressed
then files pile up at Lx because the (Lx,Lx+1)->Lx+1
compaction process is the first to be slowed by
compression.
* without compression -> at level 1
Task ID: #1832108
Blame Rev:
Test Plan:
run with real data, added test
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D9045
Summary:
On some filesystems, pre-allocation can be a considerable
amount of space. xfs in our production environment pre-allocates by
1GB, for instance. By using fallocate to inform the kernel of our
expected file sizes, we eliminate this wasteage (that isn't recovered
until the file is closed which, in the case of LOG files, can be a
considerable amount of time).
Test Plan:
created an xfs loopback filesystem, mounted with
allocsize=4M, and ran db_stress. LOG file without this change was 4M,
and with it it was 128k then grew to normal size.
Reviewers: dhruba
Reviewed By: dhruba
CC: adsharma, leveldb
Differential Revision: https://reviews.facebook.net/D7953
Summary:
Replace manual memory management with std::unique_ptr in a
number of places; not exhaustive, but this fixes a few leaks with file
handles as well as clarifies semantics of the ownership of file handles
with log classes.
Test Plan: db_stress, make check
Reviewers: dhruba
Reviewed By: dhruba
CC: zshao, leveldb, heyongqiang
Differential Revision: https://reviews.facebook.net/D8043
Summary:
Check in LogAndApply if the file size is more than the limit set in
Options.
Things to consider : will this be expensive?
Test Plan: make all check. Inputs on a new unit test?
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7701
Summary: Due to how the code handled compactions in Level 0 in `PickCompaction()` it could be the case that two compactions on level 0 ran that produced tables in level 1 that overlap. However, this case seems like it would only occur on a seek compaction which is unlikely on level 0. Furthermore, level 0 and level 1 had to have a certain arrangement of files.
Test Plan:
make check
Reviewers: dhruba, vamsi
Reviewed By: dhruba
CC: leveldb, sheki
Differential Revision: https://reviews.facebook.net/D7923
Summary:
Pretty much a blind copy of the patch in open source.
Hope to get this in before we make a release
Test Plan: make clean check
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7809
Summary:
A compaction is picked based on its score. It is useful to
print the compaction score in the LOG because it aids in
debugging. If one looks at the logs, one can find out why
a compaction was preferred over another.
Test Plan: make clean check
Differential Revision: https://reviews.facebook.net/D7137
Summary:
Scripted and removed all trailing spaces and converted all tabs to
spaces.
Also fixed other lint errors.
All lint errors from this point of time should be taken seriously.
Test Plan: make all check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7059
Summary:
When we expand the range of keys for a level 0 compaction, we
need to invoke ParentFilesInCompaction() only once for the
entire range of keys that is being compacted. We were invoking
it for each file that was being compacted, but this triggers
an assertion because each file's range were contiguous but
non-overlapping.
I renamed ParentFilesInCompaction to ParentRangeInCompaction
to adequately represent that it is the range-of-keys and
not individual files that we compact in a single compaction run.
Here is the assertion that is fixed by this patch.
db_test: db/version_set.cc:585: void leveldb::Version::ExtendOverlappingInputs(int, const leveldb::Slice&, const leveldb::Slice&, std::vector<leveldb::FileMetaData*, std::allocator<leveldb::FileMetaData*> >*, int): Assertion `user_cmp->Compare(flimit, user_begin) >= 0' failed.
Test Plan: make clean check OPT=-g
Reviewers: sheki
Reviewed By: sheki
CC: MarkCallaghan, emayanke, leveldb
Differential Revision: https://reviews.facebook.net/D6963
Summary:
The manifest file contains a series of edits. If the verbose
option is switched on, then print each individual edit in the
manifest file. This helps in debugging.
Test Plan: make clean manifest_dump
Reviewers: emayanke, sheki
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D6807
Summary:
When a new version is created, we sort all the files at every
level based on their size. This is necessary because we want
to compact the largest file first. The sorting takes quite a
bit of CPU.
Moved the sorting code to be outside the mutex. Also, the
earlier code was sorting files at all levels but we do not
need to sort the highest-number level because those files
are never the cause of any compaction. To reduce sorting
costs, we sort only the first few files in each level
because it is likely that those are the only files in that
level that will be picked for compaction.
At steady state, I have seen that this patch increase
throughout from 1500 writes/sec to 1700 writes/sec at the
end of a 72 hour run. The cpu saving by not sorting the
last level was not distinctive in this test run because
there were only 100K files in the highest numbered level.
I expect the cpu saving to be significant when the number of
files is much higher.
This is mostly an early preview and not ready for rigorous review.
With this patch, the writs/sec is now bottlenecked not by the sorting code but by GetOverlappingInputs. I am working on a patch to optimize GetOverlappingInputs.
Test Plan: make check
Reviewers: MarkCallaghan, heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D6411
Summary:
The Version::GetOverlappingInputs() is called multiple times in
the compaction code path. Eack invocation does a binary search
for overlapping files in the specified key range.
This patch remembers the offset of an overlapped file when
GetOverlappingInputs() is called the first time within
a compaction run. Suceeding calls to GetOverlappingInputs()
uses the remembered index to avoid the binary search.
I measured that 1000 iterations of GetOverlappingInputs
takes around 4500 microseconds without this patch. If I use
this patch with the hint on every invocation, then 1000
iterations take about 3900 microsecond.
Test Plan: make check OPT=-g
Reviewers: heyongqiang
Reviewed By: heyongqiang
CC: MarkCallaghan, emayanke, sheki
Differential Revision: https://reviews.facebook.net/D6513
Summary:
The method Version::GetOverlappingInputs used a sequential search
to map a kay-range to a set of files. But the files are arranged
in ascending order of key, so a biary search is more effective.
This patch implements Version::GetOverlappingInputsBinarySearch
that finds one file that corresponds to the specified key range
and then iterates backwards and forwards to find all overlapping
files.
This patch is critical for making compactions efficient, especially
when there are thousands of files in a single level.
I measured that 1000 iterations of TEST_MaxNextLevelOverlappingBytes
takes 16000 microseconds without this patch. With this patch, the
same method takes about 4600 microseconds.
Test Plan: Almost all unit tests in db_test uses this method to lookup keys.
Reviewers: heyongqiang
Reviewed By: heyongqiang
CC: MarkCallaghan, emayanke, sheki
Differential Revision: https://reviews.facebook.net/D6465
Summary: as subject.
Test Plan: manually test it, will add a testcase
Reviewers: dhruba, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D6345
Summary:
Adds a method that returns the score for the next level that most
needs compaction. That method is then used by db_bench to rate limit threads.
Threads are put to sleep at the end of each stats interval until the score
is less than the limit. The limit is set via the --rate_limit=$double option.
The specified value must be > 1.0. Also adds the option --stats_per_interval
to enable additional metrics reported every stats interval.
Task ID: #
Blame Rev:
Test Plan:
run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6243
Summary:
It is best if we pick the largest file to compact in a level.
This reduces the write amplification factor for compactions.
Each level has an auxiliary data structure called files_by_size_
that sorts all files by their size. This data structure is
updated when a new version is created.
Test Plan: make check
Differential Revision: https://reviews.facebook.net/D6195