Fix for:
[db/db_test.cc:6141]: (performance) Function parameter
'key' should be passed by reference.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Summary:
Add the MultiGet API to allow prefetching.
With file size of 1.5G, I configured it to have 0.9 hash ratio that can
fill With 115M keys and result in 2 hash functions, the lookup QPS is
~4.9M/s vs. 3M/s for Get().
It is tricky to set the parameters right. Since files size is determined
by power-of-two factor, that means # of keys is fixed in each file. With
big file size (thus smaller # of files), we will have more chance to
waste lot of space in the last file - lower space utilization as a
result. Using smaller file size can improve the situation, but that
harms lookup speed.
Test Plan: db_bench
Reviewers: yhchiang, sdong, igor
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23673
Summary:
Add a CompactedDBImpl that will enabled when calling OpenForReadOnly()
and the DB only has one level (>0) of files. As a performan comparison,
CuckooTable performs 2.1M/s with CompactedDBImpl vs. 1.78M/s with
ReadOnlyDBImpl.
Test Plan: db_bench
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23553
Summary: Those were introduced with 2fb1fea30f because the flushing behavior changed when max_background_flushes is > 0.
Test Plan: make check
Reviewers: ljin, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23577
Summary: We currently don't test mmap reads as part of db_test. Piggyback it on kWalDir test config.
Test Plan: make check
Reviewers: ljin, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23337
Summary:
1. wrap a filter policy like what fbcode/multifeed/rocksdb/MultifeedRocksDbKey.h
to ensure that rocksdb works fine after filterpolicy interface change
Test Plan: 1. valgrind ./bloom_test
Reviewers: ljin, igor, yhchiang, dhruba, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23229
Summary: Get valgrind to stop complaining about uninitialized value
Test Plan: valgrind not complaining anymore
Reviewers: sdong, yhchiang, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23289
Summary: The test makes sure that we don't call flush too often. For that, it's ok to check if we have less than 10 table files. Otherwise, the test is flaky because it's hard to estimate number of entries in the memtable before it gets flushed (any ideas?)
Test Plan: Still works, but hopefully less flaky.
Reviewers: ljin, sdong, yhchiang
Reviewed by: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23241
Summary:
When memtable is full it calls the registered callback. That callback then registers column family as needing the flush. Every write checks if there are some column families that need to be flushed. This completely eliminates the need for MakeRoomForWrite() function and simplifies our Write code-path.
There is some complexity with the concurrency when the column family is dropped. I made it a bit less complex by dropping the column family from the write thread in https://reviews.facebook.net/D22965. Let me know if you want to discuss this.
Test Plan: make check works. I'll also run db_stress with creating and dropping column families for a while.
Reviewers: yhchiang, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23067
Summary:
See t5106397.
Also, few more changes:
1. in unit tests, the assumption is that writes will be dropped when there is no space left on device. I changed the wording around it.
2. InvalidArgument() errors are only when user-provided arguments are invalid. When the file is corrupted, we need to return Status::Corruption
Test Plan: make check
Reviewers: sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23145
Summary:
In column family's SanitizeOptions() [1], we make sure that min_write_buffer_number_to_merge is normal value. However, this test depended on the fact that setting min_write_buffer_number_to_merge to be bigger than max_write_buffer_number will cause a deadlock. I'm not sure how it worked before.
This diff fixes it by scheduling sleeping background task, which will actually block any attempts of flushing.
[1] https://github.com/facebook/rocksdb/blob/master/db/column_family.cc#L104
Test Plan: the test works now
Reviewers: yhchiang, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23103
Summary: Avoid creating unnecessary sst files while db opening
Test Plan: make all check
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: zagfox, yhchiang, ljin, leveldb
Differential Revision: https://reviews.facebook.net/D20661
Summary: ...
Test Plan: Can't repro the test failure, but let's see what jenkins says
Reviewers: zagfox, sdong, ljin
Reviewed By: sdong, ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23061
Summary:
When we have multiple column families, users can issue Flush() on every column families to make sure everything is flushes, even if some of them might be empty. By skipping the waiting for empty cases, it can be greatly speed up.
Still wait for people's comments before writing unit tests for it.
Test Plan: Will write a unit test to make sure it is correct.
Reviewers: ljin, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D22953
Summary:
1. Make filter_block.h a base class. Derive block_based_filter_block and full_filter_block. The previous one is the traditional filter block. The full_filter_block is newly added. It would generate a filter block that contain all the keys in SST file.
2. When querying a key, table would first check if full_filter is available. If not, it would go to the exact data block and check using block_based filter.
3. User could choose to use full_filter or tradional(block_based_filter). They would be stored in SST file with different meta index name. "filter.filter_policy" or "full_filter.filter_policy". Then, Table reader is able to know the fllter block type.
4. Some optimizations have been done for full_filter_block, thus it requires a different interface compared to the original one in filter_policy.h.
5. Actual implementation of filter bits coding/decoding is placed in util/bloom_impl.cc
Benchmark: base commit 1d23b5c470
Command:
db_bench --db=/dev/shm/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --write_buffer_size=134217728 --max_write_buffer_number=2 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --verify_checksum=false --max_background_compactions=4 --use_plain_table=0 --memtablerep=prefix_hash --open_files=-1 --mmap_read=1 --mmap_write=0 --bloom_bits=10 --bloom_locality=1 --memtable_bloom_bits=500000 --compression_type=lz4 --num=393216000 --use_hash_search=1 --block_size=1024 --block_restart_interval=16 --use_existing_db=1 --threads=1 --benchmarks=readrandom —disable_auto_compactions=1
Read QPS increase for about 30% from 2230002 to 2991411.
Test Plan:
make all check
valgrind db_test
db_stress --use_block_based_filter = 0
./auto_sanity_test.sh
Reviewers: igor, yhchiang, ljin, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D20979
Summary: If we drop column family only from (single) write thread, we can be sure that nobody will drop the column family while we're writing (and our mutex is released). This greatly simplifies my patch that's getting rid of MakeRoomForWrite().
Test Plan: make check, but also running stress test
Reviewers: ljin, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22965
Summary: fixed memory leak in unit test DBIteratorBoundTest
Test Plan: ran valgrind test on my unit test
Reviewers: sdong
Differential Revision: https://reviews.facebook.net/D22911
Summary:
Simply code by removing code path which does not use Arena
from NewInternalIterator
Test Plan:
make all check
make valgrind_check
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22395
Summary: In DBImpl::Recover method, while loading memtables, also check if memtables are empty. Use this in DBImplReadonly to determine whether to lookup memtable or not.
Test Plan:
db_test
make check all
Reviewers: sdong, yhchiang, ljin, igor
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22281
Summary:
I will move compression related options in a separate diff since this
diff is already pretty lengthy.
I guess I will also need to change JNI accordingly :(
Test Plan: make all check
Reviewers: yhchiang, igor, sdong
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D21915
Summary:
Currently, PlainTable must use mmap_reads. When PlainTable is used but
allow_mmap_reads is not set, rocksdb will fail in flush.
This diff improve Options sanitization and add MmapReadRequired() to
TableFactory.
Test Plan:
export ROCKSDB_TESTS=PlainTableOptionsSanitizeTest
make db_test -j32
./db_test
Reviewers: sdong, ljin
Reviewed By: ljin
Subscribers: you, leveldb
Differential Revision: https://reviews.facebook.net/D21939
Summary: A previous change triggered a change by mistake: DestroyDB() will keep info logs under DB directory. Revert the unintended change.
Test Plan: Add a unit test case to verify it.
Reviewers: ljin, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22209
Summary:
Add WriteBatchWithIndex so that a user can query data out of a WriteBatch, to support MongoDB's read-its-own-write.
WriteBatchWithIndex uses a skiplist to store the binary index. The index stores the offset of the entry in the write batch. When searching for a key, the key for the entry is read by read the entry from the write batch from the offset.
Define a new iterator class for querying data out of WriteBatchWithIndex. A user can create an iterator of the write batch for one column family, seek to a key and keep calling Next() to see next entries.
I will add more unit tests if people are OK about this API.
Test Plan:
make all check
Add unit tests.
Reviewers: yhchiang, igor, MarkCallaghan, ljin
Reviewed By: ljin
Subscribers: dhruba, leveldb, xjin
Differential Revision: https://reviews.facebook.net/D21381
Summary:
1. Support purging info logs from a separate paths from DB path. Refactor the codes of generating info log prefixes so that it can be called when generating new files and scanning log directory.
2. Fix the bug of not scanning multiple DB paths (should only impact multiple DB paths)
Test Plan:
Add unit test for generating and parsing info log files
Add end-to-end test in db_test
Reviewers: yhchiang, ljin
Reviewed By: ljin
Subscribers: leveldb, igor, dhruba
Differential Revision: https://reviews.facebook.net/D21801
Summary:
Currently DBImpl::Flush() triggers flushes in all column families.
Instead we need to trigger just the column family specified.
Test Plan: make all check
Reviewers: igor, ljin, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20841
Summary: Fix old issue with DBTest.Randomized with BlockBasedTableWithWholeKeyHashIndex + added printing in DBTest.Randomized.
Test Plan: make all check
Reviewers: zagfox, igor, ljin, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D21003
Summary:
Add a DB Property "rocksdb.estimate-table-readers-mem" to return estimated memory usage by all loaded table readers, other than allocated from block cache.
Refactor the property codes to allow getting property from a version, with DB mutex not acquired.
Test Plan: Add several checks of this new property in existing codes for various cases.
Reviewers: yhchiang, ljin
Reviewed By: ljin
Subscribers: xjin, igor, leveldb
Differential Revision: https://reviews.facebook.net/D20733
Summary:
Fixed the following compilation error detected in mac:
db/db_test.cc:2524:3: note: in instantiation of function template
specialization 'rocksdb::test::Tester::IsEq<unsigned long long, int>' requested here
ASSERT_EQ(int_num, 0);
^
Test Plan:
make
Summary: We have quite some properties that are integers and we are adding more. Add a function to directly return them as an integer, instead of a string
Test Plan: Add several unit test checks
Reviewers: yhchiang, igor, dhruba, haobo, ljin
Reviewed By: ljin
Subscribers: yoshinorim, leveldb
Differential Revision: https://reviews.facebook.net/D20637
Summary: Add a DB property of estimated number of live keys, by adding number of entries of all mem tables and all files, subtracted by all deletions in all files.
Test Plan: Add the case in unit tests
Reviewers: hobbymanyp, ljin
Reviewed By: ljin
Subscribers: MarkCallaghan, yoshinorim, leveldb, igor, dhruba
Differential Revision: https://reviews.facebook.net/D20631
Summary:
User gets undefinied error since the definition is not exposed.
Also re-enable the db test with only upper bound check
Test Plan: db_test, rate_limit_test
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20403
Summary: Add a parameter path_id to DB::CompactRange(), to indicate where the output file should be placed to.
Test Plan: add a unit test
Reviewers: yhchiang, ljin
Reviewed By: ljin
Subscribers: xjin, igor, dhruba, MarkCallaghan, leveldb
Differential Revision: https://reviews.facebook.net/D20085
Summary: Implement Prev() with merge operator for DBIterator. Request from mongoDB. Task 4673663.
Test Plan: make all check
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19743
Summary:
This patch adds a target size parameter in options.db_paths and universal compaction will base it to determine which DB path to place a new file.
Level-style stays the same.
Test Plan: Add new unit tests
Reviewers: ljin, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, dhruba, igor, leveldb
Differential Revision: https://reviews.facebook.net/D19869
Summary:
If `NeedToSeekImmutable()` returns false, `SeekInternal()` won't reset the
contents of `immutable_min_heap_`. However, since it calls `UpdateCurrent()`
unconditionally, if `current_` is one of immutable iterators (previously popped
from `immutable_min_heap_`), `UpdateCurrent()` will overwrite it. As a result,
if old `current_` in fact pointed to the smallest entry, forward iterator will
skip some records.
Fix implemented in this diff pushes `current_` back to `immutable_min_heap_`
before calling `UpdateCurrent()`.
Test Plan:
New unit test (courtesy of @lovro):
$ ROCKSDB_TESTS=TailingIteratorSeekToSame ./db_test
Reviewers: igor, dhruba, haobo, ljin
Reviewed By: ljin
Subscribers: lovro, leveldb
Differential Revision: https://reviews.facebook.net/D19653
Summary:
Forward iterator only checked `status_` and `mutable_iter_->status()`, which is
not sufficient. For example, when reading exclusively from cache
(kBlockCacheTier), `mutable_iter_->status()` may return kOk (e.g. there's
nothing in the memtable), but one of immutable iterators could be in
kIncomplete. In this case, `ForwardIterator::status()` ought to return that
status instead of kOk.
This diff changes `status()` to also check `imm_iters_`, `l0_iters_`, and
`level_iters_`.
Test Plan:
ROCKSDB_TESTS=TailingIteratorIncomplete ./db_test
Reviewers: ljin, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D19581
Summary: Add a function to return the perf level. It is to allow a wrapper of DB to increase the perf level and restore the original perf level after finishing the function call.
Test Plan: Add a verification in db_test
Reviewers: yhchiang, igor, ljin
Reviewed By: ljin
Subscribers: xjin, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D19551
Summary:
The test is not stable because it relies on disk and only runs for a
short period of time. So misisng a compaction/flush would greatly affect
the rate. I am disabling it for now. What do you guys think?
Test Plan: make
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19599
Summary:
Randomize keys so that compaction actually happens.
Change the config so that compaction happens more aggressively.
The test takes longer time, but the results are more stable shown by
iostat
Test Plan: ran it
Reviewers: igor, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19533
Summary:
Add option and plugin rate limiter for PosixWritableFile. The rate
limiter only applies to flush and compaction. WAL and MANIFEST are
excluded from this enforcement.
Test Plan: db_test
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19425
Summary:
SimpleWriteTimeoutTest has two parts: 1) insert two large key/values
to make memtable full and expect both of them are successful; 2) insert
another key / value and expect it to be timed-out. Previously we also
set a timeout in the first step, but this might sometimes cause
false alarm.
This diff makes the first two writes run without timeout setting.
Test Plan:
export ROCKSDB_TESTS=Time
make db_test
Reviewers: sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19461
Summary:
This diff adds timeout_hint_us to WriteOptions. If it's non-zero, then
1) writes associated with this options MAY be aborted when it has been
waiting for longer than the specified time. If an abortion happens,
associated writes will return Status::TimeOut.
2) the stall time of the associated write caused by flush or compaction
will be limited by timeout_hint_us.
The default value of timeout_hint_us is 0 (i.e., OFF.)
The statistics of timeout writes will be recorded in WRITE_TIMEDOUT.
Test Plan:
export ROCKSDB_TESTS=WriteTimeoutAndDelayTest
make db_test
./db_test
Reviewers: igor, ljin, haobo, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D18837
Summary:
In this patch, we allow RocksDB to support multiple DB paths internally.
No user interface is supported yet so this patch is silent to users.
Test Plan: make all check
Reviewers: igor, haobo, ljin, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D18921
Commit 6634844dba by sdong
Two small fixes in db_test
Summary:
Two fixes:
(1) WalDir to pick a directory under TmpDir to allow two tests running in parallel without impacting each other
(2) kBlockBasedTableWithWholeKeyHashIndex is disabled by mistake (I assume). Enable it.
Test Plan: ./db_test
Reviewers: yhchiang, ljin
Reviewed By: ljin
Subscribers: nkg-, igor, dhruba, haobo, leveldb
Differential Revision: https://reviews.facebook.net/D19389
Summary:
In this patch, we enhance HashLinkList memtable to reduce performance outliers when a bucket contains too many entries. We switch to skip list for this case to enable binary search.
Add threshold_use_skiplist parameter to determine when a bucket needs to switch to skip list.
The new data structure is documented in comments in the codes.
Test Plan:
make all check
set threshold_use_skiplist in several tests
Reviewers: yhchiang, haobo, ljin
Reviewed By: yhchiang, ljin
Subscribers: nkg-, xjin, dhruba, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D19299
Summary:
Two fixes:
(1) WalDir to pick a directory under TmpDir to allow two tests running in parallel without impacting each other
(2) kBlockBasedTableWithWholeKeyHashIndex is disabled by mistake (I assume). Enable it.
Test Plan: ./db_test
Reviewers: yhchiang, ljin
Reviewed By: ljin
Subscribers: nkg-, igor, dhruba, haobo, leveldb
Differential Revision: https://reviews.facebook.net/D19389
Summary:
Bloomfilter and hashskiplist's buckets_ allocated by memtable's arena
DynamicBloom: pass arena via constructor, allocate space in SetTotalBits
HashSkipListRep: allocate space of buckets_ using arena.
do not delete it in deconstructor because arena would take care of it.
Several test files are changed.
Test Plan:
make all check
Reviewers: ljin, haobo, yhchiang, sdong
Reviewed By: sdong
Subscribers: igor, dhruba
Differential Revision: https://reviews.facebook.net/D19335
Summary:
This diff allows compaction to reclaim storage more effectively.
In the current design, compactions are mainly triggered based on
the file sizes. However, since deletion entries does not have
value, files which have many deletion entries are less likely
to be compacted. As a result, it may took a while to make
deletion entries to be compacted.
This diff address issue by compensating the size of deletion
entries during compaction process: the size of each deletion
entry in the compaction process is augmented by 2x average
value size. The diff applies to both leveled and universal
compacitons.
Test Plan:
develop CompactionDeletionTrigger
make db_test
./db_test
Reviewers: haobo, igor, ljin, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19029
Summary:
As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases.
This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction.
There is one test case that relied on didIO information. I'll try to find another way to implement it.
Test Plan: make check
Reviewers: sdong, haobo, yhchiang, ljin, dhruba
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19161
Summary: Currently, RocksDB returns error if a db written with prefix hash index, is later opened without providing a prefix extractor. This is uncessarily harsh. Without a prefix extractor, we could always fallback to the normal binary index.
Test Plan: unit test, also manually veried LOG that fallback did occur.
Reviewers: sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19191
Summary:
Introducing new compaction style -- FIFO.
FIFO compaction style has write amplification of 1 (+1 for WAL) and it deletes the oldest files when the total DB size exceeds pre-configured values.
FIFO compaction style is suited for storing high-frequency event logs.
Test Plan: Added a unit test
Reviewers: dhruba, haobo, sdong
Reviewed By: dhruba
Subscribers: alberts, leveldb
Differential Revision: https://reviews.facebook.net/D18765
Summary:
In order to use arena to a use case that the total allocation size might be small (LogBuffer is already such a case), inline 1KB of data in it, so that it can be mostly in stack or inline in another class.
If always inlining 2KB is a concern, I could make it a template to determine what to inline. However, dependents need to changes. Doesn't go with it for now
Test Plan: make all check.
Reviewers: haobo, igor, yhchiang, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18609
Summary:
Fixed a file-not-found issue when a log file is moved to archive
by doing a missing retry.
Test Plan:
make db_test
export ROCKSDB_TEST=TransactionLogIteratorRace
./db_test
Reviewers: sdong, haobo
Reviewed By: sdong
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D18669
Summary: We had a hypothesis in https://reviews.facebook.net/D18507 that empty-string internal keys might have been caused by compaction filter deleting all the entries. I added a unit test for that case. Unforutnately, everything works as expected.
Test Plan: this is a test
Reviewers: dhruba, haobo, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18519
Summary:
db_test includes Benchmark for LogAndApply. This diff removes it from db_test and puts it into a separate log_and_apply bench. I just wanted to play around with our new benchmark framework and figure out how it works.
I would also like to show you how great it is! I believe right set of microbenchmarks can speed up our productivity a lot and help catch early regressions.
Test Plan: no
Reviewers: dhruba, haobo, sdong, ljin, yhchiang
Reviewed By: yhchiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18261
Summary: Add an option to allocate a piece of memory from huge page TLB. Add options to trigger it in dynamic bloom, plain table indexes andhash linked list hash table.
Test Plan: make all check
Reviewers: haobo, ljin
Reviewed By: haobo
CC: nkg-, dhruba, leveldb, igor, yhchiang
Differential Revision: https://reviews.facebook.net/D18357
Summary:
= Major Changes =
* Add a new mem-table representation, HashCuckooRep, which is based cuckoo hash.
Cuckoo hash uses multiple hash functions. This allows each key to have multiple
possible locations in the mem-table.
- Put: When insert a key, it will try to find whether one of its possible
locations is vacant and store the key. If none of its possible
locations are available, then it will kick out a victim key and
store at that location. The kicked-out victim key will then be
stored at a vacant space of its possible locations or kick-out
another victim. In this diff, the kick-out path (known as
cuckoo-path) is found using BFS, which guarantees to be the shortest.
- Get: Simply tries all possible locations of a key --- this guarantees
worst-case constant time complexity.
- Time complexity: O(1) for Get, and average O(1) for Put if the
fullness of the mem-table is below 80%.
- Default using two hash functions, the number of hash functions used
by the cuckoo-hash may dynamically increase if it fails to find a
short-enough kick-out path.
- Currently, HashCuckooRep does not support iteration and snapshots,
as our current main purpose of this is to optimize point access.
= Minor Changes =
* Add IsSnapshotSupported() to DB to indicate whether the current DB
supports snapshots. If it returns false, then DB::GetSnapshot() will
always return nullptr.
Test Plan:
Run existing tests. Will develop a test specifically for cuckoo hash in
the next diff.
Reviewers: sdong, haobo
Reviewed By: sdong
CC: leveldb, dhruba, igor
Differential Revision: https://reviews.facebook.net/D16155
Summary:
ReadFirstRecord() reads the actual log file from disk on every call. This diff introduces a cache layer on top of ReadFirstRecord(), which should significantly speed up repeated calls to GetUpdatesSince().
I also cleaned up some stuff, but the whole TransactionLogIterator could use some refactoring, especially if we see increased usage.
Test Plan: make check
Reviewers: haobo, sdong, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18387
Summary:
When TransactionLogIterator comes to EOF, it calls UnmarkEOF and continues reading. However, if glibc cached the EOF status of the file, it will get EOF again, even though the new data might have been written to it.
This has been causing errors in Mac OS.
Test Plan: test passes, was failing before
Reviewers: dhruba, haobo, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18381
Summary:
As a follow-up diff for https://reviews.facebook.net/D17805, add
optimization to check PrefixMayMatch on Seek()
Test Plan: make all check
Reviewers: igor, haobo, sdong, yhchiang, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17853
Summary:
also add an override option total_order_iteration if you want to use full
iterator with prefix_extractor
Test Plan: make all check
Reviewers: igor, haobo, sdong, yhchiang
Reviewed By: haobo
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D17805
Summary: In this patch, two new DB properties are defined: rocksdb.num-immutable-mem-table and rocksdb.num-entries-imm-mem-tables, from where number of entries in mem tables can be exposed to users
Test Plan:
Cover the codes in db_test
make all check
Reviewers: haobo, ljin, igor
Reviewed By: igor
CC: nkg-, igor, yhchiang, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D18207
Summary:
Using ThreadLocalPtr as a flag to determine if a mutex is locked or not enables us to implement AssertNotHeld(). It also makes AssertHeld() actually correct.
I had to remove port::Mutex as a dependency for util/thread_local.h, but that's fine since we can just use std::mutex :)
Test Plan: make check
Reviewers: ljin, dhruba, haobo, sdong, yhchiang
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18171
Summary: Current behavior of creating new DB is, if there is existing log files, we will go ahead and replay them on top of empty DB. This is a behavior that no user would expect. With this patch, we will fail the creation if a user creates a DB with existing log files.
Test Plan: make all check
Reviewers: haobo, igor, ljin
Reviewed By: haobo
CC: nkg-, yhchiang, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D17817
Summary: Based on previous patches, this diff eventually provides the end-to-end mechanism for users to specify the hash-index.
Test Plan: Wrote several new unit tests.
Reviewers: sdong, haobo, dhruba
Reviewed By: sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16539
Summary: Compiling for iOS has by default turned on -Wmissing-prototypes, which causes rocksdb to fail compiling. This diff turns on -Wmissing-prototypes in our compile options and cleans up all functions with missing prototypes.
Test Plan: compiles
Reviewers: dhruba, haobo, ljin, sdong
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17649
Summary:
This is testing behavior that was reported in https://github.com/facebook/rocksdb/issues/111
No issue was found, but it still good to commit this and make CompressedCache more robust.
Test Plan: this is a plan
Reviewers: ljin, dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17625
Summary: When opening DB in read-only mode, client can choose to only specify a subset of column families ("default" column family can't be omitted, though)
Test Plan: added a unit test in column_family_test
Reviewers: haobo, sdong, ljin, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17565
Summary:
GetProperty test is flakey.
Before this diff: P8635927
After: P8635945
We need to make sure the thread is done before we destruct sleeping tasks. Otherwise, bad things happen.
Test Plan: See summary
Reviewers: ljin, sdong, haobo, dhruba
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17595
Summary: This patch fixed a race condition where a log file is moved to archived dir in the middle of GetSortedWalFiles. Without the fix, the log file would be missed in the result, which leads to transaction log iterator gap. A test utility SyncPoint is added to help reproducing the race condition.
Test Plan: TransactionLogIteratorRace; make check
Reviewers: dhruba, ljin
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17121
Summary: The previous change D15087 changed existing compaction filter, which makes the commonly used class not backward compatible. Revert the older interface. Use a new interface for V2 instead.
Test Plan: make all check
Reviewers: haobo, yhchiang, igor
CC: danguo, dhruba, ljin, igor, leveldb
Differential Revision: https://reviews.facebook.net/D17223
Summary: DBIter now uses a std::string for saved_key. Based on some profiling, it could be more expensive than we though. Optimize it with the same technique as LookupKey -- if it is short, we copy it to a static allocated char. Otherwise, dynamically allocate memory for it.
Test Plan: make all check
Reviewers: haobo, ljin
Reviewed By: haobo
CC: dhruba, igor, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D17289
Summary: Fix some more CompactionFilterV2 valgrind issues. Maybe it would make sense for CompactionFilterV2 to delete its prefix_extractor?
Test Plan: ran CompactionFilterV2* tests with valgrind. issues before patch -> no issues after
Reviewers: haobo, sdong, ljin, dhruba
Reviewed By: dhruba
CC: leveldb, danguo
Differential Revision: https://reviews.facebook.net/D17337
Summary: Since we are optimizing for server workloads, some default values are not optimized any more. We change some of those values that I feel it's less prone to regression bugs.
Test Plan: make all check
Reviewers: dhruba, haobo, ljin, igor, yhchiang
Reviewed By: igor
CC: leveldb, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D16995
Summary:
This is the part that was not finished when doing the Options.max_num_files=-1 feature. For iterating non level0 SST files (which was done using two level iterator), table cache is not bypassed. With this patch, the leftover feature is done.
Test Plan: make all check; change Options.max_num_files=-1 in one of the tests to cover the codes.
Reviewers: haobo, igor, dhruba, ljin, yhchiang
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17001
Summary:
Currently if client uses kNULLString as the prefix, it will confuse
compaction filter v2. This diff added a bool to indicate if the prefix
has been intialized. I also added a unit test to cover this case and
make sure the new code path is hit.
Test Plan: db_test
Reviewers: igor, haobo
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17151
Summary:
This diff adds a new CompactionFilterV2 API that roll up the
decisions of kv pairs during compactions. These kv pairs must share the
same key prefix. They are buffered inside the db.
typedef std::vector<Slice> SliceVector;
virtual std::vector<bool> Filter(int level,
const SliceVector& keys,
const SliceVector& existing_values,
std::vector<std::string>* new_values,
std::vector<bool>* values_changed
) const = 0;
Application can override the Filter() function to operate
on the buffered kv pairs. More details in the inline documentation.
Test Plan:
make check. Added unit tests to make sure Keep, Delete,
Change all works.
Reviewers: haobo
CCs: leveldb
Differential Revision: https://reviews.facebook.net/D15087
Summary: D17067 breaks DBTest.UniversalCompactionTrigger because of wrong location of the checking. Fix it.
Test Plan: Run the test and make sure it passes.
Reviewers: igor, haobo
Reviewed By: igor
CC: dhruba, ljin, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D17079
Summary: Add unit tests to make sure CompactionFilterContext::is_manual_compaction_ and CompactionFilterContext::is_full_compaction_ are set correctly.
Test Plan: run the new tests.
Reviewers: haobo, igor, dhruba, yhchiang, ljin
Reviewed By: haobo
CC: nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D17067
Summary: Add a property to calculate number of background errors encountered to help users build their monitoring
Test Plan: Add a unit test. make all check
Reviewers: haobo, igor, dhruba
Reviewed By: igor
CC: ljin, nkg-, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D16959
Summary: To partly address the request @nkg- raised, add three easy-to-add properties to compactions and flushes.
Test Plan: run unit tests and add a new unit test to cover new properties.
Reviewers: haobo, dhruba
Reviewed By: dhruba
CC: nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D13677
Summary: Prev() now can hang when there is a key with more than max_skipped number of appearance internally but all of them are newer than the sequence ID to seek. Add unit tests to confirm the bug and fix it.
Test Plan: make all check
Reviewers: igor, haobo
Reviewed By: igor
CC: ljin, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D16899
Summary:
This is is based on https://reviews.facebook.net/D15027. It's not finished but I would like to give a prototype to avoid arena over-allocation while making better use of the already allocated memory blocks.
Instead of check approximate memtable size, we will take a deeper look at the arena, which incorporate essential idea that @sdong suggests: flush when arena has allocated its last and the last is "almost full"
Test Plan: N/A
Reviewers: haobo, sdong
Reviewed By: sdong
CC: leveldb, sdong
Differential Revision: https://reviews.facebook.net/D15051
Summary: A bad Auto-Merge caused log buffer is flushed twice. Remove the unintended one.
Test Plan: Should already be tested (the code looks the same as when I ran unit tests).
Reviewers: haobo, igor
Reviewed By: haobo
CC: ljin, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D16821
Summary:
(1) Fix SanitizeOptions() to also check HashLinkList. The current
dynamic case just happens to work because the 2 classes have the same
layout.
(2) Do not delete SliceTransform object in HashSkipListFactory and
HashLinkListFactory destructor. Reason: SanitizeOptions() enforces
prefix_extractor and SliceTransform to be the same object when
Hash**Factory is used. This makes the behavior strange: when
Hash**Factory is used, prefix_extractor will be released by RocksDB. If
other memtable factory is used, prefix_extractor should be released by
user.
Test Plan: db_bench && make asan_check
Reviewers: haobo, igor, sdong
Reviewed By: igor
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D16587
Summary: I wrote a test that triggers assertion in MergingIterator. I have not touched that code ever, so I'm looking for somebody with good understanding of the MergingIterator code to fix this. The solution is probably a one-liner. Let me know if you're willing to take a look.
Test Plan: This test fails with an assertion `use_heap_ == false`
Reviewers: dhruba, haobo, sdong, kailiu
Reviewed By: sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16521
Summary:
Add an optional input parameter ReadOptions to DB::GetUpdateSince(),
which allows the verification of checksums to be disabled by setting
ReadOptions::verify_checksums to false.
Test Plan: Tests are done off-line and will not be included in the regular unit test.
Reviewers: igor
Reviewed By: igor
CC: leveldb, xjin, dhruba
Differential Revision: https://reviews.facebook.net/D16305
Summary:
Two new column family tests:
* DifferentMergeOperators -- three column families, one without merge operator, one with add operator and one with append operator. verify that operations work as expected.
* DifferentCompactionStyles -- three column families, two with level compactions and one with universal compaction. trigger the compactions and verify they work as expected.
Test Plan: nope
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16377
Summary:
* Add ColumnFamilyHandle::GetID() function. Client needs to know column family's ID to be able to construct WriteBatch
* Handle WriteBatch::Handler failure gracefully. Since WriteBatch is not a very smart function (it takes raw CF id), client can add data to WriteBatch for column family that doesn't exist. In that case, we need to gracefully return failure status from DB::Write(). To do that, I added a return Status to WriteBatch functions PutCF, DeleteCF and MergeCF.
Test Plan: Added test to column_family_test
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16323
Summary: Adapt table properties to column family world
Test Plan: make check
Reviewers: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16161
Summary:
This is a huge diff and it was hectic, but the idea is actually quite simple. Every operation (Put, Get, etc.) done on default column family in DBTest is now forwarded to non-default ("pikachu"). The good news is that we had zero test failures! Column families look stable so far.
One interesting test that I adapted for column families is MultiThreadedTest. I replaced every Put() with a WriteBatch writing to all column families concurrently. Every Put in the write batch contains unique_id. Instead of Get() I do a multiget across all column families with the same key. If atomicity holds, I expect to see the same unique_id in all column families.
Test Plan: This is a test!
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16149
Summary: Provide a public API for users to access the table properties for each SSTable.
Test Plan: Added a unit tests to test the function correctness under differnet conditions.
Reviewers: haobo, dhruba, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16083
Summary:
The change to the public behavior:
* When opening a DB or creating new column family client gets a ColumnFamilyHandle.
* As long as column family handle is alive, client can do whatever he wants with it, even drop it
* Dropped column family can still be read from (using the column family handle)
* Added a new call CloseColumnFamily(). Client has to close all column families that he has opened before deleting the DB
* As soon as column family is closed, any calls to DB using that column family handle will fail (also any outstanding calls)
Internally:
* Ref-counting ColumnFamilyData
* New thread-safety for ColumnFamilySet
* Dropped column families are now completely dropped and their memory cleaned-up
Test Plan: added some tests to column_family_test
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16101
Summary: Added a bit more information to compaction context, requested by internal team at FB.
Test Plan: Modified CompactionFilter test to make sure is_manual_compaction is properly set.
Reviewers: haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16095
Summary: This covers existing table files before DB open happens and avoids contention on table cache
Test Plan: db_test
Reviewers: haobo, sdong, igor, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16089
Summary:
Use super_version insider NewIterator to avoid Ref() each component
separately under mutex
The new added bench shows NewIterator QPS increases from 515K to 719K
No meaningful improvement for multiget I guess due to its relatively small
cost comparing to 90 keys fetch in the test.
Test Plan: unit test and db_bench
Reviewers: igor, sdong
Reviewed By: igor
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D15609
Summary: In PlainTable, use one single byte to represent 8 bytes of internal bytes, if seqID = 0 and it is value type (which should be common for bottom most files). It is to save 7 bytes for uncompressed cases.
Test Plan: make all check
Reviewers: haobo, dhruba, kailiu
Reviewed By: haobo
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D15489