Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
Summary:
1. Make filter_block.h a base class. Derive block_based_filter_block and full_filter_block. The previous one is the traditional filter block. The full_filter_block is newly added. It would generate a filter block that contain all the keys in SST file.
2. When querying a key, table would first check if full_filter is available. If not, it would go to the exact data block and check using block_based filter.
3. User could choose to use full_filter or tradional(block_based_filter). They would be stored in SST file with different meta index name. "filter.filter_policy" or "full_filter.filter_policy". Then, Table reader is able to know the fllter block type.
4. Some optimizations have been done for full_filter_block, thus it requires a different interface compared to the original one in filter_policy.h.
5. Actual implementation of filter bits coding/decoding is placed in util/bloom_impl.cc
Benchmark: base commit 1d23b5c470
Command:
db_bench --db=/dev/shm/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --write_buffer_size=134217728 --max_write_buffer_number=2 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --verify_checksum=false --max_background_compactions=4 --use_plain_table=0 --memtablerep=prefix_hash --open_files=-1 --mmap_read=1 --mmap_write=0 --bloom_bits=10 --bloom_locality=1 --memtable_bloom_bits=500000 --compression_type=lz4 --num=393216000 --use_hash_search=1 --block_size=1024 --block_restart_interval=16 --use_existing_db=1 --threads=1 --benchmarks=readrandom —disable_auto_compactions=1
Read QPS increase for about 30% from 2230002 to 2991411.
Test Plan:
make all check
valgrind db_test
db_stress --use_block_based_filter = 0
./auto_sanity_test.sh
Reviewers: igor, yhchiang, ljin, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D20979
Summary:
As a preparation to support updating some options dynamically, I'd like
to first introduce ImmutableOptions, which is a subset of Options that
cannot be changed during the course of a DB lifetime without restart.
ColumnFamily will keep both Options and ImmutableOptions. Any component
below ColumnFamily should only take ImmutableOptions in their
constructor. Other options should be taken from APIs, which will be
allowed to adjust dynamically.
I am yet to make changes to memtable and other related classes to take
ImmutableOptions in their ctor. That can be done in a seprate diff as
this one is already pretty big.
Test Plan: make all check
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D22545
Summary: as title
Test Plan: none
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22737
Summary: It only covers Open() with default column family right now
Test Plan: make release
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22467
Summary:
Before this diff, whenever we Write to non-existing column family, Write() would fail.
This diff adds an option to not fail a Write() when WriteBatch points to non-existing column family. MongoDB said this would be useful for them, since they might have a transaction updating an index that was dropped by another thread. This way, they don't have to worry about checking if all indexes are alive on every write. They don't care if they lose writes to dropped index.
Test Plan: added a small unit test
Reviewers: sdong, yhchiang, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22143
Summary: 1. db/db_impl.cc:2324 (DBImpl::BackgroundCompaction) should not raise bg_error_ when column family is dropped during compaction.
Test Plan: 1. db_stress
Reviewers: ljin, yhchiang, dhruba, igor, sdong
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22653
Summary: This implements a cache friendly version of Cuckoo Hash in which, in case of collission, we try to insert in next few locations. The size of the neighborhood to check is taken as an input parameter in builder and stored in the table.
Test Plan:
make check all
cuckoo_table_{db,reader,builder}_test
Reviewers: sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22455
Summary: No __thread for ios.
Test Plan: compile works for ios now
Reviewers: ljin, dhruba
Reviewed By: dhruba
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22491
Summary: also fix HISTORY.md
Test Plan: make all check
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22437
Summary: Add a virtual function in table factory that will print table options
Test Plan: make release
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22149
Summary:
I will move compression related options in a separate diff since this
diff is already pretty lengthy.
I guess I will also need to change JNI accordingly :(
Test Plan: make all check
Reviewers: yhchiang, igor, sdong
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D21915
Summary:
Currently, PlainTable must use mmap_reads. When PlainTable is used but
allow_mmap_reads is not set, rocksdb will fail in flush.
This diff improve Options sanitization and add MmapReadRequired() to
TableFactory.
Test Plan:
export ROCKSDB_TESTS=PlainTableOptionsSanitizeTest
make db_test -j32
./db_test
Reviewers: sdong, ljin
Reviewed By: ljin
Subscribers: you, leveldb
Differential Revision: https://reviews.facebook.net/D21939
Summary:
Add WriteBatchWithIndex so that a user can query data out of a WriteBatch, to support MongoDB's read-its-own-write.
WriteBatchWithIndex uses a skiplist to store the binary index. The index stores the offset of the entry in the write batch. When searching for a key, the key for the entry is read by read the entry from the write batch from the offset.
Define a new iterator class for querying data out of WriteBatchWithIndex. A user can create an iterator of the write batch for one column family, seek to a key and keep calling Next() to see next entries.
I will add more unit tests if people are OK about this API.
Test Plan:
make all check
Add unit tests.
Reviewers: yhchiang, igor, MarkCallaghan, ljin
Reviewed By: ljin
Subscribers: dhruba, leveldb, xjin
Differential Revision: https://reviews.facebook.net/D21381
Summary: Bump up version after we've cut 3.4
Test Plan: N/A
Reviewers: yhchiang, ljin
Reviewed By: ljin
Subscribers: leveldb, dhruba, igor
Differential Revision: https://reviews.facebook.net/D22047
Summary: This is a linux-specific system call.
Test Plan: ran db_bench
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: haobo, leveldb
Differential Revision: https://reviews.facebook.net/D21183
Summary: as title
Test Plan: make release
Reviewers: sdong, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D21717
Summary: as title
Test Plan: make all check
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D21201
Summary:
Contains the following changes:
- Implementation of cuckoo_table_factory
- Adding cuckoo table into AdaptiveTableFactory
- Adding cuckoo_table_db_test, similar to lines of plain_table_db_test
- Minor fixes to Reader: When a key is found in the table, return the key found instead of the search key.
- Minor fixes to Builder: Add table properties that are required by Version::UpdateTemporaryStats() during Get operation. Don't define curr_node as a reference variable as the memory locations may get reassigned during tree.push_back operation, leading to invalid memory access.
Test Plan:
cuckoo_table_reader_test --enable_perf
cuckoo_table_builder_test
cuckoo_table_db_test
make check all
make valgrind_check
make asan_check
Reviewers: sdong, igor, yhchiang, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D21219
Summary: as title
Test Plan: make release
Reviewers: yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D21549
Summary: This will help debugging
Test Plan: ran, observed output
Reviewers: yinwang
Reviewed By: yinwang
Differential Revision: https://reviews.facebook.net/D20937
Summary: We have quite some properties that are integers and we are adding more. Add a function to directly return them as an integer, instead of a string
Test Plan: Add several unit test checks
Reviewers: yhchiang, igor, dhruba, haobo, ljin
Reviewed By: ljin
Subscribers: yoshinorim, leveldb
Differential Revision: https://reviews.facebook.net/D20637
Summary:
Make StatisticsImpl being able to forward stats to provided statistics
implementation. The main purpose is to allow us to collect internal
stats in the future even when user supplies custom statistics
implementation. It avoids intrumenting 2 sets of stats collection code.
One immediate use case is tuning advisor, which needs to collect some
internal stats, users may not be interested.
Test Plan:
ran db_bench and see stats show up at the end of run
Will run make all check since some tests rely on statistics
Reviewers: yhchiang, sdong, igor
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D20145
Summary:
Contains:
- Implementation of TableReader based on Cuckoo Hashing
- Unittests for CuckooTableReader
- Performance test for TableReader
Test Plan:
make cuckoo_table_reader_test
./cuckoo_table_reader_test
make valgrind_check
make asan_check
Reviewers: yhchiang, sdong, igor, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20511
Summary:
User gets undefinied error since the definition is not exposed.
Also re-enable the db test with only upper bound check
Test Plan: db_test, rate_limit_test
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20403
Summary: I changed SpatialDB API so that we only specify list of indexes when we create the database. That way, whoever is querying the DB doesn't need to know the full list of indexes and their options.
Test Plan: spatial_db_test
Reviewers: yinwang
Reviewed By: yinwang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20571
Summary:
Contains the following changes in CuckooTableBuilder:
- Take an extra parameter in constructor to identify last level file.
- Implement a better way to identify if a bucket has been inserted into the tree already during BFS search.
- Minor typos
Test Plan:
make cuckoo_table_builder
./cuckoo_table_builder
make valgrind_check
Reviewers: sdong, igor, yhchiang, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20445
Summary:
This diff is adding spatial index support to RocksDB.
When creating the DB user specifies a list of spatial indexes. Spatial indexes can cover different areas and have different resolution (i.e. number of tiles). This is useful for supporting different zoom levels.
Each element inserted into SpatialDB has:
* a bounding box, which determines how will the element be indexed
* string blob, which will usually be WKB representation of the polygon (http://en.wikipedia.org/wiki/Well-known_text)
* feature set, which is a map of key-value pairs, where value can be int, double, bool, null or a string. FeatureSet will be a set of tags associated with geo elements (for example, 'road': 'highway' and similar)
* a list of indexes to insert the element in. For example, small river element will be inserted in index for high zoom level, while country border will be inserted in all indexes (including the index for low zoom level).
Each query is executed on single spatial index. Query guarantees that it will return all elements intersecting the specified bounding box, but it might also return some extra non-intersecting elements.
Test Plan: Added bunch of unit tests in spatial_db_test
Reviewers: dhruba, yinwang
Reviewed By: yinwang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20361
Summary:
All public headers need to be under `include/rocksdb` directory. Otherwise, clients include our header files like this:
#include <rocksdb/db.h>
#include <utilities/backupable_db.h> // still our public header!
Also, internally, we include:
#include "utilities/backupable/backupable_db.h" // internal header
#include "utilities/backupable_db.h" // public header
which is confusing.
This way, when we install rocksdb as a system library, we can just copy `include/rocksdb` directory to system's header files. We can't really copy `utilities` directory to system's header files.
Test Plan: compiles
Reviewers: dhruba, ljin, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20409
Summary:
Add a function GetOptions(), where based on four parameters users give: read/write amplification threshold, memory budget for mem tables and target DB size, it picks up a compaction style and parameters for them. Background threads are not touched yet.
One limit of this algorithm: since compression rate and key/value size are hard to predict, it's hard to predict level 0 file size from write buffer size. Simply make 1:1 ratio here.
Sample results: https://reviews.facebook.net/P477
Test Plan: Will add some a unit test where some sample scenarios are given and see they pick the results that make sense
Reviewers: yhchiang, dhruba, haobo, igor, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D18741
Summary: Add a parameter path_id to DB::CompactRange(), to indicate where the output file should be placed to.
Test Plan: add a unit test
Reviewers: yhchiang, ljin
Reviewed By: ljin
Subscribers: xjin, igor, dhruba, MarkCallaghan, leveldb
Differential Revision: https://reviews.facebook.net/D20085
Summary:
Suggested by @dhruba from the other diff, here is the improved
comments for parameters of the function
Test Plan: none
Reviewers: dhruba, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D19623
Summary:
Adding option to save PlainTable index and bloom filter in SST file.
If there is no bloom block and/or index block, PlainTableReader builds
new ones. Otherwise PlainTableReader just use these blocks.
Test Plan: make all check
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19527
Summary:
Since we have a lot of options for PlainTable, add a struct PlainTableOptions
to manage them
Test Plan: make all check
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20175
Summary:
This patch adds a target size parameter in options.db_paths and universal compaction will base it to determine which DB path to place a new file.
Level-style stays the same.
Test Plan: Add new unit tests
Reviewers: ljin, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, dhruba, igor, leveldb
Differential Revision: https://reviews.facebook.net/D19869
Summary: Browsing through the code, looks like StatsLogger is not used at all!
Test Plan: compiles
Reviewers: ljin, sdong, yhchiang, dhruba
Reviewed By: dhruba
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D19827
Summary: As title
Test Plan: Not needed for comment only.
Reviewers: yhchiang, ljin, MarkCallaghan
Reviewed By: MarkCallaghan
Subscribers: xjin, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D19887
Summary: Add a function to return the perf level. It is to allow a wrapper of DB to increase the perf level and restore the original perf level after finishing the function call.
Test Plan: Add a verification in db_test
Reviewers: yhchiang, igor, ljin
Reviewed By: ljin
Subscribers: xjin, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D19551
Summary:
Seems like NewTotalOrderPlainTableFactory is useless and is semantically incorrect.
Total order mode indicator is prefix_extractor == nullptr,
but NewTotalOrderPlainTableFactory doesn't set it to be nullptr. That's why some tests
in plain_table_db_tests is incorrect.
Test Plan: make all check
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19587
Summary:
This is a rough sketch of our new document API. Would like to get some thoughts and comments about the high-level architecture and API.
I didn't optimize for performance at all. Leaving some low-hanging fruit so that we can be happy when we fix them! :)
Currently, bunch of features are not supported at all. Indexes can be only specified when creating database. There is no query planner whatsoever. This will all be added in due time.
Test Plan: Added a simple unit test
Reviewers: haobo, yhchiang, dhruba, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D18747
Summary:
Add option and plugin rate limiter for PosixWritableFile. The rate
limiter only applies to flush and compaction. WAL and MANIFEST are
excluded from this enforcement.
Test Plan: db_test
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19425
Summary:
A generic rate limiter that can be shared by threads and rocksdb
instances. Will use this to smooth out write traffic generated by
compaction and flush. This will help us get better p99 behavior on flash
storage.
Test Plan:
unit test output
==== Test RateLimiterTest.Rate
request size [1 - 1023], limit 10 KB/sec, actual rate: 10.374969 KB/sec, elapsed 2002265
request size [1 - 2047], limit 20 KB/sec, actual rate: 20.771242 KB/sec, elapsed 2002139
request size [1 - 4095], limit 40 KB/sec, actual rate: 41.285299 KB/sec, elapsed 2202424
request size [1 - 8191], limit 80 KB/sec, actual rate: 81.371605 KB/sec, elapsed 2402558
request size [1 - 16383], limit 160 KB/sec, actual rate: 162.541268 KB/sec, elapsed 3303500
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19359
Summary: stackable_db doesn't have GetLiveFilesMetaData() implemented. Add it
Test Plan: make all check
Reviewers: yhchiang, ljin, igor
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19491
Summary:
This diff allows the I/O stats about Flush and Compaction to be reported
in a more accurate way. Instead of measuring the size of a file, it
measure I/O cost in per read / write basis.
Test Plan: make all check
Reviewers: sdong, igor, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19383
Summary:
This diff adds timeout_hint_us to WriteOptions. If it's non-zero, then
1) writes associated with this options MAY be aborted when it has been
waiting for longer than the specified time. If an abortion happens,
associated writes will return Status::TimeOut.
2) the stall time of the associated write caused by flush or compaction
will be limited by timeout_hint_us.
The default value of timeout_hint_us is 0 (i.e., OFF.)
The statistics of timeout writes will be recorded in WRITE_TIMEDOUT.
Test Plan:
export ROCKSDB_TESTS=WriteTimeoutAndDelayTest
make db_test
./db_test
Reviewers: igor, ljin, haobo, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D18837
Summary:
In this patch, we allow RocksDB to support multiple DB paths internally.
No user interface is supported yet so this patch is silent to users.
Test Plan: make all check
Reviewers: igor, haobo, ljin, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D18921
Summary:
In this patch, we enhance HashLinkList memtable to reduce performance outliers when a bucket contains too many entries. We switch to skip list for this case to enable binary search.
Add threshold_use_skiplist parameter to determine when a bucket needs to switch to skip list.
The new data structure is documented in comments in the codes.
Test Plan:
make all check
set threshold_use_skiplist in several tests
Reviewers: yhchiang, haobo, ljin
Reviewed By: yhchiang, ljin
Subscribers: nkg-, xjin, dhruba, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D19299
Summary: It seems to me that when ever function MemTableRep::GetIterator(const Slice& slice) is used, we can use MemTableRep::GetDynamicPrefixIterator() instead. Just delete it to simplify the codes.
Test Plan: make all check
Reviewers: yhchiang, ljin
Reviewed By: ljin
Subscribers: xjin, dhruba, haobo, leveldb
Differential Revision: https://reviews.facebook.net/D19281
Summary: Currently ldb tool dump keys either in ascii format or hex format - neither is ideal if the key has a binary structure and is not readable in ascii. This diff also allows LDB tool to be customized in ways beyond DB options.
Test Plan: verify that key formatter works with some simple db with binary key.
Reviewers: sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19209
Summary:
After evaluating options for JSON storage, I decided to implement our own. The reason is that we'll be able to optimize it better and we get to reduce unnecessary dependencies (which is what we'd get with folly).
I also plan to write a serializer/deserializer for JSONDocument with our own binary format similar to BSON. That way we'll store binary JSON format in RocksDB instead of the plain-text JSON. This means less storage and faster deserialization.
There are still some inefficiencies left here. I plan to optimize them after we develop a functioning DocumentDB. That way we can move and iterate faster.
Test Plan: added a unit test
Reviewers: dhruba, haobo, sdong, ljin, yhchiang
Reviewed By: haobo
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D18831
Summary:
As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases.
This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction.
There is one test case that relied on didIO information. I'll try to find another way to implement it.
Test Plan: make check
Reviewers: sdong, haobo, yhchiang, ljin, dhruba
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19161
Summary:
Add a encoding feature of PlainTable to encode PlainTable's keys to save some bytes for the same prefixes.
The data format is documented in table/plain_table_factory.h
Test Plan: Add unit test coverage in plain_table_db_test
Reviewers: yhchiang, igor, dhruba, ljin, haobo
Reviewed By: haobo
Subscribers: nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D18735
Summary:
Currently, the in-memory hash index of blockbased table uses a precise hash map to track the prefix to block range mapping. In some use cases, especially when prefix itself is big, the memory overhead becomes a problem. This diff introduces a fixed hash bucket array that does not store the prefix and allows prefix collision, which is similar to the plaintable hash index, in order to reduce the memory consumption.
Just a quick draft, still testing and refining.
Test Plan: unit test and shadow testing
Reviewers: dhruba, kailiu, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19047
Summary:
This is minor, but if we put the writing talbe factory as the third parameter, when we add a new table format, we'll have a situation:
1) block based factory
2) plain table factory
3) output factory
4) new format factory
I think it makes more sense to have output as the first parameter.
Also, fixed a NewAdaptiveTableFactory() call in unit test
Test Plan: unit test
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19119
Summary: Add two parameters of hash linked list to log distribution of number of entries across all buckets, and a sample row when there are too many entries in one single bucket.
Test Plan: Turn it on in plain_table_db_test and see the logs.
Reviewers: haobo, ljin
Reviewed By: ljin
Subscribers: leveldb, nkg-, dhruba, yhchiang
Differential Revision: https://reviews.facebook.net/D19095
Summary: The new table factory is used if users want to convert a DB from one table format to the other. A user can use this table to open a DB written using one table format and write new files to another table format.
Test Plan: add a unit test
Reviewers: haobo, igor
Reviewed By: igor
Subscribers: dhruba, ljin, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D19017
Summary: Include max_write_buffer_number >= 2 to SanitizeOptions.
Test Plan: make all check
Reviewers: haobo, sdong, igor, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19077
Summary: This is a temporary solution to a issue that we have with compression libraries. See task #4453446.
Test Plan: make check doesn't complain :)
Reviewers: haobo, ljin, yhchiang, dhruba, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D18975
Summary: Provide an convenience option to create column families if they are missing from the DB. Task #4460490
Test Plan: added unit test. also, stress test for some time
Reviewers: sdong, haobo, dhruba, ljin, yhchiang
Reviewed By: yhchiang
Subscribers: yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D18951
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.
Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.
Test Plan: make all check
Reviewers: ljin, haobo
Reviewed By: haobo
Subscribers: leveldb, dhruba, yhchiang, igor
Differential Revision: https://reviews.facebook.net/D18513
Summary:
This patch changes meaning of options.bloom_locality: 0 means disable cache line optimization and any positive number means use CACHE_LINE_SIZE as block size (the previous behavior is the block size will be CACHE_LINE_SIZE*options.bloom_locality). By doing it, the divide operations inside a block can be replaced by a shift.
Performance is improved:
https://reviews.facebook.net/P471
Also, improve the basic algorithm in two ways:
(1) make sure num of blocks is an odd number
(2) rotate bytes after every probe in locality mode. Since the divider is 2^n, unless doing it, we are never able to use all the bits.
Improvements of false positive: https://reviews.facebook.net/P459
Test Plan: make all check
Reviewers: ljin, haobo
Reviewed By: haobo
Subscribers: dhruba, yhchiang, igor, leveldb
Differential Revision: https://reviews.facebook.net/D18843
Summary:
Introducing new compaction style -- FIFO.
FIFO compaction style has write amplification of 1 (+1 for WAL) and it deletes the oldest files when the total DB size exceeds pre-configured values.
FIFO compaction style is suited for storing high-frequency event logs.
Test Plan: Added a unit test
Reviewers: dhruba, haobo, sdong
Reviewed By: dhruba
Subscribers: alberts, leveldb
Differential Revision: https://reviews.facebook.net/D18765
Summary:
This diff addresses task #4296714 and rethinks how users provide us with TablePropertiesCollectors as part of Options.
Here's description of task #4296714:
I'm debugging #4295529 and noticed that our count of user properties kDeletedKeys is wrong. We're sharing one single InternalKeyPropertiesCollector with all Table Builders. In LOG Files, we're outputting number of kDeletedKeys as connected with a single table, while it's actually the total count of deleted keys since creation of the DB.
For example, this table has 3155 entries and 1391828 deleted keys.
The problem with current approach that we call methods on a single TablePropertiesCollector for all the tables we create. Even worse, we could do it from multiple threads at the same time and TablePropertiesCollector has no way of knowing which table we're calling it for.
Good part: Looks like nobody inside Facebook is using Options::table_properties_collectors. This means we should be able to painfully change the API.
In this change, I introduce TablePropertiesCollectorFactory. For every table we create, we call `CreateTablePropertiesCollector`, which creates a TablePropertiesCollector for a single table. We then use it sequentially from a single thread, which means it doesn't have to be thread-safe.
Test Plan:
Added a test in table_properties_collector_test that fails on master (build two tables, assert that kDeletedKeys count is correct for the second one).
Also, all other tests
Reviewers: sdong, dhruba, haobo, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18579
Summary:
This diff is addressing multiple things with a single goal -- to make RocksDB easier to use:
* Add some functions to Options that make RocksDB easier to tune.
* Add example code for both simple RocksDB and RocksDB with Column Families.
* Rewrite our README.md
Regarding Options, I took a stab at something we talked about for a long time:
* https://www.facebook.com/groups/rocksdb.dev/permalink/563169950448190/
I added functions:
* IncreaseParallelism() -- easy, increases the thread pool and max_background_compactions
* OptimizeLevelStyleCompaction(memtable_memory_budget) -- the easiest way to optimize rocksdb for less stalls with level style compaction. This is very likely not ideal configuration. Feel free to suggest improvements. I used some of Mark's suggestions from here: https://github.com/facebook/rocksdb/issues/54
* OptimizeUniversalStyleCompaction(memtable_memory_budget) -- optimize for universal compaction.
Test Plan: compiled rocksdb. ran examples.
Reviewers: dhruba, MarkCallaghan, haobo, sdong, yhchiang
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18621
Summary:
TLB page allocation errors are now logged to info logs, instead of stderr.
In order to do that, mem table rep's factory functions take a info logger now.
Test Plan: make all check
Reviewers: haobo, igor, yhchiang
Reviewed By: yhchiang
CC: leveldb, yhchiang, dhruba
Differential Revision: https://reviews.facebook.net/D18471
Summary: added a new option to BackupEngine: if share_files_with_checksum is set to true, sst files are stored in shared_checksum/ and are identified by the triple (file name, checksum, file size) instead of just the file name. This option is targeted at distributed databases that want to backup their primary replica.
Test Plan: unit tests and tested backup and restore on a distributed rocksdb
Reviewers: igor
Reviewed By: igor
Differential Revision: https://reviews.facebook.net/D18393
Summary: Added a method that executes a callback on every cache entry.
Test Plan: added a unit test
Reviewers: haobo
Reviewed By: haobo
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D18441