Commit Graph

862 Commits

Author SHA1 Message Date
Igor Canadi
21ddcf6e4f Remove allow_thread_local
Summary: See https://reviews.facebook.net/D19365

Test Plan: compiles

Reviewers: sdong, yhchiang, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23907
2014-09-24 13:12:16 -07:00
Lei Jin
0a29ce5393 re-enable BlockBasedTable::SetupForCompaction()
Summary:
It was commented out in D22545 by accident. Keep the option in
ImmutableOptions for now. I can make it dynamic in
https://reviews.facebook.net/D23349

Test Plan: make release

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23865
2014-09-23 14:18:57 -07:00
sdong
d0de413f4d WriteBatchWithIndex to allow different Comparators for different column families
Summary:
Previously, one single column family is given to WriteBatchWithIndex to index keys for all column families. An extra map from column family ID to comparator is maintained which can override the default comparator given in the constructor. A WriteBatchWithIndex::SetComparatorForCF() is added for user to add comparators per column family.

Also move more codes into anonymous namespace.

Test Plan: Add a unit test

Reviewers: ljin, igor

Reviewed By: igor

Subscribers: dhruba, leveldb, yhchiang

Differential Revision: https://reviews.facebook.net/D23355
2014-09-22 13:47:39 -07:00
Lei Jin
57a32f147f change target_file_size_base to uint64_t
Summary: It contrains the file size to be 4G max with int

Test Plan:
tried to grep instance and made sure other related variables are also
uint64

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23697
2014-09-22 11:15:03 -07:00
Lei Jin
51af7c326c CuckooTable: add one option to allow identity function for the first hash function
Summary:
MurmurHash becomes expensive when we do millions Get() a second in one
thread. Add this option to allow the first hash function to use identity
function as hash function. It results in QPS increase from 3.7M/s to
~4.3M/s. I did not observe improvement for end to end RocksDB
performance. This may be caused by other bottlenecks that I will address
in a separate diff.

Test Plan:
```
[ljin@dev1964 rocksdb] ./cuckoo_table_reader_test --enable_perf --file_dir=/dev/shm --write --identity_as_first_hash=0
==== Test CuckooReaderTest.WhenKeyExists
==== Test CuckooReaderTest.WhenKeyExistsWithUint64Comparator
==== Test CuckooReaderTest.CheckIterator
==== Test CuckooReaderTest.CheckIteratorUint64
==== Test CuckooReaderTest.WhenKeyNotFound
==== Test CuckooReaderTest.TestReadPerformance
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.272us (3.7 Mqps) with batch size of 0, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.138us (7.2 Mqps) with batch size of 10, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.142us (7.1 Mqps) with batch size of 25, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.142us (7.0 Mqps) with batch size of 50, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.144us (6.9 Mqps) with batch size of 100, # of found keys 125829120

With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.201us (5.0 Mqps) with batch size of 0, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.121us (8.3 Mqps) with batch size of 10, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.123us (8.1 Mqps) with batch size of 25, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.121us (8.3 Mqps) with batch size of 50, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.112us (8.9 Mqps) with batch size of 100, # of found keys 104857600

With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.251us (4.0 Mqps) with batch size of 0, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.107us (9.4 Mqps) with batch size of 10, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.099us (10.1 Mqps) with batch size of 25, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.100us (10.0 Mqps) with batch size of 50, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.116us (8.6 Mqps) with batch size of 100, # of found keys 83886080

With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.189us (5.3 Mqps) with batch size of 0, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.095us (10.5 Mqps) with batch size of 10, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.096us (10.4 Mqps) with batch size of 25, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.098us (10.2 Mqps) with batch size of 50, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.105us (9.5 Mqps) with batch size of 100, # of found keys 73400320

[ljin@dev1964 rocksdb] ./cuckoo_table_reader_test --enable_perf --file_dir=/dev/shm --write --identity_as_first_hash=1
==== Test CuckooReaderTest.WhenKeyExists
==== Test CuckooReaderTest.WhenKeyExistsWithUint64Comparator
==== Test CuckooReaderTest.CheckIterator
==== Test CuckooReaderTest.CheckIteratorUint64
==== Test CuckooReaderTest.WhenKeyNotFound
==== Test CuckooReaderTest.TestReadPerformance
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.230us (4.3 Mqps) with batch size of 0, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.086us (11.7 Mqps) with batch size of 10, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.088us (11.3 Mqps) with batch size of 25, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.083us (12.1 Mqps) with batch size of 50, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.083us (12.1 Mqps) with batch size of 100, # of found keys 125829120

With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.159us (6.3 Mqps) with batch size of 0, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.078us (12.8 Mqps) with batch size of 10, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.080us (12.6 Mqps) with batch size of 25, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.080us (12.5 Mqps) with batch size of 50, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.082us (12.2 Mqps) with batch size of 100, # of found keys 104857600

With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.154us (6.5 Mqps) with batch size of 0, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.077us (13.0 Mqps) with batch size of 10, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.077us (12.9 Mqps) with batch size of 25, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.078us (12.8 Mqps) with batch size of 50, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.079us (12.6 Mqps) with batch size of 100, # of found keys 83886080

With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.218us (4.6 Mqps) with batch size of 0, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.083us (12.0 Mqps) with batch size of 10, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.085us (11.7 Mqps) with batch size of 25, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.086us (11.6 Mqps) with batch size of 50, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.078us (12.8 Mqps) with batch size of 100, # of found keys 73400320
```

Reviewers: sdong, igor, yhchiang

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23451
2014-09-18 11:00:48 -07:00
Lei Jin
a062e1f2c4 SetOptions() for memtable related options
Summary: as title

Test Plan:
make all check
I will think a way to set up stress test for this

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23055
2014-09-17 12:49:13 -07:00
Lei Jin
e4eca6a1e5 Options conversion function for convenience
Summary: as title

Test Plan: options_test

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23283
2014-09-17 12:46:32 -07:00
Igor Canadi
4a27a2f193 Don't sync manifest when disableDataSync = true
Summary: As we discussed offline

Test Plan: compiles

Reviewers: yhchiang, sdong, ljin, dhruba

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22989
2014-09-15 11:32:01 -07:00
Jonah Cohen
092f97e219 Fix comments and typos
Summary: Correct some comments and typos in RocksDB.

Test Plan: Inspection

Reviewers: sdong, igor

Reviewed By: igor

Differential Revision: https://reviews.facebook.net/D23133
2014-09-09 15:20:49 -07:00
Xiaozheng Tie
6cc12860f0 Added a few statistics for BackupableDB
Summary:
Added the following statistics to BackupableDB:

1. Number of successful and failed backups in class BackupStatistics
2. Time taken to do a backup
3. Number of files in a backup

1 is implemented in the BackupStatistics class
2 and 3 are added in the BackupMeta and BackupInfo class

Test Plan:
1 can be tested using BackupStatistics::ToString(),
2 and 3 can be tested in the BackupInfo class

Reviewers: sdong, igor2, ljin, igor

Reviewed By: igor

Differential Revision: https://reviews.facebook.net/D22785
2014-09-09 13:44:42 -07:00
Lei Jin
52311463e9 MemTableOptions
Summary: removed reference to options in WriteBatch and DBImpl::Get()

Test Plan: make all check

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23049
2014-09-08 18:46:52 -07:00
Lei Jin
659d2d50c3 move compaction_filter to immutable_options
Summary:
all shared_ptrs are in immutable_options now. This will also make
options assignment a little cheaper

Test Plan: make release

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23001
2014-09-08 15:09:25 -07:00
Lei Jin
048560a642 reduce references to cfd->options() in DBImpl
Summary:
I found it is almost impossible to get rid of this function in a single
batch. I will take a step by step approach

Test Plan: make release

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22995
2014-09-08 15:04:34 -07:00
Igor Canadi
a2bb7c3c33 Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes

The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).

When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.

This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.

Test Plan: make check for now. I'll add some unit tests later. Also, perf test.

Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 11:20:25 -07:00
Feng Zhu
0af157f9bf Implement full filter for block based table.
Summary:
1. Make filter_block.h a base class. Derive block_based_filter_block and full_filter_block. The previous one is the traditional filter block. The full_filter_block is newly added. It would generate a filter block that contain all the keys in SST file.

2. When querying a key, table would first check if full_filter is available. If not, it would go to the exact data block and check using block_based filter.

3. User could choose to use full_filter or tradional(block_based_filter). They would be stored in SST file with different meta index name. "filter.filter_policy" or "full_filter.filter_policy". Then, Table reader is able to know the fllter block type.

4. Some optimizations have been done for full_filter_block, thus it requires a different interface compared to the original one in filter_policy.h.

5. Actual implementation of filter bits coding/decoding is placed in util/bloom_impl.cc

Benchmark: base commit 1d23b5c470
Command:
db_bench --db=/dev/shm/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --write_buffer_size=134217728 --max_write_buffer_number=2 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --verify_checksum=false --max_background_compactions=4 --use_plain_table=0 --memtablerep=prefix_hash --open_files=-1 --mmap_read=1 --mmap_write=0 --bloom_bits=10 --bloom_locality=1 --memtable_bloom_bits=500000 --compression_type=lz4 --num=393216000 --use_hash_search=1 --block_size=1024 --block_restart_interval=16 --use_existing_db=1 --threads=1 --benchmarks=readrandom —disable_auto_compactions=1
Read QPS increase for about 30% from 2230002 to 2991411.

Test Plan:
make all check
valgrind db_test
db_stress --use_block_based_filter = 0
./auto_sanity_test.sh

Reviewers: igor, yhchiang, ljin, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D20979
2014-09-08 10:37:05 -07:00
Nik Bougalis
d1cfb71ec7 Remove unused member(s) 2014-09-05 20:50:29 -07:00
liuhuahang
bb6ae0f80c fix more compile warnings
N/A

Change-Id: I5b6f9c70aea7d3f3489328834fed323d41106d9f
Signed-off-by: liuhuahang <liuhuahang@zerus.co>
2014-09-05 14:14:37 +08:00
Lei Jin
5665e5e285 introduce ImmutableOptions
Summary:
As a preparation to support updating some options dynamically, I'd like
to first introduce ImmutableOptions, which is a subset of Options that
cannot be changed during the course of a DB lifetime without restart.

ColumnFamily will keep both Options and ImmutableOptions. Any component
below ColumnFamily should only take ImmutableOptions in their
constructor. Other options should be taken from APIs, which will be
allowed to adjust dynamically.

I am yet to make changes to memtable and other related classes to take
ImmutableOptions in their ctor. That can be done in a seprate diff as
this one is already pretty big.

Test Plan: make all check

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D22545
2014-09-04 16:18:36 -07:00
Raghav Pisolkar
e0b99d4f5d created a new ReadOptions parameter 'iterate_upper_bound' 2014-09-04 11:00:16 -07:00
Lei Jin
703c3eacd9 comments about the BlockBasedTableOptions migration in Options
Summary: as title

Test Plan: none

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22737
2014-09-03 17:01:34 -07:00
Lei Jin
9b58c73c7c call SanitizeDBOptionsByCFOptions() in the right place
Summary: It only covers Open() with default column family right now

Test Plan: make release

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22467
2014-09-02 14:42:23 -07:00
Igor Canadi
a84234a61b Ignore missing column families
Summary:
Before this diff, whenever we Write to non-existing column family, Write() would fail.

This diff adds an option to not fail a Write() when WriteBatch points to non-existing column family. MongoDB said this would be useful for them, since they might have a transaction updating an index that was dropped by another thread. This way, they don't have to worry about checking if all indexes are alive on every write. They don't care if they lose writes to dropped index.

Test Plan: added a small unit test

Reviewers: sdong, yhchiang, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22143
2014-09-02 13:29:05 -07:00
Feng Zhu
8438a19360 fix dropping column family bug
Summary: 1. db/db_impl.cc:2324 (DBImpl::BackgroundCompaction) should not raise bg_error_ when column family is dropped during compaction.

Test Plan: 1. db_stress

Reviewers: ljin, yhchiang, dhruba, igor, sdong

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22653
2014-09-02 12:25:58 -07:00
Radheshyam Balasundaram
7f71448388 Implementing a cache friendly version of Cuckoo Hash
Summary: This implements a cache friendly version of Cuckoo Hash in which, in case of collission, we try to insert in next few locations. The size of the neighborhood to check is taken as an input parameter in builder and stored in the table.

Test Plan:
make check all
cuckoo_table_{db,reader,builder}_test

Reviewers: sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22455
2014-08-28 10:42:23 -07:00
Igor Canadi
d5bd6c772b Fix ios compile
Summary: No __thread for ios.

Test Plan: compile works for ios now

Reviewers: ljin, dhruba

Reviewed By: dhruba

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22491
2014-08-28 12:46:05 -04:00
Lei Jin
1755581f19 improve OptimizeForPointLookup()
Summary: also fix HISTORY.md

Test Plan: make all check

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22437
2014-08-26 14:15:00 -07:00
Lei Jin
23861857c4 ReadOptions.total_order_seek to allow total order seek for block-based table when hash index is enabled
Summary: as title

Test Plan: table_test

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22239
2014-08-25 16:14:30 -07:00
Lei Jin
a98badff16 print table options
Summary: Add a virtual function in table factory that will print table options

Test Plan: make release

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22149
2014-08-25 14:24:09 -07:00
Lei Jin
384400128f move block based table related options BlockBasedTableOptions
Summary:
I will move compression related options in a separate diff since this
diff is already pretty lengthy.
I guess I will also need to change JNI accordingly :(

Test Plan: make all check

Reviewers: yhchiang, igor, sdong

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21915
2014-08-25 14:22:05 -07:00
Andrew Bonventre
050869177e Add missing include to use std::unique_ptr
This was causing issues when including this header from another file.
2014-08-23 13:02:21 -04:00
Yueh-Hsuan Chiang
63a2215c63 Improve Options sanitization and add MmapReadRequired() to TableFactory
Summary:
Currently, PlainTable must use mmap_reads.  When PlainTable is used but
allow_mmap_reads is not set, rocksdb will fail in flush.

This diff improve Options sanitization and add MmapReadRequired() to
TableFactory.

Test Plan:
export ROCKSDB_TESTS=PlainTableOptionsSanitizeTest
make db_test -j32
./db_test

Reviewers: sdong, ljin

Reviewed By: ljin

Subscribers: you, leveldb

Differential Revision: https://reviews.facebook.net/D21939
2014-08-20 15:53:39 -07:00
sdong
28b5c76004 WriteBatchWithIndex: a wrapper of WriteBatch, with a searchable index
Summary:
Add WriteBatchWithIndex so that a user can query data out of a WriteBatch, to support MongoDB's read-its-own-write.

WriteBatchWithIndex uses a skiplist to store the binary index. The index stores the offset of the entry in the write batch. When searching for a key, the key for the entry is read by read the entry from the write batch from the offset.

Define a new iterator class for querying data out of WriteBatchWithIndex. A user can create an iterator of the write batch for one column family, seek to a key and keep calling Next() to see next entries.

I will add more unit tests if people are OK about this API.

Test Plan:
make all check
Add unit tests.

Reviewers: yhchiang, igor, MarkCallaghan, ljin

Reviewed By: ljin

Subscribers: dhruba, leveldb, xjin

Differential Revision: https://reviews.facebook.net/D21381
2014-08-18 16:37:38 -07:00
sdong
68eed8c9f0 Bump up version
Summary: Bump up version after we've cut 3.4

Test Plan: N/A

Reviewers: yhchiang, ljin

Reviewed By: ljin

Subscribers: leveldb, dhruba, igor

Differential Revision: https://reviews.facebook.net/D22047
2014-08-18 12:02:02 -07:00
Lei Jin
58c49466d2 Allow env_posix to lower background thread IO priority
Summary: This is a linux-specific system call.

Test Plan: ran db_bench

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: haobo, leveldb

Differential Revision: https://reviews.facebook.net/D21183
2014-08-13 20:49:58 -07:00
Lei Jin
5a5953b388 Add histogram for DB_SEEK
Summary: as title

Test Plan: make release

Reviewers: sdong, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21717
2014-08-13 15:56:37 -07:00
Lei Jin
5d0074c471 set bytes_per_sync to 1MB if rate limiter is enabled
Summary: as title

Test Plan: make all check

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21201
2014-08-12 16:42:18 -07:00
Radheshyam Balasundaram
9674c11d01 Integrating Cuckoo Hash SST Table format into RocksDB
Summary:
Contains the following changes:
- Implementation of cuckoo_table_factory
- Adding cuckoo table into AdaptiveTableFactory
- Adding cuckoo_table_db_test, similar to lines of plain_table_db_test
- Minor fixes to Reader: When a key is found in the table, return the key found instead of the search key.
- Minor fixes to Builder: Add table properties that are required by Version::UpdateTemporaryStats() during Get operation. Don't define curr_node as a reference variable as the memory locations may get reassigned during tree.push_back operation, leading to invalid memory access.

Test Plan:
cuckoo_table_reader_test --enable_perf
cuckoo_table_builder_test
cuckoo_table_db_test
make check all
make valgrind_check
make asan_check

Reviewers: sdong, igor, yhchiang, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21219
2014-08-11 20:21:07 -07:00
Lei Jin
37c6740c38 make statistics ToString function empty instead of pure virtual
Summary: as title

Test Plan: make release

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21549
2014-08-11 15:04:41 -07:00
Spencer Kimball
38e8b727a8 Fix typo, add missing inclusion of state void* in invocation of
create_compaction_filter_v2_.
2014-08-06 18:42:15 -04:00
Spencer Kimball
c1f588af71 Add support for C bindings to the compaction V2 filter mechanism.
Test Plan: make c_test && ./c_test

Some fixes after merge.
2014-08-06 15:55:48 -04:00
Igor Canadi
9c5a3f4746 FeatureSet DebugString
Summary: This will help debugging

Test Plan: ran, observed output

Reviewers: yinwang

Reviewed By: yinwang

Differential Revision: https://reviews.facebook.net/D20937
2014-08-01 08:55:37 -07:00
Demon
7ef7df005f fix project name in the comments
fix project name in the comments
2014-07-31 11:42:36 +08:00
Igor Canadi
99e03bcbf1 Better comment for inplace_update_support
Summary: See https://github.com/facebook/rocksdb/issues/215

Test Plan: none

Reviewers: dhruba, sdong, ljin, yhchiang, nkg-

Reviewed By: nkg-

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D20769
2014-07-30 09:32:47 -07:00
sdong
f04356e660 Add DB::GetIntProperty() to return integer properties to be returned as integers
Summary: We have quite some properties that are integers and we are adding more. Add a function to directly return them as an integer, instead of a string

Test Plan: Add several unit test checks

Reviewers: yhchiang, igor, dhruba, haobo, ljin

Reviewed By: ljin

Subscribers: yoshinorim, leveldb

Differential Revision: https://reviews.facebook.net/D20637
2014-07-28 16:55:57 -07:00
Lei Jin
40fa8a4cd5 make statistics forward-able
Summary:
Make StatisticsImpl being able to forward stats to provided statistics
implementation. The main purpose is to allow us to collect internal
stats in the future even when user supplies custom statistics
implementation. It avoids intrumenting 2 sets of stats collection code.
One immediate use case is tuning advisor, which needs to collect some
internal stats, users may not be interested.

Test Plan:
ran db_bench and see stats show up at the end of run
Will run make all check since some tests rely on statistics

Reviewers: yhchiang, sdong, igor

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D20145
2014-07-28 12:05:36 -07:00
Radheshyam Balasundaram
62f9b071ff Implementation of CuckooTableReader
Summary:
Contains:
- Implementation of TableReader based on Cuckoo Hashing
- Unittests for CuckooTableReader
- Performance test for TableReader

Test Plan:
make cuckoo_table_reader_test
./cuckoo_table_reader_test
make valgrind_check
make asan_check

Reviewers: yhchiang, sdong, igor, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D20511
2014-07-25 16:37:32 -07:00
Lei Jin
d650612c4c expose RateLimiter definition
Summary:
User gets undefinied error since the definition is not exposed.
Also re-enable the db test with only upper bound check

Test Plan: db_test, rate_limit_test

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D20403
2014-07-25 15:17:06 -07:00
Igor Canadi
0754d4cb3b SpatialDB change API
Summary: I changed SpatialDB API so that we only specify list of indexes when we create the database. That way, whoever is querying the DB doesn't need to know the full list of indexes and their options.

Test Plan: spatial_db_test

Reviewers: yinwang

Reviewed By: yinwang

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D20571
2014-07-24 16:39:33 -04:00
Radheshyam Balasundaram
07a7d870b8 Addressing TODOs in CuckooTableBuilder
Summary:
Contains the following changes in CuckooTableBuilder:
- Take an extra parameter in constructor to identify last level file.
- Implement a better way to identify if a bucket has been inserted into the tree already during BFS search.
- Minor typos

Test Plan:
make cuckoo_table_builder
./cuckoo_table_builder
make valgrind_check

Reviewers: sdong, igor, yhchiang, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D20445
2014-07-24 10:07:41 -07:00
Igor Canadi
6296330417 SpatialDB
Summary:
This diff is adding spatial index support to RocksDB.

When creating the DB user specifies a list of spatial indexes. Spatial indexes can cover different areas and have different resolution (i.e. number of tiles). This is useful for supporting different zoom levels.

Each element inserted into SpatialDB has:
* a bounding box, which determines how will the element be indexed
* string blob, which will usually be WKB representation of the polygon (http://en.wikipedia.org/wiki/Well-known_text)
* feature set, which is a map of key-value pairs, where value can be int, double, bool, null or a string. FeatureSet will be a set of tags associated with geo elements (for example, 'road': 'highway' and similar)
* a list of indexes to insert the element in. For example, small river element will be inserted in index for high zoom level, while country border will be inserted in all indexes (including the index for low zoom level).

Each query is executed on single spatial index. Query guarantees that it will return all elements intersecting the specified bounding box, but it might also return some extra non-intersecting elements.

Test Plan: Added bunch of unit tests in spatial_db_test

Reviewers: dhruba, yinwang

Reviewed By: yinwang

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D20361
2014-07-23 14:22:58 -04:00
Igor Canadi
1053358a84 Bump the version 2014-07-23 10:21:54 -04:00
Igor Canadi
0ff183a0d9 Move include/utilities/*.h to include/rocksdb/utilities/*.h
Summary:
All public headers need to be under `include/rocksdb` directory. Otherwise, clients include our header files like this:

    #include <rocksdb/db.h>
    #include <utilities/backupable_db.h> // still our public header!

Also, internally, we include:

    #include "utilities/backupable/backupable_db.h" // internal header
    #include "utilities/backupable_db.h" // public header

which is confusing.

This way, when we install rocksdb as a system library, we can just copy `include/rocksdb` directory to system's header files. We can't really copy `utilities` directory to system's header files.

Test Plan: compiles

Reviewers: dhruba, ljin, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D20409
2014-07-23 10:21:38 -04:00
sdong
e6de02103a Add a utility function to guess optimized options based on constraints
Summary:
Add a function GetOptions(), where based on four parameters users give: read/write amplification threshold, memory budget for mem tables and target DB size, it picks up a compaction style and parameters for them. Background threads are not touched yet.

One limit of this algorithm: since compression rate and key/value size are hard to predict, it's hard to predict level 0 file size from write buffer size. Simply make 1:1 ratio here.

Sample results: https://reviews.facebook.net/P477

Test Plan: Will add some a unit test where some sample scenarios are given and see they pick the results that make sense

Reviewers: yhchiang, dhruba, haobo, igor, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D18741
2014-07-22 15:24:21 -07:00
sdong
f6b7e1ed1a Allow user to specify DB path of output file of manual compaction
Summary: Add a parameter path_id to DB::CompactRange(), to indicate where the output file should be placed to.

Test Plan: add a unit test

Reviewers: yhchiang, ljin

Reviewed By: ljin

Subscribers: xjin, igor, dhruba, MarkCallaghan, leveldb

Differential Revision: https://reviews.facebook.net/D20085
2014-07-21 19:06:00 -07:00
Radheshyam Balasundaram
cf3da899b0 Adding a new SST table builder based on Cuckoo Hashing
Summary:
Cuckoo Hashing based SST table builder. Contains:
- Cuckoo Hashing logic and file storage logic.
- Unit tests for logic

Test Plan:
make cuckoo_table_builder_test
./cuckoo_table_builder_test
make check all

Reviewers: yhchiang, igor, sdong, ljin

Reviewed By: ljin

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D19545
2014-07-21 13:26:09 -07:00
Lei Jin
9c0d84d240 improve comments for CrateRateLimiter()
Summary:
Suggested by @dhruba from the other diff, here is the improved
comments for parameters of the function

Test Plan: none

Reviewers: dhruba, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D19623
2014-07-21 11:03:16 -07:00
Stanislau Hlebik
9d70cce047 Adding option to save PlainTable index and bloom filter in SST file.
Summary:
Adding option to save PlainTable index and bloom filter in SST file.
If there is no bloom block and/or index block, PlainTableReader builds
new ones. Otherwise PlainTableReader just use these blocks.

Test Plan: make all check

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19527
2014-07-18 16:58:13 -07:00
Stanislau Hlebik
92d73cbe78 Add PlainTableOptions
Summary:
Since we have a lot of options for PlainTable, add a struct PlainTableOptions
to manage them

Test Plan: make all check

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D20175
2014-07-18 00:08:38 -07:00
sdong
0abaed2e08 Support multiple DB directories in universal compaction style
Summary:
This patch adds a target size parameter in options.db_paths and universal compaction will base it to determine which DB path to place a new file.
Level-style stays the same.

Test Plan: Add new unit tests

Reviewers: ljin, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, dhruba, igor, leveldb

Differential Revision: https://reviews.facebook.net/D19869
2014-07-15 12:06:28 -07:00
Igor Canadi
20c056306b Remove stats logger
Summary: Browsing through the code, looks like StatsLogger is not used at all!

Test Plan: compiles

Reviewers: ljin, sdong, yhchiang, dhruba

Reviewed By: dhruba

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D19827
2014-07-15 09:16:32 -04:00
sdong
dd6c444822 Improve Put()'s comment to indicate that the key is overwritten if existing
Summary: As title

Test Plan: Not needed for comment only.

Reviewers: yhchiang, ljin, MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: xjin, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D19887
2014-07-14 14:43:59 -07:00
Reed Allman
1fc71a4b16 C API: create missing cf's, cleanup 2014-07-10 12:55:53 -07:00
sdong
01700b6911 Update master to version 3.3
Summary: As tittle

Test Plan: no need

Reviewers: igor, yhchiang, ljin

Reviewed By: ljin

Subscribers: haobo, dhruba, xjin, leveldb

Differential Revision: https://reviews.facebook.net/D19629
2014-07-10 11:59:35 -07:00
sdong
36de0e5359 Add a function to return current perf level
Summary: Add a function to return the perf level. It is to allow a wrapper of DB to increase the perf level and restore the original perf level after finishing the function call.

Test Plan: Add a verification in db_test

Reviewers: yhchiang, igor, ljin

Reviewed By: ljin

Subscribers: xjin, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D19551
2014-07-10 11:35:48 -07:00
Stanislau Hlebik
30c81e7717 Removing NewTotalOrderPlainTableFactory
Summary:
Seems like NewTotalOrderPlainTableFactory is useless and is semantically incorrect.
Total order mode indicator is prefix_extractor == nullptr,
but NewTotalOrderPlainTableFactory doesn't set it to be nullptr. That's why some tests
in plain_table_db_tests is incorrect.

Test Plan: make all check

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19587
2014-07-10 11:32:04 -07:00
Igor Canadi
f0a8be253e JSON (Document) API sketch
Summary:
This is a rough sketch of our new document API. Would like to get some thoughts and comments about the high-level architecture and API.

I didn't optimize for performance at all. Leaving some low-hanging fruit so that we can be happy when we fix them! :)

Currently, bunch of features are not supported at all. Indexes can be only specified when creating database. There is no query planner whatsoever. This will all be added in due time.

Test Plan: Added a simple unit test

Reviewers: haobo, yhchiang, dhruba, sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D18747
2014-07-10 09:31:42 -07:00
Lei Jin
534357ca3a integrate rate limiter into rocksdb
Summary:
Add option and plugin rate limiter for PosixWritableFile. The rate
limiter only applies to flush and compaction. WAL and MANIFEST are
excluded from this enforcement.

Test Plan: db_test

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19425
2014-07-08 12:31:49 -07:00
Lei Jin
5ef1ba7ff5 generic rate limiter
Summary:
A generic rate limiter that can be shared by threads and rocksdb
instances. Will use this to smooth out write traffic generated by
compaction and flush. This will help us get better p99 behavior on flash
storage.

Test Plan:
unit test output
==== Test RateLimiterTest.Rate
request size [1 - 1023], limit 10 KB/sec, actual rate: 10.374969 KB/sec, elapsed 2002265
request size [1 - 2047], limit 20 KB/sec, actual rate: 20.771242 KB/sec, elapsed 2002139
request size [1 - 4095], limit 40 KB/sec, actual rate: 41.285299 KB/sec, elapsed 2202424
request size [1 - 8191], limit 80 KB/sec, actual rate: 81.371605 KB/sec, elapsed 2402558
request size [1 - 16383], limit 160 KB/sec, actual rate: 162.541268 KB/sec, elapsed 3303500

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19359
2014-07-08 11:41:57 -07:00
Reed Allman
fd3fb4b0bf C API: update options w/ convenience funcs & fifo compaction 2014-07-08 10:57:45 -07:00
sdong
01159aa802 stackable_db: add default function for GetLiveFilesMetaData()
Summary: stackable_db doesn't have GetLiveFilesMetaData() implemented. Add it

Test Plan: make all check

Reviewers: yhchiang, ljin, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19491
2014-07-07 17:02:57 -07:00
Igor Canadi
8a03935f8c Fix valgrind error in c_test
Summary:
External contribution caused some valgrind errors: 1a34aaaef0

This diff fixes them

Test Plan: ran valgrind

Reviewers: sdong, yhchiang, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19485
2014-07-07 14:41:54 -07:00
Evan Shaw
3f7104d7c5 C API: Allow setting compaction filter factory 2014-07-08 08:12:36 +12:00
Evan Shaw
91bede79cc C API: Add support for compaction filter factories (v1) 2014-07-08 08:12:36 +12:00
Evan Shaw
0bf5589c74 Fix compaction_filter.h typos 2014-07-08 08:11:01 +12:00
Igor Canadi
0bc5fa9f40 Merge pull request #190 from edsrzf/c-api-writebatch-serialized
C API: support constructing write batch from serialized representation
2014-07-07 10:17:43 -07:00
Reed Allman
1a34aaaef0 C API: column family support 2014-07-07 01:41:01 -07:00
Evan Shaw
9fc23d0c56 C API: support constructing write batch from serialized representation 2014-07-06 10:36:33 +12:00
Yueh-Hsuan Chiang
90a6aca48e Finer report I/O stats about Flush and Compaction.
Summary:
This diff allows the I/O stats about Flush and Compaction to be reported
in a more accurate way.  Instead of measuring the size of a file, it
measure I/O cost in per read / write basis.

Test Plan: make all check

Reviewers: sdong, igor, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19383
2014-07-03 16:28:03 -07:00
Yueh-Hsuan Chiang
d4d338de33 Add timeout_hint_us to WriteOptions and introduce Status::TimeOut.
Summary:
This diff adds timeout_hint_us to WriteOptions.  If it's non-zero, then
1) writes associated with this options MAY be aborted when it has been
  waiting for longer than the specified time.  If an abortion happens,
  associated writes will return Status::TimeOut.
2) the stall time of the associated write caused by flush or compaction
  will be limited by timeout_hint_us.

The default value of timeout_hint_us is 0 (i.e., OFF.)

The statistics of timeout writes will be recorded in WRITE_TIMEDOUT.

Test Plan:
export ROCKSDB_TESTS=WriteTimeoutAndDelayTest
make db_test
./db_test

Reviewers: igor, ljin, haobo, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18837
2014-07-03 15:47:02 -07:00
sdong
2459f7ec4e Support Multiple DB paths (without having an interface to expose to users)
Summary:
In this patch, we allow RocksDB to support multiple DB paths internally.
No user interface is supported yet so this patch is silent to users.

Test Plan: make all check

Reviewers: igor, haobo, ljin, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18921
2014-07-02 21:14:44 -07:00
sdong
9c332aa11a HashLinkList memtable switches a bucket to a skip list to reduce performance outliers
Summary:
In this patch, we enhance HashLinkList memtable to reduce performance outliers when a bucket contains too many entries. We switch to skip list for this case to enable binary search.

Add threshold_use_skiplist parameter to determine when a bucket needs to switch to skip list.

The new data structure is documented in comments in the codes.

Test Plan:
make all check
set threshold_use_skiplist in several tests

Reviewers: yhchiang, haobo, ljin

Reviewed By: yhchiang, ljin

Subscribers: nkg-, xjin, dhruba, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D19299
2014-07-01 17:14:15 -07:00
sdong
19de6a7aad Remove MemTableRep::GetIterator(const Slice& slice)
Summary: It seems to me that when ever function MemTableRep::GetIterator(const Slice& slice) is used, we can use MemTableRep::GetDynamicPrefixIterator() instead. Just delete it to simplify the codes.

Test Plan: make all check

Reviewers: yhchiang, ljin

Reviewed By: ljin

Subscribers: xjin, dhruba, haobo, leveldb

Differential Revision: https://reviews.facebook.net/D19281
2014-06-25 14:09:29 -07:00
Haobo Xu
dfb31d152d [RocksDB] allow LDB tool to have customized key formatter
Summary: Currently ldb tool dump keys either in ascii format or hex format - neither is ideal if the key has a binary structure and is not readable in ascii. This diff also allows LDB tool to be customized in ways beyond DB options.

Test Plan: verify that key formatter works with some simple db with binary key.

Reviewers: sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19209
2014-06-23 15:35:40 -07:00
Igor Canadi
00b26c3a83 JSONDocument
Summary:
After evaluating options for JSON storage, I decided to implement our own. The reason is that we'll be able to optimize it better and we get to reduce unnecessary dependencies (which is what we'd get with folly).

I also plan to write a serializer/deserializer for JSONDocument with our own binary format similar to BSON. That way we'll store binary JSON format in RocksDB instead of the plain-text JSON. This means less storage and faster deserialization.

There are still some inefficiencies left here. I plan to optimize them after we develop a functioning DocumentDB. That way we can move and iterate faster.

Test Plan: added a unit test

Reviewers: dhruba, haobo, sdong, ljin, yhchiang

Reviewed By: haobo

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D18831
2014-06-20 11:14:14 +02:00
Igor Canadi
d4a8423334 Remove seek compaction
Summary:
As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases.

This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction.

There is one test case that relied on didIO information. I'll try to find another way to implement it.

Test Plan: make check

Reviewers: sdong, haobo, yhchiang, ljin, dhruba

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19161
2014-06-20 10:23:02 +02:00
Igor Canadi
bae495740d Merge pull request #179 from edsrzf/c-api-compaction-filter
Support for compaction filters in the C API
2014-06-19 21:22:46 +02:00
Evan Shaw
d72313a7fa Add a way to set compaction filter in the C API 2014-06-19 16:31:24 +12:00
Evan Shaw
df2701373d Support for compaction filters in the C API 2014-06-19 16:31:17 +12:00
sdong
edd47c5104 PlainTable to encode to avoid to rewrite prefix when it is the same as the previous key
Summary:
Add a encoding feature of PlainTable to encode PlainTable's keys to save some bytes for the same prefixes.
The data format is documented in table/plain_table_factory.h

Test Plan: Add unit test coverage in plain_table_db_test

Reviewers: yhchiang, igor, dhruba, ljin, haobo

Reviewed By: haobo

Subscribers: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D18735
2014-06-18 20:41:52 -07:00
Haobo Xu
0f0076ed5a [RocksDB] Reduce memory footprint of the blockbased table hash index.
Summary:
Currently, the in-memory hash index of blockbased table uses a precise hash map to track the prefix to block range mapping. In some use cases, especially when prefix itself is big, the memory overhead becomes a problem. This diff introduces a fixed hash bucket array that does not store the prefix and allows prefix collision, which is similar to the plaintable hash index, in order to reduce the memory consumption.
Just a quick draft, still testing and refining.

Test Plan: unit test and shadow testing

Reviewers: dhruba, kailiu, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19047
2014-06-18 18:16:07 -07:00
Igor Canadi
3525aac9e5 Change order of parameters in adaptive table factory
Summary:
This is minor, but if we put the writing talbe factory as the third parameter, when we add a new table format, we'll have a situation:
1) block based factory
2) plain table factory
3) output factory
4) new format factory

I think it makes more sense to have output as the first parameter.

Also, fixed a NewAdaptiveTableFactory() call in unit test

Test Plan: unit test

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19119
2014-06-18 07:04:37 +02:00
sdong
8c265c08f1 HashLinkList to log distribution of number of entries aross buckets
Summary: Add two parameters of hash linked list to log distribution of number of entries across all buckets, and a sample row when there are too many entries in one single bucket.

Test Plan: Turn it on in plain_table_db_test and see the logs.

Reviewers: haobo, ljin

Reviewed By: ljin

Subscribers: leveldb, nkg-, dhruba, yhchiang

Differential Revision: https://reviews.facebook.net/D19095
2014-06-17 17:55:36 -07:00
sdong
200e4b4a72 Add a table factory that can read DB with both of PlainTable and BlockBasedTable in it
Summary: The new table factory is used if users want to convert a DB from one table format to the other. A user can use this table to open a DB written using one table format and write new files to another table format.

Test Plan: add a unit test

Reviewers: haobo, igor

Reviewed By: igor

Subscribers: dhruba, ljin, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D19017
2014-06-17 11:49:22 -07:00
Yueh-Hsuan Chiang
e6e259b8ab Include max_write_buffer_number >= 2 to SanitizeOptions.
Summary: Include max_write_buffer_number >= 2 to SanitizeOptions.

Test Plan: make all check

Reviewers: haobo, sdong, igor, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19077
2014-06-16 16:26:46 -07:00
Igor Canadi
f43c8262c2 Don't compress block bigger than 2GB
Summary: This is a temporary solution to a issue that we have with compression libraries. See task #4453446.

Test Plan: make check doesn't complain :)

Reviewers: haobo, ljin, yhchiang, dhruba, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D18975
2014-06-09 12:26:09 -07:00
Igor Canadi
a0191c9dfe Create Missing Column Families
Summary: Provide an convenience option to create column families if they are missing from the DB. Task #4460490

Test Plan: added unit test. also, stress test for some time

Reviewers: sdong, haobo, dhruba, ljin, yhchiang

Reviewed By: yhchiang

Subscribers: yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D18951
2014-06-06 18:04:56 -07:00
sdong
df9069d23f In DB::NewIterator(), try to allocate the whole iterator tree in an arena
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.

Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.

Test Plan: make all check

Reviewers: ljin, haobo

Reviewed By: haobo

Subscribers: leveldb, dhruba, yhchiang, igor

Differential Revision: https://reviews.facebook.net/D18513
2014-06-02 17:44:57 -07:00
sdong
462796697c dynamic_bloom: replace some divide (remainder) operations with shifts in locality mode, and other improvements
Summary:
This patch changes meaning of options.bloom_locality: 0 means disable cache line optimization and any positive number means use CACHE_LINE_SIZE as block size (the previous behavior is the block size will be CACHE_LINE_SIZE*options.bloom_locality). By doing it, the divide operations inside a block can be replaced by a shift.

Performance is improved:
https://reviews.facebook.net/P471

Also, improve the basic algorithm in two ways:
(1) make sure num of blocks is an odd number
(2) rotate bytes after every probe in locality mode. Since the divider is 2^n, unless doing it, we are never able to use all the bits.
Improvements of false positive: https://reviews.facebook.net/P459

Test Plan: make all check

Reviewers: ljin, haobo

Reviewed By: haobo

Subscribers: dhruba, yhchiang, igor, leveldb

Differential Revision: https://reviews.facebook.net/D18843
2014-06-02 17:36:38 -07:00
Igor Canadi
f068d2a94d Move master version to 3.2 2014-05-23 10:27:56 -07:00
Igor Canadi
6de6a06631 FIFO compaction style
Summary:
Introducing new compaction style -- FIFO.

FIFO compaction style has write amplification of 1 (+1 for WAL) and it deletes the oldest files when the total DB size exceeds pre-configured values.

FIFO compaction style is suited for storing high-frequency event logs.

Test Plan: Added a unit test

Reviewers: dhruba, haobo, sdong

Reviewed By: dhruba

Subscribers: alberts, leveldb

Differential Revision: https://reviews.facebook.net/D18765
2014-05-21 11:43:35 -07:00
Igor Canadi
b2cf95fe38 Call EnableFileDeletions with false as argument 2014-05-20 14:28:51 -07:00
Igor Canadi
c07c9606ed Expose Status::code() 2014-05-14 16:23:40 -07:00
Igor Canadi
26f5dd9a5a TablePropertiesCollectorFactory
Summary:
This diff addresses task #4296714 and rethinks how users provide us with TablePropertiesCollectors as part of Options.

Here's description of task #4296714:
       I'm debugging #4295529 and noticed that our count of user properties kDeletedKeys is wrong. We're sharing one single InternalKeyPropertiesCollector with all Table Builders. In LOG Files, we're outputting number of kDeletedKeys as connected with a single table, while it's actually the total count of deleted keys since creation of the DB.

       For example, this table has 3155 entries and 1391828 deleted keys.

The problem with current approach that we call methods on a single TablePropertiesCollector for all the tables we create. Even worse, we could do it from multiple threads at the same time and TablePropertiesCollector has no way of knowing which table we're calling it for.

Good part: Looks like nobody inside Facebook is using Options::table_properties_collectors. This means we should be able to painfully change the API.

In this change, I introduce TablePropertiesCollectorFactory. For every table we create, we call `CreateTablePropertiesCollector`, which creates a TablePropertiesCollector for a single table. We then use it sequentially from a single thread, which means it doesn't have to be thread-safe.

Test Plan:
Added a test in table_properties_collector_test that fails on master (build two tables, assert that kDeletedKeys count is correct for the second one).
Also, all other tests

Reviewers: sdong, dhruba, haobo, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18579
2014-05-13 12:30:55 -07:00
Igor Canadi
3edc056f6d comment 2014-05-10 11:25:56 -07:00
Igor Canadi
038a477b53 Make it easier to start using RocksDB
Summary:
This diff is addressing multiple things with a single goal -- to make RocksDB easier to use:
* Add some functions to Options that make RocksDB easier to tune.
* Add example code for both simple RocksDB and RocksDB with Column Families.
* Rewrite our README.md

Regarding Options, I took a stab at something we talked about for a long time:
* https://www.facebook.com/groups/rocksdb.dev/permalink/563169950448190/

I added functions:
* IncreaseParallelism() -- easy, increases the thread pool and max_background_compactions
* OptimizeLevelStyleCompaction(memtable_memory_budget) -- the easiest way to optimize rocksdb for less stalls with level style compaction. This is very likely not ideal configuration. Feel free to suggest improvements. I used some of Mark's suggestions from here: https://github.com/facebook/rocksdb/issues/54
* OptimizeUniversalStyleCompaction(memtable_memory_budget) -- optimize for universal compaction.

Test Plan: compiled rocksdb. ran examples.

Reviewers: dhruba, MarkCallaghan, haobo, sdong, yhchiang

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18621
2014-05-10 10:49:33 -07:00
sdong
3a171dcb51 Pass logger to memtable rep and TLB page allocation error logged to info logs
Summary:
TLB page allocation errors are now logged to info logs, instead of stderr.
In order to do that, mem table rep's factory functions take a info logger now.

Test Plan: make all check

Reviewers: haobo, igor, yhchiang

Reviewed By: yhchiang

CC: leveldb, yhchiang, dhruba

Differential Revision: https://reviews.facebook.net/D18471
2014-05-05 16:43:37 -07:00
sdong
4a7c747064 Revert "Revert "Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB""
And make the default 0 for hash linked list memtable

This reverts commit d69dc64be7.
2014-05-04 13:56:29 -07:00
Igor Canadi
db1854d78c Declare all DB methods virtual so that StackableDB can override them 2014-05-04 11:39:49 -07:00
Igor Canadi
d69dc64be7 Revert "Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB"
This reverts commit 7dafa3a1d7.
2014-05-04 08:37:09 -07:00
Benjamin Renard
41e5cf2392 Add share_files_with_cheksum option to BackupEngine
Summary: added a new option to BackupEngine: if share_files_with_checksum is set to true, sst files are stored in shared_checksum/ and are identified by the triple (file name, checksum, file size) instead of just the file name. This option is targeted at distributed databases that want to backup their primary replica.

Test Plan: unit tests and tested backup and restore on a distributed rocksdb

Reviewers: igor

Reviewed By: igor

Differential Revision: https://reviews.facebook.net/D18393
2014-05-02 17:08:55 -07:00
Igor Canadi
4ecfbcf865 ApplyToAllCacheEntries
Summary: Added a method that executes a callback on every cache entry.

Test Plan: added a unit test

Reviewers: haobo

Reviewed By: haobo

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D18441
2014-05-02 16:24:04 -04:00
Igor Canadi
82042f451c Include version in options 2014-05-01 19:28:23 -04:00
Igor Canadi
0afc8bc29a xxHash
Summary:
Originally: https://github.com/facebook/rocksdb/pull/87/files

I'm taking over to apply some finishing touches

Test Plan: will add tests

Reviewers: dhruba, haobo, sdong, yhchiang, ljin

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18315
2014-05-01 14:09:32 -04:00
Igor Canadi
096f5be0ed Put column family information in LiveFileMetaData
Summary: As summary

Test Plan: compiles :)

Reviewers: dhruba, haobo, sdong, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18405
2014-04-30 16:24:52 -04:00
Igor Canadi
df70047669 Flush stale column families
Summary:
Added a new option `max_total_wal_size`. Once the total WAL size goes over that, we make an attempt to flush all column families that still have data in the earliest WAL file.

By default, I calculate `max_total_wal_size` dynamically, that should be good-enough for non-advanced customers.

Test Plan: Added a test

Reviewers: dhruba, haobo, sdong, ljin, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18345
2014-04-30 14:33:40 -04:00
sdong
7dafa3a1d7 Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB
Summary: Add an option to allocate a piece of memory from huge page TLB. Add options to trigger it in dynamic bloom, plain table indexes andhash linked list hash table.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: nkg-, dhruba, leveldb, igor, yhchiang

Differential Revision: https://reviews.facebook.net/D18357
2014-04-30 11:02:26 -07:00
Yueh-Hsuan Chiang
9d9d2965cb Add a new mem-table representation based on cuckoo hash.
Summary:
= Major Changes =
* Add a new mem-table representation, HashCuckooRep, which is based cuckoo hash.
  Cuckoo hash uses multiple hash functions.  This allows each key to have multiple
  possible locations in the mem-table.

  - Put: When insert a key, it will try to find whether one of its possible
    locations is vacant and store the key.  If none of its possible
    locations are available, then it will kick out a victim key and
    store at that location.  The kicked-out victim key will then be
    stored at a vacant space of its possible locations or kick-out
    another victim.  In this diff, the kick-out path (known as
    cuckoo-path) is found using BFS, which guarantees to be the shortest.

 - Get: Simply tries all possible locations of a key --- this guarantees
   worst-case constant time complexity.

 - Time complexity: O(1) for Get, and average O(1) for Put if the
   fullness of the mem-table is below 80%.

 - Default using two hash functions, the number of hash functions used
   by the cuckoo-hash may dynamically increase if it fails to find a
   short-enough kick-out path.

 - Currently, HashCuckooRep does not support iteration and snapshots,
   as our current main purpose of this is to optimize point access.

= Minor Changes =
* Add IsSnapshotSupported() to DB to indicate whether the current DB
  supports snapshots.  If it returns false, then DB::GetSnapshot() will
  always return nullptr.

Test Plan:
Run existing tests.  Will develop a test specifically for cuckoo hash in
the next diff.

Reviewers: sdong, haobo

Reviewed By: sdong

CC: leveldb, dhruba, igor

Differential Revision: https://reviews.facebook.net/D16155
2014-04-29 17:13:46 -07:00
Igor Canadi
e525bb16ea Make kMajorVersion and kMinorVersion take version from version macros 2014-04-29 11:59:48 -04:00
Igor Canadi
6cb0cb300c Add version.h 2014-04-29 11:33:00 -04:00
Igor Canadi
f868dcbbed Support for adding TTL-ed column family
Summary:
This enables user to add a TTL column family to normal DB.

Next step should be to expand StackableDB and create StackableColumnFamily, such that users can for example add geo-spatial column families to normal DB.

Test Plan: added a test

Reviewers: dhruba, haobo, ljin

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18201
2014-04-28 20:34:20 -07:00
Donovan Hide
4f9fae9bb7 Add rocksdb_open_for_read_only to C API 2014-04-27 20:57:10 +01:00
Igor Canadi
a618691a3b Read-only BackupEngine
Summary: Read-only BackupEngine can connect to the same backup directory that is already running BackupEngine. That enables some interesting use-cases (i.e. restoring replica from primary's backup directory)

Test Plan: added a unit test

Reviewers: dhruba, haobo, ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18297
2014-04-25 15:49:29 -04:00
Lei Jin
3995e801ab kill ReadOptions.prefix and .prefix_seek
Summary:
also add an override option total_order_iteration if you want to use full
iterator with prefix_extractor

Test Plan: make all check

Reviewers: igor, haobo, sdong, yhchiang

Reviewed By: haobo

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D17805
2014-04-25 12:21:34 -07:00
sdong
86a0133d05 PlainTableReader to expose index size to users
Summary:
This is a temp solution to expose index sizes to users from PlainTableReader before we persistent them to files.
In this patch, the memory consumption of indexes used by PlainTableReader will be reported as two user defined properties, so that users can monitor them.

Test Plan:
Add a unit test.
make all check`

Reviewers: haobo, ljin

Reviewed By: haobo

CC: nkg-, yhchiang, igor, ljin, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18195
2014-04-22 19:29:05 -07:00
Igor Canadi
3992aec8fa Support for column families in TTL DB
Summary:
This will enable people using TTL DB to do so with multiple column families. They can also specify different TTLs for each one.

TODO: Implement CreateColumnFamily() in TTL world.

Test Plan: Added a very simple sanity test.

Reviewers: dhruba, haobo, ljin, sdong, yhchiang

Reviewed By: haobo

CC: leveldb, alberts

Differential Revision: https://reviews.facebook.net/D17859
2014-04-22 11:27:33 -07:00
Igor Canadi
588bca2020 RocksDBLite
Summary:
Introducing RocksDBLite! Removes all the non-essential features and reduces the binary size. This effort should help our adoption on mobile.

Binary size when compiling for IOS (`TARGET_OS=IOS m static_lib`) is down to 9MB from 15MB (without stripping)

Test Plan: compiles :)

Reviewers: dhruba, haobo, ljin, sdong, yhchiang

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17835
2014-04-15 13:39:26 -07:00
Igor Canadi
23c8f89b57 Revert "Don't compile ldb tool into static library"
This reverts commit e296577ef6.
2014-04-15 11:29:02 -07:00
Igor Canadi
e296577ef6 Don't compile ldb tool into static library
Summary:
This is first step of my effort to reduce size of librocksdb.a for use in mobile.

ldb object files are huge and are ment to be used as a command line tool. I moved them to `tools/` directory and include them only when compiling `ldb`

This diff reduced librocksdb.a from 42MB to 39MB on my mac (not stripped).

Test Plan: ran ldb

Reviewers: dhruba, haobo, sdong, ljin, yhchiang

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17823
2014-04-15 10:52:39 -07:00
Igor Canadi
be016613c2 Expose in memory Env to the world
Summary: That will help with some iOS testing I'm doing.

Test Plan: compiles

Reviewers: dhruba, haobo, ljin, yhchiang, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17787
2014-04-14 12:28:15 -07:00
Igor Canadi
30aff72f77 Don't shadow in ColumnFamilyDescriptor 2014-04-11 14:48:20 -07:00
Lei Jin
eba3fc644a make corruption_test:CompactionInputErrorParanoid deterministic
Summary:
it writes ~10M data, default L0 compaction trigger is 4, plus 2 writer
buffer, so that can accommodate ~6M data before compaction happens for
sure. I guess encoding is doing a good job to shrink the data so that
sometime, compaction does not get triggered. I get test failure quite
often.

Test Plan: ran it multiple times and all got pass

Reviewers: igor, sdong

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17775
2014-04-11 12:48:38 -07:00
Igor Canadi
b3d7435b4e No shadow in public headers 2014-04-10 17:19:03 -07:00
Igor Canadi
ddef6841b3 Renamed InfoLogLevel::DEBUG to InfoLogLevel::DEBUG_LEVEL
Summary: XCode for some reason injects `#define DEBUG 1` into our code, which makes compile fail because we use `DEBUG` keyword for other stuff. This diff fixes the issue by renaming `DEBUG` to `DEBUG_LEVEL`.

Test Plan: compiles

Reviewers: dhruba, haobo, sdong, yhchiang, ljin

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17709
2014-04-10 15:27:42 -07:00
Kai Liu
75b59d5146 Enable hash index for block-based table
Summary: Based on previous patches, this diff eventually provides the end-to-end mechanism for users to specify the hash-index.

Test Plan: Wrote several new unit tests.

Reviewers: sdong, haobo, dhruba

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16539
2014-04-10 14:19:43 -07:00
Igor Canadi
4daea66343 Turn on -Wmissing-prototypes
Summary: Compiling for iOS has by default turned on -Wmissing-prototypes, which causes rocksdb to fail compiling. This diff turns on -Wmissing-prototypes in our compile options and cleans up all functions with missing prototypes.

Test Plan: compiles

Reviewers: dhruba, haobo, ljin, sdong

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17649
2014-04-09 21:17:14 -07:00
Igor Canadi
b947fdc89d Column family support for DB::OpenForReadOnly()
Summary: When opening DB in read-only mode, client can choose to only specify a subset of column families ("default" column family can't be omitted, though)

Test Plan: added a unit test in column_family_test

Reviewers: haobo, sdong, ljin, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17565
2014-04-09 09:56:17 -07:00
Lei Jin
92c1eb0291 macros for perf_context
Summary: This will allow us to disable them completely for iOS or for better performance

Test Plan: will run make all check

Reviewers: igor, haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17511
2014-04-08 10:58:07 -07:00
Igor Canadi
664559fe2d Small final fixes before merge 2014-04-07 15:38:53 -07:00
Igor Canadi
3d2fe844ab Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/memtable_list.cc
	db/version_set.cc
2014-04-07 11:31:11 -07:00
Igor Canadi
bcd1f15b60 Remove -Wno-unused-const-variable 2014-04-04 16:15:47 -07:00
Lei Jin
c90d446ee7 make hash_link_list Node's key space consecutively followed at the end
Summary: per sdong's request, this will help processor prefetch on n->key case.

Test Plan: make all check

Reviewers: sdong, haobo, igor

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17415
2014-04-04 15:37:28 -07:00
Igor Canadi
51023c3911 Make RocksDB compile for iOS
Summary:
I had to make number of changes to the code and Makefile:
* Add `make lib`, that will create static library without debug info. We need this to avoid growing binary too much. Currently it's 14MB.
* Remove cpuinfo() function and use __SSE4_2__ macro. We actually used the macro as part of Fast_CRC32() function.
As a result, I also accidentally fixed this issue: https://www.facebook.com/groups/rocksdb.dev/permalink/549700778461774/?stream_ref=2
* Remove __thread locals in OS_MACOSX

Test Plan: `make lib PLATFORM=IOS`

Reviewers: ljin, haobo, dhruba, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17475
2014-04-04 13:11:44 -07:00
Thomas Adam
98422cba77 [C-API] implemented more options 2014-04-03 10:47:37 +02:00
Thomas Adam
3a30b5b0be [C-API] added "rocksdb_options_set_plain_table_factory" to make it possible to use plain table factory 2014-04-03 10:47:37 +02:00
sdong
4af1954fd6 Compaction Filter V1 to use old context struct to keep backward compatible
Summary: The previous change D15087 changed existing compaction filter, which makes the commonly used class not backward compatible. Revert the older interface. Use a new interface for V2 instead.

Test Plan: make all check

Reviewers: haobo, yhchiang, igor

CC: danguo, dhruba, ljin, igor, leveldb

Differential Revision: https://reviews.facebook.net/D17223
2014-04-02 14:57:51 -07:00
Igor Canadi
8555ce2dec Merge branch 'master' into columnfamilies 2014-04-02 10:48:05 -07:00
Thomas Adam
38dc5ef45f [C-API] added the possiblity to create a HashSkipList or HashLinkedList to support prefix seeks 2014-04-01 12:44:27 +02:00
Igor Canadi
ddbd1ece88 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/internal_stats.cc
	db/internal_stats.h
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
	include/rocksdb/options.h
	util/options.cc
2014-03-31 13:39:24 -07:00
sdong
43a593a6d9 Change default value of some Options
Summary: Since we are optimizing for server workloads, some default values are not optimized any more. We change some of those values that I feel it's less prone to regression bugs.

Test Plan: make all check

Reviewers: dhruba, haobo, ljin, igor, yhchiang

Reviewed By: igor

CC: leveldb, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D16995
2014-03-28 17:09:28 -07:00
Dhruba Borthakur
4031b98373 A GIS implementation for rocksdb.
Summary:
    This patch stores gps locations in rocksdb.

    Each object is uniquely identified by an id. Each object has
    a gps (latitude, longitude) associated with it. The geodb
    supports looking up an object either by its gps location
    or by its id. There is a method to retrieve all objects
    within a circular radius centered at a specified gps location.

Test Plan: Simple unit-test attached.

Reviewers: leveldb, haobo

Reviewed By: haobo

CC: leveldb, tecbot, haobo

Differential Revision: https://reviews.facebook.net/D15567
2014-03-28 15:26:44 -07:00
Lei Jin
0d755fff14 cache friendly blocked bloomfilter
Summary:
By constraining the probes within cache line(s), we can improve the
cache miss rate thus performance. This probably only makes sense for
in-memory workload so defaults the option to off.

Numbers and comparision can be found in wiki:
https://our.intern.facebook.com/intern/wiki/index.php/Ljin/rocksdb_perf/2014_03_17#Bloom_Filter_Study

Test Plan: benchmarked this change substantially. Will run make all check as well

Reviewers: haobo, igor, dhruba, sdong, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17133
2014-03-28 09:21:20 -07:00
Igor Canadi
1c9f8f0884 Fix valgrind issues
Summary:
NewFixedPrefixTransform is leaked in default options. Broken by b47812fba6

Also included in the diff some code cleanup

Test Plan:
valgrind env_test
also make check

Reviewers: haobo, danguo, yhchiang

Reviewed By: danguo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17211
2014-03-27 08:22:59 -07:00
Igor Canadi
e8168382c4 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	include/rocksdb/options.h
	util/options.cc
2014-03-25 11:09:40 -07:00
Danny Guo
b47812fba6 [rocksdb] new CompactionFilterV2 API
Summary:
This diff adds a new CompactionFilterV2 API that roll up the
decisions of kv pairs during compactions. These kv pairs must share the
same key prefix. They are buffered inside the db.

    typedef std::vector<Slice> SliceVector;
    virtual std::vector<bool> Filter(int level,
                                 const SliceVector& keys,
                                 const SliceVector& existing_values,
                                 std::vector<std::string>* new_values,
                                 std::vector<bool>* values_changed
                                 ) const = 0;

Application can override the Filter() function to operate
on the buffered kv pairs. More details in the inline documentation.

Test Plan:
make check. Added unit tests to make sure Keep, Delete,
Change all works.

Reviewers: haobo

CCs: leveldb

Differential Revision: https://reviews.facebook.net/D15087
2014-03-24 20:47:53 -07:00
Yueh-Hsuan Chiang
cda4006e87 Enhance partial merge to support multiple arguments
Summary:
* PartialMerge api now takes a list of operands instead of two operands.
* Add min_pertial_merge_operands to Options, indicating the minimum
  number of operands to trigger partial merge.
* This diff is based on Schalk's previous diff (D14601), but it also
  includes necessary changes such as updating the pure C api for
  partial merge.

Test Plan:
* make check all
* develop tests for cases where partial merge takes more than two
  operands.

TODOs (from Schalk):
* Add test with min_partial_merge_operands > 2.
* Perform benchmarks to measure the performance improvements (can probably
  use results of task #2837810.)
* Add description of problem to doc/index.html.
* Change wiki pages to reflect the interface changes.

Reviewers: haobo, igor, vamsi

Reviewed By: haobo

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16815
2014-03-24 17:57:13 -07:00
Igor Canadi
b253f24403 Rate limiter for BackupableDB
Summary: Might be useful if client doesn't want to effect running system during backup too much.

Test Plan: added a test case

Reviewers: dhruba, haobo, ljin

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17091
2014-03-24 11:38:44 -07:00
Igor Canadi
e67241f0b9 Sanity check on Open
Summary:
Everytime a client opens a DB, we do a sanity check that:
* checks the existance of all the necessary files
* verifies that file sizes are correct

Some of the code was stolen from https://reviews.facebook.net/D16935

Test Plan: added a unit test

Reviewers: dhruba, haobo, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17097
2014-03-20 14:18:29 -07:00
Yiting Li
7981a43274 Consistency Check Function
Summary: Added a function/command to check the consistency of live files' meta data

Test Plan:
Manual test (size mismatch, file not exist).
Command test script.

Reviewers: haobo

Reviewed By: haobo

CC: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D16935
2014-03-20 13:42:45 -07:00
Igor Canadi
3055a15b29 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
2014-03-18 13:24:27 -07:00
Igor Canadi
f26cb0f093 Optimize fallocation
Summary:
Based on my recent findings (posted in our internal group), if we use fallocate without KEEP_SIZE flag, we get superior performance of fdatasync() in append-only workloads.

This diff provides an option for user to not use KEEP_SIZE flag, thus optimizing his sync performance by up to 2x-3x.

At one point we also just called posix_fallocate instead of fallocate, which isn't very fast: http://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html (tl;dr it manually writes out zero bytes to allocate storage). This diff also fixes that, by first calling fallocate and then posix_fallocate if fallocate is not supported.

Test Plan: make check

Reviewers: dhruba, sdong, haobo, ljin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16761
2014-03-17 21:52:14 -07:00
Igor Canadi
64904b39a0 Merge branch 'master' into columnfamilies
Conflicts:
	utilities/backupable/backupable_db.cc
2014-03-17 17:57:14 -07:00
Igor Canadi
9caeff516e keep_log_files option in BackupableDB
Summary:
Added an option to BackupableDB implementation that allows users to persist in-memory databases. When the restore happens with keep_log_files = true, it will
*) Not delete existing log files in wal_dir
*) Move log files from archive directory to wal_dir, so that DB can replay them if necessary

Test Plan: Added an unit test

Reviewers: dhruba, ljin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16941
2014-03-17 15:39:23 -07:00
Igor Canadi
db234133a9 [CF] WriteBatch to take in ColumnFamilyHandle
Summary: Client doesn't need to know anything about ColumnFamily ID. By making WriteBatch take ColumnFamilyHandle as a parameter, we can eliminate method GetID() from ColumnFamilyHandle

Test Plan: column_family_test

Reviewers: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16887
2014-03-14 11:30:14 -07:00
Igor Canadi
dff9214165 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	tools/db_stress.cc
2014-03-12 10:17:41 -07:00
Igor Canadi
2b95dc1542 Revert "Fix bad merge of D16791 and D16767"
This reverts commit 839c8ecfcd.
2014-03-12 09:37:43 -07:00
sdong
839c8ecfcd Fix bad merge of D16791 and D16767
Summary: A bad Auto-Merge caused log buffer is flushed twice. Remove the unintended one.

Test Plan: Should already be tested (the code looks the same as when I ran unit tests).

Reviewers: haobo, igor

Reviewed By: haobo

CC: ljin, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16821
2014-03-11 21:31:57 -07:00
Igor Canadi
457c78eb89 [CF] db_stress for column families
Summary:
I had this diff for a while to test column families implementation. Last night, I ran it sucessfully for 10 hours with the command:

   time ./db_stress --threads=30 --ops_per_thread=200000000 --max_key=5000 --column_families=20 --clear_column_family_one_in=3000000 --verify_before_write=1  --reopen=50 --max_background_compactions=10 --max_background_flushes=10 --db=/tmp/db_stress

It is ready to be committed :)

Test Plan: Ran it for 10 hours

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16797
2014-03-11 12:06:12 -07:00
sdong
01dcef114b Env to add a function to allow users to query waiting queue length
Summary: Add a function to Env so that users can query the waiting queue length of each thread pool

Test Plan: add a test in env_test

Reviewers: haobo

Reviewed By: haobo

CC: dhruba, igor, yhchiang, ljin, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D16755
2014-03-11 10:19:02 -07:00
Igor Canadi
9634ba42ac Merge branch 'master' into columnfamilies
Conflicts:
	db/compaction_picker.cc
	db/db_impl.cc
	db/db_impl.h
	db/tailing_iter.cc
	db/version_set.h
	include/rocksdb/options.h
	util/options.cc
2014-03-10 17:26:09 -07:00
Igor Canadi
9db8c4c556 Fix share_table_files bug
Summary: constructor wasn't properly constructing BackupableDBOptions

Test Plan: no test

Reviewers: benj

Reviewed By: benj

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16749
2014-03-10 14:42:03 -07:00
Lei Jin
8d007b4aaf Consolidate SliceTransform object ownership
Summary:
(1) Fix SanitizeOptions() to also check HashLinkList. The current
dynamic case just happens to work because the 2 classes have the same
layout.
(2) Do not delete SliceTransform object in HashSkipListFactory and
HashLinkListFactory destructor. Reason: SanitizeOptions() enforces
prefix_extractor and SliceTransform to be the same object when
Hash**Factory is used. This makes the behavior strange: when
Hash**Factory is used, prefix_extractor will be released by RocksDB. If
other memtable factory is used, prefix_extractor should be released by
user.

Test Plan: db_bench && make asan_check

Reviewers: haobo, igor, sdong

Reviewed By: igor

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16587
2014-03-10 12:56:46 -07:00
Haobo Xu
9e0e6aa7f6 [RocksDB] make sure KSVObsolete does not get accessed as a valid pointer.
Summary: KSVObsolete is no longer nullptr and needs to be checked explicitly. Also did some minor code cleanup and added a stat counter to track superversion cleanups incurred in the foreground.

Test Plan: make check

Reviewers: ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16701
2014-03-10 12:55:25 -07:00
Igor Canadi
b04c75d244 Dump options in backupable DB
Summary: We should dump options in backupable DB log, just like with to with RocksDB. This will aid debugging.

Test Plan: checked the log

Reviewers: ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16719
2014-03-10 12:09:54 -07:00
Haobo Xu
66da467983 [RocksDB] LogBuffer Cleanup
Summary: Moved LogBuffer class to an internal header. Removed some unneccesary indirection. Enabled log buffer for BackgroundCallFlush. Forced log buffer flush right after Unlock to improve time ordering of info log.

Test Plan: make check; db_bench compare LOG output

Reviewers: sdong

Reviewed By: sdong

CC: leveldb, igor

Differential Revision: https://reviews.facebook.net/D16707
2014-03-10 11:05:44 -07:00
Igor Canadi
04d2c26e17 Add option verify_checksums_in_compaction
Summary:
If verify_checksums_in_compaction is true, compaction will verify checksums. This is default.
If it's false, compaction doesn't verify checksums. This is useful for in-memory workloads.

Test Plan: corruption_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16695
2014-03-10 10:06:34 -07:00
Igor Canadi
1e0d47276c Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
2014-03-07 16:59:47 -08:00
Igor Canadi
9f15092ebd [CF] NewIterators
Summary: Adding the last missing function -- NewIterators(). Pretty simple implementation

Test Plan: added a unit test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16689
2014-03-07 16:15:25 -08:00
Lei Jin
e5fa4944fc use CAS when returning SuperVersion to ThreadLocal
Summary:
Add a check at the end of GetImpl to release SuperVersion if it becomes
obsolete. Also do Scrape() inside InstallSuperVersion so it happens more
frequent.

Test Plan:
make all check
running asan_check now

Reviewers: igor, haobo, sdong, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16641
2014-03-07 14:43:22 -08:00
sdong
e1f52b6a22 Fix Valgrind error introduced by D16515
Summary: valgrind reports issues. This patch seems to fix it.

Test Plan: run the tests that fails in valgrind

Reviewers: igor, haobo, kailiu

Reviewed By: kailiu

CC: dhruba, ljin, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16653
2014-03-06 16:07:37 -08:00
Igor Canadi
80a207fc90 Merge branch 'master' into columnfamilies
Conflicts:
	db/compaction_picker.cc
	db/compaction_picker.h
	db/db_impl.cc
	db/version_set.cc
	db/version_set.h
	include/rocksdb/options.h
	util/options.cc
2014-03-05 16:59:22 -08:00
sdong
ecb1ffa2a8 Buffer info logs when picking compactions and write them out after releasing the mutex
Summary: Now while the background thread is picking compactions, it writes out multiple info_logs, especially for universal compaction, which introduces a chance of waiting log writing in mutex, which is bad. To remove this risk, write all those info logs to a buffer and flush it after releasing the mutex.

Test Plan:
make all check
check the log lines while running some tests that trigger compactions.

Reviewers: haobo, igor, dhruba

Reviewed By: dhruba

CC: i.am.jin.lei, dhruba, yhchiang, leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D16515
2014-03-05 15:36:32 -08:00
sdong
4405f3a000 Allow user to specify log level for info_log
Summary:
Currently, there is no easy way for user to change log level of info log. Add a parameter in options to specify that.
Also make the default level to INFO level. Removing the [INFO] tag if it is INFO level as I don't want to cause performance regression. (add [LOG] means another mem-copy and string formatting).

Test Plan:
make all check
manual check the levels work as expected.

Reviewers: dhruba, yhchiang

Reviewed By: yhchiang

CC: dhruba, igor, i.am.jin.lei, ljin, haobo, leveldb

Differential Revision: https://reviews.facebook.net/D16563
2014-03-05 14:54:31 -08:00
Igor Canadi
0738ae6dc9 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
2014-03-05 12:25:05 -08:00
Lei Jin
04298f8c33 output perf_context in db_bench readrandom
Summary:
Add helper function to print perf context data in db_bench if enabled.
I didn't find any code that actually exports perf context data. Not sure
if I missed anything

Test Plan: ran db_bench

Reviewers: haobo, sdong, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16575
2014-03-05 10:32:54 -08:00
Igor Canadi
e3f396f1ea Some fixes to BackupableDB
Summary:
(1) Report corruption if backup meta file has tailing data that was not read. This should fix: https://github.com/facebook/rocksdb/issues/81 (also, @sdong reported similar issue)
(2) Don't use OS buffer when copying file to backup directory. We don't need the file in cache since we won't be reading it twice
(3) Don't delete newer backups when somebody tries to backup the diverged DB (restore from older backup, add new data, try to backup). Rather, just fail the new backup.

Test Plan: backupable_db_test

Reviewers: ljin, dhruba, sdong

Reviewed By: ljin

CC: leveldb, sdong

Differential Revision: https://reviews.facebook.net/D16287
2014-03-04 17:02:25 -08:00
Igor Canadi
9d0577a6be Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/transaction_log_impl.cc
	db/transaction_log_impl.h
	include/rocksdb/options.h
	util/env.cc
	util/options.cc
2014-03-03 18:29:03 -08:00
kailiu
74939a9e13 Make the block-based table's index pluggable
Summary:
This patch introduced a new table options that allows us to make
block-based table's index pluggable.

To support that new features:

* Code has been refacotred to be more flexible and supports this option well.
* More documentation is added for the existing obsecure functionalities.
* Big surgeon on DataBlockReader(), where the logic was really convoluted.
* Other small code cleanups.

The pluggablility will mostly affect development of internal modules
and won't change frequently, as a result I intentionally avoid
heavy-weight patterns (like factory) and try to make it simple.

Test Plan: make all check

Reviewers: haobo, sdong

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16395
2014-02-28 18:19:07 -08:00
kailiu
bf86af5174 Remove the terrible hack in for flush_block_policy_factory
Summary:
Previous code is too convoluted and I must be drunk for letting
such code to be written without a second thought.

Thanks to the discussion with @sdong, I added the `Options` when
generating the flusher, thus avoiding the tricks.

Just FYI: I resisted to add Options in flush_block_policy.h since I
wanted to avoid cyclic dependencies: FlushBlockPolicy dpends on Options
and Options also depends FlushBlockPolicy... While I appreciate my
effort to prevent it, the old design turns out creating more troubles than
it tried to avoid.

Test Plan: ran ./table_test

Reviewers: sdong

Reviewed By: sdong

CC: sdong, leveldb

Differential Revision: https://reviews.facebook.net/D16503
2014-02-28 16:39:27 -08:00
Igor Canadi
58ca641d53 Make Log::Reader more robust
Summary:
This diff does two things:
(1) Log::Reader does not report a corruption when the last record in a log or manifest file is truncated (meaning that log writer died in the middle of the write). Inherited the code from LevelDB: https://code.google.com/p/leveldb/source/detail?r=269fc6ca9416129248db5ca57050cd5d39d177c8#
(2) Turn off mmap writes for all writes to log and manifest files

(2) is necessary because if we use mmap writes, the last record is not truncated, but is actually filled with zeros, making checksum fail. It is hard to recover from checksum failing.

Test Plan:
Added unit tests from LevelDB
Actually recovered a "corrupted" MANIFEST file.

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16119
2014-02-28 13:19:47 -08:00
Yueh-Hsuan Chiang
a77527f2af Add ReadOptions to TransactionLogIterator.
Summary:
Add an optional input parameter ReadOptions to DB::GetUpdateSince(),
which allows the verification of checksums to be disabled by setting
ReadOptions::verify_checksums to false.

Test Plan: Tests are done off-line and will not be included in the regular unit test.

Reviewers: igor

Reviewed By: igor

CC: leveldb, xjin, dhruba

Differential Revision: https://reviews.facebook.net/D16305
2014-02-28 11:50:36 -08:00
Lei Jin
ad0c3747cb cache SuperVersion in thread local storage to avoid mutex lock
Summary: as title

Test Plan:
asan_check
will post results later

Reviewers: haobo, igor, dhruba, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16257
2014-02-27 11:38:55 -08:00
Yueh-Hsuan Chiang
ccaedd16d4 Enable log info with different levels.
Summary:
* Now each Log related function has a variant that takes an additional
  argument indicating its log level, which is one of the following:
 - DEBUG, INFO, WARN, ERROR, FATAL.

* To ensure backward-compatibility, old version Log functions are kept
  unchanged.

* Logger now has a member variable indicating its log level.  Any incoming
  Log request which log level is lower than Logger's log level will not
  be output.

* The output of the newer version Log will be prefixed by its log level.

Test Plan:
Add a LogType test in auto_roll_logger_test.cc

 = Sample log output =
    2014/02/11-00:03:07.683895 7feded179840 [DEBUG] this is the message to be written to the log file!!
    2014/02/11-00:03:07.683898 7feded179840 [INFO] this is the message to be written to the log file!!
    2014/02/11-00:03:07.683900 7feded179840 [WARN] this is the message to be written to the log file!!
    2014/02/11-00:03:07.683903 7feded179840 [ERROR] this is the message to be written to the log file!!
    2014/02/11-00:03:07.683906 7feded179840 [FATAL] this is the message to be written to the log file!!

Reviewers: dhruba, xjin, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16071
2014-02-26 14:41:28 -08:00
Igor Canadi
8b7ab9951c [CF] Handle failure in WriteBatch::Handler
Summary:
* Add ColumnFamilyHandle::GetID() function. Client needs to know column family's ID to be able to construct WriteBatch
* Handle WriteBatch::Handler failure gracefully. Since WriteBatch is not a very smart function (it takes raw CF id), client can add data to WriteBatch for column family that doesn't exist. In that case, we need to gracefully return failure status from DB::Write(). To do that, I added a return Status to WriteBatch functions PutCF, DeleteCF and MergeCF.

Test Plan: Added test to column_family_test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16323
2014-02-26 10:10:00 -08:00
Igor Canadi
944ff673d6 Merge branch 'master' into columnfamilies 2014-02-26 10:09:52 -08:00
Lei Jin
b2795b799e thread local pointer storage
Summary:
This is not a generic thread local implementation in the sense that it
only takes pointer. But it does support multiple instances per thread
and lets user plugin function to perform cleanup when thread exits or an
instance gets destroyed.

Test Plan: unit test for now

Reviewers: haobo, igor, sdong, dhruba

Reviewed By: igor

CC: leveldb, kailiu

Differential Revision: https://reviews.facebook.net/D16131
2014-02-25 17:47:37 -08:00
Igor Canadi
8895526308 Merge branch 'master' into columnfamilies 2014-02-25 17:04:48 -08:00
Igor Canadi
dc277f0ab7 [CF] Adaptation of GetLiveFiles for CF
Summary: Even if user flushes the memtables before getting live files, we still can't guarantee that new data didn't come in (to already-flushed memtables). If we want backups to provide consistent view of the database, we still need to get WAL files.

Test Plan: backupable_db_test

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16299
2014-02-25 13:21:14 -08:00
Albert Strasheim
72aacf6b96 A few more C API functions. 2014-02-25 10:32:28 -08:00
Igor Canadi
d39da4b578 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
2014-02-24 17:09:05 -08:00
Igor Canadi
2bf1151a25 Fix C API 2014-02-24 15:15:34 -08:00
Igor Canadi
18a7cdfba0 Merge pull request #82 from tecbot/api-enhancements
Enhancements to the API
2014-02-24 14:20:13 -08:00
Thomas Adam
68248a2ac5 added a delete method for custom filter policy and merge operator to make it possible to override the cleanup behaviour of the return value 2014-02-23 17:58:11 +01:00
Thomas Adam
d74c9b79ea Enhancements to the API 2014-02-19 23:59:54 +01:00
Igor Canadi
44a9cbda17 Make GetPropertiesOfAllTables not virtual 2014-02-18 11:22:16 -08:00
Igor Canadi
422bb09cb0 Fix table properties
Summary: Adapt table properties to column family world

Test Plan: make check

Reviewers: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16161
2014-02-14 17:13:10 -08:00
Igor Canadi
76c048183c Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	include/rocksdb/db.h
2014-02-14 16:46:03 -08:00
kailiu
63690625cd Expose the table properties to application
Summary: Provide a public API for users to access the table properties for each SSTable.

Test Plan: Added a unit tests to test the function correctness under differnet conditions.

Reviewers: haobo, dhruba, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16083
2014-02-13 16:28:21 -08:00
Siying Dong
f3ae3d07cc Add more black-box tests for PlainTable and explicitly support total order mode
Summary:
1. Add some more implementation-aware tests for PlainTable
2. move from a hard-coded one index per 16 rows in one prefix to a configurable number. Also, make hash table ratio = 0  means binary search only. Also fixes some divide 0 risks.
3. Explicitly support total order (only use binary search)
4. some code cleaning up.

Test Plan: make all check

Reviewers: haobo, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16023
2014-02-12 17:37:22 -08:00
Igor Canadi
39ae9f7988 Remove constructors for ColumnFamilyHandle 2014-02-12 14:33:19 -08:00
Igor Canadi
ccdb93e775 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/memtable_list.cc
	db/memtable_list.h
	db/version_set.cc
	db/version_set.h
2014-02-12 14:01:30 -08:00
Igor Canadi
b06840aa7d [CF] Rethinking ColumnFamilyHandle and fix to dropping column families
Summary:
The change to the public behavior:
* When opening a DB or creating new column family client gets a ColumnFamilyHandle.
* As long as column family handle is alive, client can do whatever he wants with it, even drop it
* Dropped column family can still be read from (using the column family handle)
* Added a new call CloseColumnFamily(). Client has to close all column families that he has opened before deleting the DB
* As soon as column family is closed, any calls to DB using that column family handle will fail (also any outstanding calls)

Internally:
* Ref-counting ColumnFamilyData
* New thread-safety for ColumnFamilySet
* Dropped column families are now completely dropped and their memory cleaned-up

Test Plan: added some tests to column_family_test

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16101
2014-02-12 13:47:09 -08:00
kailiu
e6b3e3b4db Support prefix seek in UserCollectedProperties
Summary: We'll need the prefix seek support for property aggregation.

Test Plan: make all check

Reviewers: haobo, sdong, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15963
2014-02-12 13:14:59 -08:00
Igor Canadi
ca5f1a225a CompactionContext to include is_manual_compaction
Summary: Added a bit more information to compaction context, requested by internal team at FB.

Test Plan: Modified CompactionFilter test to make sure is_manual_compaction is properly set.

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16095
2014-02-12 12:24:18 -08:00
Lei Jin
994c327b86 IOError cleanup
Summary: Clean up IOErrors so that it only indicates errors talking to device.

Test Plan: make all check

Reviewers: igor, haobo, dhruba, emayanke

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15831
2014-02-12 11:42:54 -08:00
Siying Dong
33042669f6 Reduce malloc of iterators in Get() code paths
Summary:
This patch optimized Get() code paths by avoiding malloc of iterators. Iterator creation is moved to mem table rep implementations, where a callback is called when any key is found. This is the same practice as what we do in (SST) table readers.

db_bench result for readrandom following a writeseq, with no compression, single thread and tmpfs, we see throughput improved to 144958 from 139027, about 3%.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

Reviewed By: haobo

CC: leveldb, yhchiang

Differential Revision: https://reviews.facebook.net/D14685
2014-02-11 10:32:51 -08:00
Igor Canadi
8e634d3ea4 Merge pull request #74 from alberts/lz4
Support for LZ4 compression.
2014-02-10 15:46:56 -08:00
Igor Canadi
bc2ff597b8 Fixed wrong comment GetTableMetaData -> GetLiveFilesMetaData 2014-02-10 10:55:10 -08:00
Albert Strasheim
df2f92214a Support for LZ4 compression. 2014-02-08 14:15:51 -08:00
Igor Canadi
8d4db63a2d [CF] OpenWithColumnFamilies -> Open
Summary: By discussion with @dhruba, overloading Open makes more sense

Test Plan: compiles!

Reviewers: dhruba

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16017
2014-02-07 14:49:33 -08:00
Dhruba Borthakur
0982c38020 Fix compilation error with gcc 4.7
Summary:
Fix compilation error with gcc 4.7

Test Plan:
make clean
make

Reviewers:

CC:

Task ID: #

Blame Rev:
2014-02-07 13:52:54 -08:00
Igor Canadi
99e61fdd5c [CF] Separate dumping of DBOptions and ColumnFamilyOptions
Summary: When we open a DB, we should dump only DBOptions and then when we create a new column family, we dump ColumnFamilyOptions for each one.

Test Plan: make check, confirm contents of the LOG

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16011
2014-02-07 13:52:51 -08:00
Igor Canadi
2b8c44639a Merge branch 'master' into columnfamilies 2014-02-07 11:42:56 -08:00
Yueh-Hsuan Chiang
3ce8d9a988 Add support for plain table format to sst_dump.
Summary:
This diff enables the command line tool `sst_dump` to work for sst files
under plain table format.  Changes include:
  * In tools/sst_dump.cc:
    - add support for plain table format
    - display prefix_extractor information when --show_properties is on
  * In table/format.cc
    - Now the table magic number of a Footer can be later initialized
      via ReadFooterFromFile().
  * In table/meta_bocks:
    - add function ReadTableMagicNumber() that reads the magic number of
      the specified file.

Minor fixes:
 - remove a duplicate #include in table/table_test.cc
 - fix a commentary typo in include/rocksdb/memtablerep.h
 - fix lint errors.

Test Plan:
Runs sst_dump with both block-based and plain-table format files with
different arguments, specifically those with --show-properties and --from.

* sample output:
  https://reviews.facebook.net/P261

Reviewers: kailiu, sdong, xjin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15903
2014-02-07 11:15:00 -08:00
Igor Canadi
0143abdbb0 Merge branch 'master' into columnfamilies
Conflicts:
	HISTORY.md
	db/db_impl.cc
	db/db_impl.h
	db/db_iter.cc
	db/db_test.cc
	db/dbformat.h
	db/memtable.cc
	db/memtable_list.cc
	db/memtable_list.h
	db/table_cache.cc
	db/table_cache.h
	db/version_edit.h
	db/version_set.cc
	db/version_set.h
	db/write_batch.cc
	db/write_batch_test.cc
	include/rocksdb/options.h
	util/options.cc
2014-02-06 15:58:20 -08:00
Igor Canadi
0b4ccf765c Flushes should always go to HIGH priority thread pool
Summary:
This is not column-family related diff. It is in columnfamily branch because the change is significant and we want to push it with next major release (3.0).

It removes the leveldb notion of one thread pool and expands it to two thread pools by default (HIGH and LOW). Flush process is removed from compaction process and all flush threads are executed on HIGH thread pool, since we don't want long-running compactions to influence flush latency.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15987
2014-02-06 14:17:13 -08:00
kailiu
84f8185fc0 Merge branch 'master' into performance
Conflicts:
	HISTORY.md
	db/db_impl.cc
	db/memtable.cc
2014-02-05 21:21:00 -08:00
Igor Canadi
f276e0e59d [CF] Options -> DBOptions
Summary: Replaced most of occurrences of Options with more specific DBOptions. This brings us very close to supporting different configuration options for each column family.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15933
2014-02-05 14:56:09 -08:00
Igor Canadi
328ac7ee02 Merge branch 'master' into columnfamilies 2014-02-05 12:00:33 -08:00
Igor Canadi
73f62255c1 [CF] Split SanitizeOptions into two
Summary:
There are three SanitizeOption-s now : one for DBOptions, one for ColumnFamilyOptions and one for Options (which just calls the other two)

I have also reshuffled some options -- table_cache options and info_log should live in DBOptions, for example.

Test Plan: make check doesn't complain

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15873
2014-02-04 17:26:51 -08:00
kailiu
d43ebd8c65 Put table factory back to public api
Summary:
Previous I am too ambitious to hide every detail about table factory
to internal api. However, we cannot pass the compilatoin for external
users since we use table factory as the shared_ptr, which requires
the definition of table factory's destructor.

Test Plan: make check;

Reviewers: sdong, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15861
2014-02-03 19:51:20 -08:00
Igor Canadi
2966d764cd Fix some 32-bit compile errors
Summary: RocksDB doesn't compile on 32-bit architecture apparently. This is attempt to fix some of 32-bit errors. They are reported here: https://gist.github.com/paxos/8789697

Test Plan: RocksDB still compiles on 64-bit :)

Reviewers: kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15825
2014-02-03 13:48:30 -08:00
Siying Dong
d169b67680 [Performance Branch] PlainTable to encode rows with seqID 0, value type using 1 internal byte.
Summary: In PlainTable, use one single byte to represent 8 bytes of internal bytes, if seqID = 0 and it is value type (which should be common for bottom most files). It is to save 7 bytes for uncompressed cases.

Test Plan: make all check

Reviewers: haobo, dhruba, kailiu

Reviewed By: haobo

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D15489
2014-02-03 12:19:30 -08:00
kailiu
4f6cb17bdb First phase API clean up
Summary:
Addressed all the issues in https://reviews.facebook.net/D15447.
Now most table-related modules are hidden from user land.

Test Plan: make check

Reviewers: sdong, haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15525
2014-02-03 00:30:43 -08:00
Igor Canadi
f7489123e2 Move compaction picker and internal key comparator to ColumnFamilyData
Summary: Compaction picker and internal key comparator are different for each column family (not global), so they should live in ColumnFamilyData

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15801
2014-01-31 16:06:55 -08:00
Igor Canadi
5693db2a02 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
2014-01-31 14:45:15 -08:00
kailiu
4e0298f23c Clean up arena API
Summary:
Easy thing goes first. This patch moves arena to internal dir; based
on which, the coming patch will deal with memtable_rep.

Test Plan: make check

Reviewers: haobo, sdong, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15615
2014-01-30 22:10:10 -08:00
Dhruba Borthakur
abd70ecc2b The default settings enable checksum verification on every read.
Summary: The default settings enable checksum verification on every read.

Test Plan: make check

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15591
2014-01-30 19:14:03 -08:00
Igor Canadi
ac92420fc5 Merge branch 'master' into performance
Conflicts:
	db/db_impl.h
2014-01-30 10:09:23 -08:00
kailiu
3170abd297 Remove unused classes
Summary: This is a followup diff for https://reviews.facebook.net/D15447, which picks the most simple task: delete some unused memtable reps.

Test Plan: make

Reviewers: haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15585
2014-01-29 16:40:36 -08:00
Igor Canadi
514e42c7cc Fix some lint warnings 2014-01-29 15:27:27 -08:00
Igor Canadi
c1071ed95c Merge branch 'master' into columnfamilies 2014-01-28 16:04:00 -08:00
Igor Canadi
e5ec7384a0 Better interface to create BackupEngine
Summary: I think it looks nicer. In RocksDB we have both styles, but I think that static method is the more common version.

Test Plan: backupable_db_test

Reviewers: ljin, benj, swk

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15519
2014-01-28 16:01:53 -08:00
Igor Canadi
ec2fa4a690 Export BackupEngine
Summary:
Lots of clients have problems with using StackableDB interface. It's nice to have BackupableDB as a layer on top of DB, but not necessary.

This diff exports BackupEngine, which can be used to create backups without forcing clients to use StackableDB interface.

Test Plan: backupable_db_test

Reviewers: dhruba, ljin, swk

Reviewed By: ljin

CC: leveldb, benj

Differential Revision: https://reviews.facebook.net/D15477
2014-01-28 11:26:07 -08:00
kailiu
a5e220f5ef Merge branch 'master' into performance
Conflicts:
	Makefile
	db/db_impl.cc
	db/db_test.cc
	db/memtable_list.cc
	db/memtable_list.h
	table/block_based_table_reader.cc
	table/table_test.cc
	util/cache.cc
	util/coding.cc
2014-01-28 10:35:55 -08:00
Igor Canadi
eb055609e4 [column families] Move memtable and immutable memtable list to column family data
Summary: All memtables and immutable memtables are moved from DBImpl to ColumnFamilyData. For now, they are all referenced from default column family in DBImpl. It shouldn't be hard to get them from custom column family.

Test Plan: make check

Reviewers: dhruba, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15459
2014-01-27 13:37:14 -08:00
Igor Canadi
ae16606f98 Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
	db/version_set.h
2014-01-27 11:11:51 -08:00
Igor Canadi
832158e7f7 Fsync directory after we create a new file
Summary:
@dhruba, I'm not sure where we need to sync the directory. I implemented the function in Env() and added the dir sync just after we close the newly created file in the builder.

Should I also add FsyncDir() to new files that get created by a compaction?

Test Plan: Confirmed that FsyncDir is returning Status::OK()

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D14751
2014-01-27 11:02:21 -08:00
Siying Dong
b20486f294 [Performance Branch] HashLinkList to avoid to convert length prefixed string back to internal keys
Summary: Converting from length prefixed buffer back to internal key costs some CPU but it is not necessary. In this patch, internal keys are pass though the functions so that we don't need to convert back to it.

Test Plan: make all check

Reviewers: haobo, kailiu

Reviewed By: kailiu

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D15393
2014-01-27 10:26:14 -08:00
Igor Canadi
5356b2a680 Merge branch 'master' into columnfamilies 2014-01-24 18:34:48 -08:00
Siying Dong
8477255da3 Moving Some includes from options.h to forward declaration
Summary: By removing some includes form options.h and reply on forward declaration, we can more easily reason the dependencies.

Test Plan: make all check

Reviewers: kailiu, haobo, igor, dhruba

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15411
2014-01-24 17:16:22 -08:00
Igor Canadi
1423e7c9de Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
	db/version_set_reduce_num_levels.cc
	util/ldb_cmd.cc
2014-01-24 15:03:54 -08:00
Igor Canadi
b13bdfa500 Add a call DisownData() to Cache, which should speed up shutdown
Summary: On a shutdown, freeing memory takes a long time. If we're shutting down, we don't really care about memory leaks. I added a call to Cache that will avoid freeing all objects in cache.

Test Plan:
I created a script to test the speedup and demonstrate how to use the call: https://phabricator.fb.com/P3864368

Clean shutdown took 7.2 seconds, while fast and dirty one took 6.3 seconds. Unfortunately, the speedup is not that big, but should be bigger with bigger block_cache. I have set up the capacity to 80GB, but the script filled up only ~7GB.

Reviewers: dhruba, haobo, MarkCallaghan, xjin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15069
2014-01-24 14:57:52 -08:00
Igor Canadi
28d1a0c6f5 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/db_impl_readonly.h
	db/db_test.cc
	include/rocksdb/db.h
	include/utilities/stackable_db.h
2014-01-24 09:27:29 -08:00
Lei Jin
aba2acb5ec CompactRange() to return status
Summary: as title

Test Plan:
make all check
What else tests shall I cover?

Reviewers: igor, haobo

CC:

Differential Revision: https://reviews.facebook.net/D15339
2014-01-23 16:41:46 -08:00
Kai Liu
054c5dda8c Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/memtable.cc
	db/version_set.cc
	include/rocksdb/statistics.h
	util/statistics_imp.h
2014-01-23 16:32:49 -08:00
Tomislav Novak
81c9cc9b3b Tailing iterator
Summary:
This diff implements a special type of iterator that doesn't create a snapshot
(can be used to read newly inserted data) and is optimized for doing sequential
reads.

TailingIterator uses current superversion number to determine whether to
invalidate its internal iterators. If the version hasn't changed, it can often
avoid doing expensive seeks over immutable structures (sst files and immutable
memtables).

Test Plan:
* new unit tests
* running LD with this patch

Reviewers: igor, dhruba, haobo, sdong, kailiu

Reviewed By: sdong

CC: leveldb, lovro, march

Differential Revision: https://reviews.facebook.net/D15285
2014-01-23 16:26:08 -08:00
Kai Liu
bb19b530ca Aggressively inlining the short functions in coding.cc
Summary:
This diff takes an even more aggressive way to inline the functions. A decent rule that I followed is "not inline a function if it is more than 10 lines long."

Normally optimizing code by inline is ugly and hard to control, but since one of our usecase has significant amount of CPU used in functions from coding.cc, I'd like to try this diff out.

Test Plan:
1. the size for some .o file increased a little bit, but most less than 1%. So I think the negative impact of inline is negligible.
2. As the regression test shows (ran for 10 times and I calculated the average number)

    Metrics                                         Befor    After
    ========================================================================
    rocksdb.build.fillseq.qps                       426595   444515    (+4.6%)
    rocksdb.build.memtablefillrandom.qps            121739   123110
    rocksdb.build.memtablereadrandom.qps            1285103  1280520
    rocksdb.build.overwrite.qps                     125816   135570    (+9%)
    rocksdb.build.readrandom_fillunique_random.qps  285995   296863
    rocksdb.build.readrandom_memtable_sst.qps       1027132  1027279
    rocksdb.build.readrandom.qps                    1041427  1054665
    rocksdb.build.readrandom_smallblockcache.qps    1028631  1038433
    rocksdb.build.readwhilewriting.qps              918352   914629

Reviewers: haobo, sdong, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15291
2014-01-23 16:03:34 -08:00
Igor Canadi
7c5e583a27 ColumnFamilySet
Summary:
I created a separate class ColumnFamilySet to keep track of column families. Before we did this in VersionSet and I believe this approach is cleaner.

Let me know if you have any comments. I will commit tomorrow.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15357
2014-01-23 14:03:38 -08:00
Igor Canadi
23f6791c9e Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl_readonly.cc
	db/db_test.cc
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
	db/version_set.h
	db/version_set_reduce_num_levels.cc
2014-01-21 17:01:52 -08:00
Igor Canadi
83681bf9ef Statistics code cleanup
Summary: I'm separating code-cleanup part of https://reviews.facebook.net/D14517. This will make D14517 easier to understand and this diff easier to review.

Test Plan: make check

Reviewers: haobo, kailiu, sdong, dhruba, tnovak

Reviewed By: tnovak

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15099
2014-01-17 12:46:06 -08:00
Naman Gupta
1447bb5919 Allow callback to change size of existing value. Change return type of the callback function to an enum status to handle 3 cases.
Summary:
This diff fixes 2 hacks:
* The callback function can modify the existing value inplace, if the merged value fits within the existing buffer size. But currently the existing buffer size is not being modified. Now the callback recieves a int* allowing the size to be modified. Since size is encoded as a varint in the internal key for memtable. It might happen that the entire value might have be copied to the new location if the new size varint is smaller than the existing size varint.
* The callback function has 3 functionalities
    1. Modify existing buffer inplace, and update size correspondingly. Now to indicate that, Returns 1.
    2. Generate a new buffer indicating merged value. Returns 2.
    3. Fails to do either of above, based on whatever application logic. Returns 0.

Test Plan: Just make all for now. I'm adding another unit test to test each scenario.

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb, sdong, kailiu, xinyaohu, sumeet, danguo

Differential Revision: https://reviews.facebook.net/D15195
2014-01-16 15:12:39 -08:00
kailiu
1304d8c8ce Merge branch 'master' into performance
Conflicts:
	Makefile
	db/db_impl.cc
	db/db_impl.h
	db/db_test.cc
	db/memtable.cc
	db/memtable.h
	db/version_edit.h
	db/version_set.cc
	include/rocksdb/options.h
	util/hash_skiplist_rep.cc
	util/options.cc
2014-01-15 23:12:31 -08:00
kailiu
eae1804f29 Remove the unnecessary use of shared_ptr
Summary:
shared_ptr is slower than unique_ptr (which literally comes with no performance cost compare with raw pointers).
In memtable and memtable rep, we use shared_ptr when we'd actually should use unique_ptr.

According to igor's previous work, we are likely to make quite some performance gain from this diff.

Test Plan: make check

Reviewers: dhruba, igor, sdong, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15213
2014-01-15 18:22:01 -08:00
kailiu
c8f16221ed Fix the return type of WriteBatch::Data().
Summary: Quick fix for https://reviews.facebook.net/D15123

Test Plan: Make check

Reviewers: sdong, vkrest

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15165
2014-01-14 20:24:48 -08:00
Igor Canadi
d9cd7a063f Fix CompactRange to apply filter to every key
Summary:
When doing CompactRange(), we should first flush the memtable and then calculate max_level_with_files. Also, we want to compact all the levels that have files, including level `max_level_with_files`.

This patch fixed the unit test.

Test Plan: Added a failing unit test and a fix, so it's not failing anymore.

Reviewers: dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D14421
2014-01-14 16:19:09 -08:00
Siying Dong
9ea8bf90f1 DB::Put() to estimate write batch data size needed and pre-allocate buffer
Summary:
In one of CPU profiles, we see some CPU costs of string::reserve() inside Batch.Put(). This patch should be able to reduce some of the costs by allocating sufficient buffer before hand.

Since it is a trivial percentage of CPU costs, I didn't find a way to show the improvement in one of the benchmarks. I'll deploy it to same application and do the same CPU profiling to make sure those CPU costs are reduced.

Test Plan: make all check

Reviewers: haobo, kailiu, igor

Reviewed By: haobo

CC: leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D15135
2014-01-14 11:24:43 -08:00
Siying Dong
51dd21926c DB::Put() to estimate write batch data size needed and pre-allocate buffer
Summary:
In one of CPU profiles, we see some CPU costs of string::reserve() inside Batch.Put(). This patch should be able to reduce some of the costs by allocating sufficient buffer before hand.

Since it is a trivial percentage of CPU costs, I didn't find a way to show the improvement in one of the benchmarks. I'll deploy it to same application and do the same CPU profiling to make sure those CPU costs are reduced.

Test Plan: make all check

Reviewers: haobo, kailiu, igor

Reviewed By: haobo

CC: leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D15135
2014-01-14 10:53:16 -08:00
Naman Gupta
8454cfe569 Add read/modify/write functionality to Put() api
Summary: The application can set a callback function, which is applied on the previous value. And calculates the new value. This new value can be set, either inplace, if the previous value existed in memtable, and new value is smaller than previous value. Otherwise the new value is added normally.

Test Plan: fbmake. Added unit tests. All unit tests pass.

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: sdong, kailiu, xinyaohu, sumeet, leveldb

Differential Revision: https://reviews.facebook.net/D14745
2014-01-14 07:55:16 -08:00
Siying Dong
c4548d5f1f WriteBatch to provide a way for user to query data size directly and only return constant reference of data in Data()
Summary:
WriteBatch::Data() now is easily to be misuse by users. Also, there is no cheap way for user of WriteBatch to know the data size accumulated. This patch fix the problem by:
(1) return a constant reference to Data() so it's obvious to caller what it means.
(2) add a function to return data size directly

Test Plan: make all check

Reviewers: haobo, igor, kailiu

Reviewed By: kailiu

CC: zshao, leveldb

Differential Revision: https://reviews.facebook.net/D15123
2014-01-13 16:52:14 -08:00
Igor Canadi
151f9e144f Merge branch 'master' into columnfamilies 2014-01-13 09:09:51 -08:00
Schalk-Willem Kruger
a09ee1069d Improve RocksDB "get" performance by computing merge result in memtable
Summary:
Added an option (max_successive_merges) that can be used to specify the
maximum number of successive merge operations on a key in the memtable.
This can be used to improve performance of the "get" operation. If many
successive merge operations are performed on a key, the performance of "get"
operations on the key deteriorates, as the value has to be computed for each
"get" operation by applying all the successive merge operations.

FB Task ID: #3428853

Test Plan:
make all check
db_bench --benchmarks=readrandommergerandom
counter_stress_test

Reviewers: haobo, vamsi, dhruba, sdong

Reviewed By: haobo

CC: zshao

Differential Revision: https://reviews.facebook.net/D14991
2014-01-10 17:33:56 -08:00
Siying Dong
aa0ef6602d [Performance Branch] If options.max_open_files set to be -1, cache table readers in FileMetadata for Get() and NewIterator()
Summary:
In some use cases, table readers for all live files should always be cached. In that case, there will be an opportunity to avoid the table cache look-up while Get() and NewIterator().

We define options.max_open_files = -1 to be the mode that table readers for live files will always be kept. In that mode, table readers are cached in FileMetaData (with a reference count hold in table cache). So that when executing table_cache.Get() and table_cache.newInterator(), LRU cache checking can be by-passed, to reduce latency.

Test Plan: add a test case in db_test

Reviewers: haobo, kailiu

Reviewed By: haobo

CC: dhruba, igor, leveldb

Differential Revision: https://reviews.facebook.net/D15039
2014-01-10 15:57:49 -08:00
Siying Dong
424a524ac9 [Performance Branch] A Hashed Linked List Based Mem Table
Summary:
Implement a mem table, in which keys are hashed based on prefixes. In each bucket, entries are organized in a sorted linked list. It has the same thread safety guarantee as skip list.

The motivation is to optimize memory usage for the case that prefix hashing is primary way of seeking to the entry. Compared to hash skip list implementation, this implementation is more memory efficient, but inside each bucket, search is always linear. The target scenario is that there are only very limited number of records in each hash bucket.

Test Plan: Add a test case in db_test

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14979
2014-01-09 16:19:11 -08:00
Igor Canadi
cb37ddf229 Feature requests for BackupableDB
Summary:
This diff introduces some features that were requested by two internal customers:
* Ability for backups not to share table files, because we can't guarantee that equal filename means equal content accross replicas
* Ability for two threads to call EnableFileDeletions() and DisableFileDeletions()
* Ability to stop backup from another thread and not slow down the DB close
* Copy the files to the temporary folder first and then atomically rename

Test Plan: Added some tests to backupable_db_test

Reviewers: dhruba, sanketh, muthu, sdong, haobo

Reviewed By: haobo

CC: leveldb, sanketh, muthu

Differential Revision: https://reviews.facebook.net/D14769
2014-01-09 12:24:28 -08:00
kailiu
12b6d2b839 Separate the aligned and unaligned memory allocation
Summary: Use two vectors for different types of memory allocation.

Test Plan: run all unit tests.

Reviewers: haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15027
2014-01-08 15:11:42 -08:00
Igor Canadi
19e3ee64ac Add column family information to WAL
Summary:
I have added three new value types:
* kTypeColumnFamilyDeletion
* kTypeColumnFamilyValue
* kTypeColumnFamilyMerge
which include column family Varint32 before the data (value, deletion and merge). These values are used only in WAL (not in memtables yet).

This endeavour required changing some WriteBatch internals.

Test Plan: Added a unittest

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15045
2014-01-08 12:53:33 -08:00
Igor Canadi
a45b7d83ba Merge pull request #59 from mlin/more-c-bindings
C API: add rocksdb_env_set_high_priority_background_threads
2014-01-07 16:33:03 -08:00
Igor Canadi
72918efffe [column families] Implement DB::OpenWithColumnFamilies()
Summary:
In addition to implementing OpenWithColumnFamilies, this diff also includes some minor changes:
* Changed all column family names from Slice() to std::string. The performance of column family name handling is not critical, and it's more convenient and cleaner to have names as std::strings
* Implemented ColumnFamilyOptions(const Options&) and DBOptions(const Options&)
* Added ColumnFamilyOptions to VersionSet::ColumnFamilyData. ColumnFamilyOptions are specified on OpenWithColumnFamilies() and CreateColumnFamily()

I will keep the diff in the Phabricator for a day or two and will push to the branch then. Feel free to comment even after the diff has been pushed.

Test Plan: Added a simple unit test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15033
2014-01-07 11:05:50 -08:00
Igor Canadi
ef6ad1708d [column families] Support to create and drop column families
Summary:
This diff provides basic implementations of CreateColumnFamily(), DropColumnFamily() and ListColumnFamilies(). It builds on top of https://reviews.facebook.net/D14733

It also includes a bug fix for DBImplReadOnly, where Get implementation would be redirected to DBImpl instead of DBImplReadOnly.

Test Plan: Added unit test

Reviewers: dhruba, haobo, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15021
2014-01-03 01:12:16 -08:00
kailiu
e72aa37cc5 Merge branch 'master' into performance
Conflicts:
	db/table_cache.cc
2014-01-02 16:34:59 -08:00
Igor Canadi
7535443083 [RocksDB] Support for column families in manifest
Summary:
<This diff is for Column Family branch>

Added fields in manifest file to support adding and deleting column families.

Pretty simple change, each version edit record can be:
1. add column family
2. drop column family
3. add and delete N files from a single column family (compactions and flushes will generate such records)

Test Plan: make check works, the code is backward compatible

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14733
2014-01-02 04:18:28 -08:00
Igor Canadi
6de1b5b83e Merge branch 'master' into columnfamilies 2014-01-02 04:18:07 -08:00
Igor Canadi
b60c14f6ee Support multi-threaded DisableFileDeletions() and EnableFileDeletions()
Summary:
We don't want two threads to clash if they concurrently call DisableFileDeletions() and EnableFileDeletions(). I'm adding a counter that will enable file deletions only after all DisableFileDeletions() calls have been negated with EnableFileDeletions().

However, we also don't want to break the old behavior, so I added a parameter force to EnableFileDeletions(). If force is true, we will still enable file deletions after every call to EnableFileDeletions(), which is what is happening now.

Test Plan: make check

Reviewers: dhruba, haobo, sanketh

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14781
2014-01-02 03:33:42 -08:00
Mike Lin
4b1d049236 C API: add rocksdb_env_set_high_priority_background_threads 2013-12-31 15:14:18 -08:00
kailiu
f1cec73a76 Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/memtable.cc
	db/version_set.cc
	include/rocksdb/statistics.h
2013-12-27 12:23:17 -08:00
Siying Dong
18df47b79a Avoid malloc in NotFound key status if no message is given.
Summary:
In some places we have NotFound status created with empty message, but it doesn't avoid a malloc. With this patch, the malloc is avoided for that case.

The motivation of it is that I found in db_bench readrandom test when all keys are not existing, about 4% of the total running time is spent on malloc of Status, plus a similar amount of CPU spent on free of them, which is not necessary.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14691
2013-12-26 16:23:10 -08:00
Siying Dong
abaf26266d [RocksDB] [Performance Branch] Some Changes to PlainTable format
Summary:
Some changes to PlainTable format:
(1) support variable key length
(2) use user defined slice transformer to extract prefixes
(3) Run some test cases against PlainTable in db_test and table_test

Test Plan: test db_test

Reviewers: haobo, kailiu

CC: dhruba, igor, leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D14457
2013-12-20 12:08:35 -08:00
Igor Canadi
b26dc95628 Initialize sequence number in BatchResult - issue #39 2013-12-20 10:01:12 -08:00
Igor Canadi
269709a885 Merge branch 'master' into columnfamilies 2013-12-18 15:56:37 -08:00
Igor Canadi
9385a5247e [RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>

Sharing some of the work I've done so far. This diff compiles and passes the tests.

The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.

Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]

Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.

Please provide feedback.

Test Plan: make check works, the code is backward compatible

Reviewers: dhruba, haobo, sdong, kailiu, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14445
2013-12-18 13:08:22 -08:00
Mike Lin
2a2506b629 C bindings: add a bunch of the newer options 2013-12-15 13:47:06 -08:00
Kai Liu
2e9efcd6d8 Add the property block for the plain table
Summary:
This is the last diff that adds the property block to plain table.
The format resembles that of the block-based table: https://github.com/facebook/rocksdb/wiki/Rocksdb-table-format

  [data block]
  [meta block 1: stats block]
  [meta block 2: future extended block]
  ...
  [meta block K: future extended block]  (we may add more meta blocks in the future)
  [metaindex block]
  [index block: we only have the placeholder here, we can add persistent index block in the future]
  [Footer: contains magic number, handle to metaindex block and index block]
  <end_of_file>

Test Plan: extended existing property block test.

Reviewers: haobo, sdong, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14523
2013-12-13 17:18:14 -08:00
kailiu
b660e2d468 Expose usage info for the cache
Summary: This diff will help us to figure out the memory usage for the cache part.

Test Plan: added a new memory usage test for cache

Reviewers: haobo, sdong, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14559
2013-12-13 12:53:45 -08:00
Mark Callaghan
e9e6b00d29 Add monitoring for universal compaction and add counters for compaction IO
Summary:
Adds these counters
{ WAL_FILE_SYNCED, "rocksdb.wal.synced" }
  number of writes that request a WAL sync
{ WAL_FILE_BYTES, "rocksdb.wal.bytes" },
  number of bytes written to the WAL
{ WRITE_DONE_BY_SELF, "rocksdb.write.self" },
  number of writes processed by the calling thread
{ WRITE_DONE_BY_OTHER, "rocksdb.write.other" },
  number of writes not processed by the calling thread. Instead these were
  processed by the current holder of the write lock
{ WRITE_WITH_WAL, "rocksdb.write.wal" },
  number of writes that request WAL logging
{ COMPACT_READ_BYTES, "rocksdb.compact.read.bytes" },
  number of bytes read during compaction
{ COMPACT_WRITE_BYTES, "rocksdb.compact.write.bytes" },
  number of bytes written during compaction

Per-interval stats output was updated with WAL stats and correct stats for universal compaction
including a correct value for write-amplification. It now looks like:
                               Compactions
Level  Files Size(MB) Score Time(sec)  Read(MB) Write(MB)    Rn(MB)  Rnp1(MB)  Wnew(MB) RW-Amplify Read(MB/s) Write(MB/s)      Rn     Rnp1     Wnp1     NewW    Count  Ln-stall Stall-cnt
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0        7      464  46.4       281      3411      3875      3411         0      3875        2.1      12.1        13.8      621        0      240      240      628       0.0         0
Uptime(secs): 310.8 total, 2.0 interval
Writes cumulative: 9999999 total, 9999999 batches, 1.0 per batch, 1.22 ingest GB
WAL cumulative: 9999999 WAL writes, 9999999 WAL syncs, 1.00 writes per sync, 1.22 GB written
Compaction IO cumulative (GB): 1.22 new, 3.33 read, 3.78 write, 7.12 read+write
Compaction IO cumulative (MB/sec): 4.0 new, 11.0 read, 12.5 write, 23.4 read+write
Amplification cumulative: 4.1 write, 6.8 compaction
Writes interval: 100000 total, 100000 batches, 1.0 per batch, 12.5 ingest MB
WAL interval: 100000 WAL writes, 100000 WAL syncs, 1.00 writes per sync, 0.01 MB written
Compaction IO interval (MB): 12.49 new, 14.98 read, 21.50 write, 36.48 read+write
Compaction IO interval (MB/sec): 6.4 new, 7.6 read, 11.0 write, 18.6 read+write
Amplification interval: 101.7 write, 102.9 compaction
Stalls(secs): 142.924 level0_slowdown, 0.000 level0_numfiles, 0.805 memtable_compaction, 0.000 leveln_slowdown
Stalls(count): 132461 level0_slowdown, 0 level0_numfiles, 3 memtable_compaction, 0 leveln_slowdown

Task ID: #3329644, #3301695

Blame Rev:

Test Plan:
Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14583
2013-12-12 13:27:43 -08:00
Siying Dong
a8029fdc75 Introduce MergeContext to Lazily Initialize merge operand list
Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14415

Conflicts:
	db/db_impl.cc
	db/memtable.cc
2013-12-11 11:37:28 -08:00
Haobo Xu
3c02c363b3 [RocksDB] [Performance Branch] Added dynamic bloom, to be used for memable non-existing key filtering
Summary: as title

Test Plan: dynamic_bloom_test

Reviewers: dhruba, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14385
2013-12-11 00:15:14 -08:00
Igor Canadi
5e4ab767cf BackupableDB delete backups with newer seq number
Summary: We now delete backups with newer sequence number, so the clients don't have to handle confusing situations when they restore from backup.

Test Plan: added a unit test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14547
2013-12-10 20:49:28 -08:00
kailiu
c79e595471 Make Cache::GetCapacity constant
Summary: This will allow us to access constant via `DB::GetOptions().table_cache.GetCapacity()` or `DB::GetOptions().block_cache.GetCapacity()` since GetOptions() is also constant method.
2013-12-10 17:34:35 -08:00
Igor Canadi
a204dabb9d Merge pull request #31 from sepeth/c-api
Rename leveldb to rocksdb in C api
2013-12-10 09:18:47 -08:00
Doğan Çeçen
6c4e110c8c Rename leveldb to rocksdb in C api 2013-12-10 10:48:35 +02:00
Igor Canadi
fb9fce4fc3 [RocksDB] BackupableDB
Summary:
In this diff I present you BackupableDB v1. You can easily use it to backup your DB and it will do incremental snapshots for you.
Let's first describe how you would use BackupableDB. It's inheriting StackableDB interface so you can easily construct it with your DB object -- it will add a method RollTheSnapshot() to the DB object. When you call RollTheSnapshot(), current snapshot of the DB will be stored in the backup dir. To restore, you can just call RestoreDBFromBackup() on a BackupableDB (which is a static method) and it will restore all files from the backup dir. In the next version, it will even support automatic backuping every X minutes.

There are multiple things you can configure:
1. backup_env and db_env can be different, which is awesome because then you can easily backup to HDFS or wherever you feel like.
2. sync - if true, it *guarantees* backup consistency on machine reboot
3. number of snapshots to keep - this will keep last N snapshots around if you want, for some reason, be able to restore from an earlier snapshot. All the backuping is done in incremental fashion - if we already have 00010.sst, we will not copy it again. *IMPORTANT* -- This is based on assumption that 00010.sst never changes - two files named 00010.sst from the same DB will always be exactly the same. Is this true? I always copy manifest, current and log files.
4. You can decide if you want to flush the memtables before you backup, or you're fine with backing up the log files -- either way, you get a complete and consistent view of the database at a time of backup.
5. More things you can find in BackupableDBOptions

Here is the directory structure I use:

   backup_dir/CURRENT_SNAPSHOT - just 4 bytes holding the latest snapshot
               0, 1, 2, ... - files containing serialized version of each snapshot - containing a list of files
               files/*.sst - sst files shared between snapshots - if one snapshot references 00010.sst and another one needs to backup it from the DB, it will just reference the same file
               files/ 0/, 1/, 2/, ... - snapshot directories containing private snapshot files - current, manifest and log files

All the files are ref counted and deleted immediatelly when they get out of scope.

Some other stuff in this diff:
1. Added GetEnv() method to the DB. Discussed with @haobo and we agreed that it seems right thing to do.
2. Fixed StackableDB interface. The way it was set up before, I was not able to implement BackupableDB.

Test Plan:
I have a unittest, but please don't look at this yet. I just hacked it up to help me with debugging. I will write a lot of good tests and update the diff.

Also, `make asan_check`

Reviewers: dhruba, haobo, emayanke

Reviewed By: dhruba

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D14295
2013-12-09 14:06:52 -08:00
kailiu
c7707f24c2 Refine the statistics 2013-12-06 16:51:35 -08:00
kailiu
551e9428ce Merge branch 'master' into performance 2013-12-06 14:15:42 -08:00
Siying Dong
ef2211a9ca [RocksDB Performance Branch] Introduce MergeContext to Lazily Initialize merge operand list
Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14415
2013-12-06 10:28:59 -08:00
kailiu
90729f8b23 Extract metaindex block from block-based table
Summary: This change will allow other table to reuse the code for meta blocks.

Test Plan: all existing unit tests passed

Reviewers: dhruba, haobo, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14475
2013-12-05 16:34:16 -08:00
Mayank Agarwal
92e8316118 Make GetDbIdentity pure virtual and also implement it for StackableDB, DBWithTTL
Summary: As title

Test Plan: make clean and make

Reviewers: igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14469
2013-12-05 12:02:31 -08:00
Mayank Agarwal
18802689b8 Make an API to get database identity from the IDENTITY file
Summary: This would enable rocksdb users to get the db identity without depending on implementation details(storing that in IDENTITY file)

Test Plan: db/db_test (has identity checks)

Reviewers: dhruba, haobo, igor, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14463
2013-12-04 22:39:17 -08:00
Siying Dong
f040e536e4 [RocksDB Performance Branch] A more customized index in PlainTableReader
Summary:
PlainTableReader to use a more customized hash table. This patch assumes the SST file is smaller than 2GB:
(1) Every bucket uses 32-bit integer
(2) no key is stored in bucket
(3) use the first bit of the bucket value to distinguish it points to the file offset or a second level index.
This index schema fits the use case that most of prefixes have very small number of keys

Test Plan: plain_table_db_test

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14343
2013-12-04 13:43:45 -08:00
Sajal Jain
28a1b9b95f [rocksdb] statistics counters for memtable hits and misses
Summary:
added counters
rocksdb.memtable.hit - for memtable hit
rocksdb.memtable.miss - for memtable miss

Test Plan: db_bench tests

Reviewers: igor, dhruba, haobo

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D14433
2013-12-03 12:59:53 -08:00
Igor Canadi
eb12e47e0e Killing Transform Rep
Summary:
Let's get rid of TransformRep and it's children. We have confirmed that HashSkipListRep works better with multifeed, so there is no benefit to keeping this around.

This diff is mostly just deleting references to obsoleted functions. I also have a diff for fbcode that we'll need to push when we switch to new release.

I had to expose HashSkipListRepFactory in the client header files because db_impl.cc needs access to GetTransform() function for SanitizeOptions.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14397
2013-12-03 12:42:15 -08:00
lovro
930cb0b9ee Clarify CompactionFilter thread safety requirements
Summary: Documenting our discussion

Test Plan: make

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: igor

Differential Revision: https://reviews.facebook.net/D14403
2013-12-02 16:41:43 -08:00
Dhruba Borthakur
38feca4f35 Removed redundant slice_transform.h and memtablerep.h
Summary:
Removed redundant slice_transform.h and memtablerep.h

Test Plan:
make check

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-11-29 18:03:02 -08:00
Kai Liu
1966b63137 Merge branch 'master' into perf 2013-11-27 11:47:40 -08:00
Haobo Xu
4e6463ea44 [RocksDB][Performance Branch] Make height and branching factor configurable for skiplist implementation
Summary: As title. Especially, HashSkipListRepFactory will be able to specify a relatively small height, to reduce the memory overhead of one skiplist per bucket.

Test Plan: make check and test it on leaf4

Reviewers: dhruba, sdong, kailiu

CC: reconnect.grayhat, leveldb

Differential Revision: https://reviews.facebook.net/D14307
2013-11-26 21:59:36 -08:00
Igor Canadi
3ce3658411 DB::GetOptions()
Summary: We need access to options for BackupableDB

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, reconnect.grayhat

Differential Revision: https://reviews.facebook.net/D14331
2013-11-25 15:51:50 -08:00
Igor Canadi
11c26bd4a4 [RocksDB] Interface changes required for BackupableDB
Summary: This is part of https://reviews.facebook.net/D14295 -- smaller diff that is easier to review

Test Plan: make asan_check

Reviewers: dhruba, haobo, emayanke

Reviewed By: emayanke

CC: leveldb, kailiu, reconnect.grayhat

Differential Revision: https://reviews.facebook.net/D14301
2013-11-25 12:39:23 -08:00
Haobo Xu
5b825d6964 [RocksDB] Use raw pointer instead of shared pointer when passing Statistics object internally
Summary: liveness of the statistics object is already ensured by the shared pointer in DB options. There's no reason to pass again shared pointer among internal functions. Raw pointer is sufficient and efficient.

Test Plan: make check

Reviewers: dhruba, MarkCallaghan, igor

Reviewed By: dhruba

CC: leveldb, reconnect.grayhat

Differential Revision: https://reviews.facebook.net/D14289
2013-11-25 10:38:15 -08:00
Siying Dong
dfa1460d88 [For Performance Branch] Bloom filter in PlainTableIterator::Seek() - Update 1
Summary:
Address @haobo's comments in D14277

Test Plan: ./indexed_table_db_test

Reviewers: haobo

CC:

Task ID: #

Blame Rev:
2013-11-21 23:33:45 -08:00
Siying Dong
718488abc5 Add BloomFilter to PlainTableIterator::Seek()
Summary:
This patch adds a simple bloom filter in PlainTableIterator::Seek()

Test Plan: N/A

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-11-21 22:26:39 -08:00
Siying Dong
3e35aa6412 Revert "Allow users to profile a query and see bottleneck of the query"
This reverts commit 3d8ac31d71.
2013-11-21 17:40:39 -08:00
Siying Dong
b135d01e7b Allow users to profile a query and see bottleneck of the query
Summary:
Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later.

Test Plan: Enable this profiling in seveal existing tests.

Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14001

Conflicts:
	table/merger.cc
2013-11-21 17:39:19 -08:00
Siying Dong
3d8ac31d71 Allow users to profile a query and see bottleneck of the query
Summary:
Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later.

Test Plan: Enable this profiling in seveal existing tests.

Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14001
2013-11-21 16:29:57 -08:00
Siying Dong
58e1956d50 [Only for Performance Branch] A Hacky patch to lazily generate memtable key for prefix-hashed memtables.
Summary:
For prefix mem tables, encoding mem table key may be unnecessary if the prefix doesn't have any key. This patch is a little bit hacky but I want to try out the performance gain of removing this lazy initialization.

In longer term, we might want to revisit the way we abstract mem tables implementations.

Test Plan: make all check

Reviewers: haobo, igor, kailiu

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14265
2013-11-20 20:49:23 -08:00
Siying Dong
b59d4d5a50 A Simple Plain Table
Summary:
A Simple plain table format. No block structure. When creating the table reader, scanning the full table to create indexes.

Test Plan:Add unit test

Reviewers:haobo,dhruba,kailiu

CC:

Task ID: #

Blame Rev:
2013-11-20 18:44:22 -08:00
kailiu
6eb5649800 Move flush_block_policy from Options to TableFactory
Summary:
Previously we introduce a `flush_block_policy_factory` in Options, however, that options is strongly releated to Table based tables.
It will make more sense to move it to block based table's own factory class.

Test Plan: make check to pass existing tests

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14211
2013-11-19 22:00:48 -08:00
kailiu
1415f8820d Improve the "table stats"
Summary:
The primary motivation of the changes is to make it easier to figure out the inside of the tables.

* rename "table stats" to "table properties" since now we have more than "integers" to store in the property block.
* Add filter block size to the basic table properties.
* Whenever a table is built, we'll log the table properties (the sample output is in Test Plan).
* Make an api to expose deleted keys.

Test Plan:
Passed all existing test. and the sample output of table stats:

    ==================================================================
        Basic Properties
    ------------------------------------------------------------------
                  # data blocks: 1
                      # entries: 1

                   raw key size: 9
           raw average key size: 9
                 raw value size: 9
         raw average value size: 0

                data block size: 25
               index block size: 27
              filter block size: 18
         (estimated) table size: 70

                  filter policy: rocksdb.BuiltinBloomFilter
    ==================================================================
        User collected properties: InternalKeyPropertiesCollector
    ------------------------------------------------------------------
                    kDeletedKeys: 1
    ==================================================================

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14187
2013-11-19 16:29:42 -08:00
Dhruba Borthakur
31295b0a1b Add License message to public header files.
Summary:
Add License message to public header files.

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-11-18 10:21:35 -08:00
kailiu
97d8e573a6 make util/env_posix.cc work under mac
Summary: This diff invoves some more complicated issues in the posix environment.

Test Plan: works under mac os. will need to verify dev box.

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14061
2013-11-16 23:44:39 -08:00
Igor Canadi
a0ce3fd00a PurgeObsoleteFiles() unittest
Summary:
Created a unittest that verifies that automatic deletion performed by PurgeObsoleteFiles() works correctly.

Also, few small fixes on the logic part -- call version_set_->GetObsoleteFiles() in FindObsoleteFiles() instead of on some arbitrary positions.

Test Plan: Created a unit test

Reviewers: dhruba, haobo, nkg-

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14079
2013-11-14 18:03:57 -08:00
Kai Liu
88ba331c1a Add the index/filter block cache
Summary: This diff leverage the existing block cache and extend it to cache index/filter block.

Test Plan:
Added new tests in db_test and table_test

The correctness is checked by:

1. make check
2. make valgrind_check

Performance is test by:

1. 10 times of build_tools/regression_build_test.sh on two versions of rocksdb before/after the code change. Test results suggests no significant difference between them. For the two key operatons `overwrite` and `readrandom`, the average iops are both 20k and ~260k, with very small variance).
2. db_stress.

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, haobo, xjin

Differential Revision: https://reviews.facebook.net/D13167
2013-11-12 22:46:51 -08:00
kailiu
21587760b9 Fixing the warning messages captured under mac os # Consider using git commit -m 'One line title' && arc diff. # You will save time by running lint and unit in the background.
Summary: The work to make sure mac os compiles rocksdb is not completed yet. But at least we can start cleaning some warnings captured only by g++ from mac os..

Test Plan: ran make in mac os

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14049
2013-11-12 20:05:28 -08:00
Igor Canadi
9bc4a26f56 Small changes in Deleting obsolete files
Summary:
@haobo's suggestions from https://reviews.facebook.net/D13827

Renaming some variables, deprecating purge_log_after_flush, changing for loop into auto for loop.

I have not implemented deleting objects outside of mutex yet because it would require a big code change - we would delete object in db_impl, which currently does not know anything about object because it's defined in version_edit.h (FileMetaData). We should do it at some point, though.

Test Plan: Ran deletefile_test

Reviewers: haobo

Reviewed By: haobo

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D14025
2013-11-12 11:53:26 -08:00
Igor Canadi
65e45f0c4a Update documentation
Summary:
Changed leveldb documentation with rocksdb in doc/index.html. Added some of the important options from options.h to doc.

Also removed benchmark files and impl.h, since this is all replaced by RocksDB wikis.

Test Plan: -

Reviewers: dhruba, haobo, kailiu, emayanke, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13977
2013-11-11 21:02:38 -08:00
Kai Liu
e7c4d823c9 Fix two bugs that caused 3rd party release failure
Summary:

* Fix the link to gflags.
* Fix a warning for the uninitialized data member.
2013-11-10 15:36:30 -08:00
lovro
8a46ecd357 WriteBatch::Put() overload that gathers key and value from arrays of slices
Summary: In our project, when writing to the database, we want to form the value as the concatenation of a small header and a larger payload.  It's a shame to have to copy the payload just so we can give RocksDB API a linear view of the value.  Since RocksDB makes a copy internally, it's easy to support gather writes.

Test Plan: write_batch_test, new test case

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13947
2013-11-08 16:34:32 -08:00
Igor Canadi
1510339e52 Speed up FindObsoleteFiles
Summary:
Here's one solution we discussed on speeding up FindObsoleteFiles. Keep a set of all files in DBImpl and update the set every time we create a file. I probably missed few other spots where we create a file.

It might speed things up a bit, but makes code uglier. I don't really like it.

Much better approach would be to abstract all file handling to a separate class. Think of it as layer between DBImpl and Env. Having a separate class deal with file namings and deletion would benefit both code cleanliness (especially with huge DBImpl) and speed things up. It will take a huge effort to do this, though.

Let's discuss offline today.

Test Plan: Ran ./db_stress, verified that files are getting deleted

Reviewers: dhruba, haobo, kailiu, emayanke

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D13827
2013-11-08 15:23:46 -08:00
Igor Canadi
8b3379dc0a Implementing DynamicIterator for TransformRepNoLock
Summary: What @haobo done with TransformRep, now in TransformRepNoLock. Similar implementation, except that I made DynamicIterator a subclass of Iterator which makes me have less iterator initializations.

Test Plan: ./prefix_test. Seeing huge savings vs. TransformRep again!

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: haobo

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D13953
2013-11-08 00:31:09 -08:00
Kai Liu
fd075d6edd Provide mechanism to configure when to flush the block
Summary: Allow block based table to configure the way flushing the blocks. This feature will allow us to add support for prefix-aligned block.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, igor

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13875
2013-11-07 21:27:21 -08:00
Igor Canadi
444cf88a56 Flush the log outside of lock
Summary:
Added a new call LogFlush() that flushes the log contents to the OS buffers. We never call it with lock held.

We call it once for every Read/Write and often in compaction/flush process so the frequency should not be a problem.

Test Plan: db_test

Reviewers: dhruba, haobo, kailiu, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13935
2013-11-07 11:31:56 -08:00
Haobo Xu
fd2044883a [RocksDB] Generalize prefix-aware iterator to be used for more than one Seek
Summary: Added a prefix_seek flag in ReadOptions to indicate that Seek is prefix aware(might not return data with different prefix), and also not bound to a specific prefix. Multiple Seeks and range scans can be invoked on the same iterator. If a specific prefix is specified, this flag will be ignored. Just a quick prototype that works for PrefixHashRep, the new lockless memtable could be easily extended with this support too.

Test Plan: test it on Leaf

Reviewers: dhruba, kailiu, sdong, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13929
2013-11-06 20:45:49 -08:00
shamdor
c2be2cba04 WAL log retention policy based on archive size.
Summary:
Archive cleaning will still happen every WAL_ttl seconds
but archived logs will be deleted only if archive size
is greater then a WAL_size_limit value.
Empty archived logs will be deleted evety WAL_ttl.

Test Plan:
1. Unit tests pass.
2. Benchmark.

Reviewers: emayanke, dhruba, haobo, sdong, kailiu, igor

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13869
2013-11-06 18:46:28 -08:00
Igor Canadi
be96f2498e TransformRep - use array instead of unordered_map
Summary:
I'm sending this diff together with https://reviews.facebook.net/D13881 because it didn't allow me to send only the array one.

Here I also replaced unordered_map with just an array of shared_ptrs. This elminated all the locks.

I will run the new benchmark and post the results here.

Test Plan: db_test

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13893
2013-11-06 11:55:43 -08:00
Mayank Agarwal
f837f5b1c9 Making the transaction log iterator more robust
Summary:
strict essentially means that we MUST find the startsequence. Thus we should return if starteSequence is not found in the first file in case strict is set. This will take care of ending the iterator in case of permanent gaps due to corruptions in the log files
Also created NextImpl function that will have internal variable to distinguish whether Next is being called from StartSequence or by application.
Set NotFoudn::gaps status to give an indication of gaps happeneing.
Polished the inline documentation at various places

Test Plan:
* db_repl_stress test
* db_test relating to transaction log iterator
* fbcode/wormhole/rocksdb/rocks_log_iterator
* sigma production machine sigmafio032.prn1

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13689
2013-11-04 20:49:03 -08:00
Dhruba Borthakur
b4ad5e89ae Implement a compressed block cache.
Summary:
Rocksdb can now support a uncompressed block cache, or a compressed
block cache or both. Lookups first look for a block in the
uncompressed cache, if it is not found only then it is looked up
in the compressed cache. If it is found in the compressed cache,
then it is uncompressed and inserted into the uncompressed cache.

It is possible that the same block resides in the compressed cache
as well as the uncompressed cache at the same time. Both caches
have their own individual LRU policy.

Test Plan: Unit test case attached.

Reviewers: kailiu, sdong, haobo, leveldb

Reviewed By: haobo

CC: xjin, haobo

Differential Revision: https://reviews.facebook.net/D12675
2013-11-01 14:31:35 -07:00
Haobo Xu
8cbe5bb56b [RocksDB] Add OnCompactionStart to CompactionFilter class
Summary: This is to give application compaction filter a chance to access context information of a specific compaction run. For example, depending on whether a compaction goes through all data files, the application could do things differently.

Test Plan: make check

Reviewers: dhruba, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13683
2013-10-31 13:36:43 -07:00
Naman Gupta
b4fab3be2a Merge branch 'master' of github.com:facebook/rocksdb into inplace 2013-10-31 11:51:03 -07:00
Naman Gupta
fe25070242 In-place updates for equal keys and similar sized values
Summary:
Currently for each put, a fresh memory is allocated, and a new entry is added to the memtable with a new sequence number irrespective of whether the key already exists in the memtable. This diff is an attempt to update the value inplace for existing keys. It currently handles a very simple case:
1. Key already exists in the current memtable. Does not inplace update values in immutable memtable or snapshot
2. Latest value type is a 'put' ie kTypeValue
3. New value size is less than existing value, to avoid reallocating memory

TODO: For a put of an existing key, deallocate memory take by values, for other value types till a kTypeValue is found, ie. remove kTypeMerge.
TODO: Update the transaction log, to allow consistent reload of the memtable.

Test Plan: Added a unit test verifying the inplace update. But some other unit tests broken due to invalid sequence number checks. WIll fix them next.

Reviewers: xinyaohu, sumeet, haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12423

Automatic commit by arc
2013-10-31 11:27:12 -07:00
Siying Dong
f03b2df010 Follow-up Cleaning-up After D13521
Summary:
This patch is to address @haobo's comments on D13521:
1. rename Table to be TableReader and make its factory function to be GetTableReader
2. move the compression type selection logic out of TableBuilder but to compaction logic
3. more accurate comments
4. Move stat name constants into BlockBasedTable implementation.
5. remove some uncleaned codes in simple_table_db_test

Test Plan: pass test suites.

Reviewers: haobo, dhruba, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13785
2013-10-30 10:52:33 -07:00
Siying Dong
d4eec30ed0 Make "Table" pluggable
Summary: This patch makes Table and TableBuilder a abstract class and make all the implementation of the current table into BlockedBasedTable and BlockedBasedTable Builder.

Test Plan: Make db_test.cc to work with block based table. Add a new test simple_table_db_test.cc where a different simple table format is implemented.

Reviewers: dhruba, haobo, kailiu, emayanke, vamsi

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13521
2013-10-28 17:54:09 -07:00
Kai Liu
994575c134 Support user-defined table stats collector
Summary:
1. Added a new option that support user-defined table stats collection.
2. Added a deleted key stats collector in `utilities`

Test Plan:
Added a unit test for newly added code.
Also ran make check to make sure other tests are not broken.

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13491
2013-10-28 15:45:14 -07:00