Commit Graph

302 Commits

Author SHA1 Message Date
Chilledheart
54f4e2f188 Fix clang compiler warnings 2014-07-20 22:57:20 +08:00
Stanislau Hlebik
9d70cce047 Adding option to save PlainTable index and bloom filter in SST file.
Summary:
Adding option to save PlainTable index and bloom filter in SST file.
If there is no bloom block and/or index block, PlainTableReader builds
new ones. Otherwise PlainTableReader just use these blocks.

Test Plan: make all check

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19527
2014-07-18 16:58:13 -07:00
Stanislau Hlebik
92d73cbe78 Add PlainTableOptions
Summary:
Since we have a lot of options for PlainTable, add a struct PlainTableOptions
to manage them

Test Plan: make all check

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D20175
2014-07-18 00:08:38 -07:00
Igor Canadi
1614284eff Fix compressed cache 2014-07-16 06:45:49 -07:00
Stanislau Hlebik
30c81e7717 Removing NewTotalOrderPlainTableFactory
Summary:
Seems like NewTotalOrderPlainTableFactory is useless and is semantically incorrect.
Total order mode indicator is prefix_extractor == nullptr,
but NewTotalOrderPlainTableFactory doesn't set it to be nullptr. That's why some tests
in plain_table_db_tests is incorrect.

Test Plan: make all check

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19587
2014-07-10 11:32:04 -07:00
Lei Jin
8d9a46fcd1 initialize decoded_internal_key_valid
Summary:
ReadInternalKey() will assign correct value anyway. Initialize it to
true to suppress compiler error reported
https://github.com/facebook/rocksdb/issues/186

Test Plan: I cannot reproduce it but this is obvious

Reviewers: sdong, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19467
2014-07-03 23:13:08 -07:00
Feng Zhu
5656367416 use arena to allocate memtable's bloomfilter and hashskiplist's buckets_
Summary:
    Bloomfilter and hashskiplist's buckets_ allocated by memtable's arena
    DynamicBloom: pass arena via constructor, allocate space in SetTotalBits
    HashSkipListRep: allocate space of buckets_ using arena.
       do not delete it in deconstructor because arena would take care of it.
    Several test files are changed.

Test Plan:
    make all check

Reviewers: ljin, haobo, yhchiang, sdong

Reviewed By: sdong

Subscribers: igor, dhruba

Differential Revision: https://reviews.facebook.net/D19335
2014-06-30 15:54:31 -07:00
Igor Canadi
9fe87b17aa Fix compile 2014-06-20 10:36:48 +02:00
Igor Canadi
d4a8423334 Remove seek compaction
Summary:
As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases.

This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction.

There is one test case that relied on didIO information. I'll try to find another way to implement it.

Test Plan: make check

Reviewers: sdong, haobo, yhchiang, ljin, dhruba

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19161
2014-06-20 10:23:02 +02:00
Haobo Xu
7a9dd5f214 [RocksDB] Make block based table hash index more adaptive
Summary: Currently, RocksDB returns error if a db written with prefix hash index, is later opened without providing a prefix extractor. This is uncessarily harsh. Without a prefix extractor, we could always fallback to the normal binary index.

Test Plan: unit test, also manually veried LOG that fallback did occur.

Reviewers: sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19191
2014-06-19 16:40:32 -07:00
Lei Jin
1ec2d1c69d fix make shared_lib compilation error
Summary: s/class ParsedInternalKey/struct ParsedInternalKey

Test Plan: make shared_lib

Reviewers: igor, yhchiang, sdong, haobo

Reviewed By: haobo

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19173
2014-06-19 10:12:26 -07:00
Haobo Xu
167738256f [RocksDB] Fix unit test
Summary: fix a bug in D19047, which caused  DBTest.RecoverDuringMemtableCompaction to fail.

Test Plan: unit test

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19155
2014-06-19 01:37:21 -07:00
sdong
edd47c5104 PlainTable to encode to avoid to rewrite prefix when it is the same as the previous key
Summary:
Add a encoding feature of PlainTable to encode PlainTable's keys to save some bytes for the same prefixes.
The data format is documented in table/plain_table_factory.h

Test Plan: Add unit test coverage in plain_table_db_test

Reviewers: yhchiang, igor, dhruba, ljin, haobo

Reviewed By: haobo

Subscribers: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D18735
2014-06-18 20:41:52 -07:00
Haobo Xu
0f0076ed5a [RocksDB] Reduce memory footprint of the blockbased table hash index.
Summary:
Currently, the in-memory hash index of blockbased table uses a precise hash map to track the prefix to block range mapping. In some use cases, especially when prefix itself is big, the memory overhead becomes a problem. This diff introduces a fixed hash bucket array that does not store the prefix and allows prefix collision, which is similar to the plaintable hash index, in order to reduce the memory consumption.
Just a quick draft, still testing and refining.

Test Plan: unit test and shadow testing

Reviewers: dhruba, kailiu, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19047
2014-06-18 18:16:07 -07:00
Igor Canadi
3525aac9e5 Change order of parameters in adaptive table factory
Summary:
This is minor, but if we put the writing talbe factory as the third parameter, when we add a new table format, we'll have a situation:
1) block based factory
2) plain table factory
3) output factory
4) new format factory

I think it makes more sense to have output as the first parameter.

Also, fixed a NewAdaptiveTableFactory() call in unit test

Test Plan: unit test

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19119
2014-06-18 07:04:37 +02:00
sdong
200e4b4a72 Add a table factory that can read DB with both of PlainTable and BlockBasedTable in it
Summary: The new table factory is used if users want to convert a DB from one table format to the other. A user can use this table to open a DB written using one table format and write new files to another table format.

Test Plan: add a unit test

Reviewers: haobo, igor

Reviewed By: igor

Subscribers: dhruba, ljin, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D19017
2014-06-17 11:49:22 -07:00
Lei Jin
c83b085770 prefetch bloom filter data block for L0 files
Summary: as title

Test Plan:
db_bench
the initial result is very promising. I will post results of complete
runs

Reviewers: dhruba, haobo, sdong, igor

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D18867
2014-06-12 10:06:18 -07:00
sdong
88a1691a1e BlockBasedTable::PrefixMayMatch() to bloom setting to the beginning of the function
Summary: In BlockBasedTable::PrefixMayMatch() we calculate prefix even if bloom is not config. Move the check before

Test Plan: make all check

Reviewers: igor, ljin

Reviewed By: ljin

Subscribers: wuj, leveldb, haobo, yhchiang, dhruba

Differential Revision: https://reviews.facebook.net/D18993
2014-06-10 11:14:22 -07:00
sdong
80f409ea37 Clean PlainTableReader's variables for better data locality
Summary:
Clean PlainTableReader's data structures:
(1) inline bloom_ (in order to do this, change DynamicBloom to allow lazy initialization)
(2) remove some variables only used when initialization from the class
(3) put variables not used in normal read code paths to the end of the class and reference prefix_extractor directly
(4) make Options a reference.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: ljin

Subscribers: igor, yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18891
2014-06-09 13:53:39 -07:00
Igor Canadi
f43c8262c2 Don't compress block bigger than 2GB
Summary: This is a temporary solution to a issue that we have with compression libraries. See task #4453446.

Test Plan: make check doesn't complain :)

Reviewers: haobo, ljin, yhchiang, dhruba, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D18975
2014-06-09 12:26:09 -07:00
sdong
df9069d23f In DB::NewIterator(), try to allocate the whole iterator tree in an arena
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.

Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.

Test Plan: make all check

Reviewers: ljin, haobo

Reviewed By: haobo

Subscribers: leveldb, dhruba, yhchiang, igor

Differential Revision: https://reviews.facebook.net/D18513
2014-06-02 17:44:57 -07:00
Lei Jin
388d2054c7 forward iterator
Summary:
Forward iterator puts everything together in a flat structure instead of
a hierarchy of nested iterators. this should simplify the code and
provide better performance. It also enables more optimization since all
information are accessiable in one place.
Init evaluation shows about 6% improvement

Test Plan: db_test and db_bench

Reviewers: dhruba, igor, tnovak, sdong, haobo

Reviewed By: haobo

Subscribers: sdong, leveldb

Differential Revision: https://reviews.facebook.net/D18795
2014-05-30 14:31:55 -07:00
Kai Liu
0b3d03d026 Materialize the hash index
Summary:
Materialize the hash index to avoid the soaring cpu/flash usage
when initializing the database.

Test Plan: existing unit tests passed

Reviewers: sdong, haobo

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18339
2014-05-15 14:09:03 -07:00
Igor Canadi
26f5dd9a5a TablePropertiesCollectorFactory
Summary:
This diff addresses task #4296714 and rethinks how users provide us with TablePropertiesCollectors as part of Options.

Here's description of task #4296714:
       I'm debugging #4295529 and noticed that our count of user properties kDeletedKeys is wrong. We're sharing one single InternalKeyPropertiesCollector with all Table Builders. In LOG Files, we're outputting number of kDeletedKeys as connected with a single table, while it's actually the total count of deleted keys since creation of the DB.

       For example, this table has 3155 entries and 1391828 deleted keys.

The problem with current approach that we call methods on a single TablePropertiesCollector for all the tables we create. Even worse, we could do it from multiple threads at the same time and TablePropertiesCollector has no way of knowing which table we're calling it for.

Good part: Looks like nobody inside Facebook is using Options::table_properties_collectors. This means we should be able to painfully change the API.

In this change, I introduce TablePropertiesCollectorFactory. For every table we create, we call `CreateTablePropertiesCollector`, which creates a TablePropertiesCollector for a single table. We then use it sequentially from a single thread, which means it doesn't have to be thread-safe.

Test Plan:
Added a test in table_properties_collector_test that fails on master (build two tables, assert that kDeletedKeys count is correct for the second one).
Also, all other tests

Reviewers: sdong, dhruba, haobo, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18579
2014-05-13 12:30:55 -07:00
Igor Canadi
fec4269966 Fix more gflag namespace issues 2014-05-09 08:41:02 -07:00
sdong
ddd41146c4 MergingIterator uses autovector instead of vector
Summary:
Use autovector in MergingIterator so that if there are 4 or less child iterators in it, iterator wrappers are inline, which is more likely to be cache friendly.

Based on one test run with a shadow traffic of one product, it reduces CPU of MergingIterator::Seek() by half.

Test Plan: make all check

Reviewers: haobo, yhchiang, igor, dhruba

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18531
2014-05-08 15:01:20 -07:00
Igor Canadi
b5616dafd1 Fix iOS compile 2014-05-07 17:48:31 -07:00
sdong
3a171dcb51 Pass logger to memtable rep and TLB page allocation error logged to info logs
Summary:
TLB page allocation errors are now logged to info logs, instead of stderr.
In order to do that, mem table rep's factory functions take a info logger now.

Test Plan: make all check

Reviewers: haobo, igor, yhchiang

Reviewed By: yhchiang

CC: leveldb, yhchiang, dhruba

Differential Revision: https://reviews.facebook.net/D18471
2014-05-05 16:43:37 -07:00
sdong
9b17558311 PlainTableFactory::PlainTableFactory() to have huge TLB turned off by default
Summary: PlainTableFactory::PlainTableFactory() now has Huge TLB page feature turned on by default. Although it is not a public API (which we always turn the feature off now), our unit tests, like db_test sometimes uses it directly, which causes wrong coverage of codes. This patch fix it to allow unit tests to run with the correct setting

Test Plan: Run db_test and make sure this feature is not on any more.

Reviewers: igor, haobo

Reviewed By: igor

CC: yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18483
2014-05-05 11:05:54 -07:00
sdong
4a7c747064 Revert "Revert "Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB""
And make the default 0 for hash linked list memtable

This reverts commit d69dc64be7.
2014-05-04 13:56:29 -07:00
Igor Canadi
d69dc64be7 Revert "Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB"
This reverts commit 7dafa3a1d7.
2014-05-04 08:37:09 -07:00
Igor Canadi
d28ed6931f fix release build 2014-05-01 12:42:06 -07:00
Igor Canadi
d29e48bb2e fix compile warning 2014-05-01 14:12:35 -04:00
Igor Canadi
0afc8bc29a xxHash
Summary:
Originally: https://github.com/facebook/rocksdb/pull/87/files

I'm taking over to apply some finishing touches

Test Plan: will add tests

Reviewers: dhruba, haobo, sdong, yhchiang, ljin

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18315
2014-05-01 14:09:32 -04:00
sdong
7dafa3a1d7 Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB
Summary: Add an option to allocate a piece of memory from huge page TLB. Add options to trigger it in dynamic bloom, plain table indexes andhash linked list hash table.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: nkg-, dhruba, leveldb, igor, yhchiang

Differential Revision: https://reviews.facebook.net/D18357
2014-04-30 11:02:26 -07:00
Igor Canadi
c489499a2b Fix OSX compile 2014-04-26 17:15:43 -04:00
Lei Jin
ccaca59bee avoid calling FindFile twice in TwoLevelIterator for PlainTable
Summary:
this is to reclaim the regression introduced in
https://reviews.facebook.net/D17853

Test Plan: make all check

Reviewers: igor, haobo, sdong, dhruba, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17985
2014-04-25 12:23:07 -07:00
Lei Jin
d642c60bdc Check PrefixMayMatch on Seek()
Summary:
As a follow-up diff for https://reviews.facebook.net/D17805, add
optimization to check PrefixMayMatch on Seek()

Test Plan: make all check

Reviewers: igor, haobo, sdong, yhchiang, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17853
2014-04-25 12:22:23 -07:00
Lei Jin
3995e801ab kill ReadOptions.prefix and .prefix_seek
Summary:
also add an override option total_order_iteration if you want to use full
iterator with prefix_extractor

Test Plan: make all check

Reviewers: igor, haobo, sdong, yhchiang

Reviewed By: haobo

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D17805
2014-04-25 12:21:34 -07:00
sdong
86a0133d05 PlainTableReader to expose index size to users
Summary:
This is a temp solution to expose index sizes to users from PlainTableReader before we persistent them to files.
In this patch, the memory consumption of indexes used by PlainTableReader will be reported as two user defined properties, so that users can monitor them.

Test Plan:
Add a unit test.
make all check`

Reviewers: haobo, ljin

Reviewed By: haobo

CC: nkg-, yhchiang, igor, ljin, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18195
2014-04-22 19:29:05 -07:00
Yueh-Hsuan Chiang
af6ad113a8 Fix SIGFAULT when running sst_dump on v2.6 db
Summary: Fix the sigfault when running sst_dump on v2.6 db.

Test Plan:
    git checkout bba6595b1f
    make clean
    make db_bench
    ./db_bench --db=/tmp/some/db --benchmarks=fillseq
    arc patch D18039
    make clean
    make sst_dump
    ./sst_dump --file=/tmp/some/db --command=check

Reviewers: igor, haobo, sdong

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18039
2014-04-21 17:49:47 -07:00
Igor Canadi
8dc34364d2 Rename "benchmark" back to "bench".
Also, make `benchharness.cc` not compiled into rocksdb library.
2014-04-21 13:12:15 -07:00
Pratyush Seth
ff1b5df4c6 Added benchmark functionality on the lines of folly/Benchmark.h
Summary: Added benchmark functionality on the lines of folly/Benchmark.h

Test Plan: Added unit tests

Reviewers: igor, haobo, sdong, ljin, yhchiang, dhruba

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17973
2014-04-21 12:29:55 -07:00
sdong
27d3bc184e Use a different approach to make sure BlockBasedTableReader can use hash index on older files
Summary:
A recent commit e37dd216f9 makes sure hash index can be used when reading existing files. This patch uses another way to achieve the approach:
(1) Currently, always writing kBinarySearch to files, despite of BlockBasedTableOptions.IndexType setting.
(2) When reading a file, read out the field, and make sure it is kBinarySearch, while always use index type by users.

The reason for doing it is, to reserve kHashSearch property on disk to future. If now we write out binary index for both of kHashSearch and kBinarySearch. We have to use a new flag in the future for hash index on disk, otherwise compatibility would break. Also, we want the real index type and type shown in properties block to be consistent.

Test Plan: make all check

Reviewers: haobo, kailiu

Reviewed By: kailiu

CC: igor, ljin, yhchiang, xjin, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18009
2014-04-18 14:09:21 -07:00
Kai Liu
e37dd216f9 Index type doesn't have to be persisted
Summary:

With the recent changes, there is no need to check the property block about the index block type.
If user want to use it, they don't really need any disk format change; everything happens in the fly.

Also another team encountered an error while reading the index type from properties.

Test Plan:

ran all the tests

Reviewers: sdong

CC:

Task ID: #

Blame Rev:
2014-04-17 11:08:12 -07:00
sdong
5cef458a2c RocksDB 2.8 to be able to read files generated by 2.6
Summary:
From 2.6 to 2.7, property block name is renamed from rocksdb.stats to rocksdb.properties. Older properties were not able to be loaded. In 2.8, we seem to have added some logic that uses property block without checking null pointers, which create segment faults.

In this patch, we fix it by:
(1) try rocksdb.stats if rocksdb.properties is not found
(2) add some null checking before consuming rep->table_properties

Test Plan: make sure a file generated in 2.7 couldn't be opened now can be opened.

Reviewers: haobo, igor, yhchiang

Reviewed By: igor

CC: ljin, xjin, dhruba, kailiu, leveldb

Differential Revision: https://reviews.facebook.net/D17961
2014-04-17 09:51:43 -07:00
Igor Canadi
588bca2020 RocksDBLite
Summary:
Introducing RocksDBLite! Removes all the non-essential features and reduces the binary size. This effort should help our adoption on mobile.

Binary size when compiling for IOS (`TARGET_OS=IOS m static_lib`) is down to 9MB from 15MB (without stripping)

Test Plan: compiles :)

Reviewers: dhruba, haobo, ljin, sdong, yhchiang

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17835
2014-04-15 13:39:26 -07:00
Kai Liu
e23e73e67c Use shorten index key for hash-index
Summary:
I was wrong about the "index builder", right now since we create index
by scanning both whole table and index, there is not need to preserve
the whole key as the index key.

I switch back to original way index which is both space efficient and
able to supprot in-fly construction of hash index.

IN this patch, I made minimal change since I'm not sure if we still need
the "pluggable index builder", under current circumstance it is of no use
and kind of over-engineered. But I'm not sure if we can still exploit its
usefulness in the future; otherwise I think I can just burn them with great
vengeance.

Test Plan: unit tests

Reviewers: sdong, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17745
2014-04-10 22:45:25 -07:00
Kai Liu
75b59d5146 Enable hash index for block-based table
Summary: Based on previous patches, this diff eventually provides the end-to-end mechanism for users to specify the hash-index.

Test Plan: Wrote several new unit tests.

Reviewers: sdong, haobo, dhruba

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16539
2014-04-10 14:19:43 -07:00
Igor Canadi
4daea66343 Turn on -Wmissing-prototypes
Summary: Compiling for iOS has by default turned on -Wmissing-prototypes, which causes rocksdb to fail compiling. This diff turns on -Wmissing-prototypes in our compile options and cleans up all functions with missing prototypes.

Test Plan: compiles

Reviewers: dhruba, haobo, ljin, sdong

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17649
2014-04-09 21:17:14 -07:00
sdong
e9ed28f9c8 PlainTableBuilder::Add() to use local char array instead of reused std::string as tmp buffer
Summary: Our profile shows that in one of the applications, 5% of the CPU costs of PlainTableBuilder::Add() are spent on std::string stacks. By this simple change, we avoid this global reusable string. Also, we avoid another call of file appending, which probably gives another 2%.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: igor, yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D17601
2014-04-09 10:17:32 -07:00
Igor Canadi
34455deb06 Fix Mac OS compile issues 2014-04-08 14:05:53 -07:00
Igor Canadi
5b345b76cb Remove env_ from MergingIterator
Summary: env_ is not used. Compiling for iOS complains.

Test Plan: compiles now

Reviewers: ljin, haobo, sdong, dhruba

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17589
2014-04-08 13:40:42 -07:00
Lei Jin
92c1eb0291 macros for perf_context
Summary: This will allow us to disable them completely for iOS or for better performance

Test Plan: will run make all check

Reviewers: igor, haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17511
2014-04-08 10:58:07 -07:00
sdong
5e2db3b434 PlainTableIterator not to store copied key in std::string
Summary:
Move PlainTableIterator's copied key from std::string local buffer to avoid paying the extra costs in std::string related to sharing. Reuse the same buffer class in DbIter. Move the class to dbformat.h.

This patch improves iterator performance significantly. Running this benchmark:

./table_reader_bench --num_keys2=17 --iterator --plain_table --time_unit=nanosecond

The average latency is improved to about 750 nanoseconds from 1100 nanoseconds.

Test Plan:
Add a unit test.
make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: igor, yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D17547
2014-04-07 19:06:09 -07:00
Igor Canadi
3d2fe844ab Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/memtable_list.cc
	db/version_set.cc
2014-04-07 11:31:11 -07:00
Igor Canadi
bcd1f15b60 Remove -Wno-unused-const-variable 2014-04-04 16:15:47 -07:00
Igor Canadi
8555ce2dec Merge branch 'master' into columnfamilies 2014-04-02 10:48:05 -07:00
sdong
d50619a559 PlainTableIterator::Seek() shouldn't check bloom filter in total order mode
Summary:
In total order mode, iterator's seek() shouldn't check total order.

Also some cleaning up about checking null for shared pointers. I don't know the behavior before it.

This bug was reported by @igor.

Test Plan: test plain_table_db_test

Reviewers: ljin, haobo, igor

Reviewed By: igor

CC: yhchiang, dhruba, igor, leveldb

Differential Revision: https://reviews.facebook.net/D17391
2014-04-01 15:05:16 -07:00
Igor Canadi
ddbd1ece88 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/internal_stats.cc
	db/internal_stats.h
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
	include/rocksdb/options.h
	util/options.cc
2014-03-31 13:39:24 -07:00
Lei Jin
0d755fff14 cache friendly blocked bloomfilter
Summary:
By constraining the probes within cache line(s), we can improve the
cache miss rate thus performance. This probably only makes sense for
in-memory workload so defaults the option to off.

Numbers and comparision can be found in wiki:
https://our.intern.facebook.com/intern/wiki/index.php/Ljin/rocksdb_perf/2014_03_17#Bloom_Filter_Study

Test Plan: benchmarked this change substantially. Will run make all check as well

Reviewers: haobo, igor, dhruba, sdong, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17133
2014-03-28 09:21:20 -07:00
Igor Canadi
e20fa3f8a4 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/internal_stats.cc
	db/internal_stats.h
	db/version_set.cc
2014-03-19 17:22:20 -07:00
Kai Liu
69f6cf431d Fix two bugs in talbe format
Summary:
Previous code had two bugs:

* didn't initialize the table_magic_number_ explicitly -- as a
  result a random junk number is stored for table_magic_number_, making
  HasInitializedMagicNumber() always return true.
* if condition is inconrrect in set_table_magic_number(), and the return value is not checked.
  I replace if-else by a stronger requirement enforced by assert().

Test Plan:
Previous sst_dump failed to work.
After the fix, things back to normal.

Reviewers: yhchiang

CC: haobo, sdong, igor, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D17055
2014-03-19 16:18:33 -07:00
Igor Canadi
fb2346fc1f [CF] Code cleanup part 1
Summary:
I'm cleaning up some code preparing for the big diff review tomorrow. This is the first part of the cleanup.

Changes are mostly cosmetic. The goal is to decrease amount of code difference between columnfamilies and master branch.

This diff also fixes race condition when dropping column family.

Test Plan: Ran db_stress with variety of parameters

Reviewers: dhruba, haobo

Differential Revision: https://reviews.facebook.net/D16833
2014-03-12 09:56:53 -07:00
Igor Canadi
9634ba42ac Merge branch 'master' into columnfamilies
Conflicts:
	db/compaction_picker.cc
	db/db_impl.cc
	db/db_impl.h
	db/tailing_iter.cc
	db/version_set.h
	include/rocksdb/options.h
	util/options.cc
2014-03-10 17:26:09 -07:00
Lei Jin
8d007b4aaf Consolidate SliceTransform object ownership
Summary:
(1) Fix SanitizeOptions() to also check HashLinkList. The current
dynamic case just happens to work because the 2 classes have the same
layout.
(2) Do not delete SliceTransform object in HashSkipListFactory and
HashLinkListFactory destructor. Reason: SanitizeOptions() enforces
prefix_extractor and SliceTransform to be the same object when
Hash**Factory is used. This makes the behavior strange: when
Hash**Factory is used, prefix_extractor will be released by RocksDB. If
other memtable factory is used, prefix_extractor should be released by
user.

Test Plan: db_bench && make asan_check

Reviewers: haobo, igor, sdong

Reviewed By: igor

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16587
2014-03-10 12:56:46 -07:00
Igor Canadi
1e0d47276c Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
2014-03-07 16:59:47 -08:00
Kai Liu
566f18e6ad More precise calculation of sub_index_size
Summary:
Previous we did rough estimation of subindex size, which in worst case may result in array reallocation.
This patch aims to get the exact size and avoid any reallocation.

Test Plan: make all check

Reviewers: sdong, dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16125
2014-03-06 17:30:46 -08:00
Igor Canadi
0738ae6dc9 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
2014-03-05 12:25:05 -08:00
kailiu
a01bda0997 Fix a buggy assert
Summary: The assert was pointless since if if prefix is the same as the whole key, assertion will surely fail. Reason behind is when performing the internal key comparison, if user keys are the same, *key with smaller transaction id* wins.

Test Plan: make -j32 && make check

Reviewers: sdong, dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16551
2014-03-04 18:08:23 -08:00
Igor Canadi
fa34697237 Merge branch 'master' into columnfamilies 2014-03-04 09:39:14 -08:00
kailiu
a1d56e73ec Uncomment the unit tests in table test 2014-03-03 22:19:49 -08:00
kailiu
906f3dca72 Add a hash-index component for block
Summary:
this is the key component extracted from diff: https://reviews.facebook.net/D14271
I separate it to a dedicated patch to make the review easier.

Test Plan: added a unit test and passed it.

Reviewers: haobo, sdong, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16245
2014-03-03 21:11:49 -08:00
Igor Canadi
9d0577a6be Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/transaction_log_impl.cc
	db/transaction_log_impl.h
	include/rocksdb/options.h
	util/env.cc
	util/options.cc
2014-03-03 18:29:03 -08:00
sdong
f0ee2356af Fix issue with iterator operations in this order: Prev(), Seek(), Prev()
Summary:
Due to a bad merge of D14163 and D14001 before checking in D14001, "direction_ = kForward;" in MergeIterator::Seek() was deleted my mistake (in commit b135d01e7b ). It will generate wrong results or assert failure after the sequence of Prev() (or SeekToLast()), Seek() and Prev().

Fix it

Test Plan: make all check

Reviewers: igor, haobo, dhruba

Reviewed By: igor

CC: yhchiang, i.am.jin.lei, ljin, leveldb

Differential Revision: https://reviews.facebook.net/D16527
2014-03-03 17:50:27 -08:00
kailiu
1aeafeccac Make the Create() function comform the convention
Summary:

Moved "Return multiple values" a more conventional way.
2014-03-01 23:43:03 -08:00
Kai Liu
16d4e45c12 Fix the memory leak in table index
Summary:

BinarySearchIndex didn't use unique_ptr to guard the block object nor
delete it in destructor, leading to valgrind failure for "definite
memory leak".

Test Plan:
re-ran the failed valgrind test cases
2014-03-01 11:50:35 -08:00
Kai Liu
ff151132b3 Fix the unit test failure in devbox
Summary:
My last diff was developed in MacOS but in devserver environment error occurs.

I dug into the problem and found the way we calcuate approximate data size is pretty out-of-date. We can use table properties to get more accurate results.

Test Plan: ran ./table_test and passed

Reviewers: igor, dhruba, haobo, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16509
2014-02-28 20:40:05 -08:00
kailiu
74939a9e13 Make the block-based table's index pluggable
Summary:
This patch introduced a new table options that allows us to make
block-based table's index pluggable.

To support that new features:

* Code has been refacotred to be more flexible and supports this option well.
* More documentation is added for the existing obsecure functionalities.
* Big surgeon on DataBlockReader(), where the logic was really convoluted.
* Other small code cleanups.

The pluggablility will mostly affect development of internal modules
and won't change frequently, as a result I intentionally avoid
heavy-weight patterns (like factory) and try to make it simple.

Test Plan: make all check

Reviewers: haobo, sdong

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16395
2014-02-28 18:19:07 -08:00
kailiu
bf86af5174 Remove the terrible hack in for flush_block_policy_factory
Summary:
Previous code is too convoluted and I must be drunk for letting
such code to be written without a second thought.

Thanks to the discussion with @sdong, I added the `Options` when
generating the flusher, thus avoiding the tricks.

Just FYI: I resisted to add Options in flush_block_policy.h since I
wanted to avoid cyclic dependencies: FlushBlockPolicy dpends on Options
and Options also depends FlushBlockPolicy... While I appreciate my
effort to prevent it, the old design turns out creating more troubles than
it tried to avoid.

Test Plan: ran ./table_test

Reviewers: sdong

Reviewed By: sdong

CC: sdong, leveldb

Differential Revision: https://reviews.facebook.net/D16503
2014-02-28 16:39:27 -08:00
kailiu
444cafc28c Fix inconsistent code format
Summary:
Found some function follows camel style. When naming funciton, we have two styles:

Trivially expose internal data in readonly mode: `all_lower_case()`
Regular function: `CapitalizeFirstLetter()`

I renames these functions.

Test Plan: make -j32

Reviewers: haobo, sdong, dhruba, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16383
2014-02-26 18:56:39 -08:00
sdong
a04dbf6e49 PlainTable::Next() should pass the error message from ReadKey()
Summary:
PlainTable::Next() should pass the error message from ReadKey(). Now it would return a wrong error message.
Also improve the messages of status when failing to read

Test Plan: make all check

Reviewers: ljin, kailiu, haobo

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16365
2014-02-26 15:12:44 -08:00
Igor Canadi
d39da4b578 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
2014-02-24 17:09:05 -08:00
Kai Liu
2b205b35d8 Disable putting filter block to block cache
Summary: This bug caused server crash issues because the filter block is too big and kept purging out of cache.

Test Plan: Wrote a new unit tests to make sure it works.

Reviewers: dhruba, haobo, igor, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16221
2014-02-19 15:38:57 -08:00
Igor Canadi
76c048183c Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	include/rocksdb/db.h
2014-02-14 16:46:03 -08:00
kailiu
63690625cd Expose the table properties to application
Summary: Provide a public API for users to access the table properties for each SSTable.

Test Plan: Added a unit tests to test the function correctness under differnet conditions.

Reviewers: haobo, dhruba, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16083
2014-02-13 16:28:21 -08:00
Kai Liu
b2e7ee8b41 Followup code refactor on plain table
Summary:
Fixed most comments in https://reviews.facebook.net/D15429.
Still have some remaining comments left.

Test Plan: make all check

Reviewers: sdong, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15885
2014-02-13 15:27:59 -08:00
Kai Liu
59cffe02c4 Benchmark table reader wiht nanoseconds
Summary: nanosecnods gave us better view of the performance, especially when some operations are fast so that micro seconds may only reveal less informative results.

Test Plan:
sample output:

    ./table_reader_bench --plain_table --time_unit=nanosecond
    =======================================================================================================
    InMemoryTableSimpleBenchmark:           PlainTable   num_key1:   4096   num_key2:   512   non_empty
    =======================================================================================================
    Histogram (unit: nanosecond):
    Count: 6291456  Average: 475.3867  StdDev: 556.05
    Min: 135.0000  Median: 400.1817  Max: 33370.0000
    Percentiles: P50: 400.18 P75: 530.02 P99: 887.73 P99.9: 8843.26 P99.99: 9941.21
    ------------------------------------------------------
    [     120,     140 )        2   0.000%   0.000%
    [     140,     160 )      452   0.007%   0.007%
    [     160,     180 )    13683   0.217%   0.225%
    [     180,     200 )    54353   0.864%   1.089%
    [     200,     250 )   101004   1.605%   2.694%
    [     250,     300 )   729791  11.600%  14.294% ##
    [     300,     350 )   616070   9.792%  24.086% ##
    [     350,     400 )  1628021  25.877%  49.963% #####
    [     400,     450 )   647220  10.287%  60.250% ##
    [     450,     500 )   577206   9.174%  69.424% ##
    [     500,     600 )  1168585  18.574%  87.999% ####
    [     600,     700 )   506875   8.057%  96.055% ##
    [     700,     800 )   147878   2.350%  98.406%
    [     800,     900 )    42633   0.678%  99.083%
    [     900,    1000 )    16304   0.259%  99.342%
    [    1000,    1200 )     7811   0.124%  99.466%
    [    1200,    1400 )     1453   0.023%  99.490%
    [    1400,    1600 )      307   0.005%  99.494%
    [    1600,    1800 )       81   0.001%  99.496%
    [    1800,    2000 )       18   0.000%  99.496%
    [    2000,    2500 )        8   0.000%  99.496%
    [    2500,    3000 )        6   0.000%  99.496%
    [    3500,    4000 )        3   0.000%  99.496%
    [    4000,    4500 )      116   0.002%  99.498%
    [    4500,    5000 )     1144   0.018%  99.516%
    [    5000,    6000 )     1087   0.017%  99.534%
    [    6000,    7000 )     2403   0.038%  99.572%
    [    7000,    8000 )     9840   0.156%  99.728%
    [    8000,    9000 )    12820   0.204%  99.932%
    [    9000,   10000 )     3881   0.062%  99.994%
    [   10000,   12000 )      135   0.002%  99.996%
    [   12000,   14000 )      159   0.003%  99.998%
    [   14000,   16000 )       58   0.001%  99.999%
    [   16000,   18000 )       30   0.000% 100.000%
    [   18000,   20000 )       14   0.000% 100.000%
    [   20000,   25000 )        2   0.000% 100.000%
    [   25000,   30000 )        2   0.000% 100.000%
    [   30000,   35000 )        1   0.000% 100.000%

Reviewers: haobo, dhruba, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16113
2014-02-13 13:57:36 -08:00
sdong
b5140a0361 Fix table_reader_bench and add it to "make"
Summary: Fix table_reader_bench after some interface changes. Add it to make to avoid future breaking

Test Plan: make table_reader_bench and run it with different options.

Reviewers: kailiu, haobo

Reviewed By: haobo

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D16107
2014-02-12 18:31:02 -08:00
Siying Dong
f3ae3d07cc Add more black-box tests for PlainTable and explicitly support total order mode
Summary:
1. Add some more implementation-aware tests for PlainTable
2. move from a hard-coded one index per 16 rows in one prefix to a configurable number. Also, make hash table ratio = 0  means binary search only. Also fixes some divide 0 risks.
3. Explicitly support total order (only use binary search)
4. some code cleaning up.

Test Plan: make all check

Reviewers: haobo, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16023
2014-02-12 17:37:22 -08:00
Igor Canadi
ccdb93e775 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/memtable_list.cc
	db/memtable_list.h
	db/version_set.cc
	db/version_set.h
2014-02-12 14:01:30 -08:00
Igor Canadi
b06840aa7d [CF] Rethinking ColumnFamilyHandle and fix to dropping column families
Summary:
The change to the public behavior:
* When opening a DB or creating new column family client gets a ColumnFamilyHandle.
* As long as column family handle is alive, client can do whatever he wants with it, even drop it
* Dropped column family can still be read from (using the column family handle)
* Added a new call CloseColumnFamily(). Client has to close all column families that he has opened before deleting the DB
* As soon as column family is closed, any calls to DB using that column family handle will fail (also any outstanding calls)

Internally:
* Ref-counting ColumnFamilyData
* New thread-safety for ColumnFamilySet
* Dropped column families are now completely dropped and their memory cleaned-up

Test Plan: added some tests to column_family_test

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16101
2014-02-12 13:47:09 -08:00
kailiu
e6b3e3b4db Support prefix seek in UserCollectedProperties
Summary: We'll need the prefix seek support for property aggregation.

Test Plan: make all check

Reviewers: haobo, sdong, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15963
2014-02-12 13:14:59 -08:00
kailiu
aa734ce9ab Fix a member variables initialization order issue
Summary:
In MacOS, I got issue with `Footer`'s default constructor, which initialized the magic number with some random number instead of 0.
With investigation, I found we forgot to make the kInvalidTableMagicNumber to be static. As a result, kInvalidTableMagicNumber was assgined to `table_magic_number_` before it is initialized (which will be populated with random number).

Test Plan: passed current unit tests; also passed the unit tests for the incoming diff which used the default footer.

Reviewers: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16077
2014-02-11 14:16:46 -08:00
kailiu
745c181e20 Quick fix for table_test failure
Summary:
* Fixed the compression state array size bug.
* Temporarily disable running `DoCompressionTest()` against bzip, which will fail the test.

Test Plan: make && ./table_test

Reviewers: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16065
2014-02-10 17:05:14 -08:00
Igor Canadi
8e634d3ea4 Merge pull request #74 from alberts/lz4
Support for LZ4 compression.
2014-02-10 15:46:56 -08:00
Albert Strasheim
df2f92214a Support for LZ4 compression. 2014-02-08 14:15:51 -08:00
Kai Liu
9a270f3f6d Fix the valgrind error in table test. 2014-02-07 22:43:58 -08:00
Kai Liu
b8ea5e36b3 Fix incompatible compilation in Linux server 2014-02-07 19:47:48 -08:00
kailiu
161ab42a8a Make table properties shareable
Summary:
We are going to expose properties of all tables to end users through "some" db interface.
However, current design doesn't naturally fit for this need, which is because:

1. If a table presents in table cache, we cannot simply return the reference to its table properties, because the table may be destroy after compaction (and we don't want to hold the ref of the version).
2. Copy table properties is OK, but it's slow.

Thus in this diff, I change the table reader's interface to return a shared pointer (for const table properties), instead a const refernce.

Test Plan: `make check` passed

Reviewers: haobo, sdong, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15999
2014-02-07 19:26:49 -08:00