Commit Graph

110 Commits

Author SHA1 Message Date
kailiu
b1d2de4a40 Fix #26 by putting the implementation of CreateDBStatistics() to a cc file 2013-12-05 22:29:03 -08:00
Igor Canadi
043fc14c3e Get rid of some shared_ptrs
Summary:
I went through all remaining shared_ptrs and removed the ones that I found not-necessary. Only GenerateCachePrefix() is called fairly often, so don't expect much perf wins.

The ones that are left are accessed infrequently and I think we're fine with keeping them.

Test Plan: make asan_check

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14427
2013-12-03 11:17:58 -08:00
Dhruba Borthakur
96bc3ec297 Memtables should be deleted appropriately in the unit test.
Summary:
Memtables should be deleted appropriately in the unit test.

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-12-01 21:23:44 -08:00
Haobo Xu
5b825d6964 [RocksDB] Use raw pointer instead of shared pointer when passing Statistics object internally
Summary: liveness of the statistics object is already ensured by the shared pointer in DB options. There's no reason to pass again shared pointer among internal functions. Raw pointer is sufficient and efficient.

Test Plan: make check

Reviewers: dhruba, MarkCallaghan, igor

Reviewed By: dhruba

CC: leveldb, reconnect.grayhat

Differential Revision: https://reviews.facebook.net/D14289
2013-11-25 10:38:15 -08:00
kailiu
0c93df912e Improve the readability of the TableProperties::ToString() 2013-11-21 17:54:23 -08:00
Siying Dong
3e35aa6412 Revert "Allow users to profile a query and see bottleneck of the query"
This reverts commit 3d8ac31d71.
2013-11-21 17:40:39 -08:00
Siying Dong
3d8ac31d71 Allow users to profile a query and see bottleneck of the query
Summary:
Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later.

Test Plan: Enable this profiling in seveal existing tests.

Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14001
2013-11-21 16:29:57 -08:00
kailiu
56589ab81f Add TableOptions for BlockBasedTableFactory
We are having more and more options to specify for this table so it makes sense to have a TableOptions for future extension.
2013-11-20 18:42:12 -08:00
kailiu
1c8b819be2 Fix a memory leak happened in table_test 2013-11-20 13:45:32 -08:00
kailiu
6eb5649800 Move flush_block_policy from Options to TableFactory
Summary:
Previously we introduce a `flush_block_policy_factory` in Options, however, that options is strongly releated to Table based tables.
It will make more sense to move it to block based table's own factory class.

Test Plan: make check to pass existing tests

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14211
2013-11-19 22:00:48 -08:00
Igor Canadi
469a9f32a7 Fix two nasty use-after-free-bugs
Summary:
These bugs were caught by ASAN crash test.
1. The first one, in table/filter_block.cc is very nasty. We first reference entries_ and store the reference to Slice prev. Then, we call entries_.append(), which can change the reference. The Slice prev now points to junk.
2. The second one is a bug in a test, so it's not very serious. Once we set read_opts.prefix, we never clear it, so some other function might still reference it.

Test Plan: asan crash test now runs more than 5 mins. Before, it failed immediately. I will run the full one, but the full one takes quite some time (5 hours)

Reviewers: dhruba, haobo, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14223
2013-11-19 21:01:48 -08:00
kailiu
1415f8820d Improve the "table stats"
Summary:
The primary motivation of the changes is to make it easier to figure out the inside of the tables.

* rename "table stats" to "table properties" since now we have more than "integers" to store in the property block.
* Add filter block size to the basic table properties.
* Whenever a table is built, we'll log the table properties (the sample output is in Test Plan).
* Make an api to expose deleted keys.

Test Plan:
Passed all existing test. and the sample output of table stats:

    ==================================================================
        Basic Properties
    ------------------------------------------------------------------
                  # data blocks: 1
                      # entries: 1

                   raw key size: 9
           raw average key size: 9
                 raw value size: 9
         raw average value size: 0

                data block size: 25
               index block size: 27
              filter block size: 18
         (estimated) table size: 70

                  filter policy: rocksdb.BuiltinBloomFilter
    ==================================================================
        User collected properties: InternalKeyPropertiesCollector
    ------------------------------------------------------------------
                    kDeletedKeys: 1
    ==================================================================

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14187
2013-11-19 16:29:42 -08:00
kailiu
97d8e573a6 make util/env_posix.cc work under mac
Summary: This diff invoves some more complicated issues in the posix environment.

Test Plan: works under mac os. will need to verify dev box.

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14061
2013-11-16 23:44:39 -08:00
Kai Liu
7604e2f70c Make the options in table_builder/block_builder less misleading
Summary:
By original design, the regular `block options` and index `block options` in table_builder is mutable. We can use ChangeOptions to change the options directly.

However, with my last change, `BlockBuilder` no longer hold the reference to the index_block_options -- as a result, any changes made after the creation of index block builder will be of no effect.

But still the code is very error-prone and developers can easily fall into the trap without aware of it. To avoid this problem from happening in the future, I deleted the `ChangeOptions` and the `index_block_options`, as well as many other changes to make it less misleading.

Test Plan:
make
make check
make release

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13707
2013-11-16 23:21:15 -08:00
Siying Dong
55baa3d955 Add an option to table_reader_bench to access the table from DB And Iterating non-existing prefix case.
Summary: This patch adds an option to table_reader_bench that queries run against DB level (which has one table). It is useful if user wants to see the extra costs DB level introduces.

Test Plan: Run the benchmark with and without the new parameter

Reviewers: haobo, dhruba, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13863
2013-11-15 22:23:12 -08:00
Igor Canadi
e0ad0f26b8 Fix bloom filters
Summary: https://reviews.facebook.net/D13167 broke bloom filters. If filter is not in cache, we want to return true (safe thing). Am I right?

Test Plan: when benchmarking https://reviews.facebook.net/D14031 I got different results when using bloom filters vs. when not using them. This fixed the issue. I will also be putting this change to the other diff, but that one will probably be in review for longer time.

Reviewers: kailiu, dhruba, haobo

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14085
2013-11-14 14:05:15 -08:00
Kai Liu
80bb81c6fe Add the correct table_factory for tables in table_tests 2013-11-12 23:54:31 -08:00
Kai Liu
88ba331c1a Add the index/filter block cache
Summary: This diff leverage the existing block cache and extend it to cache index/filter block.

Test Plan:
Added new tests in db_test and table_test

The correctness is checked by:

1. make check
2. make valgrind_check

Performance is test by:

1. 10 times of build_tools/regression_build_test.sh on two versions of rocksdb before/after the code change. Test results suggests no significant difference between them. For the two key operatons `overwrite` and `readrandom`, the average iops are both 20k and ~260k, with very small variance).
2. db_stress.

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, haobo, xjin

Differential Revision: https://reviews.facebook.net/D13167
2013-11-12 22:46:51 -08:00
kailiu
21587760b9 Fixing the warning messages captured under mac os # Consider using git commit -m 'One line title' && arc diff. # You will save time by running lint and unit in the background.
Summary: The work to make sure mac os compiles rocksdb is not completed yet. But at least we can start cleaning some warnings captured only by g++ from mac os..

Test Plan: ran make in mac os

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14049
2013-11-12 20:05:28 -08:00
Kai Liu
0ef628537c Don't not suggest flushing data when data block is still empty
Summary:

This diff fix the bug when the Options::block_size is too small.
2013-11-11 21:05:16 -08:00
Kai Liu
551ecfa416 Move down the time consuming tests in table_test
Summary:

it helps us to better check the tests we really care.

Test Plan:

make
2013-11-10 01:17:32 -08:00
Kai Liu
fd075d6edd Provide mechanism to configure when to flush the block
Summary: Allow block based table to configure the way flushing the blocks. This feature will allow us to add support for prefix-aligned block.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, igor

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13875
2013-11-07 21:27:21 -08:00
Dhruba Borthakur
292c2b3357 Fix stress test failure when using mmap-reads.
Summary:
The mmap-read file->Read() does not use the scratch buffer to
read in file-contents.

Test Plan: ./db_stress --test_batches_snapshots=1 --ops_per_thread=100000000 --threads=32 --write_buffer_size=4194304 --destroy_db_initially=0 --reopen=0 --readpercent=45 --prefixpercent=5 --writepercent=35 --delpercent=5 --iterpercent=10 --db=/tmp/dhruba --max_key=100000000 --disable_seek_compaction=0 --mmap_read=1 --block_size=16384 --cache_size=1048576 --open_files=500000 --verify_checksum=1 --sync=1 --disable_wal=0 --disable_data_sync=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --max_write_buffer_number=3 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --filter_deletes=0

Reviewers: haobo, kailiu

Reviewed By: kailiu

CC: leveldb, kailiu, emayanke

Differential Revision: https://reviews.facebook.net/D13923
2013-11-06 15:40:26 -08:00
Igor Canadi
36409e0016 Fix slow no-io iterator
Summary:
This fixes #3130525. Dhruba's suggestion and Tnovak's implementation :)

The issue was with SkipEmptyDataBlocksForward(), but I also changed SkipEmptyDataBlocksBackward(). Is that OK?

Test Plan: Run the logdevice test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13911
2013-11-06 14:11:52 -08:00
Dhruba Borthakur
b4ad5e89ae Implement a compressed block cache.
Summary:
Rocksdb can now support a uncompressed block cache, or a compressed
block cache or both. Lookups first look for a block in the
uncompressed cache, if it is not found only then it is looked up
in the compressed cache. If it is found in the compressed cache,
then it is uncompressed and inserted into the uncompressed cache.

It is possible that the same block resides in the compressed cache
as well as the uncompressed cache at the same time. Both caches
have their own individual LRU policy.

Test Plan: Unit test case attached.

Reviewers: kailiu, sdong, haobo, leveldb

Reviewed By: haobo

CC: xjin, haobo

Differential Revision: https://reviews.facebook.net/D12675
2013-11-01 14:31:35 -07:00
Siying Dong
82b7e37f6e Fix a bug of table_reader_bench
Summary: Iterator benchmark case is timed incorrectly. Fix it

Test Plan: Run the benchmark

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13845
2013-10-31 15:26:06 -07:00
Siying Dong
7caadf2e52 A very simple benchmark to measure Table implemenation's Get() And Iterator performance
Summary: It is a very simple benchmark to measure a Table implementation's Get() and iterator performance if all the data is in memory.

Test Plan: N/A

Reviewers: dhruba, haobo, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13743
2013-10-31 13:38:54 -07:00
Naman Gupta
b4fab3be2a Merge branch 'master' of github.com:facebook/rocksdb into inplace 2013-10-31 11:51:03 -07:00
Naman Gupta
fe25070242 In-place updates for equal keys and similar sized values
Summary:
Currently for each put, a fresh memory is allocated, and a new entry is added to the memtable with a new sequence number irrespective of whether the key already exists in the memtable. This diff is an attempt to update the value inplace for existing keys. It currently handles a very simple case:
1. Key already exists in the current memtable. Does not inplace update values in immutable memtable or snapshot
2. Latest value type is a 'put' ie kTypeValue
3. New value size is less than existing value, to avoid reallocating memory

TODO: For a put of an existing key, deallocate memory take by values, for other value types till a kTypeValue is found, ie. remove kTypeMerge.
TODO: Update the transaction log, to allow consistent reload of the memtable.

Test Plan: Added a unit test verifying the inplace update. But some other unit tests broken due to invalid sequence number checks. WIll fix them next.

Reviewers: xinyaohu, sumeet, haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12423

Automatic commit by arc
2013-10-31 11:27:12 -07:00
Siying Dong
f03b2df010 Follow-up Cleaning-up After D13521
Summary:
This patch is to address @haobo's comments on D13521:
1. rename Table to be TableReader and make its factory function to be GetTableReader
2. move the compression type selection logic out of TableBuilder but to compaction logic
3. more accurate comments
4. Move stat name constants into BlockBasedTable implementation.
5. remove some uncleaned codes in simple_table_db_test

Test Plan: pass test suites.

Reviewers: haobo, dhruba, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13785
2013-10-30 10:52:33 -07:00
Kai Liu
79d8dad331 Change a typo in method signature 2013-10-28 21:23:17 -07:00
Siying Dong
d4eec30ed0 Make "Table" pluggable
Summary: This patch makes Table and TableBuilder a abstract class and make all the implementation of the current table into BlockedBasedTable and BlockedBasedTable Builder.

Test Plan: Make db_test.cc to work with block based table. Add a new test simple_table_db_test.cc where a different simple table format is implemented.

Reviewers: dhruba, haobo, kailiu, emayanke, vamsi

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13521
2013-10-28 17:54:09 -07:00
Kai Liu
994575c134 Support user-defined table stats collector
Summary:
1. Added a new option that support user-defined table stats collection.
2. Added a deleted key stats collector in `utilities`

Test Plan:
Added a unit test for newly added code.
Also ran make check to make sure other tests are not broken.

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13491
2013-10-28 15:45:14 -07:00
Kai Liu
1ca86f0391 Fix a bug that index block's restart_block_interval is not 1
Summary:

This bug may affect the seek performance.

Test Plan:

make
make check

Also gdb into some index block builder to make sure the restart_block_interval is `1`.
2013-10-28 10:51:34 -07:00
Kai Liu
43ee5e2b3a Fix the valgrind error in newly added unittests for table stats
Summary:

Previous the newly added test called NewBloomFilter without releasing it at the end of the test, which resulted in memory leak and was detected by valgrind.

Test Plan:

Ran valgrind test.
2013-10-20 22:02:05 -07:00
Siying Dong
9edda37027 Universal Compaction to Have a Size Percentage Threshold To Decide Whether to Compress
Summary:
This patch adds a option for universal compaction to allow us to only compress output files if the files compacted previously did not yet reach a specified ratio, to save CPU costs in some cases.

Compression is always skipped for flushing. This is because the size information is not easy to evaluate for flushing case. We can improve it later.

Test Plan:
add test
DBTest.UniversalCompactionCompressRatio1 and DBTest.UniversalCompactionCompressRatio12

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13467
2013-10-17 13:33:39 -07:00
Kai Liu
aac44226a0 Add bloom filter to predefined table stats
Summary: As title.

Test Plan: Updated the unit tests to make sure new statistic is correctly written/read.

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13497
2013-10-17 11:43:06 -07:00
Dhruba Borthakur
9cd221094c Add appropriate LICENSE and Copyright message.
Summary:
Add appropriate LICENSE and Copyright message.

Test Plan:
make check

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-10-16 17:48:41 -07:00
Kai Liu
86ef6c3f74 Add statistics to sst file
Summary:
So far we only have key/value pairs as well as bloom filter stored in the
sst file.  It will be great if we are able to store more metadata about
this table itself, for example, the entry size, bloom filter name, etc.

This diff is the first step of this effort. It allows table to keep the
basic statistics mentioned in http://fburl.com/14995441, as well as
allowing writing user-collected stats to stats block.

After this diff, we will figure out the interface of how to allow user to collect their interested statistics.

Test Plan:
1. Added several unit tests.
2. Ran `make check` to ensure it doesn't break other tests.

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13419
2013-10-14 15:56:13 -07:00
Kai Liu
1f8ade6bd6 Fix a bug in table builder
Summary:
In talbe.cc, when reading the metablock, it uses BytewiseComparator();
However in table_builder.cc, we use r->options.comparator. After tracing
the creation of r->options.comparator, I found this comparator is an
InternalKeyComparator, which wraps the user defined comparator(details
can be found in DBImpl::SanitizeOptions().

I encountered this problem when adding metadata about "bloom filter"
before. With different comparator, we may fail to do the binary sort.

Current code works well since there is only one entry in meta block.

Test Plan:
make all check

I've also tested this change in https://reviews.facebook.net/D8283 before.

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13335
2013-10-07 18:53:19 -07:00
Dhruba Borthakur
4463b11cad Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix.
Summary: Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix.

Test Plan: make check

Reviewers: emayanke, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13311
2013-10-06 00:14:26 -07:00
Dhruba Borthakur
a143ef9b38 Change namespace from leveldb to rocksdb
Summary:
Change namespace from leveldb to rocksdb. This allows a single
application to link in open-source leveldb code as well as
rocksdb code into the same process.

Test Plan: compile rocksdb

Reviewers: emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13287
2013-10-04 11:59:26 -07:00
Haobo Xu
f2f4c8072f [RocksDB] Added nano second stopwatch and new perf counters to track block read cost
Summary: The pupose of this diff is to expose per user-call level precise timing of block read, so that we can answer questions like: a Get() costs me 100ms, is that somehow related to loading blocks from file system, or sth else? We will answer that with EXACTLY how many blocks have been read, how much time was spent on transfering the bytes from os, how much time was spent on checksum verification and how much time was spent on block decompression, just for that one Get. A nano second stopwatch was introduced to track time with higher precision. The cost/precision of the stopwatch is also measured in unit-test. On my dev box, retrieving one time instance costs about 30ns, on average. The deviation of timing results is good enough to track 100ns-1us level events. And the overhead could be safely ignored for 100us level events (10000 instances/s), for example, a viewstate thrift call.

Test Plan: perf_context_test, also testing with viewstate shadow traffic.

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D12351
2013-09-07 21:14:54 -07:00
Mayank Agarwal
352f0636ef Fix memory leak in table.cc
Summary:
In InternalGet, BlockReader returns an Iterator which is legitimately freed at the end of the 'else' scope. BUT there is a break statement in between and must be freed there too!
The best solution would be to move to unique_ptr and let it handle. Changed it to a unique_ptr.

Test Plan: valgrind ./db_test;make all check

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12681
2013-09-02 22:13:29 -07:00
Dhruba Borthakur
fc0c399d2e Introduced a new flag non_blocking_io in ReadOptions.
Summary:
If ReadOptions.non_blocking_io is set to true, then KeyMayExists
and Iterators will return data that is cached in RAM.
If the Iterator needs to do IO from storage to serve the data,
then the Iterator.status() will return Status::IsRetry().

Test Plan:
Enhanced unit test DBTest.KeyMayExist to detect if there were are IOs
issues from storage. Added DBTest.NonBlockingIteration to verify
nonblocking Iterations.

Reviewers: emayanke, haobo

Reviewed By: haobo

CC: leveldb

Maniphest Tasks: T63

Differential Revision: https://reviews.facebook.net/D12531
2013-08-28 10:49:14 -07:00
Mayank Agarwal
dad2731729 Fix bug in KeyMayExist
Summary: In KeyMayExist.db_test we do a Flush which causes sst file to be written and added as open file in TableCache, but block cache for the file is not populated. So value_found should have been false where it was true and KeyMayExist.db_test should not have passed earlier. But it passed because BlockReader in table/table.cc takes 2 default arguments at the end called for_compaction and no_io. Although I passed no_io=true from InternalGet to BlockReader, but it understood for_compaction=true and defaulted no_io to false. This is a bug and although will be removed by Dhruba's new patch to incorporate no_io in readoptions, I'm submitting this patch to fix this bug independently of that patch.

Test Plan: make all check

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12537
2013-08-26 08:45:58 -07:00
Tyler Harter
4504c99030 Internal/user key bug fix.
Summary: Fix code so that the filter_block layer only assumes keys are internal when prefix_extractor is set.

Test Plan: ./filter_block_test

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12501
2013-08-23 14:49:57 -07:00
Dhruba Borthakur
1186192ed1 Replace include/leveldb with include/rocksdb.
Summary: Replace include/leveldb with include/rocksdb.

Test Plan:
make clean; make check
make clean; make release

Differential Revision: https://reviews.facebook.net/D12489
2013-08-23 10:51:00 -07:00
Jim Paton
74781a0c49 Add three new MemTableRep's
Summary:
This patch adds three new MemTableRep's: UnsortedRep, PrefixHashRep, and VectorRep.

UnsortedRep stores keys in an std::unordered_map of std::sets. When an iterator is requested, it dumps the keys into an std::set and iterates over that.

VectorRep stores keys in an std::vector. When an iterator is requested, it creates a copy of the vector and sorts it using std::sort. The iterator accesses that new vector.

PrefixHashRep stores keys in an unordered_map mapping prefixes to ordered sets.

I also added one API change. I added a function MemTableRep::MarkImmutable. This function is called when the rep is added to the immutable list. It doesn't do anything yet, but it seems like that could be useful. In particular, for the vectorrep, it means we could elide the extra copy and just sort in place. The only reason I haven't done that yet is because the use of the ArenaAllocator complicates things (I can elaborate on this if needed).

Test Plan:
make -j32 check
./db_stress --memtablerep=vector
./db_stress --memtablerep=unsorted
./db_stress --memtablerep=prefixhash --prefix_size=10

Reviewers: dhruba, haobo, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12117
2013-08-22 23:10:02 -07:00
Tyler Harter
94cf218720 Revert "Prefix scan: db_bench and bug fixes"
This reverts commit c2bd8f4824.
2013-08-22 18:01:11 -07:00