Commit Graph

87 Commits

Author SHA1 Message Date
Marton Trencseni
9b51987521 Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.

Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56133
2016-04-01 10:42:39 -07:00
sdong
b1fafcaca6 Revert "Adding pin_l0_filter_and_index_blocks_in_cache feature."
This reverts commit 522de4f59e.

It has bug of index block cleaning up.
2016-03-21 11:50:42 -07:00
Marton Trencseni
522de4f59e Adding pin_l0_filter_and_index_blocks_in_cache feature.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
When the table reader is destroyed, it releases the pinned blocks (if there were any). This has to happen before the cache is destroyed, so I had to introduce a TableReader::Close(), to guarantee the order of destruction.

Test Plan:
Added two unit tests for this. Existing unit tests run fine (default is pin_l0_filter_and_index_blocks_in_cache=false).

DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32
  Mac: OK.
  Linux: with D55287 patched in it's OK.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54801
2016-03-17 22:40:01 +00:00
sdong
b2ae5950ba Index Reader should not be reused after DB restart
Summary:
In block based table reader, wow we put index reader to block cache, which can be retrieved after DB restart. However, index reader may reference internal comparator, which can be destroyed after DB restarts, causing problems.
Fix it by making cache key identical per table reader.

Test Plan: Add a new test which failed with out the commit but now pass.

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: maro, yhchiang, kradhakrishnan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55287
2016-03-14 10:04:09 -07:00
Yi Wu
f71fc77b7c Cache to have an option to fail Cache::Insert() when full
Summary:
Cache to have an option to fail Cache::Insert() when full. Update call sites to check status and handle error.

I totally have no idea what's correct behavior of all the call sites when they encounter error. Please let me know if you see something wrong or more unit test is needed.

Test Plan: make check -j32, see tests pass.

Reviewers: anthony, yhchiang, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54705
2016-03-10 17:35:19 -08:00
Baraa Hamodi
21e95811d1 Updated all copyright headers to the new format. 2016-02-09 15:12:00 -08:00
Yueh-Hsuan Chiang
4a8cbf4e31 Allows Get and MultiGet to read directly from SST files.
Summary:
Add kSstFileTier to ReadTier, which allows Get and MultiGet to
read only directly from SST files and skip mem-tables.

    kSstFileTier = 0x2      // data in SST files.
                          // Note that this ReadTier currently only supports
                          // Get and MultiGet and does not support iterators.

Test Plan: add new test in db_test.

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: igor, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53511
2016-02-09 11:20:22 -08:00
Islam AbdelRahman
0c433cd1eb Fix issue in Iterator::Seek when using Block based filter block with prefix_extractor
Summary: Similar to D53385 we need to check InDomain before checking the filter block.

Test Plan: unit tests

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53421
2016-01-26 14:47:42 -08:00
Islam AbdelRahman
b0afcdeeac Fix bug in block based tables with full filter block and prefix_extractor
Summary:
Right now when we are creating a BlockBasedTable with fill filter block
we add to the filter all the prefixes that are InDomain() based on the prefix_extractor

the problem is that when we read a key from the file, we check the filter block for the prefix whether or not it's InDomain()

Test Plan: unit tests

Reviewers: yhchiang, rven, anthony, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53385
2016-01-26 11:07:08 -08:00
Peter Mattis
260c29762a Fix index seeking in BlockTableReader::PrefixMayMatch.
PrefixMayMatch previously seeked in the prefix index using an internal
key with a sequence number of 0. This would cause the prefix index seek
to fall off the end if the last key in the index had a user-key greater
than or equal to the key being looked for. Falling off the end of the
index in turn results in PrefixMayMatch returning false if the index is
in memory.
2016-01-06 16:39:56 -05:00
Andrew Kryczka
e089db40f9 Skip bottom-level filter block caching when hit-optimized
Summary:
When Get() or NewIterator() trigger file loads, skip caching the filter block if
(1) optimize_filters_for_hits is set and (2) the file is on the bottommost
level. Also skip checking filters under the same conditions, which means that
for a preloaded file or a file that was trivially-moved to the bottom level, its
filter block will eventually expire from the cache.

- added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
- in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
- in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr

Test Plan:
updated unit test:

  $ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits

will also run 'make check'

Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D51633
2015-12-23 10:15:07 -08:00
sdong
35ad531be3 Seperate InternalIterator from Iterator
Summary:
Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.

This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.

Test Plan: Run all existing tests.

Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48549
2015-10-13 15:32:13 -07:00
Islam AbdelRahman
51fa7ecec5 Bytes read/written from cache statistics
Summary: Add 2 new counters BLOCK_CACHE_BYTES_WRITE, BLOCK_CACHE_BYTES_READ to keep track of how many bytes were written to the cache and how many bytes that we read from cache

Test Plan: make check

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48195
2015-10-07 15:17:20 -07:00
sdong
d746eaad5e RandomAccessFileReader should not inherit RandomAccessFile
Summary: RandomAccessFileReader unnecessarily inherited RandomAccessFile, which can introduce unnecessarily extra costs. Remove it.

Test Plan: Run all existing tests

Reviewers: yhchiang, anthony, igor, kradhakrishnan, rven, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D47409
2015-09-23 11:00:41 -07:00
Igor Canadi
76f286cc82 Optimize bloom filter cache misses
Summary:
This optimizes the case when (cache_index_and_filter_blocks=1) and bloom filter is not present in the cache.

Previously we did:
1. Read meta block from file
2. Read the filter position from the meta block
3. Read the filter

Now, we pre-load the filter position on Table::Open(), so we can skip steps (1) and (2) on bloom filter cache miss. Instead of 2 IOs, we do only 1.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46047
2015-09-02 15:36:47 -07:00
sdong
6e9fbeb27c Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.

Test Plan: Run all existing unit tests.

Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42321
2015-07-17 16:58:18 -07:00
sdong
041b6f95a2 perf_context: report time spent on reading index and bloom blocks
Summary: Add a perf context counter to help users figure out time spent on reading indexes and bloom filter blocks.

Test Plan: Will write a unit test

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41433
2015-07-10 14:45:42 -07:00
Igor Canadi
0a019d74a0 Use malloc_usable_size() for accounting block cache size
Summary:
Currently, when we insert something into block cache, we say that the block cache capacity decreased by the size of the block. However, size of the block might be less than the actual memory used by this object. For example, 4.5KB block will actually use 8KB of memory. So even if we configure block cache to 10GB, our actually memory usage of block cache will be 20GB!

This problem showed up a lot in testing and just recently also showed up in MongoRocks production where we were using 30GB more memory than expected.

This diff will fix the problem. Instead of counting the block size, we will count memory used by the block. That way, a block cache configured to be 10GB will actually use only 10GB of memory.

I'm using non-portable function and I couldn't find info on portability on Google. However, it seems to work on Linux, which will cover majority of our use-cases.

Test Plan:
1. fill up mongo instance with 80GB of data
2. restart mongo with block cache size configured to 10GB
3. do a table scan in mongo
4. memory usage before the diff: 12GB. memory usage after the diff: 10.5GB

Reviewers: sdong, MarkCallaghan, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40635
2015-06-26 11:48:09 -07:00
clark.kang
6ede020dc4 fix typos 2015-04-25 18:14:27 +09:00
sdong
98a44559d5 Build for CYGWIN
Summary:
Make it build for CYGWIN.
Need to define "-std=gnu++11" instead of "-std=c++11" and use some replacement functions.

Test Plan: Build it and run some unit tests in CYGWIN

Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D37605
2015-04-23 21:33:44 -07:00
krad
f29b33c73b Add functionality to pre-fetch blocks specified by a key range to BlockBasedTable implementation.
Summary:
Pre-fetching is a common operation performed by data stores for
disk/flash based systems as part of database startup.

This is part of task 5197184.

Test Plan: Run the newly added unit test

Reviewers: rven, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D33933
2015-03-02 17:07:03 -08:00
krad
d9f4875e52 Disable pre-fetching of index and filter blocks for sst_dump_tool.
Summary:
BlockBasedTable pre-fetches the filter and index blocks on Open call.
This is an optimistic optimization targeted for runtime scenario. The
optimization is unnecessary for sst_dump_tool

- Added a provision to disable pre-fetching of index and filter blocks
  in BlockBasedTable
- Disabled pre-fetching for the sst_dump tool

Stack for reference :

#01  0x00000000005ed944 in snappy::InternalUncompress<snappy::SnappyArrayWriter> () from /home/engshare/third-party2/snappy/1.0.3/src/snappy-1.0.3/snappy.cc:148
#02  0x00000000005edeee in snappy::RawUncompress () from /home/engshare/third-party2/snappy/1.0.3/src/snappy-1.0.3/snappy.cc:947
#03  0x00000000004e0b4d in rocksdb::UncompressBlockContents () from /data/users/paultuckfield/rocksdb/./util/compression.h:69
#04  0x00000000004e145c in rocksdb::ReadBlockContents () from /data/users/paultuckfield/rocksdb/table/format.cc:334
#05  0x00000000004ca424 in rocksdb::(anonymous namespace)::ReadBlockFromFile () from /data/users/paultuckfield/rocksdb/table/block_based_table_reader.cc:70
#06  0x00000000004cccad in rocksdb::BlockBasedTable::CreateIndexReader () from /data/users/paultuckfield/rocksdb/table/block_based_table_reader.cc:173
#07  0x00000000004d17e5 in rocksdb::BlockBasedTable::Open () from /data/users/paultuckfield/rocksdb/table/block_based_table_reader.cc:553
#08  0x00000000004c8184 in rocksdb::BlockBasedTableFactory::NewTableReader () from /data/users/paultuckfield/rocksdb/table/block_based_table_factory.cc:51
#09  0x0000000000598463 in rocksdb::SstFileReader::NewTableReader () from /data/users/paultuckfield/rocksdb/util/sst_dump_tool.cc:69
#10  0x00000000005986c2 in rocksdb::SstFileReader::SstFileReader () from /data/users/paultuckfield/rocksdb/util/sst_dump_tool.cc:26
#11  0x0000000000599047 in rocksdb::SSTDumpTool::Run () from /data/users/paultuckfield/rocksdb/util/sst_dump_tool.cc:332
#12  0x0000000000409b06 in main () from /data/users/paultuckfield/rocksdb/tools/sst_dump.cc:12

Test Plan:
- Added a unit test to trigger the code.
- Also did some manual verification.
- Passed all unit tests

task #6296048

Reviewers: igor, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34041
2015-02-25 16:34:26 -08:00
Igor Sugak
98870c7b9c rocksdb: Fix scan-build memory warning in table/block_based_table_reader.cc
Summary:
scan-build is reporting two memory leak bugs in `table/block_based_table_reader.cc`. They are both false positives. In both cases we allocate memory in `ReadBlockFromFile` if `s.ok()`. Then after the function `ReadBlockFromFile` returns we check for the same variable if `s.ok()` and then use the memory that was allocated. The bugs reported by scan-build is if `ReadBlockFromFile` allocates memory and returns, but for some reason status `s` is not the same and `s.ok() != true`.

In this case scan-build is concerned that memory owner transfer is not explicit. I modified `ReadBlockFromFile` to accept `std::unique_ptr<Block>*` as a parameter, instead of raw pointer.

scan-build reports:
http://home.fburl.com/~sugak/latest2/report-a4b3fa.html#EndPath
http://home.fburl.com/~sugak/latest2/report-29adbf.html#EndPath

Test Plan:
Make sure scan-build does not report these bugs and all tests are passing.
```lang=bash
% make check
% make analyze
```

Reviewers: sdong, lgalanis, meyering, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D33681
2015-02-19 14:07:38 -08:00
sdong
68af7811ea Remember whole key/prefix filtering on/off in SST file
Summary: Remember whole key or prefix filtering on/off in SST files. If user opens the DB with a different setting that cannot be satisfied while reading the SST file, ignore the bloom filter.

Test Plan: Add a unit test for it

Reviewers: yhchiang, igor, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D32889
2015-02-11 11:20:04 -08:00
sdong
e63140d52b Get() to use prefix bloom filter when filter is not block based
Summary:
Get() now doesn't make use of bloom filter if it is prefix based. Add the check.
Didn't touch block based bloom filter. I can't fully reason whether it is correct to do that. But it's straight-forward to for full bloom filter.

Test Plan:
make all check
Add a test case in DBTest

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: MarkCallaghan, leveldb, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D31941
2015-02-04 15:15:41 -08:00
Igor Canadi
9ab5adfc59 New BlockBasedTable version -- better compressed block format
Summary:
This diff adds BlockBasedTable format_version = 2. New format version brings better compressed block format for these compressions:
1) Zlib -- encode decompressed size in compressed block header
2) BZip2 -- encode decompressed size in compressed block header
3) LZ4 and LZ4HC -- instead of doing memcpy of size_t encode size as varint32. memcpy is very bad because the DB is not portable accross big/little endian machines or even platforms where size_t might be 8 or 4 bytes.

It does not affect format for snappy.

If you write a new database with format_version = 2, it will not be readable by RocksDB versions before 3.10. DB::Open() will return corruption in that case.

Test Plan:
Added a new test in db_test.
I will also run db_bench and verify VSIZE when block_cache == 1GB

Reviewers: yhchiang, rven, MarkCallaghan, dhruba, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31461
2015-01-14 16:24:24 -08:00
Igor Canadi
96b8240bc5 Support footer versions bigger than 1
Summary:
In this diff I add another parameter to BlockBasedTableOptions that will let users specify block based table's format. This will greatly simplify block based table's format changes in the future.

First format change that this will support is encoding decompressed size in Zlib and BZip2 blocks. This diff is blocking https://reviews.facebook.net/D31311.

Test Plan: Added a unit tests. More tests to come as part of https://reviews.facebook.net/D31311.

Reviewers: dhruba, MarkCallaghan, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31383
2015-01-13 14:33:04 -08:00
Manish Patil
7ea7bdf04d Dump routine to BlockBasedTableReader
Summary: Added necessary routines for dumping block based SST with block filter

Test Plan: Added "raw" mode to utility sst_dump

Reviewers: sdong, rven

Reviewed By: rven

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D29679
2014-12-23 13:24:07 -08:00
Yueh-Hsuan Chiang
13de000f07 Add rocksdb::ToString() to address cases where std::to_string is not available.
Summary:
In some environment such as android, the c++ library does not have
std::to_string.  This path adds rocksdb::ToString(), which wraps std::to_string
when std::to_string is not available, and implements std::to_string
in the other case.

Test Plan:
make dbg -j32
./db_test
make clean
make dbg OPT=-DOS_ANDROID -j32
./db_test

Reviewers: ljin, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D29181
2014-11-24 20:44:49 -08:00
Igor Canadi
9f7fc3ac45 Turn on -Wshadow
Summary:
...and fix all the errors :)

Jim suggested turning on -Wshadow because it helped him fix number of critical bugs in fbcode. I think it's a good idea to be -Wshadow clean.

Test Plan: compiles

Reviewers: yhchiang, rven, sdong, ljin

Reviewed By: ljin

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D27711
2014-10-31 11:59:54 -07:00
Yueh-Hsuan Chiang
98849a35fa Apply InfoLogLevel to the logs in table/block_based_table_reader.cc
Summary:
Apply InfoLogLevel to the logs in table/block_based_table_reader.cc

Also, add missing checks for the returned status in BlockBasedTable::Open

Test Plan: make

Reviewers: sdong, ljin, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D28005
2014-10-31 11:41:15 -07:00
Lei Jin
0fd985f427 Avoid reloading filter on Get() if cache_index_and_filter_blocks == false
Summary:
This fixes the case that filter policy is missing in SST file, but we
open the table with filter policy on and cache_index_and_filter_blocks =
false. The current behavior is that we will try to load it every time on
Get() but fail.

Test Plan: unit test

Reviewers: yhchiang, igor, rven, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D25455
2014-10-22 11:52:35 -07:00
Lei Jin
2faf49d5f1 use GetContext to replace callback function pointer
Summary:
Intead of passing callback function pointer and its arg on Table::Get()
interface, passing GetContext. This makes the interface cleaner and
possible better perf. Also adding a fast pass for SaveValue()

Test Plan: make all check

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D24057
2014-09-29 11:09:09 -07:00
Lei Jin
0a29ce5393 re-enable BlockBasedTable::SetupForCompaction()
Summary:
It was commented out in D22545 by accident. Keep the option in
ImmutableOptions for now. I can make it dynamic in
https://reviews.facebook.net/D23349

Test Plan: make release

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23865
2014-09-23 14:18:57 -07:00
Igor Canadi
55af370756 Remove TODO for checking index checksums 2014-09-23 13:02:23 -07:00
Igor Canadi
54cada92b1 Run make format on PR #249 2014-09-17 15:08:50 -07:00
Torrie Fischer
fb6456b00d Replace naked calls to operator new and delete (Fixes #222)
This replaces a mishmash of pointers in the Block and BlockContents classes with
std::unique_ptr. It also changes the semantics of BlockContents to be limited to
use as a constructor parameter for Block objects, as it owns any block buffers
handed to it.
2014-09-17 13:50:07 -07:00
Feng Zhu
0af157f9bf Implement full filter for block based table.
Summary:
1. Make filter_block.h a base class. Derive block_based_filter_block and full_filter_block. The previous one is the traditional filter block. The full_filter_block is newly added. It would generate a filter block that contain all the keys in SST file.

2. When querying a key, table would first check if full_filter is available. If not, it would go to the exact data block and check using block_based filter.

3. User could choose to use full_filter or tradional(block_based_filter). They would be stored in SST file with different meta index name. "filter.filter_policy" or "full_filter.filter_policy". Then, Table reader is able to know the fllter block type.

4. Some optimizations have been done for full_filter_block, thus it requires a different interface compared to the original one in filter_policy.h.

5. Actual implementation of filter bits coding/decoding is placed in util/bloom_impl.cc

Benchmark: base commit 1d23b5c470
Command:
db_bench --db=/dev/shm/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --write_buffer_size=134217728 --max_write_buffer_number=2 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --verify_checksum=false --max_background_compactions=4 --use_plain_table=0 --memtablerep=prefix_hash --open_files=-1 --mmap_read=1 --mmap_write=0 --bloom_bits=10 --bloom_locality=1 --memtable_bloom_bits=500000 --compression_type=lz4 --num=393216000 --use_hash_search=1 --block_size=1024 --block_restart_interval=16 --use_existing_db=1 --threads=1 --benchmarks=readrandom —disable_auto_compactions=1
Read QPS increase for about 30% from 2230002 to 2991411.

Test Plan:
make all check
valgrind db_test
db_stress --use_block_based_filter = 0
./auto_sanity_test.sh

Reviewers: igor, yhchiang, ljin, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D20979
2014-09-08 10:37:05 -07:00
Lei Jin
5665e5e285 introduce ImmutableOptions
Summary:
As a preparation to support updating some options dynamically, I'd like
to first introduce ImmutableOptions, which is a subset of Options that
cannot be changed during the course of a DB lifetime without restart.

ColumnFamily will keep both Options and ImmutableOptions. Any component
below ColumnFamily should only take ImmutableOptions in their
constructor. Other options should be taken from APIs, which will be
allowed to adjust dynamically.

I am yet to make changes to memtable and other related classes to take
ImmutableOptions in their ctor. That can be done in a seprate diff as
this one is already pretty big.

Test Plan: make all check

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D22545
2014-09-04 16:18:36 -07:00
Feng Zhu
1d23b5c470 remove_internal_filter_policy
Summary:
1. remove class InternalFilterPolicy in db/dbformat.h
2. Transformation from internal key to user key is done in filter_block.cc
3. This is a preparation for patch D20979

Test Plan:
make all check
valgrind ./db_test

Reviewers: igor, yhchiang, ljin, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22509
2014-08-28 17:06:29 -07:00
Lei Jin
23861857c4 ReadOptions.total_order_seek to allow total order seek for block-based table when hash index is enabled
Summary: as title

Test Plan: table_test

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22239
2014-08-25 16:14:30 -07:00
Lei Jin
384400128f move block based table related options BlockBasedTableOptions
Summary:
I will move compression related options in a separate diff since this
diff is already pretty lengthy.
I guess I will also need to change JNI accordingly :(

Test Plan: make all check

Reviewers: yhchiang, igor, sdong

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21915
2014-08-25 14:22:05 -07:00
sdong
1242bfcad7 Add DB property "rocksdb.estimate-table-readers-mem"
Summary:
Add a DB Property "rocksdb.estimate-table-readers-mem" to return estimated memory usage by all loaded table readers, other than allocated from block cache.

Refactor the property codes to allow getting property from a version, with DB mutex not acquired.

Test Plan: Add several checks of this new property in existing codes for various cases.

Reviewers: yhchiang, ljin

Reviewed By: ljin

Subscribers: xjin, igor, leveldb

Differential Revision: https://reviews.facebook.net/D20733
2014-08-06 11:39:46 -07:00
Feng Zhu
8f09d53fd1 remove malloc when create data and index iterator in Get
Summary:
  Define Block::Iter to be an independent class to be used by block_based_table_reader
  When creating data and index iterator, update an existing iterator rather than new one
  Thus malloc and free could be reduced

Benchmark,
Base:
commit 76286ee67e
commands:
--db=/dev/shm/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --write_buffer_size=134217728 --max_write_buffer_number=2 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --verify_checksum=false --max_background_compactions=4 --use_plain_table=0 --memtablerep=prefix_hash --open_files=-1 --mmap_read=1 --mmap_write=0 --bloom_bits=10 --bloom_locality=1 --memtable_bloom_bits=500000 --compression_type=lz4 --num=2621440 --use_hash_search=1 --block_size=1024 --block_restart_interval=1 --use_existing_db=1 --threads=1 --benchmarks=readrandom —disable_auto_compactions=1

malloc: 3.30% -> 1.42%
free: 3.59%->1.61%

Test Plan:
  make all check
  run db_stress
  valgrind ./db_test ./table_test

Reviewers: ljin, yhchiang, dhruba, igor, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D20655
2014-07-30 16:34:35 -07:00
Lei Jin
40fa8a4cd5 make statistics forward-able
Summary:
Make StatisticsImpl being able to forward stats to provided statistics
implementation. The main purpose is to allow us to collect internal
stats in the future even when user supplies custom statistics
implementation. It avoids intrumenting 2 sets of stats collection code.
One immediate use case is tuning advisor, which needs to collect some
internal stats, users may not be interested.

Test Plan:
ran db_bench and see stats show up at the end of run
Will run make all check since some tests rely on statistics

Reviewers: yhchiang, sdong, igor

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D20145
2014-07-28 12:05:36 -07:00
Igor Canadi
d4a8423334 Remove seek compaction
Summary:
As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases.

This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction.

There is one test case that relied on didIO information. I'll try to find another way to implement it.

Test Plan: make check

Reviewers: sdong, haobo, yhchiang, ljin, dhruba

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19161
2014-06-20 10:23:02 +02:00
Haobo Xu
7a9dd5f214 [RocksDB] Make block based table hash index more adaptive
Summary: Currently, RocksDB returns error if a db written with prefix hash index, is later opened without providing a prefix extractor. This is uncessarily harsh. Without a prefix extractor, we could always fallback to the normal binary index.

Test Plan: unit test, also manually veried LOG that fallback did occur.

Reviewers: sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19191
2014-06-19 16:40:32 -07:00
Haobo Xu
0f0076ed5a [RocksDB] Reduce memory footprint of the blockbased table hash index.
Summary:
Currently, the in-memory hash index of blockbased table uses a precise hash map to track the prefix to block range mapping. In some use cases, especially when prefix itself is big, the memory overhead becomes a problem. This diff introduces a fixed hash bucket array that does not store the prefix and allows prefix collision, which is similar to the plaintable hash index, in order to reduce the memory consumption.
Just a quick draft, still testing and refining.

Test Plan: unit test and shadow testing

Reviewers: dhruba, kailiu, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19047
2014-06-18 18:16:07 -07:00
sdong
88a1691a1e BlockBasedTable::PrefixMayMatch() to bloom setting to the beginning of the function
Summary: In BlockBasedTable::PrefixMayMatch() we calculate prefix even if bloom is not config. Move the check before

Test Plan: make all check

Reviewers: igor, ljin

Reviewed By: ljin

Subscribers: wuj, leveldb, haobo, yhchiang, dhruba

Differential Revision: https://reviews.facebook.net/D18993
2014-06-10 11:14:22 -07:00
sdong
df9069d23f In DB::NewIterator(), try to allocate the whole iterator tree in an arena
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.

Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.

Test Plan: make all check

Reviewers: ljin, haobo

Reviewed By: haobo

Subscribers: leveldb, dhruba, yhchiang, igor

Differential Revision: https://reviews.facebook.net/D18513
2014-06-02 17:44:57 -07:00