Commit Graph

1085 Commits

Author SHA1 Message Date
anand76
e10570331d Support row cache with batched MultiGet (#5706)
Summary:
This PR adds support for row cache in ```rocksdb::TableCache::MultiGet```.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5706

Test Plan:
1. Unit tests in db_basic_test
2. db_bench results with batch size of 2 (```Get``` is faster than ```MultiGet``` for single key) -
Get -
readrandom   :       3.935 micros/op 254116 ops/sec;   28.1 MB/s (22870998 of 22870999 found)
MultiGet -
multireadrandom :       3.743 micros/op 267190 ops/sec; (24047998 of 24047998 found)

Command used -
TEST_TMPDIR=/dev/shm/multiget numactl -C 10  ./db_bench -use_existing_db=true -use_existing_keys=false -benchmarks="readtorowcache,[read|multiread]random" -write_buffer_size=16777216 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -row_cache_size=4194304000 -batch_size=2 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=131072

Differential Revision: D17086297

Pulled By: anand1976

fbshipit-source-id: 85784378da913e05f1baf31ec1b4e7c9345e7f57
2019-08-28 16:11:56 -07:00
Zhongyi Xie
2f41ecfe75 Refactor trimming logic for immutable memtables (#5022)
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022

Differential Revision: D14394062

Pulled By: miasantreble

fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
2019-08-23 13:55:34 -07:00
Levi Tamasi
df8c307d63 Revert to storing UncompressionDicts in the cache (#5645)
Summary:
PR https://github.com/facebook/rocksdb/issues/5584 decoupled the uncompression dictionary object from the underlying block data; however, this defeats the purpose of the digested ZSTD dictionary, since the whole point
of the digest is to create it once and reuse it over and over again. This patch goes back to
storing the uncompression dictionary itself in the cache (which should be now safe to do,
since it no longer includes a Statistics pointer), while preserving the rest of the refactoring.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5645

Test Plan: make asan_check

Differential Revision: D16551864

Pulled By: ltamasi

fbshipit-source-id: 2a7e2d34bb16e70e3c816506d5afe1d842057800
2019-08-23 08:27:30 -07:00
Patrick Pei
202942b20c Fix local includes
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5722

Differential Revision: D16908380

fbshipit-source-id: 6a0e3cb2730b08d6012d3d7f31c937f01c399846
2019-08-22 16:21:47 -07:00
Maysam Yabandeh
244e6f2002 Refactor MultiGet names in BlockBasedTable (#5726)
Summary:
To improve code readability, since RetrieveBlock already calls MaybeReadBlockAndLoadToCache, we avoid name similarity of the functions that call RetrieveBlock with MaybeReadBlockAndLoadToCache. The patch thus renames MaybeLoadBlocksToCache to RetrieveMultipleBlock and deletes GetDataBlockFromCache, which contains only two lines.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5726

Differential Revision: D16962535

Pulled By: maysamyabandeh

fbshipit-source-id: 99e8946808ce4eb7857592b9003812e3004f92d6
2019-08-22 08:49:00 -07:00
anand76
9046bdc5d3 Fix MultiGet() bug when whole_key_filtering is disabled (#5665)
Summary:
The batched MultiGet() implementation was not correctly handling bloom filter lookups when whole_key_filtering is disabled. It was incorrectly skipping keys not in the prefix_extractor domain, and not calling transform for keys in domain. This PR fixes both problems by moving the domain check and transformation to the FilterBlockReader.

Tests:
Unit test (confirmed failed before the fix)
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5665

Differential Revision: D16902380

Pulled By: anand1976

fbshipit-source-id: a6be81ad68a6e37134a65246aec7a2c590eccf00
2019-08-21 10:23:23 -07:00
sdong
e1c468d16f Do readahead in VerifyChecksum() (#5713)
Summary:
Right now VerifyChecksum() doesn't do read-ahead. In some use cases, users won't be able to achieve good performance. With this change, by default, RocksDB will do a default readahead, and users will be able to overwrite the readahead size by passing in a ReadOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5713

Test Plan: Add a new unit test.

Differential Revision: D16860874

fbshipit-source-id: 0cff0fe79ac855d3d068e6ccd770770854a68413
2019-08-16 16:42:56 -07:00
Zhongyi Xie
e89b1c9c6e add missing check for hash index when calling BlockBasedTableIterator (#5712)
Summary:
Previous PR https://github.com/facebook/rocksdb/pull/3601 added support for making prefix_extractor dynamically mutable. However, there was a missing check for hash index when creating new BlockBasedTableIterator. While the check may be redundant because no other types of IndexReader makes uses of the flag, it is less error-prone to add the missing check so that future index reader implementation will not worry about violating the contract.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5712

Differential Revision: D16842052

Pulled By: miasantreble

fbshipit-source-id: aef11c0ff7a690ed248f5b8fe23481cac486b381
2019-08-16 16:39:49 -07:00
Eli Pozniansky
c2404d9928 Optimizing ApproximateSize to create index iterator just once (#5693)
Summary:
VersionSet::ApproximateSize doesn't need to create two separate index iterators and do binary search for each in BlockBasedTable. So BlockBasedTable::ApproximateSize was added that creates the iterator once and uses it to calculate the data size between start and end keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5693

Differential Revision: D16774056

Pulled By: elipoz

fbshipit-source-id: 53ce262e1a057788243bf30cd9b8aa6581df1a18
2019-08-16 14:18:28 -07:00
Levi Tamasi
d92a59b6f2 Fix regression affecting partitioned indexes/filters when cache_index_and_filter_blocks is false (#5705)
Summary:
PR https://github.com/facebook/rocksdb/issues/5298 (and subsequent related patches) unintentionally changed the
semantics of cache_index_and_filter_blocks: historically, this option
only affected the main index/filter block; with the changes, it affects
index/filter partitions as well. This can cause performance issues when
cache_index_and_filter_blocks is false since in this case, partitions are
neither cached nor preloaded (i.e. they are loaded on demand upon each
access). The patch reverts to the earlier behavior, that is, partitions
are cached similarly to data blocks regardless of the value of the above
option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5705

Test Plan:
make check
./db_bench -benchmarks=fillrandom --statistics --stats_interval_seconds=1 --duration=30 --num=500000000 --bloom_bits=20 --partition_index_and_filters=true --cache_index_and_filter_blocks=false
./db_bench -benchmarks=readrandom --use_existing_db --statistics --stats_interval_seconds=1 --duration=10 --num=500000000 --bloom_bits=20 --partition_index_and_filters=true --cache_index_and_filter_blocks=false --cache_size=8000000000

Relevant statistics from the readrandom benchmark with the old code:

rocksdb.block.cache.index.miss COUNT : 0
rocksdb.block.cache.index.hit COUNT : 0
rocksdb.block.cache.index.add COUNT : 0
rocksdb.block.cache.index.bytes.insert COUNT : 0
rocksdb.block.cache.index.bytes.evict COUNT : 0
rocksdb.block.cache.filter.miss COUNT : 0
rocksdb.block.cache.filter.hit COUNT : 0
rocksdb.block.cache.filter.add COUNT : 0
rocksdb.block.cache.filter.bytes.insert COUNT : 0
rocksdb.block.cache.filter.bytes.evict COUNT : 0

With the new code:

rocksdb.block.cache.index.miss COUNT : 2500
rocksdb.block.cache.index.hit COUNT : 42696
rocksdb.block.cache.index.add COUNT : 2500
rocksdb.block.cache.index.bytes.insert COUNT : 4050048
rocksdb.block.cache.index.bytes.evict COUNT : 0
rocksdb.block.cache.filter.miss COUNT : 2500
rocksdb.block.cache.filter.hit COUNT : 4550493
rocksdb.block.cache.filter.add COUNT : 2500
rocksdb.block.cache.filter.bytes.insert COUNT : 10331040
rocksdb.block.cache.filter.bytes.evict COUNT : 0

Differential Revision: D16817382

Pulled By: ltamasi

fbshipit-source-id: 28a516b0da1f041a03313e0b70b28cf5cf205d00
2019-08-14 18:16:06 -07:00
Yi Zhang
04a849b7b4 Fix compiler error by deleting GetContext default ctor (#5685)
Summary:
When updating compiler version for MyRocks I'm seeing this error with rocksdb:

```
ome/yzha/mysql/mysql-fork2/rocksdb/table/get_context.h:91:3: error: explicitly defaulted default constructor is implicitly deleted
      [-Werror,-Wdefaulted-function-deleted]
  GetContext() = default;
  ^
/home/yzha/mysql/mysql-fork2/rocksdb/table/get_context.h:166:18: note: default constructor of 'GetContext' is implicitly deleted because field
      'tracing_get_id_' of const-qualified type 'const uint64_t' (aka 'const unsigned long') would not be initialized
  const uint64_t tracing_get_id_;
                 ^
```

The error itself is rather self explanatory and makes sense.

Given that no one seems to be using the default ctor (they shouldn't, anyway), I'm deleting it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5685

Differential Revision: D16747712

Pulled By: yizhang82

fbshipit-source-id: 95c0acb958a1ed41154c0047d2e6fce7644de53f
2019-08-12 16:42:10 -07:00
Vijay Nadimpalli
d150e01474 New API to get all merge operands for a Key (#5604)
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
      const ReadOptions& options, ColumnFamilyHandle* column_family,
      const Slice& key, PinnableSlice* merge_operands,
      GetMergeOperandsOptions* get_merge_operands_options,
      int* number_of_operands)

Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);

Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604

Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist

Differential Revision: D16657366

Pulled By: vjnadimpalli

fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
2019-08-06 14:26:44 -07:00
Levi Tamasi
092f417037 Move the uncompression dictionary object out of the block cache (#5584)
Summary:
RocksDB has historically stored uncompression dictionary objects in the block
cache as opposed to storing just the block contents. This neccesitated
evicting the object upon table close. With the new code, only the raw blocks
are stored in the cache, eliminating the need for eviction.

In addition, the patch makes the following improvements:

1) Compression dictionary blocks are now prefetched/pinned similarly to
index/filter blocks.
2) A copy operation got eliminated when the uncompression dictionary is
retrieved.
3) Errors related to retrieving the uncompression dictionary are propagated as
opposed to silently ignored.

Note: the patch temporarily breaks the compression dictionary evicition stats.
They will be fixed in a separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5584

Test Plan: make asan_check

Differential Revision: D16344151

Pulled By: ltamasi

fbshipit-source-id: 2962b295f5b19628f9da88a3fcebbce5a5017a7b
2019-07-23 16:01:44 -07:00
Eli Pozniansky
6b7fcc0d5f Improve CPU Efficiency of ApproximateSize (part 1) (#5613)
Summary:
1. Avoid creating the iterator in order to call BlockBasedTable::ApproximateOffsetOf(). Instead, directly call into it.
2. Optimize BlockBasedTable::ApproximateOffsetOf() keeps the index block iterator in stack.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5613

Differential Revision: D16442660

Pulled By: elipoz

fbshipit-source-id: 9320be3e918c139b10e758cbbb684706d172e516
2019-07-23 15:34:33 -07:00
sdong
66b5613d0c row_cache to share entry for recent snapshots (#5600)
Summary:
Right now, users cannot take advantage of row cache, unless no snapshot is used, or Get() is repeated for the same snapshots. This limits the usage of row cache.
This change eliminate this restriction in some cases. If the snapshot used is newer than the largest sequence number in the file, and write callback function is not registered, the same row cache key is used as no snapshot is given. We still need the callback function restriction for now because the callback function may filter out different keys for different snapshots even if the snapshots are new.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5600

Test Plan: Add a unit test.

Differential Revision: D16386616

fbshipit-source-id: 6b7d214bd215d191b03ccf55926ad4b703ec2e53
2019-07-22 18:56:19 -07:00
haoyuhuang
8a008d4170 Block access tracing: Trace referenced key for Get on non-data blocks. (#5548)
Summary:
This PR traces the referenced key for Get for all types of blocks. This is useful when evaluating hybrid row-block caches.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5548

Test Plan: make clean && USE_CLANG=1 make check -j32

Differential Revision: D16157979

Pulled By: HaoyuHuang

fbshipit-source-id: f6327411c9deb74e35e22a35f66cdbae09ab9d87
2019-07-17 13:05:58 -07:00
Levi Tamasi
3bde41b5a3 Move the filter readers out of the block cache (#5504)
Summary:
Currently, when the block cache is used for the filter block, it is not
really the block itself that is stored in the cache but a FilterBlockReader
object. Since this object is not pure data (it has, for instance, pointers that
might dangle, including in one case a back pointer to the TableReader), it's not
really sharable. To avoid the issues around this, the current code erases the
cache entries when the TableReader is closed (which, BTW, is not sufficient
since a concurrent TableReader might have picked up the object in the meantime).
Instead of doing this, the patch moves the FilterBlockReader out of the cache
altogether, and decouples the filter reader object from the filter block.
In particular, instead of the TableReader owning, or caching/pinning the
FilterBlockReader (based on the customer's settings), with the change the
TableReader unconditionally owns the FilterBlockReader, which in turn
owns/caches/pins the filter block. This change also enables us to reuse the code
paths historically used for data blocks for filters as well.

Note:
Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504

Test Plan: make asan_check

Differential Revision: D16036974

Pulled By: ltamasi

fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
2019-07-16 13:14:58 -07:00
haoyuhuang
6ca3feed5c Fix -Werror=shadow (#5546)
Summary:
This PR fixes shadow errors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5546

Test Plan: make clean && make check -j32 && make clean && USE_CLANG=1 make check -j32 && make clean && COMPILE_WITH_ASAN=1 make check -j32

Differential Revision: D16147841

Pulled By: HaoyuHuang

fbshipit-source-id: 1043500d70c134185f537ab4c3900452752f1534
2019-07-08 00:12:43 -07:00
sdong
2de61d9129 Assert get_context not null in BlockBasedTable::Get() (#5542)
Summary:
clang analyze fails after https://github.com/facebook/rocksdb/pull/5514 for this failure:
table/block_based/block_based_table_reader.cc:3450:16: warning: Called C++ object pointer is null
          if (!get_context->SaveValue(
               ^~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.

The reaon is that a branching is added earlier in the function on get_context is null or not, CLANG analyze thinks that it can be null and we make the function call withou the null checking.
Fix the issue by removing the branch and add an assert.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5542

Test Plan: "make all check" passes and CLANG analyze failure goes away.

Differential Revision: D16133988

fbshipit-source-id: d4627d03c4746254cc11926c523931086ccebcda
2019-07-05 12:34:13 -07:00
haoyuhuang
6edc5d0719 Block cache tracing: Associate a unique id with Get and MultiGet (#5514)
Summary:
This PR associates a unique id with Get and MultiGet. This enables us to track how many blocks a Get/MultiGet request accesses. We can also measure the impact of row cache vs block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5514

Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32

Differential Revision: D16032681

Pulled By: HaoyuHuang

fbshipit-source-id: 775b05f4440badd58de6667e3ec9f4fc87a0af4c
2019-07-03 19:35:41 -07:00
Yi Wu
662ce62044 Reduce iterator key comparison for upper/lower bound check (2nd attempt) (#5468)
Summary:
This is a second attempt for https://github.com/facebook/rocksdb/issues/5111, with the fix to redo iterate bounds check after `SeekXXX()`. This is because MyRocks may change iterate bounds between seek.

See https://github.com/facebook/rocksdb/issues/5111 for original benchmark result and discussion.

Closes https://github.com/facebook/rocksdb/issues/5463.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5468

Test Plan: Existing rocksdb tests, plus myrocks test `rocksdb.optimizer_loose_index_scans` and `rocksdb.group_min_max`.

Differential Revision: D15863332

fbshipit-source-id: ab4aba5899838591806b8673899bd465f3f53e18
2019-07-02 11:48:46 -07:00
anand76
7259e28d91 MultiGet parallel IO (#5464)
Summary:
Enhancement to MultiGet batching to read data blocks required for keys in a batch in parallel from disk. It uses Env::MultiRead() API to read multiple blocks and reduce latency.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5464

Test Plan:
1. make check
2. make asan_check
3. make asan_crash

Differential Revision: D15911771

Pulled By: anand1976

fbshipit-source-id: 605036b9af0f90ca0020dc87c3a86b4da6e83394
2019-06-30 20:56:04 -07:00
sdong
15fd3be07b LRU Cache to enable mid-point insertion by default (#5508)
Summary:
Mid-point insertion is a useful feature and is mature now. Make it default. Also changed cache_index_and_filter_blocks_with_high_priority=true as default accordingly, so that we won't evict index and filter blocks easier after the change, to avoid too many surprises to users.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5508

Test Plan: Run all existing tests.

Differential Revision: D16021179

fbshipit-source-id: ce8456e8d43b3bfb48df6c304b5290a9d19817eb
2019-06-27 10:20:57 -07:00
haoyuhuang
a8975b6245 Block cache tracer: Do not populate block cache trace record when tracing is disabled. (#5510)
Summary:
This PR makes sure that trace record is not populated when tracing is disabled.

Before this PR:
DB path: [/data/mysql/rocks_regression_tests/OPTIONS-myrocks-40-33-10000000/2019-06-26-13-04-41/db]
readwhilewriting :       9.803 micros/op 1550408 ops/sec;  107.9 MB/s (5000000 of 5000000 found)
Microseconds per read:
Count: 80000000 Average: 9.8045  StdDev: 12.64
Min: 1  Median: 7.5246  Max: 25343
Percentiles: P50: 7.52 P75: 12.10 P99: 37.44 P99.9: 75.07 P99.99: 133.60

After this PR:
DB path: [/data/mysql/rocks_regression_tests/OPTIONS-myrocks-40-33-10000000/2019-06-26-14-08-21/db]
readwhilewriting :       8.723 micros/op 1662882 ops/sec;  115.8 MB/s (5000000 of 5000000 found)
Microseconds per read:
Count: 80000000 Average: 8.7236  StdDev: 12.19
Min: 1  Median: 6.7262  Max: 25229
Percentiles: P50: 6.73 P75: 10.50 P99: 31.54 P99.9: 74.81 P99.99: 132.82
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5510

Differential Revision: D16016428

Pulled By: HaoyuHuang

fbshipit-source-id: 3b3d11e6accf207d18ec2545b802aa01ee65901f
2019-06-27 08:34:08 -07:00
Mike Kolupaev
9dbcda9e3b Fix uninitialized prev_block_offset_ in BlockBasedTableReader (#5507)
Summary:
Found by valgrind_check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5507

Differential Revision: D16002612

Pulled By: miasantreble

fbshipit-source-id: 13c11c183190e0a0571844635457d434da3ac59a
2019-06-25 23:02:01 -07:00
Mike Kolupaev
b4d7209428 Add an option to put first key of each sst block in the index (#5289)
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.

Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.

So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.

Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.

This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289

Differential Revision: D15256423

Pulled By: al13n321

fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
2019-06-24 20:54:04 -07:00
haoyuhuang
705b8eecb4 Add more callers for table reader. (#5454)
Summary:
This PR adds more callers for table readers. These information are only used for block cache analysis so that we can know which caller accesses a block.
1. It renames the BlockCacheLookupCaller to TableReaderCaller as passing the caller from upstream requires changes to table_reader.h and TableReaderCaller is a more appropriate name.
2. It adds more table reader callers in table/table_reader_caller.h, e.g., kCompactionRefill, kExternalSSTIngestion, and kBuildTable.

This PR is long as it requires modification of interfaces in table_reader.h, e.g., NewIterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5454

Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32.

Differential Revision: D15819451

Pulled By: HaoyuHuang

fbshipit-source-id: b6caa704c8fb96ddd15b9a934b7e7ea87f88092d
2019-06-20 14:31:48 -07:00
Zhongyi Xie
24f73436fb sanitize and limit block_size under 4GB (#5492)
Summary:
`Block::restart_index_`, `Block::restarts_`, and `Block::current_` are defined as uint32_t but  `BlockBasedTableOptions::block_size` is defined as a size_t so user might see corruption as in https://github.com/facebook/rocksdb/issues/5486.
This PR adds a check in `BlockBasedTableFactory::SanitizeOptions` to disallow such configurations.
yiwu-arbug
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5492

Differential Revision: D15914047

Pulled By: miasantreble

fbshipit-source-id: c943f153d967e15aee7f2795730ab8259e2be201
2019-06-20 11:45:08 -07:00
Vijay Nadimpalli
24b118ad98 Combine the read-ahead logic for user reads and compaction reads (#5431)
Summary:
Currently the read-ahead logic for user reads and compaction reads go through different code paths where compaction reads create new table readers and use `ReadaheadRandomAccessFile`. This change is to unify read-ahead logic to use read-ahead in BlockBasedTableReader::InitDataBlock(). As a result of the change  `ReadAheadRandomAccessFile` class and `new_table_reader_for_compaction_inputs` option will no longer be used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5431

Test Plan:
make check

Here is the benchmarking - https://gist.github.com/vjnadimpalli/083cf423f7b6aa12dcdb14c858bc18a5

Differential Revision: D15772533

Pulled By: vjnadimpalli

fbshipit-source-id: b71dca710590471ede6fb37553388654e2e479b9
2019-06-19 14:10:46 -07:00
Levi Tamasi
5355e527d9 Make the 'block read count' performance counters consistent (#5484)
Summary:
The patch brings the semantics of per-block-type read performance
context counters in sync with the generic block_read_count by only
incrementing the counter if the block was actually read from the file.
It also fixes index_block_read_count, which fell victim to the
refactoring in PR https://github.com/facebook/rocksdb/issues/5298.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5484

Test Plan: Extended the unit tests.

Differential Revision: D15887431

Pulled By: ltamasi

fbshipit-source-id: a3889759d0ac5759d56625d692cd828d1b9207a6
2019-06-18 19:03:24 -07:00
Levi Tamasi
d0c6aea192 Revert to respecting only the read_tier read option for index blocks (#5481)
Summary:
PR https://github.com/facebook/rocksdb/issues/5298 subtly changed how read options are applied to the index block
during a Get, MultiGet, or iteration. Earlier, only the read_tier option
applied to the index block read; since PR https://github.com/facebook/rocksdb/issues/5298, fill_cache and
verify_checksums also have an effect. This patch restores the earlier
behavior to prevent surprise memory increases for clients due to the
index block not being cached.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5481

Test Plan: make check

Differential Revision: D15883082

Pulled By: ltamasi

fbshipit-source-id: 9a065ec3a6db5a365cf6dd5e95190a20c5756356
2019-06-18 15:02:09 -07:00
haoyuhuang
7a8d7358bb Integrate block cache tracer in block based table reader. (#5441)
Summary:
This PR integrates the block cache tracer into block based table reader. The tracer will write the block cache accesses using the trace_writer. The tracer is null in this PR so that nothing will be logged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5441

Differential Revision: D15772029

Pulled By: HaoyuHuang

fbshipit-source-id: a64adb92642cd23222e0ba8b10d86bf522b42f9b
2019-06-14 17:40:31 -07:00
haoyuhuang
bb4178066d Integrate block cache tracer into db_impl (#5433)
Summary:
This PR integrates the block cache tracer class into db_impl.cc.
db_impl.cc contains a member variable of AtomicBlockCacheTraceWriter class and passes its reference to the block_based_table_reader.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5433

Differential Revision: D15728016

Pulled By: HaoyuHuang

fbshipit-source-id: 23d5659e8c82d556833dcc1a5558aac8c1f7db71
2019-06-13 15:43:10 -07:00
Levi Tamasi
ba64a4cf52 Revert "Reduce iterator key comparison for upper/lower bound check (#5111)" (#5440)
Summary:
This reverts commit f3a7847598.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5440

Differential Revision: D15765967

Pulled By: ltamasi

fbshipit-source-id: d027fe24132e3729289cd7c01857a7eb449d9dd0
2019-06-11 16:23:41 -07:00
haoyuhuang
5efa0d6b0d Create a BlockCacheLookupContext to enable fine-grained block cache tracing. (#5421)
Summary:
BlockCacheLookupContext only contains the caller for now.
We will trace block accesses at five places:
1. BlockBasedTable::GetFilter.
2. BlockBasedTable::GetUncompressedDict.
3. BlockBasedTable::MaybeReadAndLoadToCache. (To trace access on data, index, and range deletion block.)
4. BlockBasedTable::Get. (To trace the referenced key and whether the referenced key exists in a fetched data block.)
5. BlockBasedTable::MultiGet. (To trace the referenced key and whether the referenced key exists in a fetched data block.)

We create the context at:
1. BlockBasedTable::Get. (kUserGet)
2. BlockBasedTable::MultiGet. (kUserMGet)
3. BlockBasedTable::NewIterator. (either kUserIterator, kCompaction, or external SST ingestion calls this function.)
4. BlockBasedTable::Open. (kPrefetch)
5. Index/Filter::CacheDependencies. (kPrefetch)
6. BlockBasedTable::ApproximateOffsetOf. (kCompaction or kUserApproximateSize).

I loaded 1 million key-value pairs into the database and ran the readrandom benchmark with a single thread. I gave the block cache 10 GB to make sure all reads hit the block cache after warmup. The throughput is comparable.
Throughput of this PR: 231334 ops/s.
Throughput of the master branch: 238428 ops/s.

Experiment setup:
RocksDB:    version 6.2
Date:       Mon Jun 10 10:42:51 2019
CPU:        24 * Intel Core Processor (Skylake)
CPUCache:   16384 KB
Keys:       20 bytes each
Values:     100 bytes each (100 bytes after compression)
Entries:    1000000
Prefix:    20 bytes
Keys per prefix:    0
RawSize:    114.4 MB (estimated)
FileSize:   114.4 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: NoCompression
Compression sampling rate: 0
Memtablerep: skip_list
Perf Level: 1

Load command: ./db_bench --benchmarks="fillseq" --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000

Run command: ./db_bench --benchmarks="readrandom,stats" --use_existing_db --threads=1 --duration=120 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 --duration=120

TODOs:
1. Create a caller for external SST file ingestion and differentiate the callers for iterator.
2. Integrate tracer to trace block cache accesses.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5421

Differential Revision: D15704258

Pulled By: HaoyuHuang

fbshipit-source-id: 4aa8a55f8cb1576ffb367bfa3186a91d8f06d93a
2019-06-10 15:33:27 -07:00
anand76
63ace8ef0e Reuse data block iterator in BlockBasedTableReader::MultiGet() (#5314)
Summary:
Instead of creating a new DataBlockIterator for every key in a MultiGet batch, reuse it if the next key is in the same block. This results in a small 1-2% cpu improvement.

TEST_TMPDIR=/dev/shm/multiget numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4

Without the change -
multireadrandom :       3.066 micros/op 326122 ops/sec; (29375968 of 29375968 found)

With the change -
multireadrandom :       3.003 micros/op 332945 ops/sec; (29983968 of 29983968 found)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5314

Differential Revision: D15742108

Pulled By: anand1976

fbshipit-source-id: 220fb0b8eea9a0d602ddeb371528f7af7936d771
2019-06-10 13:31:19 -07:00
Levi Tamasi
0f48e56f96 Revert to checking the upper bound on a per-key basis in BlockBasedTableIterator (#5428)
Summary:
PR #5111 reduced the number of key comparisons when iterating with
upper/lower bounds; however, this caused a regression for MyRocks.
Reverting to the previous behavior in BlockBasedTableIterator as a hotfix.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5428

Differential Revision: D15721038

Pulled By: ltamasi

fbshipit-source-id: 5450106442f1763bccd17f6cfd648697f2ae8b6c
2019-06-07 15:17:05 -07:00
Zhongyi Xie
d68f9f4580 simplify include directive involving inttypes (#5402)
Summary:
When using `PRIu64` type of printf specifier, current code base does the following:
```
#ifndef __STDC_FORMAT_MACROS
#define __STDC_FORMAT_MACROS
#endif
#include <inttypes.h>
```
However, this can be simplified to
```
#include <cinttypes>
```
as long as flag `-std=c++11` is used.
This should solve issues like https://github.com/facebook/rocksdb/issues/5159
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402

Differential Revision: D15701195

Pulled By: miasantreble

fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03
2019-06-06 13:56:07 -07:00
Levi Tamasi
bee2f48a66 Refactor the handling of cache related counters and statistics (#5408)
Summary:
The patch cleans up the handling of cache hit/miss/insertion related
performance counters, get context counters, and statistics by
eliminating some code duplication and factoring out the affected logic
into separate methods. In addition, it makes the semantics of cache hit
metrics more consistent by changing the code so that accessing a
partition of partitioned indexes/filters through a pinned reference no
longer counts as a cache hit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5408

Differential Revision: D15610883

Pulled By: ltamasi

fbshipit-source-id: ee749c18965077aca971d8f8bee8b24ed8fa76f1
2019-06-06 11:36:40 -07:00
Yanqin Jin
340ed4fac7 Add support for timestamp in Get/Put (#5079)
Summary:
It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now).

Test plan (on devserver):
```
$COMPILE_WITH_ASAN=1 make -j32 all
$./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
$make check
```
All tests must pass.

We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled.
```
$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000
```
Repeat for 6 times for both versions.

Results are as follows:
```
|        | readrandom | fillrandom |
| master | 16.77 MB/s | 47.05 MB/s |
| PR5079 | 16.44 MB/s | 47.03 MB/s |
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079

Differential Revision: D15132946

Pulled By: riversand963

fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c
2019-06-05 23:10:47 -07:00
haoyuhuang
349db90497 Make GetEntryFromCache a member function. (#5394)
Summary:
The commit makes GetEntryFromCache become a member function. It also makes all its callers become member functions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5394

Differential Revision: D15579222

Pulled By: HaoyuHuang

fbshipit-source-id: 07509c42ee9022dcded54950012bd3bd562aa1ae
2019-06-03 12:34:59 -07:00
Siying Dong
000b9ec217 Move some logging related files to logging/ (#5387)
Summary:
Many logging related source files are under util/. It will be more structured if they are together.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387

Differential Revision: D15579036

Pulled By: siying

fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f
2019-05-31 17:23:59 -07:00
Vijay Nadimpalli
49c5a12dbe Organizing rocksdb/db directory
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390

Differential Revision: D15579388

Pulled By: vjnadimpalli

fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366
2019-05-31 11:57:01 -07:00
Levi Tamasi
a3609b7dde Improve const correctness in BlockBasedTableReader (#5383)
Summary:
Many methods are passing around pointers to non-const objects when in fact
they do not/should not modify said objects. The patch makes the semantics
clearer and also helps from a thread safety point-of-view by changing some
pointers to pointers-to-const and marking some instance methods as const.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5383

Differential Revision: D15562770

Pulled By: ltamasi

fbshipit-source-id: 89361dadbb8b25bbe54d17e8da28fee24a2419af
2019-05-31 11:41:35 -07:00
Siying Dong
8843129ece Move some memory related files from util/ to memory/ (#5382)
Summary:
Move arena, allocator, and memory tools under util to a separate memory/ directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382

Differential Revision: D15564655

Pulled By: siying

fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5
2019-05-30 17:44:09 -07:00
Vijay Nadimpalli
50e470791d Organizing rocksdb/table directory by format
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373

Differential Revision: D15559425

Pulled By: vjnadimpalli

fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55
2019-05-30 14:51:11 -07:00
Levi Tamasi
1e35584251 Move the index readers out of the block cache (#5298)
Summary:
Currently, when the block cache is used for index blocks as well, it is
not really the index block that is stored in the cache but an
IndexReader object. Since this object is not pure data (it has, for
instance, pointers that might dangle), it's not really sharable. To
avoid the issues around this, the current code uses a dummy unique cache
key for each TableReader to store the IndexReader, and erases the
IndexReader entry when the TableReader is closed. Instead of doing this,
the new code moves the IndexReader out of the cache altogether. In
particular, instead of the TableReader owning, or caching/pinning the
IndexReader based on the customer's settings, the TableReader
unconditionally owns the IndexReader, which in turn owns/caches/pins
the index block (which is itself sharable and thus can be safely put in
the cache without any hacks).

Note: the change has two side effects:
1) Partitions of partitioned indexes no longer affect the read
amplification statistics.
2) Eviction statistics for index blocks are temporarily broken. We plan to fix
this in a separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298

Differential Revision: D15303203

Pulled By: ltamasi

fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
2019-05-30 11:53:27 -07:00
Siying Dong
e9e0101ca4 Move test related files under util/ to test_util/ (#5377)
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377

Differential Revision: D15551366

Pulled By: siying

fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
2019-05-30 11:25:51 -07:00
anand76
eb7647ee6c Add comments t get_context.h
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5353

Differential Revision: D15497912

Pulled By: anand1976

fbshipit-source-id: 72cff2465ca342aa810f925be5a7016b938aa416
2019-05-24 13:31:05 -07:00
Vijay Nadimpalli
f66026c8c7 Comments for BlockBasedTable
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5352

Differential Revision: D15498477

Pulled By: vjnadimpalli

fbshipit-source-id: 08a981521848433362a56ac521c7fb83c7dd7b2a
2019-05-24 12:35:25 -07:00
Levi Tamasi
98094f6cac Add some comments for BlockContents
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5354

Differential Revision: D15496645

Pulled By: ltamasi

fbshipit-source-id: 1282b1ce11fbc412d3d87b2688fd0586e7bb6b85
2019-05-24 11:20:09 -07:00
Siying Dong
38a06aa225 Improve comments of classes for PlainTable (#5339)
Summary:
Simply add some comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5339

Differential Revision: D15485315

Pulled By: siying

fbshipit-source-id: 4594b1c4c967e6bd08aa7fa08a37df3481df1938
2019-05-23 16:51:44 -07:00
Levi Tamasi
40aa520a51 Add class comment for BlockFetcher
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5337

Differential Revision: D15482289

Pulled By: ltamasi

fbshipit-source-id: 8639ca78c1b8dfcc337a742d4d81d5752f12545f
2019-05-23 14:22:26 -07:00
yiwu-arbug
f3a7847598 Reduce iterator key comparison for upper/lower bound check (#5111)
Summary:
Previously if iterator upper/lower bound presents, `DBIter` will check the bound for every key. This patch turns the check into per-file or per-data block check when applicable, by checking against either file largest/smallest key or block index key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5111

Differential Revision: D15330061

Pulled By: siying

fbshipit-source-id: 8a653fe3cd50d94d81eb2d13b087326c58ee2024
2019-05-17 10:28:31 -07:00
Maysam Yabandeh
f383641a1d Unordered Writes (#5218)
Summary:
Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
The patch is prepared based on an original by siying
Existing unit tests are extended to include unordered_write option.

Benchmark Results:
```
TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions  --unordered_write=1
```
With WAL
- Vanilla RocksDB: 78.6 MB/s
- WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
- unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)

Without WAL
- Vanilla RocksDB: 111.3 MB/s
- WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
- unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)

- WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)

Limitations:
- The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218

Differential Revision: D15219029

Pulled By: maysamyabandeh

fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
2019-05-13 17:47:21 -07:00
Mike Kolupaev
6a6aef25c1 Fix crash in BlockBasedTableIterator::Seek() (#5291)
Summary:
https://github.com/facebook/rocksdb/pull/5256 broke it: `block_iter_.user_key()` may not be valid even if `block_iter_points_to_real_block_` is true. E.g. if there was an IO error or Status::Incomplete.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5291

Differential Revision: D15273324

Pulled By: al13n321

fbshipit-source-id: 442e5b09f9884a58f92a6ac1ca93af719c219886
2019-05-10 12:40:57 -07:00
Levi Tamasi
f0bf3bf34b Turn CachableEntry into a proper resource handle (#5252)
Summary:
CachableEntry is used in a variety of contexts: it may refer to a cached
object (i.e. an object in the block cache), an owned object, or an
unowned object; also, in some cases (most notably with iterators), the
responsibility of managing the pointed-to object gets handed off to
another object. Each of the above scenarios have different implications
for the lifecycle of the referenced object. For the most part, the patch
does not change the lifecycle of managed objects; however, it makes
these relationships explicit, and it also enables us to eliminate some
hacks and accident-prone code around releasing cache handles and
deleting/cleaning up objects. (The only places where the patch changes
how an objects are managed are the partitions of partitioned indexes and
filters.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5252

Differential Revision: D15101358

Pulled By: ltamasi

fbshipit-source-id: 9eb59e9ae5a7230e3345789762d0ba1f189485be
2019-05-10 11:57:49 -07:00
Siying Dong
9fad3e21eb Merging iterator to avoid child iterator reseek for some cases (#5286)
Summary:
When reseek happens in merging iterator, reseeking a child iterator can be avoided if:
(1) the iterator represents imutable data
(2) reseek() to a larger key than the current key
(3) the current key of the child iterator is larger than the seek key
because it is guaranteed that the result will fall into the same position.

This optimization will be useful for use cases where users keep seeking to keys nearby in ascending order.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5286

Differential Revision: D15283635

Pulled By: siying

fbshipit-source-id: 35f79ffd5ce3609146faa8cd55f2bfd733502f83
2019-05-09 14:20:04 -07:00
Zhongyi Xie
eea1cad850 avoid updating index type during iterator creation (#5288)
Summary:
Right now there is a potential race condition where two threads are created to iterate through the DB (https://gist.github.com/miasantreble/88f5798a397ee7cb8e7baff9db2d9e85).  The problem is that in `BlockBasedTable::NewIndexIterator`, if both threads failed to find index_reader from block cache, they will call `CreateIndexReader->UpdateIndexType()` which creates a race to update `index_type` in the shared rep_ object. By checking the code, we realize the index type is always populated by `PrefetchIndexAndFilterBlocks` during the table `Open` call, so there is no need to update index type every time during iterator creation. This PR attempts to fix the race condition by removing the unnecessary call to `UpdateIndexType`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5288

Differential Revision: D15252509

Pulled By: miasantreble

fbshipit-source-id: 6e3258652121d5c76d267f7ac457e15c5e84756e
2019-05-07 20:20:40 -07:00
Maysam Yabandeh
6a40ee5eb1 Refresh snapshot list during long compactions (2nd attempt) (#5278)
Summary:
Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list.
This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278

Differential Revision: D15203291

Pulled By: maysamyabandeh

fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258
2019-05-03 17:30:22 -07:00
Zhongyi Xie
5d27d65bef multiget: fix memory issues due to vector auto resizing (#5279)
Summary:
This PR fixes three memory issues found by ASAN
* in db_stress, the key vector for MultiGet is created using `emplace_back` which could potentially invalidates references to the underlying storage (vector<string>) due to auto resizing. Fix by calling reserve in advance.
* Similar issue in construction of GetContext autovector in version_set.cc
* In multiget_context.h use T[] specialization for unique_ptr that holds a char array
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5279

Differential Revision: D15202893

Pulled By: miasantreble

fbshipit-source-id: 14cc2cda0ed64d29f2a1e264a6bfdaa4294ee75d
2019-05-03 15:58:43 -07:00
Siying Dong
4479dff208 Reduce binary search when reseek into the same data block (#5256)
Summary:
Right now, when Seek() is called again, RocksDB always does a binary search against the files and index blocks, even if they end up with the same file/block. Improve it as following:
1. in LevelIterator, reseek first try to check the boundary of the current file. If it falls into the same file, skip the binary search to find the file
2. in block based table iterator, reseek skip to reseek the iterator block if the seek key is larger than the current key and lower than the index key (boundary of the current block and the next block).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5256

Differential Revision: D15105072

Pulled By: siying

fbshipit-source-id: 39634bdb4a881082451fa39cecd7ecf12160bf80
2019-05-01 14:26:30 -07:00
Maysam Yabandeh
521d234bda Revert snap_refresh_nanos feature (#5269)
Summary:
Our daily stress tests are failing after this feature. Reverting temporarily until we figure the reason for test failures.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5269

Differential Revision: D15151285

Pulled By: maysamyabandeh

fbshipit-source-id: e4002b99690a97df30d4b4b58bf0f61e9591bc6e
2019-05-01 10:07:30 -07:00
Andrew Kryczka
b02d0c238d Init compression dict handle before reading meta-blocks (#5267)
Summary:
At least one of the meta-block loading functions (`ReadRangeDelBlock`)
uses the same block reading function (`NewDataBlockIterator`) as data
block reads, which means it uses the dictionary handle. However, the
dictionary handle was uninitialized while reading meta-blocks, causing
readers to receive an error. This situation was only noticed when
`cache_index_and_filter_blocks=true`.

This PR initializes the handle to null while reading meta-blocks to
prevent the error. It also adds support to `db_stress` /
`db_crashtest.py` for `cache_index_and_filter_blocks`.

Fixes #5263.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5267

Differential Revision: D15149264

Pulled By: maysamyabandeh

fbshipit-source-id: 991d38a306c62db5976778bfb050fa3cd4a0671b
2019-04-30 09:50:49 -07:00
Sagar Vemuri
3548e4220d Improve explicit user readahead performance (#5246)
Summary:
Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`.

1. Stop creating new table readers when the user explicitly sets readahead size.
2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases).
3. Add `readahead_size` to db_bench.

**Benchmarks:**
https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737

For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%.
For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%.

**Test Plan:**
Updated `DBIteratorTest.ReadAhead` test to make sure that:
- no new table readers are created for iterators on setting ReadOptions.readahead_size
- At least "readahead" number of bytes are actually getting read on each iterator read.

TODO later:
- Use similar logic for compactions as well.
- This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246

Differential Revision: D15107946

Pulled By: sagar0

fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea
2019-04-26 21:24:10 -07:00
Maysam Yabandeh
506e8448be Refresh snapshot list during long compactions (#5099)
Summary:
Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099

Differential Revision: D15086710

Pulled By: maysamyabandeh

fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12
2019-04-25 18:17:22 -07:00
Yuchi Chen
78a6e07c83 Fix compilation errors for 32bits/LITE/ios build. (#5220)
Summary:
When I build RocksDB for 32bits/LITE/iOS environment, some errors like the following.

`
table/block_based_table_reader.cc:971:44: error: implicit conversion loses integer precision: 'uint64_t'
      (aka 'unsigned long long') to 'size_t' (aka 'unsigned long') [-Werror,-Wshorten-64-to-32]
    size_t block_size = props_block_handle.size();
           ~~~~~~~~~~   ~~~~~~~~~~~~~~~~~~~^~~~~~

./util/file_reader_writer.h:177:8: error: private field 'env_' is not used [-Werror,-Wunused-private-field]
  Env* env_;
       ^
`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5220

Differential Revision: D15023481

Pulled By: siying

fbshipit-source-id: 1b5d121d3016f2b0a8a9a2cc1bd638479357f9f7
2019-04-22 16:02:16 -07:00
Mike Kolupaev
df38c1ce66 Add BlockBasedTableOptions::index_shortening (#5174)
Summary:
Introduce BlockBasedTableOptions::index_shortening to give users control on which key shortening techniques to be used in building index blocks. Before this patch, both separators and successor keys where shortened in indexes. With this patch, the default is set to kShortenSeparators to only shorten the separators. Since each index block has many separators and only one successor (last key), the change should not have negative impact on index block size. However it should prevent many unnecessary block loads where due to approximation introduced by shorted successor, seek would land us to the previous block and then fix it by moving to the next one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5174

Differential Revision: D14884185

Pulled By: al13n321

fbshipit-source-id: 1b08bc8c03edcf09b6b8c16e9a7eea08ad4dd534
2019-04-22 08:20:35 -07:00
anand76
5265c5709e Remove a couple of non-public includes from public header file (#5219)
Summary:
Cleanup a couple of stray includes left by #5011.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5219

Differential Revision: D15007244

Pulled By: anand1976

fbshipit-source-id: 15ca1d4f977b5b60e99df3bfb8fc3db217d19bdd
2019-04-19 11:10:33 -07:00
Sagar Vemuri
efa948741c Use creation_time or mtime when file_creation_time=0 (#5184)
Summary:
We found an issue in Periodic Compactions (introduced in #5166) where files were not being picked up for compactions as all the SST files created with older versions of RocksDB have `file_creation_time` as 0. (Note that `file_creation_time` is a new table property introduced in #5166).

To address this, Periodic compactions now fall back to looking at the `creation_time` table property or the file's modification time (as given by the Env) when `file_creation_time` table property is found to be 0.

Here how the file's modification time (and, in turn, the file age) is computed now:
1. Use `file_creation_time` table property if it is > 0.
1. If not, then use `creation_time` table property if it is > 0.
1. If not, then use file's mtime stat metadata given by the underlying Env.
Don't consider the file at all for compaction if the modification time cannot be correctly determined based on the above conditions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5184

Differential Revision: D14907795

Pulled By: sagar0

fbshipit-source-id: 4bb2f3631f9a3e04470c674a1d13544584e1e56c
2019-04-18 22:39:34 -07:00
Siying Dong
992dfc7811 Introduce InternalIteratorBase::NextAndGetResult() (#5197)
Summary:
In long scans, virtual function calls of Next(), Valid(), key() and value() are not trivial. By introducing NextAndGetResult(), Some of the Next(), Valid() and key() calls are consolidated into one virtual function call to reduce CPU.
Also did some inline tricks and add some "final" randomly in some functions. Even without the "final" annotation, most Next() calls are inlined with -O3, but sometimes with a final it is inlined by O2 too. It doesn't hurt to add those final annotations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5197

Differential Revision: D14945977

Pulled By: siying

fbshipit-source-id: 7003969f9a5f1d5717f0bda503b91d19ba75ed88
2019-04-18 11:12:39 -07:00
Fosco Marotto
6c2bf9e916 Add copyright headers per FB open-source checkup tool. (#5199)
Summary:
internal task: T35568575
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5199

Differential Revision: D14962794

Pulled By: gfosco

fbshipit-source-id: 93838ede6d0235eaecff90d200faed9a8515bbbe
2019-04-18 10:55:01 -07:00
Yanqin Jin
d9280ff2d2 Add back NewEmptyIterator (#5203)
Summary:
#4905 removed the implementation of `NewEmptyIterator` but kept its
declaration in the public header. This breaks some systems that depend on
RocksDB if the systems use `NewEmptyIterator`. Therefore, add it back to fix. cc maysamyabandeh please remind me if I miss anything here. Thanks
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5203

Differential Revision: D14968382

Pulled By: riversand963

fbshipit-source-id: 5fb86e99c8cfaf9f7a9473cdb1355d7558ff6e01
2019-04-16 20:28:05 -07:00
yiwu-arbug
f1239d5f10 Avoid per-key upper bound check in BlockBasedTableIterator (#5142)
Summary:
This is second attempt for #5101. Original commit message:
`BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons.

This patch come with two fixes:

Fix 1: To optimize checking for bounds, we need comparing the bounds with index key as well. However BlockBasedTableIterator doesn't know whether its index iterator is internally using user keys or internal keys. The patch fixes that by extending InternalIterator with a user_key() function that is overridden by In IndexBlockIter.

Fix 2: In #5101 we return `IsOutOfBound()=true` when block index key is out of bound. But the index key can be larger than smallest key of the next file on the level. That file can be within upper bound and should not be filtered out.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5142

Differential Revision: D14907113

Pulled By: siying

fbshipit-source-id: ac95775c5b4e7b700f76ab43e39f45402c98fbfb
2019-04-16 11:37:47 -07:00
Yanqin Jin
3189398c00 Fix bugs detected by clang analyzer (#5185)
Summary:
as titled. False positive included, fixed anyway to make the check
pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5185

Differential Revision: D14909384

Pulled By: riversand963

fbshipit-source-id: dc5177e72b1929ccfd6175a60e2cd7bdb9bd80f3
2019-04-12 10:45:56 -07:00
anand76
fefd4b98c5 Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.

Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency

The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.

Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).

Batch   Sizes

1        | 2        | 4         | 8      | 16  | 32

Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074        - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14        - MultiGet (w/ batching)

Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135

Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62

Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891

dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011

Differential Revision: D14348703

Pulled By: anand1976

fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
2019-04-11 14:28:26 -07:00
Sagar Vemuri
d3d20dcdca Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.

This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted.  And also, of course, it helps to cleanup data older than certain threshold.

- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).

This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166

Differential Revision: D14884441

Pulled By: sagar0

fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
2019-04-10 19:31:18 -07:00
Levi Tamasi
59ef2ba559 Evict the uncompression dictionary from the block cache upon table close (#5150)
Summary:
The uncompression dictionary object has a Statistics pointer that might
dangle if the database closed. This patch evicts the dictionary from the
block cache when a table is closed, similarly to how index and filter
readers are handled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5150

Differential Revision: D14782422

Pulled By: ltamasi

fbshipit-source-id: 0cec9336c742c479aa92206e04521767f1aa9622
2019-04-04 16:21:12 -07:00
Adam Simpkins
c06c4c01c5 Fix many bugs in log statement arguments (#5089)
Summary:
Annotate all of the logging functions to inform the compiler that these
use printf-style formatting arguments.  This allows the compiler to emit
warnings if the format arguments are incorrect.

This also fixes many problems reported now that format string checking
is enabled.  Many of these are simply mix-ups in the argument type (e.g,
int vs uint64_t), but in several cases the wrong number of arguments
were being passed in which can cause the code to crash.

The primary motivation for this was to fix the log message in
`DBImpl::SwitchMemtable()` which caused a segfault due to an extra %s
format parameter with no argument supplied.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5089

Differential Revision: D14574795

Pulled By: simpkins

fbshipit-source-id: 0921b03f0743652bf4ae21e414ff54b3bb65422a
2019-04-04 12:12:11 -07:00
Zhongyi Xie
26015f3b48 add compression options to table properties (#5081)
Summary:
Since we are planning to use dictionary compression and to use different compression level, it is quite useful to add compression options to TableProperties. For example, in MyRocks, if the feature is available, we can query from information_schema.rocksdb_sst_props to see if all sst files are converted to ZSTD dictionary compressions. Resolves https://github.com/facebook/rocksdb/issues/4992

With this PR, user can query table properties through `GetPropertiesOfAllTables` API and get compression options as std::string:
`window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0;`
or table_properties->ToString() will also contain it
`# data blocks=1; # entries=13; # deletions=0; # merge operands=0; # range deletions=0; raw key size=143; raw average key size=11.000000; raw value size=39; raw average value size=3.000000; data block size=120; index block size (user-key? 0, delta-value? 0)=27; filter block size=0; (estimated) table size=147; filter policy name=N/A; prefix extractor name=nullptr; column family ID=0; column family name=default; comparator name=leveldb.BytewiseComparator; merge operator name=nullptr; property collectors names=[]; SST file compression algo=Snappy; SST file compression options=window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ; creation time=1552946632; time stamp of earliest key=1552946632;`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5081

Differential Revision: D14716692

Pulled By: miasantreble

fbshipit-source-id: 7d2f2cf84e052bff876e71b4212cfdebf5be32dd
2019-04-02 14:52:34 -07:00
Siying Dong
ebcc8ae1d3 Revert "Avoid per-key upper bound check in BlockBasedTableIterator (#5101)" (#5132)
Summary:
This reverts commit f29dc1b906.

In BlockBasedTableIterator, index_iter_->key() is sometimes a user key, so it is wrong to call ExtractUserKey() against it. This is a bug introduced by #5101.
Temporarily revert the diff to keep the branch clean.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5132

Differential Revision: D14718584

Pulled By: siying

fbshipit-source-id: 0ac55dc9b5dbc18c7809092146bdf7eb9364b9ad
2019-04-02 10:00:38 -07:00
Remington Brasga
127a850beb Fix arena allocation size in NewEmptyInternalIterator (#4905)
Summary:
NewEmptyInternalIterator with arena mistakenly used EmptyIterator to allocate the size from area but then initialized it to a totally different object: EmptyInternalIterator. The patch fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4905

Differential Revision: D14689840

Pulled By: maysamyabandeh

fbshipit-source-id: af64fd8ee93d5a4ad54691c792e5ecc5efabc887
2019-03-29 15:09:35 -07:00
Yi Wu
f29dc1b906 Avoid per-key upper bound check in BlockBasedTableIterator (#5101)
Summary:
`BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5101

Differential Revision: D14678707

Pulled By: siying

fbshipit-source-id: 2372446116753c7892ea4cec7b4b49ef87ba463e
2019-03-29 13:11:46 -07:00
Siying Dong
89ab1381f8 Apply automatic formatting to some files (#5114)
Summary:
Following files were run through automatic formatter:
db/db_impl.cc
db/db_impl.h
db/db_impl_compaction_flush.cc
db/db_impl_debug.cc
db/db_impl_files.cc
db/db_impl_readonly.h
db/db_impl_write.cc
db/dbformat.cc
db/dbformat.h
table/block.cc
table/block.h
table/block_based_filter_block.cc
table/block_based_filter_block.h
table/block_based_filter_block_test.cc
table/block_based_table_builder.cc
table/block_based_table_reader.cc
table/block_based_table_reader.h
table/block_builder.cc
table/block_builder.h
table/block_fetcher.cc
table/block_prefix_index.cc
table/block_prefix_index.h
table/block_test.cc
table/format.cc
table/format.h

I could easily run all the files, but I don't want people to feel that
I'm doing it for lines of code changes :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5114

Differential Revision: D14633040

Pulled By: siying

fbshipit-source-id: 3f346cb53bf21e8c10704400da548dfce1e89a52
2019-03-27 16:24:45 -07:00
Yi Wu
d69241586e Fix perf_context.user_key_comparison_count for range scan (#5098)
Summary:
Currently `perf_context.user_key_comparison_count` is bump only in `InternalKeyComparator`. For places user comparator is used directly the counter is not bump. Fixing the majority of it.

Index iterator and filter code also use user comparator directly and don't bump the counter. It is not fixed in this patch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5098

Differential Revision: D14603753

Pulled By: siying

fbshipit-source-id: 1cd41035644ca9e49b97a51030a5d1e15f5f3cae
2019-03-27 10:34:27 -07:00
Siying Dong
2b4d5ceb47 Remove some "using std::..." from header files. (#5113)
Summary:
The code convention we are following, Google C++ Style, discourage
alias in header files, especially public headers:
https://google.github.io/styleguide/cppguide.html#Aliases
Remove some of them. Might removed some from .cc files as well to be consistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5113

Differential Revision: D14633030

Pulled By: siying

fbshipit-source-id: b990edc919d5de60295992284f980195e501d424
2019-03-27 10:28:21 -07:00
Yi Wu
75133b1b6b Fix SstFileReader not able to open ingested file (#5097)
Summary:
Since `SstFileReader` don't know largest seqno of a file, it will fail this check when it open a file with global seqno: ca89ac2ba9/table/block_based_table_reader.cc (L730)
Changes:
* Pass largest_seqno=kMaxSequenceNumber from `SstFileReader` and allow it to bypass the above check.
* `BlockBasedTable::VerifyChecksum` also double check if checksum will match when excluding global seqno (this is to make the new test in sst_table_reader_test pass).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5097

Differential Revision: D14607434

Pulled By: riversand963

fbshipit-source-id: 9008599227c5fccbf9b73fee46b3bf4a1523f023
2019-03-26 10:25:18 -07:00
Yi Wu
7ca9eb7542 Fix BlockBasedTableIterator construction missing index_key_is_full parameter
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5104

Differential Revision: D14619000

Pulled By: maysamyabandeh

fbshipit-source-id: c2895794a3f31b826c149dcb698c1952dacc2332
2019-03-26 10:13:01 -07:00
Shobhit Dayal
b45b1cde3e Feature for sampling and reporting compressibility (#4842)
Summary:
This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms:
1. lz4 or snappy for fast compression
2. zstd or zlib for slow but higher compression.

The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType.

The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842

Differential Revision: D13629011

Pulled By: shobhitdayal

fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa
2019-03-18 12:15:34 -07:00
Yi Wu
62eb2c23aa Print data block index options to info log (#5039)
Summary:
Print data block index type related options to info log
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5039

Differential Revision: D14387718

Pulled By: miasantreble

fbshipit-source-id: 9df8f82eea83a8344c7d12a712486f656691bc4a
2019-03-08 11:18:45 -08:00
Siying Dong
0920bf4e68 Revert "Remove PlainTable's feature store_index_in_file (#4914)" (#5034)
Summary:
This reverts commit ee1818081f.

We are not ready to deprecate this feature. revert it for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5034

Differential Revision: D14287246

Pulled By: siying

fbshipit-source-id: e4beafdeaee1c94364fdaa6ba198218d158339f7
2019-03-01 15:45:45 -08:00
Siying Dong
aef763b6d6 Make statistics's stats_level change thread-safe (#5030)
Summary:
Right now, users can change statistics.stats_level while DB is running, but TSAN may report
data race. We make stats_level_ to be atomic, and access them using accessors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5030

Differential Revision: D14267519

Pulled By: siying

fbshipit-source-id: 37d7ebeff7a43a406230143422a16af899163f73
2019-03-01 10:42:09 -08:00
Siying Dong
5e298f865b Add two more StatsLevel (#5027)
Summary:
Statistics cost too much CPU for some use cases. Add two stats levels
so that people can choose to skip two types of expensive stats, timers and
histograms.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027

Differential Revision: D14252765

Pulled By: siying

fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592
2019-02-28 10:27:59 -08:00
Zhongyi Xie
ed995c6a69 add whole key bloom filter support in memtables (#4985)
Summary:
MyRocks calls `GetForUpdate` on `INSERT`, for unique key check, and in almost all cases GetForUpdate returns empty result. For such cases, whole key bloom filter is helpful.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4985

Differential Revision: D14118257

Pulled By: miasantreble

fbshipit-source-id: d35cb7109c62fd5ad541a26968e3a3e16d3e85ea
2019-02-19 12:15:39 -08:00
Michael Liu
ca89ac2ba9 Apply modernize-use-override (2nd iteration)
Summary:
Use C++11’s override and remove virtual where applicable.
Change are automatically generated.

Reviewed By: Orvid

Differential Revision: D14090024

fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a
2019-02-14 14:41:36 -08:00
Andrew Kryczka
c8c8104d7e Dictionary compression for files written by SstFileWriter (#4978)
Summary:
If `CompressionOptions::max_dict_bytes` and/or `CompressionOptions::zstd_max_train_bytes` are set, `SstFileWriter` will now generate files respecting those options.

I refactored the logic a bit for deciding when to use dictionary compression. Previously we plumbed `is_bottommost_level` down to the table builder and used that. However it was kind of confusing in `SstFileWriter`'s context since we don't know what level the file will be ingested to. Instead, now the higher-level callers (e.g., flush, compaction, file writer) are responsible for building the right `CompressionOptions` to give the table builder.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4978

Differential Revision: D14060763

Pulled By: ajkr

fbshipit-source-id: dc802c327896df2b319dc162d6acc82b9cdb452a
2019-02-14 11:23:55 -08:00
Andrew Kryczka
62f70f6d14 Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.

So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:

- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952

Differential Revision: D13967980

Pulled By: ajkr

fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-11 19:47:32 -08:00
Peter (Stig) Edwards
79496d71ed Increment NUMBER_BLOCK_NOT_COMPRESSED when !GoodCompressionRatio (#4929)
Summary:
See https://github.com/facebook/rocksdb/issues/4884
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4929

Differential Revision: D14028333

Pulled By: sagar0

fbshipit-source-id: eed12bceae85385a34aaa6dd303bf0f53c4c7b06
2019-02-11 17:56:23 -08:00
Yanqin Jin
2d049ab7e8 Checksum properties block for block-based table (#4956)
Summary:
Always enable properties block checksum verification for block-based table. For external SST file ingested with 'write_global_seqno==true', we use 'DecodeEntrySlow' to parse its blocks' contents so that the process will not die upon failing the assertion possibly caused by corruption.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4956

Differential Revision: D14012741

Pulled By: riversand963

fbshipit-source-id: 8b766e6f54b36f8f9e074c0e19e0926ec3cce186
2019-02-11 11:50:01 -08:00
Siying Dong
cf3a671733 Remove cuckoo hash memtable (#4953)
Summary:
Cuckoo Hash is less useful than we initially expected. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4953

Differential Revision: D13979264

Pulled By: siying

fbshipit-source-id: 2a60afdaa989f045357398b43a1cc5d46f4492ed
2019-02-07 16:15:27 -08:00
Alexander Zinoviev
32a6dd9a41 Add a new CPU time counter to compaction report (#4889)
Summary:
Measure CPU time consumed for a compaction and report it in the stats report
Enable NowCPUNanos() to work for MacOS
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889

Differential Revision: D13701276

Pulled By: zinoale

fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055
2019-01-29 17:24:00 -08:00
Yanqin Jin
158da7a6ee Verify checksum before ingestion (#4916)
Summary:
before file ingestion (in preparation phase), verify the checksums of
the blocks of the external SST file, including properties block with global
seqno.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4916

Differential Revision: D13863501

Pulled By: riversand963

fbshipit-source-id: dc54697f970e3807832e2460f7228fcc7efe81ee
2019-01-29 17:17:29 -08:00
Siying Dong
ee1818081f Remove PlainTable's feature store_index_in_file (#4914)
Summary:
Store_index_in_file is a less useful feature. To simplify the code to maintain, we are dropping the feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4914

Differential Revision: D13791883

Pulled By: siying

fbshipit-source-id: d187c5d662584866103e4b77d09dfb925509ae2e
2019-01-28 12:50:22 -08:00
Siying Dong
f184bee77b PlainTable should avoid copying Get() results from immortal source. (#4924)
Summary:
https://github.com/facebook/rocksdb/pull/4053 avoids memcopy for Get() results if files are immortable
(read-only DB, max_open_files=-1) and the file is ammaped. The same optimization is being applied to PlainTable
here.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4924

Differential Revision: D13827749

Pulled By: siying

fbshipit-source-id: 1f2cbfc530b40ce08ccd53f95f6e78de4d1c2f96
2019-01-25 17:12:19 -08:00
Siying Dong
fc53839bfa Disallow customized hash function in DynamicBloom (#4915)
Summary:
I didn't find where customized hash function is used in DynamicBloom. This can only reduce performance. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4915

Differential Revision: D13794452

Pulled By: siying

fbshipit-source-id: e38669b11e01444d2d782da11c7decabbd851819
2019-01-24 10:34:30 -08:00
Andrew Kryczka
8ec3e72551 Cache dictionary used for decompressing data blocks (#4881)
Summary:
- If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`.
- If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle.
- New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881

Differential Revision: D13663801

Pulled By: ajkr

fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f
2019-01-23 18:15:47 -08:00
Andrew Kryczka
27054d837b Call NewDataBlockIterator with correct arguments in DB::Get (#4913)
Summary:
The pointer `get_context` was passed as the value for the boolean argument `index_key_is_full`. Luckily the pointer was always non-null so evaluated to true which is the correct value for the boolean argument. But we were missing out on batch updates to stats since we were not passing anything for the `GetContext*` argument and it defaults to `nullptr`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4913

Differential Revision: D13791449

Pulled By: ajkr

fbshipit-source-id: dbe40bf406c64d34cb5298604145d18b9e0ca9be
2019-01-23 15:39:05 -08:00
Andrew Kryczka
01013ae766 Digest ZSTD compression dictionary once when writing SST file (#4849)
Summary:
This is essentially a re-submission of #4251 with a few improvements:

- Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict`
- Eliminated `Init` functions. Instead do all initialization work in constructors.
- Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849

Differential Revision: D13606039

Pulled By: ajkr

fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447
2019-01-18 19:12:57 -08:00
Andrew Kryczka
ace543a815 fix accounting for range tombstones in TableProperties (#4841)
Summary:
- To be consistent with the accounting of other optypes in `TableProperties`, we should count range tombstones in `TableProperties::num_entries` and `TableProperties::num_deletions`.
- Updated assertions in stress test's `OnTableFileCreated` handler to accept files with range tombstones only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4841

Differential Revision: D13568424

Pulled By: ajkr

fbshipit-source-id: 0139d7806494eda20ece67ec460d2458dbbf6026
2019-01-02 15:08:53 -08:00
Alexander Zinoviev
80bf8975fd Add a new per level counter for block cache hit (#4796)
Summary:
Add a new per level counter for block cache hits, increase it by one on every successful attempt to get an entry from cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4796

Differential Revision: D13513688

Pulled By: zinoale

fbshipit-source-id: 104df038f1232e3356e162eb2d8ca138e34a8281
2018-12-21 13:20:05 -08:00
Andrew Kryczka
e0be1bc4f1 fix DeleteRange memory leak for mmap and block cache (#4810)
Summary:
Previously we were cleaning up range tombstone meta-block by calling `ReleaseCachedEntry`, which wouldn't work if `value != nullptr && cache_handle == nullptr`. This happened at least in the case with mmap reads and block cache both enabled. I noticed `NewDataBlockIterator` intends to handle all these cases, so migrated to that instead of `NewUnfragmentedRangeTombstoneIterator`.

Also changed the table-opening logic to fail on `ReadRangeDelBlock` failure, since that can cause data corruption. Added a test case to verify this behavior. Note the test case does not fail on `TryReopen` because failure to preload table handlers is not considered critical. However, it does fail on any read involving that file since it cannot return correct data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4810

Differential Revision: D13534296

Pulled By: ajkr

fbshipit-source-id: 55dde1111717cea6ec4bf38418daab81ccef3599
2018-12-20 21:59:49 -08:00
DorianZheng
2670fe8c73 Get CompactionJobInfo from CompactFiles
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4716

Differential Revision: D13207677

Pulled By: ajkr

fbshipit-source-id: d0ccf5a66df6cbb07288b0c5ebad81fd9df3926b
2018-12-13 14:21:24 -08:00
Abhishek Madan
cad248f5c6 Prepare FragmentedRangeTombstoneIterator for use in compaction (#4740)
Summary:
To support the flush/compaction use cases of RangeDelAggregator
in v2, FragmentedRangeTombstoneIterator now supports dropping tombstones
that cannot be read in the compaction output file. Furthermore,
FragmentedRangeTombstoneIterator supports the "snapshot striping" use
case by allowing an iterator to be split by a list of snapshots.
RangeDelAggregatorV2 will use these changes in a follow-up change.

In the process of making these changes, other miscellaneous cleanups
were also done in these files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4740

Differential Revision: D13287382

Pulled By: abhimadan

fbshipit-source-id: f5aeb03e1b3058049b80c02a558ee48f723fa48c
2018-12-11 12:10:48 -08:00
Anand Ananthabhotla
1b01d23be2 Add PerfContext counters for index/filter block cache stats (#4540)
Summary:
Add counters to track block cache index/filter hits and misses. We currently count aggregate hits and misses, which includes index/filter/data blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4540

Differential Revision: D10459652

Pulled By: anand1976

fbshipit-source-id: 0c59eee7f12f5103dcb6686f0e7995babe63d425
2018-12-07 15:07:56 -08:00
Sagar Vemuri
0463f61837 Refactor BlockBasedTable::Open (#4636)
Summary:
Refactored and simplified `BlockBasedTable::Open` to be similar to `BlockBasedTableBuilder::Finish` as both these functions complement each other. Also added `BlockBasedTableBuilder::WriteFooter` along the way.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4636

Differential Revision: D12933319

Pulled By: sagar0

fbshipit-source-id: 1ff1d02f6d80a63b5ba720a1fc75e71c7344137b
2018-12-07 13:18:44 -08:00
anand76
8d9b4d9741 Fix failure of sst_file_reader_test in LITE mode regression test (#4725)
Summary:
Add a dummy main() in sst_file_reader_test for ROCKSDB_LITE to fix link failure in regression
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4725

Differential Revision: D13252885

Pulled By: anand1976

fbshipit-source-id: 0e22b964815e2bf01aff7d03ed4ae59d44fa86f1
2018-11-29 10:51:41 -08:00
Yi Wu
512a5e3ef8 Fix BlockBasedTable not always using memory allocator if available (#4678)
Summary:
Fix block based table reader not using memory_allocator when allocating index blocks and compression dictionary blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4678

Differential Revision: D13054594

Pulled By: yiwu-arbug

fbshipit-source-id: 379f25bcc665395662511c4f873f4b7b55104ce2
2018-11-28 18:01:24 -08:00
Abhishek Madan
8fe1e06ca0 Clean up FragmentedRangeTombstoneList (#4692)
Summary:
Removed `one_time_use` flag, which removed the need for some
tests, and changed all `NewRangeTombstoneIterator` methods to return
`FragmentedRangeTombstoneIterators`.

These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones`
and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692

Differential Revision: D13106570

Pulled By: abhimadan

fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845
2018-11-28 15:29:02 -08:00
Huachao Huang
5e72bc113a Add SstFileReader to read sst files (#4717)
Summary:
A user friendly sst file reader is useful when we want to access sst
files outside of RocksDB. For example, we can generate an sst file
with SstFileWriter and send it to other places, then use SstFileReader
to read the file and process the entries in other ways.

Also rename the original SstFileReader to SstFileDumper because of
name conflict, and seems SstFileDumper is more appropriate for tools.

TODO: there is only a very simple test now, because I want to get some feedback first.
If the changes look good, I will add more tests soon.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4717

Differential Revision: D13212686

Pulled By: ajkr

fbshipit-source-id: 737593383264c954b79e63edaf44aaae0d947e56
2018-11-27 13:02:23 -08:00
Yi Wu
05d9d82181 Revert "Move MemoryAllocator option from Cache to BlockBasedTableOpti… (#4697)
Summary:
…ons (#4676)"

This reverts commit b32d087dbb.

`MemoryAllocator` needs to be with `Cache`, since cache entry can
outlive DB and block based table. The cache needs to hold reference to
memory allocator when deleting cache entry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4697

Differential Revision: D13133490

Pulled By: yiwu-arbug

fbshipit-source-id: 8ef7e8a51263bfd929f892fd062665ff4ce9ce5a
2018-11-21 11:29:57 -08:00
Abhishek Madan
6bee36a786 Modify FragmentedRangeTombstoneList member layout (#4632)
Summary:
Rather than storing a `vector<RangeTombstone>`, we now store a
`vector<RangeTombstoneStack>` and a `vector<SequenceNumber>`. A
`RangeTombstoneStack` contains the start and end keys of a range tombstone
fragment, and indices into the seqnum vector to indicate which sequence
numbers the fragment is located at. The diagram below illustrates an
example:

```
tombstones_:     [a, b) [c, e) [h, k)
                   | \   /  \   /  |
                   |  \ /    \ /   |
                   v   v      v    v
tombstone_seqs_: [ 5 3 10 7 2 8 6  ]
```

This format allows binary searching the tombstone list to use less key
comparisons, which helps in cases where there are many overlapping
tombstones. Also, this format makes it easier to add DBIter-like
semantics to `FragmentedRangeTombstoneIterator` in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4632

Differential Revision: D13053103

Pulled By: abhimadan

fbshipit-source-id: e8220cc712fcf5be4d602913bb23ace8ea5f8ef0
2018-11-14 17:52:17 -08:00
Siying Dong
b82e57d425 Remove two variables from BlockContents class and don't use class Block for compressed block (#4650)
Summary:
We carry compression type and "cachable" variables for every block in the block cache, while they take well-known values. 8-byte is wasted for each block (2-byte for useful information but it takes 8 bytes because of padding). With this change, these two variables are removed.

The cachable information is only useful in the process of reading the block. We use other information to infer from it. For compressed blocks, the compression type is a part of the block content itself so we can get it from there.

Some code is slightly refactored so that the cachable information can flow better.

Another change is to only use class BlockContents for compressed block, and narrow the class Block to only be used for uncompressed blocks, including blocks in compressed block cache. This can make the Block class less confusing. It also saves tens of bytes for each block in compressed block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4650

Differential Revision: D12969070

Pulled By: siying

fbshipit-source-id: 548b62724e9eb66993026429fd9c7c3acd1f95ed
2018-11-13 17:02:55 -08:00
Yi Wu
b32d087dbb Move MemoryAllocator option from Cache to BlockBasedTableOptions (#4676)
Summary:
Per offline discussion with siying, `MemoryAllocator` and `Cache` should be decouple. The idea is that memory allocator handles memory allocation, while cache handle cache policy.

It is normal that external cache libraries pack couple the two components for better optimization. If we want to integrate with such library in the future, we can make a wrapper of the library implementing both `Cache` and `MemoryAllocator` interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4676

Differential Revision: D13047662

Pulled By: yiwu-arbug

fbshipit-source-id: cd42e246d80ab600b4de47d073f7d2db308ce6dd
2018-11-13 13:48:38 -08:00
Sagar Vemuri
dc3528077a Update all unique/shared_ptr instances to be qualified with namespace std (#4638)
Summary:
Ran the following commands to recursively change all the files under RocksDB:
```
find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
```
Running `make format` updated some formatting on the files touched.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638

Differential Revision: D12934992

Pulled By: sagar0

fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
2018-11-09 11:19:58 -08:00
Siying Dong
566fc8b994 Black list some valgrind tests (#4642)
Summary:
valgrind tests with 1 thread run too long. To make it shorter, black list some long tests. These are already blacklisted in parallel valgrind tests, but they are not in non-parallel mode
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4642

Differential Revision: D12945237

Pulled By: siying

fbshipit-source-id: 04cf977d435996480fe87aa09f14b17975b74f7d
2018-11-06 14:22:36 -08:00
Bo Hou
cd9404bb77 xxhash 64 support
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4607

Reviewed By: siying

Differential Revision: D12836696

Pulled By: jsjhoubo

fbshipit-source-id: 7122ccb712d0b0f1cd998aa4477e0da1401bd870
2018-11-01 15:44:06 -07:00
Abhishek Madan
eaaf1a6f05 Promote rocksdb.{deleted.keys,merge.operands} to main table properties (#4594)
Summary:
Since the number of range deletions are reported in
TableProperties, it is confusing to not report the number of merge
operands and point deletions as top-level properties; they are
accessible through the public API, but since they are not the "main"
properties, they do not appear in aggregated table properties, or the
string representation of table properties.

This change promotes those two property keys to
`rocksdb/table_properties.h`, adds corresponding uint64 members for
them, deprecates the old access methods `GetDeletedKeys()` and
`GetMergeOperands()` (though they are still usable for now), and removes
`InternalKeyPropertiesCollector`. The property key strings are the same
as before this change, so this should be able to read DBs written from older
versions (though I haven't tested this yet).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4594

Differential Revision: D12826893

Pulled By: abhimadan

fbshipit-source-id: 9e4e4fbdc5b0da161c89582566d184101ba8eb68
2018-10-30 15:34:27 -07:00
Yi Wu
f560c8f5c8 s/CacheAllocator/MemoryAllocator/g (#4590)
Summary:
Rename the interface, as it is mean to be a generic interface for memory allocation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4590

Differential Revision: D10866340

Pulled By: yiwu-arbug

fbshipit-source-id: 85cb753351a40cb856c046aeaa3f3b369eef3d16
2018-10-26 14:30:30 -07:00
Abhishek Madan
7528130e38 Cache fragmented range tombstones in BlockBasedTableReader (#4493)
Summary:
This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses.

On the same DB used in #4449, running `readrandom` results in the following:
```
readrandom   :       0.983 micros/op 1017076 ops/sec;   78.3 MB/s (63103 of 100000 found)
```

Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results):
```
   Tombstones?    | avg micros/op | stddev micros/op |  avg ops/s   | stddev ops/s
----------------- | ------------- | ---------------- | ------------ | ------------
None              |        0.6186 |          0.04637 | 1,625,252.90 | 124,679.41
500 Expanded      |        0.6019 |          0.03628 | 1,666,670.40 | 101,142.65
500 Unexpanded    |        0.6435 |          0.03994 | 1,559,979.40 | 104,090.52
1k Expanded       |        0.6034 |          0.04349 | 1,665,128.10 | 125,144.57
1k Unexpanded     |        0.6261 |          0.03093 | 1,600,457.50 |  79,024.94
5k Expanded       |        0.6163 |          0.05926 | 1,636,668.80 | 154,888.85
5k Unexpanded     |        0.6402 |          0.04002 | 1,567,804.70 | 100,965.55
10k Expanded      |        0.6036 |          0.05105 | 1,667,237.70 | 142,830.36
10k Unexpanded    |        0.6128 |          0.02598 | 1,634,633.40 |  72,161.82
25k Expanded      |        0.6198 |          0.04542 | 1,620,980.50 | 116,662.93
25k Unexpanded    |        0.5478 |          0.0362  | 1,833,059.10 | 121,233.81
50k Expanded      |        0.5104 |          0.04347 | 1,973,107.90 | 184,073.49
50k Unexpanded    |        0.4528 |          0.03387 | 2,219,034.50 | 170,984.32
```

After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493

Differential Revision: D10842844

Pulled By: abhimadan

fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509
2018-10-25 19:26:44 -07:00
Abhishek Madan
8c78348c77 Use only "local" range tombstones during Get (#4449)
Summary:
Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key from higher levels, and compute the highest covering
sequence number at the current level. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.

In the benchmark results, the following command was used to initialize the database:
```
./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
```

...and the following command was used to measure read throughput:
```
./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
```

The filluniquerandom command was only run once, and the resulting database was used
to measure read performance before and after the PR. Both binaries were compiled with
`DEBUG_LEVEL=0`.

Readrandom results before PR:
```
readrandom   :       4.544 micros/op 220090 ops/sec;   16.9 MB/s (63103 of 100000 found)
```

Readrandom results after PR:
```
readrandom   :      11.147 micros/op 89707 ops/sec;    6.9 MB/s (63103 of 100000 found)
```

So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).

----
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449

Differential Revision: D10370575

Pulled By: abhimadan

fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
2018-10-24 12:31:12 -07:00
Zhongyi Xie
21bf7421ca use per-level perf context for bloom filter related counters (#4581)
Summary:
PR https://github.com/facebook/rocksdb/pull/4226 introduced per-level perf context which allows breaking down perf context by levels.
This PR takes advantage of the feature to populate a few counters related to bloom filters
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4581

Differential Revision: D10518010

Pulled By: miasantreble

fbshipit-source-id: 011244561783ec860d32d5b0fa6bce6e78d70ef8
2018-10-24 12:21:38 -07:00
Neil Mayhew
43dbd4411e Adapt three unit tests with newer compiler/libraries (#4562)
Summary:
This fixes three tests that fail with relatively recent tools and libraries:

The tests are:

* `spatial_db_test`
* `table_test`
* `db_universal_compaction_test`

I'm using:

* `gcc` 7.3.0
* `glibc` 2.27
* `snappy` 1.1.7
* `gflags` 2.2.1
* `zlib` 1.2.11
* `bzip2` 1.0.6.0.1
* `lz4` 1.8.2
* `jemalloc` 5.0.1

The versions used in the Travis environment (which is two Ubuntu LTS versions behind the current one and doesn't use `lz4` or `jemalloc`) don't seem to have a problem. However, to be safe, I verified that these tests pass with and without my changes in a trusty Docker container without `lz4` and `jemalloc`.

However, I do get an unrelated set of other failures when using a trusty Docker container that uses `lz4` and `jemalloc`:

```
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/0, where GetParam() = (1, false) (1189 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1, where GetParam() = (1, true) (1246 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2, where GetParam() = (3, false) (1237 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3, where GetParam() = (3, true) (1195 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4, where GetParam() = (5, false) (1161 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5, where GetParam() = (5, true) (1229 ms)
```

I haven't attempted to fix these since I'm not using trusty and Travis doesn't use `lz4` and `jemalloc`. However, the final commit in this PR does at least fix the compilation errors that occur when using trusty's version of `lz4`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4562

Differential Revision: D10510917

Pulled By: maysamyabandeh

fbshipit-source-id: 59534042015ec339270e5fc2f6ac4d859370d189
2018-10-24 08:17:56 -07:00
Zhongyi Xie
d6ec288703 Add PerfContextByLevel to provide per level perf context information (#4226)
Summary:
Current implementation of perf context is level agnostic. Making it hard to do performance evaluation for the LSM tree. This PR adds `PerfContextByLevel` to decompose the counters by level.
This will be helpful when analyzing point and range query performance as well as tuning bloom filter
Also replaced __thread with thread_local keyword for perf_context
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4226

Differential Revision: D10369509

Pulled By: miasantreble

fbshipit-source-id: f1ced4e0de5fcebdb7f9cff36164516bc6382d82
2018-10-17 11:19:40 -07:00
Yanqin Jin
729a617b5b Add listener to sample file io (#3933)
Summary:
We would like to collect file-system-level statistics including file name, offset, length, return code, latency, etc., which requires to add callbacks to intercept file IO function calls when RocksDB is running.
To collect file-system-level statistics, users can inherit the class `EventListener`, as in `TestFileOperationListener `. Note that `TestFileOperationListener::ShouldBeNotifiedOnFileIO()` returns true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3933

Differential Revision: D10219571

Pulled By: riversand963

fbshipit-source-id: 7acc577a2d31097766a27adb6f78eaf8b1e8ff15
2018-10-12 18:36:11 -07:00
jsteemann
517d3b8b77 fix typo in error message, twice (#4457)
Summary:
Fixes a typo in error messages returned by Iterator::GetProperty(...)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4457

Differential Revision: D10281965

Pulled By: sagar0

fbshipit-source-id: 1cd3c665f467ef06cdfd9f482692e6f8568f3d22
2018-10-09 17:07:27 -07:00
Dmitry Alimov
e13d8dcbbb Fix typos in comments (#4456)
Summary:
Fix some typos in the comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4456

Differential Revision: D10209214

Pulled By: miasantreble

fbshipit-source-id: dff857ba60396bc95126e635db96d7dc8330d2cb
2018-10-04 20:46:50 -07:00
Zhongyi Xie
ce1fc5af09 fix unused param allocator in compression.h (#4453)
Summary:
this should fix currently failing contrun test: rocksdb-contrun-no_compression, rocksdb-contrun-tsan, rocksdb-contrun-tsan_crash
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4453

Differential Revision: D10202626

Pulled By: miasantreble

fbshipit-source-id: 850b07f14f671b5998c22d8239e2a55b2fc1e355
2018-10-04 13:24:22 -07:00
Igor Canadi
1cf5deb8fd Introduce CacheAllocator, a custom allocator for cache blocks (#4437)
Summary:
This is a conceptually simple change, but it touches many files to
pass the allocator through function calls.

We introduce CacheAllocator, which can be used by clients to configure
custom allocator for cache blocks. Our motivation is to hook this up
with folly's `JemallocNodumpAllocator`
(f43ce6d686/folly/experimental/JemallocNodumpAllocator.h),
but there are many other possible use cases.

Additionally, this commit cleans up memory allocation in
`util/compression.h`, making sure that all allocations are wrapped in a
unique_ptr as soon as possible.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4437

Differential Revision: D10132814

Pulled By: yiwu-arbug

fbshipit-source-id: be1343a4b69f6048df127939fea9bbc96969f564
2018-10-02 17:24:58 -07:00
Maysam Yabandeh
65ac72edd9 Fix bug in partition filters with format_version=4 (#4381)
Summary:
Value delta encoding in format_version 4 requires the differences between the size of two consecutive handles to be sent to BlockBuilder::Add. This applies not only to indexes on blocks but also the indexes on indexes and filters in partitioned indexes and filters respectively. The patch fixes a bug where the partitioned filters would encode the entire size of the handle rather than the difference of the size with the last size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4381

Differential Revision: D9879505

Pulled By: maysamyabandeh

fbshipit-source-id: 27a22e49b482b927fbd5629dc310c46d63d4b6d1
2018-09-17 17:28:15 -07:00
Sagar Vemuri
ac46790374 Fix sync-point comment in Block destructor (#4380)
Summary:
This is a follow up to #4370. The earlier comment is not correct.

Thanks to ajkr for pointing this out.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4380

Differential Revision: D9874667

Pulled By: sagar0

fbshipit-source-id: f4e092d86b29c715258210b770643d367e38caae
2018-09-17 11:58:11 -07:00
Sagar Vemuri
3db584059c Remove sync point from Block destructor (#4370)
Summary:
AddressSanitizer: heap-use-after-free in std::__atomic_base<bool>::load(std::memory_order) const
==1798517==ABORTING
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4370

Differential Revision: D9844146

Pulled By: sagar0

fbshipit-source-id: 18a2970b1d504b4f6c8fb04857f26e0f32124dd1
2018-09-15 00:12:57 -07:00
Maysam Yabandeh
9ea9007b50 Reduce IndexBlockIter size (#4358)
Summary:
With #3983 the size of IndexBlockIter was increased. This had resulted in a regression on P50 latencies in one of our benchmarks. The patch reduces IndexBlockIter size be eliminating active_comparator_ field from the class.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4358

Differential Revision: D9781737

Pulled By: maysamyabandeh

fbshipit-source-id: 71e2b28d90ff0813db9e04b737ae73e185583c52
2018-09-12 10:03:35 -07:00
Andrew Kryczka
2c14662213 Revert "Digest ZSTD compression dictionary once per SST file (#4251)" (#4347)
Summary:
Reverting is needed to unblock a user building against master, who is blocked for multiple days due to a thread-safety issue in `GetEmptyDict`. We haven't been able to fix it quickly, so reverting.

Simply ran `git revert 6c40806e51a89386d2b066fddf73d3fd03a36f65`. There were no merge conflicts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4347

Differential Revision: D9668365

Pulled By: ajkr

fbshipit-source-id: 0c56334f0a23cf5ee0233d4e4679eae6709739cd
2018-09-06 09:58:34 -07:00
cngzhnp
64324e329e Support pragma once in all header files and cleanup some warnings (#4339)
Summary:
As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited.

Besides this, try to fix some warnings about loss of data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339

Differential Revision: D9654990

Pulled By: ajkr

fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848
2018-09-05 18:13:31 -07:00
Andrew Kryczka
6c40806e51 Digest ZSTD compression dictionary once per SST file (#4251)
Summary:
In RocksDB, for a given SST file, all data blocks are compressed with the same dictionary. When we compress a block using the dictionary's raw bytes, the compression library first has to digest the dictionary to get it into a usable form. This digestion work is redundant and ideally should be done once per file.

ZSTD offers APIs for the caller to create and reuse a digested dictionary object (`ZSTD_CDict`). In this PR, we call `ZSTD_createCDict` once per file to digest the raw bytes. Then we use `ZSTD_compress_usingCDict` to compress each data block using the pre-digested dictionary. Once the file's created `ZSTD_freeCDict` releases the resources held by the digested dictionary.

There are a couple other changes included in this PR:

- Changed the parameter object for (un)compression functions from `CompressionContext`/`UncompressionContext` to `CompressionInfo`/`UncompressionInfo`. This avoids the previous pattern, where `CompressionContext`/`UncompressionContext` had to be mutated before calling a (un)compression function depending on whether dictionary should be used. I felt that mutation was error-prone so eliminated it.
- Added support for digested uncompression dictionaries (`ZSTD_DDict`) as well. However, this PR does not support reusing them across uncompression calls for the same file. That work is deferred to a later PR when we will store the `ZSTD_DDict` objects in block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4251

Differential Revision: D9257078

Pulled By: ajkr

fbshipit-source-id: 21b8cb6bbdd48e459f1c62343780ab66c0a64438
2018-08-23 19:28:18 -07:00
Fenggang Wu
da40d45267 DataBlockHashIndex: avoiding expensive iiter->Next when handling hash kNoEntry (#4296)
Summary:
When returning `kNoEntry` from HashIndex lookup, previously we invalidate the
`biter` by set `current_=restarts_`, so that the search can continue to the next
block in case the search result may reside in the next block.

There is one problem: when we are searching for a missing key, if the search
finds a `kNoEntry` and continue the search to the next block, there is also a
non-trivial possibility that the HashIndex return `kNoEntry` too, and the
expensive index iterator `Next()` will happen several times for nothing.

The solution is that if the hash table returns `kNoEntry`, `SeekForGetImpl()` just search the last restart interval for the key. It will stop at the first key that is large than the seek_key, or to the end of the block, and each case will be handled correctly.

Microbenchmark script:
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillseq,readtocache,readmissing \
          --cache_size=20000000000  --use_data_block_hash_index={true|false}
```

`readmissing` performance (lower is better):
```
binary:                      3.6098 micros/op
hash (before applying diff): 4.1048 micros/op
hash (after  applying diff): 3.3502 micros/op
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4296

Differential Revision: D9419159

Pulled By: fgwu

fbshipit-source-id: 21e3eedcccbc47a249aa8eb4bf405c9def0b8a05
2018-08-23 10:12:58 -07:00
Yanqin Jin
bb5dcea98e Add path to WritableFileWriter. (#4039)
Summary:
We want to sample the file I/O issued by RocksDB and report the function calls. This requires us to include the file paths otherwise it's hard to tell what has been going on.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4039

Differential Revision: D8670178

Pulled By: riversand963

fbshipit-source-id: 97ee806d1c583a2983e28e213ee764dc6ac28f7a
2018-08-23 10:12:58 -07:00
Fenggang Wu
640cfa7c33 DataBlockHashIndex: fix comment in NumRestarts() (#4286)
Summary:
Improve the description of the backward compatibility check in NumRestarts()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4286

Differential Revision: D9412490

Pulled By: fgwu

fbshipit-source-id: ea7dd5c61d8ff8eacef623b729d4e4fd53cca066
2018-08-21 17:12:45 -07:00
Fenggang Wu
6d37fdb365 DataBlockHashIndex: Remove the division from EstimateSize() (#4293)
Summary:
`BlockBasedTableBuilder::Add()` eventually calls
`DataBlockHashIndexBuilder::EstimateSize()`. The previous implementation
divides the `num_keys` by the `util_ratio_` to get an estimizted `num_buckets`.
Such division is expensive as it happens in every
`BlockBasedTableBuilder::Add()`.

This diff estimates the `num_buckets` by double addition instead of double
division. Specifically, in each `Add()`, we add `bucket_per_key_`
(inverse of `util_ratio_`) to the current `estimiated_num_buckets_`. The cost is
that we are gonna have the `estimated_num_buckets_` stored as one extra field
in the DataBlockHashIndexBuilder.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4293

Differential Revision: D9412472

Pulled By: fgwu

fbshipit-source-id: 2925c8509a401e7bd3c1ab1d9e9c7244755c277a
2018-08-20 23:13:50 -07:00
Siying Dong
dc064f302e Suppress two CLANG Analyze warning (#4291)
Summary:
Suppress two CLANG analyze warnings. They don't seem to be real bugs
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4291

Differential Revision: D9407333

Pulled By: siying

fbshipit-source-id: 2ed63d88fa0b217fdccb1572d7508467c2203dc8
2018-08-20 16:57:38 -07:00
Andrey Zagrebin
fa4de6e30f #3865 followup for fix performance degression introduced by switching order of operands (#4284)
Summary:
Followup for #4266. There is one more place in **get_context.cc** where **MergeOperator::ShouldMerge** should be called with reversed list of operands.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4284

Differential Revision: D9380008

Pulled By: sagar0

fbshipit-source-id: 70ec26e607e5b88465e1acbdcd6c6171bd76b9f2
2018-08-17 10:57:25 -07:00
Fenggang Wu
9d646a6311 Add db_bench options of data block hash index (#4281)
Summary:
Add `--data_block_index_type` and `--data_block_hash_table_util_ratio` option to `db_bench`.

`--data_block_index_type` can be either of `binary` (default) or `binary_and_hash`;
`--data_block_hash_table_util_ratio` will be a double. The default value is `0.75`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4281

Differential Revision: D9361476

Pulled By: fgwu

fbshipit-source-id: dc53e01acef9db81b9eec5e8a96f3bc8ed718c10
2018-08-16 18:42:46 -07:00
Fenggang Wu
19ec44fd39 Improve point-lookup performance using a data block hash index (#4174)
Summary:
Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.`

The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174

Differential Revision: D8965914

Pulled By: fgwu

fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198
2018-08-15 14:30:03 -07:00
Siying Dong
f3d91a0b57 Add a unit test to verify iterators release data blocks after using them (#4170)
Summary:
Add a unit test to check that iterators release data blocks after it has moved away from it. Verify the same for compaction input iterators.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4170

Differential Revision: D8962513

Pulled By: siying

fbshipit-source-id: 05a5b604d7d29887fb488f2cda7286f554a14407
2018-08-13 17:43:14 -07:00
Maysam Yabandeh
d511f35ea7 Fix wrong partitioned index size recorded in properties block (#4259)
Summary:
After refactoring in https://github.com/facebook/rocksdb/pull/4158 the properties block is written after the index block. This breaks the existing logic in estimating the index size in partitioned indexes. The patch fixes that by using the accurate index block size, which is available since by the time we write the properties block, the index block is already written.
The patch also fixes an issue in estimating the partition size with format_version=3 which was resulting into partitions smaller than the configured metadata_block_size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4259

Differential Revision: D9274454

Pulled By: maysamyabandeh

fbshipit-source-id: c82d045505cca3e7ed1a44ee1eaa26e4f25a4272
2018-08-10 15:27:20 -07:00
Maysam Yabandeh
058026a885 Fix unity compile error (#4257)
Summary:
Fix the compile error in "make unity_test" caused by #3983.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4257

Differential Revision: D9271740

Pulled By: maysamyabandeh

fbshipit-source-id: 94e56d1675bf8bdc0e94439467eb4f40dd107517
2018-08-10 10:27:55 -07:00
Maysam Yabandeh
caf0f53a74 Index value delta encoding (#3983)
Summary:
Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the  block cache more efficiently. The feature is enabled with using format_version 4.

The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size.
Results with sysbench read-only using 4k blocks and using 16 index restart interval:
Format 2:
19585   rocksdb read-only range=100
Format 3:
19569   rocksdb read-only range=100
Format 4:
19352   rocksdb read-only range=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983

Differential Revision: D8361343

Pulled By: maysamyabandeh

fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651
2018-08-09 16:58:40 -07:00
Maysam Yabandeh
eb8885a08a Return correct usable_size for BlockContents (#4246)
Summary:
If jemalloc is disabled or the API is incorrectly referenced (jemalloc api on windows have a prefix je_) memory usage is incorrectly reported for all block sizes. This is because sizeof(char) is always 1. sizeof() is calculated at compile time and *(char*) is char. The patch uses the size of the slice to fix that.
Fixes #4245
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4246

Differential Revision: D9233958

Pulled By: maysamyabandeh

fbshipit-source-id: 9646933b24504e2814c7379f06a31148829c6b4e
2018-08-08 17:43:00 -07:00
Yanqin Jin
54de56844d Remove random writes from SST file ingestion (#4172)
Summary:
RocksDB used to store global_seqno in external SST files written by
SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the
`global_seqno`. Since random write is not supported in some non-POSIX compliant
file systems, external SST file ingestion is not supported on these file
systems. To address this limitation, we no longer update `global_seqno` during
file ingestion. Later RocksDB uses the MANIFEST and other information in table
properties to deduce global seqno for externally-ingested SST files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172

Differential Revision: D8961465

Pulled By: riversand963

fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071
2018-07-27 16:12:23 -07:00
Fenggang Wu
a11df583ec Add DataBlockIndexType option in BlockBasedTableOptions (#4150)
Summary:
Added DataBlockIndexType option in BlockBasedTableOptions.
```
enum DataBlockIndexType : char {
    kDataBlockBinarySearch = 0, // traditional block type
    kDataBlockHashIndex = 1, // additional hash index appended to the end.
};
```
The default type is the traditional binary seek option: `kDataBlockBinarySearch`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4150

Differential Revision: D8895958

Pulled By: fgwu

fbshipit-source-id: 480adef48104cf11d30db3bad9a73f98b4a80c10
2018-07-27 15:42:27 -07:00
Siying Dong
2a81633da2 Fix bug when seeking backward against an out-of-bound iterator (#4187)
Summary:
92ee3350e0 introduces an out-of-bound check in BlockBasedTableIterator::Valid(). However, this flag is not reset when re-seeking in backward direction. This caused the iterator to be invalide by mistake. Fix it by always resetting the out-of-bound flag in every seek.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4187

Differential Revision: D8996600

Pulled By: siying

fbshipit-source-id: b6235ea614f71381e50e7904c4fb036300604ac1
2018-07-25 17:14:01 -07:00
Fenggang Wu
8805ec2f49 DataBlockHashIndex: Standalone Implementation with Unit Test (#4139)
Summary:
The first step of the `DataBlockHashIndex` implementation. A string based hash table is implemented and unit-tested.

`DataBlockHashIndexBuilder`: `Add()` takes pairs of `<key, restart_index>`, and formats it into a string when `Finish()` is called.
`DataBlockHashIndex`: initialized by the formatted string, and can interpret it as a hash table. Lookup for a key is supported by iterator operation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4139

Reviewed By: sagar0

Differential Revision: D8866764

Pulled By: fgwu

fbshipit-source-id: 7f015f0098632c65979a22898a50424384730b10
2018-07-24 11:43:37 -07:00
Zhongyi Xie
f95a5b2464 Avoid unnecessary big for-loop when reporting ticker stats stored in GetContext (#3490)
Summary:
Currently in `Version::Get` when reporting ticker stats stored in `GetContext`, there is a big for-loop through all `Ticker` which adds unnecessary cost to overall CPU usage. We can optimize by storing only ticker values that are used in `Get()` calls in a new struct `GetContextStats` since only a small fraction of all tickers are used in `Get()` calls. For comparison, with the new approach we only need to visit 17 values while old approach will require visiting 100+ `Ticker`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3490

Differential Revision: D6969154

Pulled By: miasantreble

fbshipit-source-id: fc27072965a3a94125a3e6883d20dafcf5b84029
2018-07-20 16:58:13 -07:00
Siying Dong
a5e851e113 Reformatting some recent changes (#4161)
Summary:
Lint is not happy with some new code recently committed. Format them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4161

Differential Revision: D8940582

Pulled By: siying

fbshipit-source-id: c9b43b1ef8c88b5e923911058b44eb77234b36b7
2018-07-20 14:43:38 -07:00
Siying Dong
8425c8bd4d BlockBasedTableReader: automatically adjust tail prefetch size (#4156)
Summary:
Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others.
Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10%
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4156

Differential Revision: D8916847

Pulled By: siying

fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758
2018-07-20 14:43:37 -07:00
Andrew Kryczka
ab35505e21 Write properties metablock last in block-based tables (#4158)
Summary:
The properties meta-block should come at the end since we always need to
read it when opening a file, unlike index/filter/other meta-blocks, which
are sometimes read depending on the user's configuration. This ordering
will allow us to (in a future PR) do a small readahead on the end of the file
to read properties and meta-index blocks with one I/O.

The bulk of this PR is a refactoring of the `BlockBasedTableBuilder::Finish`
function. It was previously too large with inconsistent error handling, which
made it difficult to change. So I broke it up into one function per meta-block
write, and tried to make error handling consistent within those functions.
Then reordering the metablocks was trivial -- just reorder the calls to these
helper functions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4158

Differential Revision: D8921705

Pulled By: ajkr

fbshipit-source-id: 96c9cc3182eb1adf11af46adab79dbeba7b12fcc
2018-07-20 09:11:59 -07:00
Andrew Kryczka
b5613227a9 Smaller tail readahead when not reading index/filters (#4159)
Summary:
In all cases during `BlockBasedTable::Open`, we issue at least three read requests to the file's tail: (1) footer, (2) metaindex block, and (3) properties block. Depending on the config, we may also read other metablocks like filter and index.

This PR issues smaller readahead when we expect to do only the three necessary reads mentioned above. Then, 4KB should be enough (ignoring the case where there are lots of user-defined properties). We can keep doing 512KB readahead when additional reads are expected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4159

Differential Revision: D8924002

Pulled By: ajkr

fbshipit-source-id: cfc713275de4d05ce11f18571f1d72e27ccd3356
2018-07-19 16:13:22 -07:00
Maysam Yabandeh
b55da012f6 Refactor IndexBlockIter (#4141)
Summary:
Refactor IndexBlockIter to reduce conditional branches on key_includes_seq_. IndexBlockIter::Prev is also separated from DataBlockIter::Prev, not to cache the prev entries as they are of less importance when iterating over the index block.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4141

Differential Revision: D8866437

Pulled By: maysamyabandeh

fbshipit-source-id: fdac76880426fc2be7d3c6354c09ab98f6657d4b
2018-07-16 17:13:10 -07:00
Siying Dong
8f06b4fa01 Separate some IndexBlockIter logic from BlockIter (#4136)
Summary:
Some logic only related to IndexBlockIter is separated from BlockIter to IndexBlockIter. This is done by writing an exclusive Seek() and SeekForPrev() for DataBlockIter, and all metadata block iter and tombstone block iter now use data block iter. Dealing with the BinarySeek() sharing problem by passing in the comparator to use.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4136

Reviewed By: maysamyabandeh

Differential Revision: D8859673

Pulled By: siying

fbshipit-source-id: 703e5e6824b82b7cbf4721f3594b94127797ca9e
2018-07-16 10:13:18 -07:00
Nathan VanBenschoten
ef7815b803 Support range deletion tombstones in IngestExternalFile SSTs (#3778)
Summary:
Fixes #3391.

This change adds a `DeleteRange` method to `SstFileWriter` and adds
support for ingesting SSTs with range deletion tombstones. This is
important for applications that need to atomically ingest SSTs while
clearing out any existing keys in a given key range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3778

Differential Revision: D8821836

Pulled By: anand1976

fbshipit-source-id: ca7786c1947ff129afa703dab011d524c7883844
2018-07-13 22:43:09 -07:00
Maysam Yabandeh
8581a93a6b Per-thread unique test db names (#4135)
Summary:
The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel.
Example: ``` ~/gtest-parallel/gtest-parallel ./table_test```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135

Differential Revision: D8846653

Pulled By: maysamyabandeh

fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84
2018-07-13 17:27:39 -07:00
Tamir Duberstein
7bee48bdbd Add GCC 8 to Travis (#3433)
Summary:
- Avoid `strdup` to use jemalloc on Windows
- Use `size_t` for consistency
- Add GCC 8 to Travis
- Add CMAKE_BUILD_TYPE=Release to Travis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3433

Differential Revision: D6837948

Pulled By: sagar0

fbshipit-source-id: b8543c3a4da9cd07ee9a33f9f4623188e233261f
2018-07-13 10:58:06 -07:00
Maysam Yabandeh
d4ad32d7bd Refactor BlockIter (#4121)
Summary:
BlockIter is getting crowded including details that specific only to either index or data blocks. The patch moves down such details to DataBlockIter and IndexBlockIter, both inheriting from BlockIter.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4121

Differential Revision: D8816832

Pulled By: maysamyabandeh

fbshipit-source-id: d492e74155c11d8a0c1c85cd7ee33d24c7456197
2018-07-12 17:27:31 -07:00
Nikhil Benesch
5cd8240b86 Test range deletions with more configurations (#4021)
Summary:
Run the basic range deletion tests against the standard set of
configurations. This testing exposed that files with hash indexes and
partitioned indexes were not handling the case where the file contained
only range deletions--i.e., where the index was empty.

Additionally file a TODO about the fact that range deletions are broken
when allow_mmap_reads = true is set.

/cc ajkr nvanbenschoten

Best viewed with ?w=1: https://github.com/facebook/rocksdb/pull/4021/files?w=1
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4021

Differential Revision: D8811860

Pulled By: ajkr

fbshipit-source-id: 3cc07e6d6210a2a00b932866481b3d5c59775343
2018-07-11 15:57:49 -07:00
Yanqin Jin
d4d9fe8e57 Fix a bug caused by not copying the block trailer. (#4096)
Summary:
This was caught by crash test, and the following is a simple way to reproduce it and verify the fix.
One way to trigger this code path is to use the following configuration:
- Compress SST file
- Enable direct IO and prefetch buffer
- Do NOT use compressed block cache
Closes https://github.com/facebook/rocksdb/pull/4096

Differential Revision: D8742009

Pulled By: riversand963

fbshipit-source-id: f13381078bbb0dce92f60bd313a78ab602bcacd2
2018-07-06 13:12:39 -07:00
Yanqin Jin
39218a72a4 Increase the size of LRU cache. (#4090)
Summary:
Increase the size of each shard so that the number of cache hit/miss match
expectation. Otherwise FilterBlockInBlockCache test will fail.
Closes https://github.com/facebook/rocksdb/pull/4090

Differential Revision: D8736158

Pulled By: riversand963

fbshipit-source-id: 5cdbc06b02390389fd5b72a6d251d88949ad3d91
2018-07-05 11:45:11 -07:00
Maysam Yabandeh
2462763b2e Fix mis-spoken assert on prefetch_filter and prefetch_index (#4077)
Summary:
We can have prefetch_index without prefetch_filter but not the other way around. The assert statement is fixed.
Closes https://github.com/facebook/rocksdb/pull/4077

Differential Revision: D8694472

Pulled By: maysamyabandeh

fbshipit-source-id: ccd2804d9d9cdafb1c3e65062c7bc38603e69004
2018-06-29 09:28:12 -07:00
Maysam Yabandeh
29ffbb8a50 Charging block cache more accurately (#4073)
Summary:
Currently the block cache is charged only by the size of the raw data block and excludes the overhead of the c++ objects that contain the raw data block. The patch improves the accuracy of the charge by including the c++ object overhead into it.
Closes https://github.com/facebook/rocksdb/pull/4073

Differential Revision: D8686552

Pulled By: maysamyabandeh

fbshipit-source-id: 8472f7fc163c0644533bc6942e20cdd5725f520f
2018-06-29 08:57:20 -07:00
Zhongyi Xie
14f409c0f1 PrefixMayMatch: remove unnecessary check for prefix_extractor_ (#4067)
Summary:
with https://github.com/facebook/rocksdb/pull/3601 and https://github.com/facebook/rocksdb/pull/3899, `prefix_extractor_` is not really being used in block based filter and full filter's version of `PrefixMayMatch` because now `prefix_extractor` is passed as an argument. Also it is now possible that prefix_extractor_ may be initialized to nullptr when a non-standard prefix_extractor is used and also for ROCKSDB_LITE. Removing these checks should not break any existing tests.
Closes https://github.com/facebook/rocksdb/pull/4067

Differential Revision: D8669002

Pulled By: miasantreble

fbshipit-source-id: 0e701ba912b8a26734fadb72d15bb1b266b6176a
2018-06-27 20:42:43 -07:00
Zhichao Cao
1f6efabe23 Add bottommost_compression_opts to for bottommost_compression (#3985)
Summary:
…ression

 For `CompressionType` we have options `compression` and `bottommost_compression`. Thus, to make the compression options consitent with the compression type when bottommost_compression is enabled, we add the bottommost_compression_opts
Closes https://github.com/facebook/rocksdb/pull/3985

Reviewed By: riversand963

Differential Revision: D8385911

Pulled By: zhichao-cao

fbshipit-source-id: 07bc533dd61bcf1cef5927d8d62901c13d38d5fc
2018-06-27 17:42:38 -07:00
Maysam Yabandeh
235ab9dd32 Pin mmap files in ReadOnlyDB (#4053)
Summary:
https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case.
Closes https://github.com/facebook/rocksdb/pull/4053

Differential Revision: D8662546

Pulled By: maysamyabandeh

fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd
2018-06-27 17:13:34 -07:00
Manuel Ung
a16e00b7b9 WriteUnPrepared Txn: Disable seek to snapshot optimization (#3955)
Summary:
This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data.

There are the places where reads had to be modified.
- Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible.
- Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key.
- Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case.
Closes https://github.com/facebook/rocksdb/pull/3955

Differential Revision: D8286688

Pulled By: lth

fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe
2018-06-27 12:23:07 -07:00
Nikhil Benesch
17339dc2f3 Add table property tracking number of range deletions (#4016)
Summary:
Add a new table property, rocksdb.num.range-deletions, which tracks the
number of range deletions in a block-based table. Range deletions are no
longer counted in rocksdb.num.entries; as discovered in PR #3778, there
are various code paths that implicitly assume that rocksdb.num.entries
counts only true keys, not range deletions.

/cc ajkr nvanbenschoten
Closes https://github.com/facebook/rocksdb/pull/4016

Differential Revision: D8527575

Pulled By: ajkr

fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7
2018-06-26 20:27:35 -07:00
Zhongyi Xie
408205a36b use user_key and iterate_upper_bound to determine compatibility of bloom filters (#3899)
Summary:
Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file.
This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at.
Closes https://github.com/facebook/rocksdb/pull/3899

Differential Revision: D8157459

Pulled By: miasantreble

fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41
2018-06-26 15:57:26 -07:00
Sagar Vemuri
189f0c27aa Make BlockBasedTableIterator compaction-aware (#4048)
Summary:
Pass in `for_compaction` to `BlockBasedTableIterator` via `BlockBasedTableReader::NewIterator`.

In 7103559f49, `for_compaction` was set in `BlockBasedTable::Rep` via `BlockBasedTable::SetupForCompaction`. In hindsight it was not the right decision; it also caused TSAN to complain.
Closes https://github.com/facebook/rocksdb/pull/4048

Differential Revision: D8601056

Pulled By: sagar0

fbshipit-source-id: 30127e898c15c38c1080d57710b8c5a6d64a0ab3
2018-06-25 13:19:27 -07:00
Maysam Yabandeh
80ade9ad83 Pin top-level index on partitioned index/filter blocks (#4037)
Summary:
Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead.
Closes https://github.com/facebook/rocksdb/pull/4037

Differential Revision: D8596218

Pulled By: maysamyabandeh

fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2
2018-06-22 15:27:46 -07:00
Sagar Vemuri
7103559f49 Improve direct IO range scan performance with readahead (#3884)
Summary:
This PR extends the improvements in #3282 to also work when using Direct IO.
We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash.

**Description:**
This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.

**Implementation Details:**
- Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead.
- `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled.
- `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer.
- Made sure not to re-read partial chunks of data that were already available in the buffer, from device again.
- Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date.

**Constraints:**
- Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).
- Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously.
- Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them.

**Benchmarks:**
I used the same benchmark as used in #3282.
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```

Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```

```
Before:
seekrandom   :   37939.906 micros/op 26 ops/sec;   29.2 MB/s (1636 of 1999 found)
With this change:
seekrandom   :   8527.720 micros/op 117 ops/sec;  129.7 MB/s (6530 of 7999 found)
```
~4.5X perf improvement. Taken on an average of 3 runs.
Closes https://github.com/facebook/rocksdb/pull/3884

Differential Revision: D8082143

Pulled By: sagar0

fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb
2018-06-21 11:13:08 -07:00
Maysam Yabandeh
28a9d8910b Fix the bug with duplicate prefix in partition filters (#4024)
Summary:
https://github.com/facebook/rocksdb/pull/3764 introduced an optimization feature to skip duplicate prefix entires in full bloom filters. Unfortunately it also introduces a bug in partitioned full filters, where the duplicate prefix should still be inserted if it is in a new partition. The patch fixes the bug by resetting the duplicate detection logic each time a partition is cut.
This bug could result into false negatives, which means that DB could skip an existing key.
Closes https://github.com/facebook/rocksdb/pull/4024

Differential Revision: D8518866

Pulled By: maysamyabandeh

fbshipit-source-id: 044f4d988e606a330ecafd8c79daceb68b8796bf
2018-06-19 14:12:46 -07:00
Siying Dong
92ee3350e0 BlockBasedTableIterator to keep BlockIter after out of upper bound (#4004)
Summary:
b555ed30a4 makes the BlockBasedTableIterator to be invalidated if the current position if over the upper bound. However, this can bring performance regression to the case of multiple Seek()s hitting the same data block but all out of upper bound.

For example, if an SST file has a data block containing following keys : {a, z}

The user sets the upper bound to be "x", and it executed following queries:
Seek("b")
Seek("c")
Seek("d")

Before the upper bound optimization, these queries always come to this same current data block of the iterator, but now inside each Seek() the data block is read from the block cache but is returned again.

To prevent this regression case, we keep the current data block iterator if it is upper bound.
Closes https://github.com/facebook/rocksdb/pull/4004

Differential Revision: D8463192

Pulled By: siying

fbshipit-source-id: 8710628b30acde7063a097c3184d6c4333a8ef81
2018-06-19 09:57:11 -07:00
Zhongyi Xie
80bc35927c Should only decode restart points for uncompressed blocks (#3996)
Summary:
The Block object assumes contents are uncompressed. Block's constructor tries to read the number of restarts, but does not get an accurate number when its contents are compressed, which is causing issues like https://github.com/facebook/rocksdb/issues/3843.
This PR address this issue by skipping reconstruction of restart points when blocks are known to be compressed. Somehow the restart points can be read directly when Snappy is used and some tests (for example https://github.com/facebook/rocksdb/blob/master/db/db_block_cache_test.cc#L196) expects blocks to be fully constructed even when Snappy compression is used, so here we keep the restart point logic for Snappy.
Closes https://github.com/facebook/rocksdb/pull/3996

Differential Revision: D8416186

Pulled By: miasantreble

fbshipit-source-id: 002c0b62b9e5d89fb7736563d354ce0023c8cb28
2018-06-15 19:26:58 -07:00
Siying Dong
d82f1421b4 Fix regression bug of Prev() with upper bound (#3989)
Summary:
A recent change pushed down the upper bound checking to child iterators. However, this causes the logic of following sequence wrong:
  Seek(key);
  if (!Valid()) SeekToLast();
Because !Valid() may be caused by upper bounds, rather than the end of the iterator. In this case SeekToLast() points to totally wrong places. This can cause wrong results, infinite loops, or segfault in some cases.
This sequence is called when changing direction from forward to backward. And this by itself also implicitly happen during reseeking optimization in Prev().

Fix this bug by using SeekForPrev() rather than this sequuence, as what is already done in prefix extrator case.
Closes https://github.com/facebook/rocksdb/pull/3989

Differential Revision: D8385422

Pulled By: siying

fbshipit-source-id: 429e869990cfd2dc389421e0836fc496bed67bb4
2018-06-12 16:57:36 -07:00
Andrew Kryczka
9d347332fb Fix argument mismatch in BlockBasedTableBuilder (#3974)
Summary:
The sixth argument should be `key_includes_seq` bool, the seventh a `GetContext*`. We were mistakenly passing the `GetContext*` as the sixth argument and relying on the default (nullptr) for the seventh. This would make statistics inaccurate, at least.

Blame: 402b7aa0
Closes https://github.com/facebook/rocksdb/pull/3974

Differential Revision: D8344907

Pulled By: ajkr

fbshipit-source-id: 3ad865a0541d6d30f75dfc726352788118cfe12e
2018-06-12 13:57:44 -07:00
Fenggang Wu
3593275357 Remove restart point from the properties_block (#3970)
Summary:
Property block will be read sequentially and cached in a heap located
object, so there's no need for restart points. Thus we set the restart
interval to infinity to save space.
Closes https://github.com/facebook/rocksdb/pull/3970

Differential Revision: D8332586

Pulled By: fgwu

fbshipit-source-id: 899c3267832a81d0f084ec2db6b387332f461134
2018-06-12 12:57:37 -07:00
Fenggang Wu
f4502944c3 Change db path for BlockBasedTableTest.BadOptions (#3965)
Summary:
BadOptions test creates a temporary db path changed to
table_block_based_bad_options_test to avoid collide with that created by
the PrefixAndWholeKeyTest
Closes https://github.com/facebook/rocksdb/pull/3965

Differential Revision: D8316080

Pulled By: fgwu

fbshipit-source-id: bb8e0fdfdb9abf0e5ce94494b4388cd1622ee032
2018-06-08 12:57:14 -07:00
Maysam Yabandeh
b73652169e Extend format 3 to partitioned index/filters (#3958)
Summary:
format_version 3 changes the format of index blocks by storing user keys instead of the internal keys, which saves 8-bytes per key. This patch extends the format to top-level indexes in partitioned index/filters.
Closes https://github.com/facebook/rocksdb/pull/3958

Differential Revision: D8294615

Pulled By: maysamyabandeh

fbshipit-source-id: 17666cc16b8076c363972e2308e31547e835f0fe
2018-06-06 16:58:16 -07:00
Zhongyi Xie
f1592a06c2 run make format for PR 3838 (#3954)
Summary:
PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings.
Run `make format` to fix formatting as suggested by siying .
Also piggyback two changes:
1) fix singleton destruction order for windows and posix env
2) fix two clang warnings
Closes https://github.com/facebook/rocksdb/pull/3954

Differential Revision: D8272041

Pulled By: miasantreble

fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f
2018-06-05 12:58:02 -07:00
Mike Kolupaev
812c7371d3 Fix performance regression in Get() for block-based tables (#3953)
Summary:
This fixes a regression in one of myrocks regression tests (readwhilewriting), introduced in 8bf555f487

This PR changes two lines of code: one of them actually fixes the observed regression, the other is a mostly unrelated small fix that I'm piggy-backing here. EDIT: Nevermind, it fixes one line. More details in inline comments.
Closes https://github.com/facebook/rocksdb/pull/3953

Differential Revision: D8270664

Pulled By: al13n321

fbshipit-source-id: a7d91e196807d1e816551591257c700f70e4ccac
2018-06-05 11:43:16 -07:00
Maysam Yabandeh
d0c38c0c8c Extend some tests to format_version=3 (#3942)
Summary:
format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3.
Closes https://github.com/facebook/rocksdb/pull/3942

Differential Revision: D8238413

Pulled By: maysamyabandeh

fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035
2018-06-04 20:13:00 -07:00
Dmitri Smirnov
f4b72d7056 Provide a way to override windows memory allocator with jemalloc for ZSTD
Summary:
Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably.

For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance.

The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression.
Closes https://github.com/facebook/rocksdb/pull/3838

Differential Revision: D8229794

Pulled By: miasantreble

fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d
2018-06-04 12:12:48 -07:00
Andrew Kryczka
fea2b1dfb2 Copy Get() result when file reads use mmap
Summary:
For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region.

For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped.   The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`).

This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied.

This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime.
Closes https://github.com/facebook/rocksdb/pull/3881

Differential Revision: D8076288

Pulled By: ajkr

fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08
2018-06-01 16:57:58 -07:00