rocksdb

Author	SHA1	Message	Date
Siying Dong	2a81633da2	Fix bug when seeking backward against an out-of-bound iterator (#4187 ) Summary: `92ee3350e0` introduces an out-of-bound check in BlockBasedTableIterator::Valid(). However, this flag is not reset when re-seeking in backward direction. This caused the iterator to be invalide by mistake. Fix it by always resetting the out-of-bound flag in every seek. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4187 Differential Revision: D8996600 Pulled By: siying fbshipit-source-id: b6235ea614f71381e50e7904c4fb036300604ac1	2018-07-25 17:14:01 -07:00
Fenggang Wu	8805ec2f49	DataBlockHashIndex: Standalone Implementation with Unit Test (#4139 ) Summary: The first step of the `DataBlockHashIndex` implementation. A string based hash table is implemented and unit-tested. `DataBlockHashIndexBuilder`: `Add()` takes pairs of `<key, restart_index>`, and formats it into a string when `Finish()` is called. `DataBlockHashIndex`: initialized by the formatted string, and can interpret it as a hash table. Lookup for a key is supported by iterator operation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4139 Reviewed By: sagar0 Differential Revision: D8866764 Pulled By: fgwu fbshipit-source-id: 7f015f0098632c65979a22898a50424384730b10	2018-07-24 11:43:37 -07:00
Zhongyi Xie	f95a5b2464	Avoid unnecessary big for-loop when reporting ticker stats stored in GetContext (#3490 ) Summary: Currently in `Version::Get` when reporting ticker stats stored in `GetContext`, there is a big for-loop through all `Ticker` which adds unnecessary cost to overall CPU usage. We can optimize by storing only ticker values that are used in `Get()` calls in a new struct `GetContextStats` since only a small fraction of all tickers are used in `Get()` calls. For comparison, with the new approach we only need to visit 17 values while old approach will require visiting 100+ `Ticker` Pull Request resolved: https://github.com/facebook/rocksdb/pull/3490 Differential Revision: D6969154 Pulled By: miasantreble fbshipit-source-id: fc27072965a3a94125a3e6883d20dafcf5b84029	2018-07-20 16:58:13 -07:00
Siying Dong	a5e851e113	Reformatting some recent changes (#4161 ) Summary: Lint is not happy with some new code recently committed. Format them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4161 Differential Revision: D8940582 Pulled By: siying fbshipit-source-id: c9b43b1ef8c88b5e923911058b44eb77234b36b7	2018-07-20 14:43:38 -07:00
Siying Dong	8425c8bd4d	BlockBasedTableReader: automatically adjust tail prefetch size (#4156 ) Summary: Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others. Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10% Pull Request resolved: https://github.com/facebook/rocksdb/pull/4156 Differential Revision: D8916847 Pulled By: siying fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758	2018-07-20 14:43:37 -07:00
Andrew Kryczka	ab35505e21	Write properties metablock last in block-based tables (#4158 ) Summary: The properties meta-block should come at the end since we always need to read it when opening a file, unlike index/filter/other meta-blocks, which are sometimes read depending on the user's configuration. This ordering will allow us to (in a future PR) do a small readahead on the end of the file to read properties and meta-index blocks with one I/O. The bulk of this PR is a refactoring of the `BlockBasedTableBuilder::Finish` function. It was previously too large with inconsistent error handling, which made it difficult to change. So I broke it up into one function per meta-block write, and tried to make error handling consistent within those functions. Then reordering the metablocks was trivial -- just reorder the calls to these helper functions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4158 Differential Revision: D8921705 Pulled By: ajkr fbshipit-source-id: 96c9cc3182eb1adf11af46adab79dbeba7b12fcc	2018-07-20 09:11:59 -07:00
Andrew Kryczka	b5613227a9	Smaller tail readahead when not reading index/filters (#4159 ) Summary: In all cases during `BlockBasedTable::Open`, we issue at least three read requests to the file's tail: (1) footer, (2) metaindex block, and (3) properties block. Depending on the config, we may also read other metablocks like filter and index. This PR issues smaller readahead when we expect to do only the three necessary reads mentioned above. Then, 4KB should be enough (ignoring the case where there are lots of user-defined properties). We can keep doing 512KB readahead when additional reads are expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4159 Differential Revision: D8924002 Pulled By: ajkr fbshipit-source-id: cfc713275de4d05ce11f18571f1d72e27ccd3356	2018-07-19 16:13:22 -07:00
Maysam Yabandeh	b55da012f6	Refactor IndexBlockIter (#4141 ) Summary: Refactor IndexBlockIter to reduce conditional branches on key_includes_seq_. IndexBlockIter::Prev is also separated from DataBlockIter::Prev, not to cache the prev entries as they are of less importance when iterating over the index block. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4141 Differential Revision: D8866437 Pulled By: maysamyabandeh fbshipit-source-id: fdac76880426fc2be7d3c6354c09ab98f6657d4b	2018-07-16 17:13:10 -07:00
Siying Dong	8f06b4fa01	Separate some IndexBlockIter logic from BlockIter (#4136 ) Summary: Some logic only related to IndexBlockIter is separated from BlockIter to IndexBlockIter. This is done by writing an exclusive Seek() and SeekForPrev() for DataBlockIter, and all metadata block iter and tombstone block iter now use data block iter. Dealing with the BinarySeek() sharing problem by passing in the comparator to use. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4136 Reviewed By: maysamyabandeh Differential Revision: D8859673 Pulled By: siying fbshipit-source-id: 703e5e6824b82b7cbf4721f3594b94127797ca9e	2018-07-16 10:13:18 -07:00
Nathan VanBenschoten	ef7815b803	Support range deletion tombstones in IngestExternalFile SSTs (#3778 ) Summary: Fixes #3391. This change adds a `DeleteRange` method to `SstFileWriter` and adds support for ingesting SSTs with range deletion tombstones. This is important for applications that need to atomically ingest SSTs while clearing out any existing keys in a given key range. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3778 Differential Revision: D8821836 Pulled By: anand1976 fbshipit-source-id: ca7786c1947ff129afa703dab011d524c7883844	2018-07-13 22:43:09 -07:00
Maysam Yabandeh	8581a93a6b	Per-thread unique test db names (#4135 ) Summary: The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel. Example: ``` ~/gtest-parallel/gtest-parallel ./table_test``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135 Differential Revision: D8846653 Pulled By: maysamyabandeh fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84	2018-07-13 17:27:39 -07:00
Tamir Duberstein	7bee48bdbd	Add GCC 8 to Travis (#3433 ) Summary: - Avoid `strdup` to use jemalloc on Windows - Use `size_t` for consistency - Add GCC 8 to Travis - Add CMAKE_BUILD_TYPE=Release to Travis Pull Request resolved: https://github.com/facebook/rocksdb/pull/3433 Differential Revision: D6837948 Pulled By: sagar0 fbshipit-source-id: b8543c3a4da9cd07ee9a33f9f4623188e233261f	2018-07-13 10:58:06 -07:00
Maysam Yabandeh	d4ad32d7bd	Refactor BlockIter (#4121 ) Summary: BlockIter is getting crowded including details that specific only to either index or data blocks. The patch moves down such details to DataBlockIter and IndexBlockIter, both inheriting from BlockIter. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4121 Differential Revision: D8816832 Pulled By: maysamyabandeh fbshipit-source-id: d492e74155c11d8a0c1c85cd7ee33d24c7456197	2018-07-12 17:27:31 -07:00
Nikhil Benesch	5cd8240b86	Test range deletions with more configurations (#4021 ) Summary: Run the basic range deletion tests against the standard set of configurations. This testing exposed that files with hash indexes and partitioned indexes were not handling the case where the file contained only range deletions--i.e., where the index was empty. Additionally file a TODO about the fact that range deletions are broken when allow_mmap_reads = true is set. /cc ajkr nvanbenschoten Best viewed with ?w=1: https://github.com/facebook/rocksdb/pull/4021/files?w=1 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4021 Differential Revision: D8811860 Pulled By: ajkr fbshipit-source-id: 3cc07e6d6210a2a00b932866481b3d5c59775343	2018-07-11 15:57:49 -07:00
Yanqin Jin	d4d9fe8e57	Fix a bug caused by not copying the block trailer. (#4096 ) Summary: This was caught by crash test, and the following is a simple way to reproduce it and verify the fix. One way to trigger this code path is to use the following configuration: - Compress SST file - Enable direct IO and prefetch buffer - Do NOT use compressed block cache Closes https://github.com/facebook/rocksdb/pull/4096 Differential Revision: D8742009 Pulled By: riversand963 fbshipit-source-id: f13381078bbb0dce92f60bd313a78ab602bcacd2	2018-07-06 13:12:39 -07:00
Yanqin Jin	39218a72a4	Increase the size of LRU cache. (#4090 ) Summary: Increase the size of each shard so that the number of cache hit/miss match expectation. Otherwise FilterBlockInBlockCache test will fail. Closes https://github.com/facebook/rocksdb/pull/4090 Differential Revision: D8736158 Pulled By: riversand963 fbshipit-source-id: 5cdbc06b02390389fd5b72a6d251d88949ad3d91	2018-07-05 11:45:11 -07:00
Maysam Yabandeh	2462763b2e	Fix mis-spoken assert on prefetch_filter and prefetch_index (#4077 ) Summary: We can have prefetch_index without prefetch_filter but not the other way around. The assert statement is fixed. Closes https://github.com/facebook/rocksdb/pull/4077 Differential Revision: D8694472 Pulled By: maysamyabandeh fbshipit-source-id: ccd2804d9d9cdafb1c3e65062c7bc38603e69004	2018-06-29 09:28:12 -07:00
Maysam Yabandeh	29ffbb8a50	Charging block cache more accurately (#4073 ) Summary: Currently the block cache is charged only by the size of the raw data block and excludes the overhead of the c++ objects that contain the raw data block. The patch improves the accuracy of the charge by including the c++ object overhead into it. Closes https://github.com/facebook/rocksdb/pull/4073 Differential Revision: D8686552 Pulled By: maysamyabandeh fbshipit-source-id: 8472f7fc163c0644533bc6942e20cdd5725f520f	2018-06-29 08:57:20 -07:00
Zhongyi Xie	14f409c0f1	PrefixMayMatch: remove unnecessary check for prefix_extractor_ (#4067 ) Summary: with https://github.com/facebook/rocksdb/pull/3601 and https://github.com/facebook/rocksdb/pull/3899, `prefix_extractor_` is not really being used in block based filter and full filter's version of `PrefixMayMatch` because now `prefix_extractor` is passed as an argument. Also it is now possible that prefix_extractor_ may be initialized to nullptr when a non-standard prefix_extractor is used and also for ROCKSDB_LITE. Removing these checks should not break any existing tests. Closes https://github.com/facebook/rocksdb/pull/4067 Differential Revision: D8669002 Pulled By: miasantreble fbshipit-source-id: 0e701ba912b8a26734fadb72d15bb1b266b6176a	2018-06-27 20:42:43 -07:00
Zhichao Cao	1f6efabe23	Add bottommost_compression_opts to for bottommost_compression (#3985 ) Summary: …ression For `CompressionType` we have options `compression` and `bottommost_compression`. Thus, to make the compression options consitent with the compression type when bottommost_compression is enabled, we add the bottommost_compression_opts Closes https://github.com/facebook/rocksdb/pull/3985 Reviewed By: riversand963 Differential Revision: D8385911 Pulled By: zhichao-cao fbshipit-source-id: 07bc533dd61bcf1cef5927d8d62901c13d38d5fc	2018-06-27 17:42:38 -07:00
Maysam Yabandeh	235ab9dd32	Pin mmap files in ReadOnlyDB (#4053 ) Summary: https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case. Closes https://github.com/facebook/rocksdb/pull/4053 Differential Revision: D8662546 Pulled By: maysamyabandeh fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd	2018-06-27 17:13:34 -07:00
Manuel Ung	a16e00b7b9	WriteUnPrepared Txn: Disable seek to snapshot optimization (#3955 ) Summary: This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data. There are the places where reads had to be modified. - Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible. - Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key. - Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case. Closes https://github.com/facebook/rocksdb/pull/3955 Differential Revision: D8286688 Pulled By: lth fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe	2018-06-27 12:23:07 -07:00
Nikhil Benesch	17339dc2f3	Add table property tracking number of range deletions (#4016 ) Summary: Add a new table property, rocksdb.num.range-deletions, which tracks the number of range deletions in a block-based table. Range deletions are no longer counted in rocksdb.num.entries; as discovered in PR #3778, there are various code paths that implicitly assume that rocksdb.num.entries counts only true keys, not range deletions. /cc ajkr nvanbenschoten Closes https://github.com/facebook/rocksdb/pull/4016 Differential Revision: D8527575 Pulled By: ajkr fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7	2018-06-26 20:27:35 -07:00
Zhongyi Xie	408205a36b	use user_key and iterate_upper_bound to determine compatibility of bloom filters (#3899 ) Summary: Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file. This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at. Closes https://github.com/facebook/rocksdb/pull/3899 Differential Revision: D8157459 Pulled By: miasantreble fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41	2018-06-26 15:57:26 -07:00
Sagar Vemuri	189f0c27aa	Make BlockBasedTableIterator compaction-aware (#4048 ) Summary: Pass in `for_compaction` to `BlockBasedTableIterator` via `BlockBasedTableReader::NewIterator`. In `7103559f49`, `for_compaction` was set in `BlockBasedTable::Rep` via `BlockBasedTable::SetupForCompaction`. In hindsight it was not the right decision; it also caused TSAN to complain. Closes https://github.com/facebook/rocksdb/pull/4048 Differential Revision: D8601056 Pulled By: sagar0 fbshipit-source-id: 30127e898c15c38c1080d57710b8c5a6d64a0ab3	2018-06-25 13:19:27 -07:00
Maysam Yabandeh	80ade9ad83	Pin top-level index on partitioned index/filter blocks (#4037 ) Summary: Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead. Closes https://github.com/facebook/rocksdb/pull/4037 Differential Revision: D8596218 Pulled By: maysamyabandeh fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2	2018-06-22 15:27:46 -07:00
Sagar Vemuri	7103559f49	Improve direct IO range scan performance with readahead (#3884 ) Summary: This PR extends the improvements in #3282 to also work when using Direct IO. We see 4.5X performance improvement in seekrandom benchmark doing long range scans, when using direct reads, on flash. Description: This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan. Implementation Details: - Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead. - `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled. - `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer. - Made sure not to re-read partial chunks of data that were already available in the buffer, from device again. - Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date. Constraints: - Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value). - Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously. - Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them. Benchmarks: I used the same benchmark as used in #3282. Data fill: ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes ``` Do a long range scan: Seekrandom with large number of nexts ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram ``` ``` Before: seekrandom : 37939.906 micros/op 26 ops/sec; 29.2 MB/s (1636 of 1999 found) With this change: seekrandom : 8527.720 micros/op 117 ops/sec; 129.7 MB/s (6530 of 7999 found) ``` ~4.5X perf improvement. Taken on an average of 3 runs. Closes https://github.com/facebook/rocksdb/pull/3884 Differential Revision: D8082143 Pulled By: sagar0 fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb	2018-06-21 11:13:08 -07:00
Maysam Yabandeh	28a9d8910b	Fix the bug with duplicate prefix in partition filters (#4024 ) Summary: https://github.com/facebook/rocksdb/pull/3764 introduced an optimization feature to skip duplicate prefix entires in full bloom filters. Unfortunately it also introduces a bug in partitioned full filters, where the duplicate prefix should still be inserted if it is in a new partition. The patch fixes the bug by resetting the duplicate detection logic each time a partition is cut. This bug could result into false negatives, which means that DB could skip an existing key. Closes https://github.com/facebook/rocksdb/pull/4024 Differential Revision: D8518866 Pulled By: maysamyabandeh fbshipit-source-id: 044f4d988e606a330ecafd8c79daceb68b8796bf	2018-06-19 14:12:46 -07:00
Siying Dong	92ee3350e0	BlockBasedTableIterator to keep BlockIter after out of upper bound (#4004 ) Summary: `b555ed30a4` makes the BlockBasedTableIterator to be invalidated if the current position if over the upper bound. However, this can bring performance regression to the case of multiple Seek()s hitting the same data block but all out of upper bound. For example, if an SST file has a data block containing following keys : {a, z} The user sets the upper bound to be "x", and it executed following queries: Seek("b") Seek("c") Seek("d") Before the upper bound optimization, these queries always come to this same current data block of the iterator, but now inside each Seek() the data block is read from the block cache but is returned again. To prevent this regression case, we keep the current data block iterator if it is upper bound. Closes https://github.com/facebook/rocksdb/pull/4004 Differential Revision: D8463192 Pulled By: siying fbshipit-source-id: 8710628b30acde7063a097c3184d6c4333a8ef81	2018-06-19 09:57:11 -07:00
Zhongyi Xie	80bc35927c	Should only decode restart points for uncompressed blocks (#3996 ) Summary: The Block object assumes contents are uncompressed. Block's constructor tries to read the number of restarts, but does not get an accurate number when its contents are compressed, which is causing issues like https://github.com/facebook/rocksdb/issues/3843. This PR address this issue by skipping reconstruction of restart points when blocks are known to be compressed. Somehow the restart points can be read directly when Snappy is used and some tests (for example https://github.com/facebook/rocksdb/blob/master/db/db_block_cache_test.cc#L196) expects blocks to be fully constructed even when Snappy compression is used, so here we keep the restart point logic for Snappy. Closes https://github.com/facebook/rocksdb/pull/3996 Differential Revision: D8416186 Pulled By: miasantreble fbshipit-source-id: 002c0b62b9e5d89fb7736563d354ce0023c8cb28	2018-06-15 19:26:58 -07:00
Siying Dong	d82f1421b4	Fix regression bug of Prev() with upper bound (#3989 ) Summary: A recent change pushed down the upper bound checking to child iterators. However, this causes the logic of following sequence wrong: Seek(key); if (!Valid()) SeekToLast(); Because !Valid() may be caused by upper bounds, rather than the end of the iterator. In this case SeekToLast() points to totally wrong places. This can cause wrong results, infinite loops, or segfault in some cases. This sequence is called when changing direction from forward to backward. And this by itself also implicitly happen during reseeking optimization in Prev(). Fix this bug by using SeekForPrev() rather than this sequuence, as what is already done in prefix extrator case. Closes https://github.com/facebook/rocksdb/pull/3989 Differential Revision: D8385422 Pulled By: siying fbshipit-source-id: 429e869990cfd2dc389421e0836fc496bed67bb4	2018-06-12 16:57:36 -07:00
Andrew Kryczka	9d347332fb	Fix argument mismatch in BlockBasedTableBuilder (#3974 ) Summary: The sixth argument should be `key_includes_seq` bool, the seventh a `GetContext`. We were mistakenly passing the `GetContext` as the sixth argument and relying on the default (nullptr) for the seventh. This would make statistics inaccurate, at least. Blame: `402b7aa0` Closes https://github.com/facebook/rocksdb/pull/3974 Differential Revision: D8344907 Pulled By: ajkr fbshipit-source-id: 3ad865a0541d6d30f75dfc726352788118cfe12e	2018-06-12 13:57:44 -07:00
Fenggang Wu	3593275357	Remove restart point from the properties_block (#3970 ) Summary: Property block will be read sequentially and cached in a heap located object, so there's no need for restart points. Thus we set the restart interval to infinity to save space. Closes https://github.com/facebook/rocksdb/pull/3970 Differential Revision: D8332586 Pulled By: fgwu fbshipit-source-id: 899c3267832a81d0f084ec2db6b387332f461134	2018-06-12 12:57:37 -07:00
Fenggang Wu	f4502944c3	Change db path for BlockBasedTableTest.BadOptions (#3965 ) Summary: BadOptions test creates a temporary db path changed to table_block_based_bad_options_test to avoid collide with that created by the PrefixAndWholeKeyTest Closes https://github.com/facebook/rocksdb/pull/3965 Differential Revision: D8316080 Pulled By: fgwu fbshipit-source-id: bb8e0fdfdb9abf0e5ce94494b4388cd1622ee032	2018-06-08 12:57:14 -07:00
Maysam Yabandeh	b73652169e	Extend format 3 to partitioned index/filters (#3958 ) Summary: format_version 3 changes the format of index blocks by storing user keys instead of the internal keys, which saves 8-bytes per key. This patch extends the format to top-level indexes in partitioned index/filters. Closes https://github.com/facebook/rocksdb/pull/3958 Differential Revision: D8294615 Pulled By: maysamyabandeh fbshipit-source-id: 17666cc16b8076c363972e2308e31547e835f0fe	2018-06-06 16:58:16 -07:00
Zhongyi Xie	f1592a06c2	run make format for PR 3838 (#3954 ) Summary: PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings. Run `make format` to fix formatting as suggested by siying . Also piggyback two changes: 1) fix singleton destruction order for windows and posix env 2) fix two clang warnings Closes https://github.com/facebook/rocksdb/pull/3954 Differential Revision: D8272041 Pulled By: miasantreble fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f	2018-06-05 12:58:02 -07:00
Mike Kolupaev	812c7371d3	Fix performance regression in Get() for block-based tables (#3953 ) Summary: This fixes a regression in one of myrocks regression tests (readwhilewriting), introduced in `8bf555f487` This PR changes two lines of code: one of them actually fixes the observed regression, the other is a mostly unrelated small fix that I'm piggy-backing here. EDIT: Nevermind, it fixes one line. More details in inline comments. Closes https://github.com/facebook/rocksdb/pull/3953 Differential Revision: D8270664 Pulled By: al13n321 fbshipit-source-id: a7d91e196807d1e816551591257c700f70e4ccac	2018-06-05 11:43:16 -07:00
Maysam Yabandeh	d0c38c0c8c	Extend some tests to format_version=3 (#3942 ) Summary: format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3. Closes https://github.com/facebook/rocksdb/pull/3942 Differential Revision: D8238413 Pulled By: maysamyabandeh fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035	2018-06-04 20:13:00 -07:00
Dmitri Smirnov	f4b72d7056	Provide a way to override windows memory allocator with jemalloc for ZSTD Summary: Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably. For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance. The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression. Closes https://github.com/facebook/rocksdb/pull/3838 Differential Revision: D8229794 Pulled By: miasantreble fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d	2018-06-04 12:12:48 -07:00
Andrew Kryczka	fea2b1dfb2	Copy Get() result when file reads use mmap Summary: For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region. For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`). This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied. This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime. Closes https://github.com/facebook/rocksdb/pull/3881 Differential Revision: D8076288 Pulled By: ajkr fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08	2018-06-01 16:57:58 -07:00
Zhongyi Xie	2a0dfaa044	fix PrefixExtractorChanged: pass raw pointer instead shared_ptr Summary: This should resolve the performance regression caused by the unnecessary copying of the shared_ptr. Closes https://github.com/facebook/rocksdb/pull/3937 Differential Revision: D8232330 Pulled By: miasantreble fbshipit-source-id: 7885bf7cd190b6f87164c52d6edd328298c13f97	2018-05-31 21:42:50 -07:00
Maysam Yabandeh	44cf84932f	Fix the bug of some test scenarios being put after kEnd Summary: DBTestBase::OptionConfig includes the scenarios that unit tests could iterate over them by calling ChangeOptions(). Some of the options have been mistakenly put after kEnd which makes them essentially invisible to ChangeOptions() caller. This patch fixes it except for kUniversalSubcompactions which is left as TODO since it would break some unit tests. Closes https://github.com/facebook/rocksdb/pull/3935 Differential Revision: D8230748 Pulled By: maysamyabandeh fbshipit-source-id: edddb8fffcd161af1809fef24798ce118f8593db	2018-05-31 19:28:00 -07:00
Maysam Yabandeh	03cda531e4	Check for rep_->table_properties being nullptr Summary: The very old sst formats do not have table_properties and rep_->table_properties is thus nullptr. The recent patch in https://github.com/facebook/rocksdb/pull/3894 does not check for nullptr and hence makes it backward incompatible. This patch adds the check. Closes https://github.com/facebook/rocksdb/pull/3918 Differential Revision: D8188638 Pulled By: maysamyabandeh fbshipit-source-id: b1d986665ecf0b4d1c442adfa8a193b97707d47b	2018-05-29 12:13:55 -07:00
Maysam Yabandeh	402b7aa07f	Exclude seq from index keys Summary: Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator. The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s Closes https://github.com/facebook/rocksdb/pull/3894 Differential Revision: D8118775 Pulled By: maysamyabandeh fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd	2018-05-25 18:42:43 -07:00
Nathan VanBenschoten	8c3bf0801b	Check status when reading HashIndexPrefixesMetadataBlock Summary: This was missed in a refactor of `ReadBlockContents` (`2f1a3a4`). Closes https://github.com/facebook/rocksdb/pull/3906 Differential Revision: D8172648 Pulled By: ajkr fbshipit-source-id: 27e453b19795fea974bfed4721105be6f3a12090	2018-05-25 17:42:51 -07:00
Yanqin Jin	aa53579d6c	Fix segfault caused by object premature destruction Summary: Please refer to earlier discussion in [issue 3609](https://github.com/facebook/rocksdb/issues/3609). There was also an alternative fix in [PR 3888](https://github.com/facebook/rocksdb/pull/3888), but the proposed solution requires complex change. To summarize the cause of the problem. Upon creation of a column family, a `BlockBasedTableFactory` object is `new`ed and encapsulated by a `std::shared_ptr`. Since there is no other `std::shared_ptr` pointing to this `BlockBasedTableFactory`, when the column family is dropped, the `ColumnFamilyData` is `delete`d, causing the destructor of `std::shared_ptr`. Since there is no other `std::shared_ptr`, the underlying memory is also freed. Later when the db exits, it releases all the table readers, including the table readers that have been operating on the dropped column family. This needs to access the `table_options` owned by `BlockBasedTableFactory` that has already been deleted. Therefore, a segfault is raised. Previous workaround is to purge all obsolete files upon `ColumnFamilyData` destruction, which leads to a force release of table readers of the dropped column family. However this does not work when the user disables file deletion. Our solution in this PR is making a copy of `table_options` in `BlockBasedTable::Rep`. This solution increases memory copy and usage, but is much simpler. Test plan ``` $ make -j16 $ ./column_family_test --gtest_filter=ColumnFamilyTest.CreateDropAndDestroy:ColumnFamilyTest.CreateDropAndDestroyWithoutFileDeletion ``` Expected behavior: All tests should pass. Closes https://github.com/facebook/rocksdb/pull/3898 Differential Revision: D8149421 Pulled By: riversand963 fbshipit-source-id: eaecc2e064057ef607fbdd4cc275874f866c3438	2018-05-25 11:57:51 -07:00
Zhongyi Xie	6c73a46693	Fix a backward compatibility problem with table_properties being nullptr Summary: Currently when ldb built from master tries to open a DB from version 2.2, there will be a segfault because table_properties didn't exist back then. Closes https://github.com/facebook/rocksdb/pull/3890 Differential Revision: D8100914 Pulled By: miasantreble fbshipit-source-id: b255e8aedc54695432be2e704839c857dabdd65a	2018-05-22 13:57:17 -07:00
Zhongyi Xie	c3ebc75843	Move prefix_extractor to MutableCFOptions Summary: Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users. This PR aims to make it possible to dynamically change bloom filter config. Closes https://github.com/facebook/rocksdb/pull/3601 Differential Revision: D7253114 Pulled By: miasantreble fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c	2018-05-21 14:43:11 -07:00
Andrew Kryczka	7b655214d2	Assert keys/values pinned by range deletion meta-block iterators Summary: `RangeDelAggregator` holds the pointers returned by `BlockIter::key()` and `BlockIter::value()` so requires the data to which they point is pinned. `BlockIter::key()` points into block memory and is guaranteed to be pinned if and only if prefix encoding is disabled (or, equivalently, restart interval is set to one). I think `BlockIter::value()` is always pinned. Added an assert for these and removed the wrong TODO about increasing restart interval, which would enable key prefix encoding and break the assertion. Closes https://github.com/facebook/rocksdb/pull/3875 Differential Revision: D8063667 Pulled By: ajkr fbshipit-source-id: 60b5ebcc0cdd610dd6aad9e74a23378793672c41	2018-05-21 09:57:00 -07:00
Siying Dong	26da3676d9	class Block to store num_restarts_ Summary: Right now, every Block::NewIterator() reads num_restarts_ from the block, which is already read in Block::Block(). This sometimes cause a CPU cache miss. Although fetching this cacheline can usually benefit follow-up block restart offset reading, as they are close to each other, it's almost free to get ride of this read by storing it in the Block class. Closes https://github.com/facebook/rocksdb/pull/3869 Differential Revision: D8052493 Pulled By: siying fbshipit-source-id: 9c72360f0c2d7329f3c198ce4eaedd2bc14b87c1	2018-05-18 12:56:55 -07:00
Mike Kolupaev	8bf555f487	Change and clarify the relationship between Valid(), status() and Seek() for all iterators. Also fix some bugs Summary: Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are: If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted. * When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not. However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours). This PR changes the convention to: * If status() is not ok, Valid() always returns false. * Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.) This does sacrifice the two use cases listed above, but siying said it's ok. Overview of the changes: * A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario. * Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed. * A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator. To find the iterator types that needed changes I searched for "public .Iterator" in the code. Here's an overview of all 27 iterator types: Iterators that didn't need changes: status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator. * Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator. Iterators with changes (see inline comments for details): * DBIter - an overhaul: - It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back. - It had a few code paths silently discarding subiterator's status. The stress test caught a few. - The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example. - Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption. - It used to not reset status on seek for some types of errors. - Some simplifications and better comments. - Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer. * MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status. * ForwardIterator - changed to the new convention, also slightly simplified. * ForwardLevelIterator - fixed some bugs and simplified. * LevelIterator - simplified. * TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_. * BlockBasedTableIterator - minor changes. * BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid. * PlainTableIterator - some seeks used to not reset status. * CuckooTableIterator - tiny code cleanup. * ManagedIterator - fixed some bugs. * BaseDeltaIterator - changed to the new convention and fixed a bug. * BlobDBIterator - seeks used to not reset status. * KeyConvertingIterator - some small change. Closes https://github.com/facebook/rocksdb/pull/3810 Differential Revision: D7888019 Pulled By: al13n321 fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d	2018-05-17 02:56:56 -07:00
Siying Dong	ddfd2525d2	Make BlockIter final Summary: Now BlockBasedTableIterator directly uses BlockIter. By making BlockIter final, we can prevent unintended virtual function overriding. Closes https://github.com/facebook/rocksdb/pull/3828 Differential Revision: D7933816 Pulled By: siying fbshipit-source-id: 026a08cb5c5b6d3d6f44743152b4251da4756f2c	2018-05-09 10:27:26 -07:00
Maysam Yabandeh	fc522bdb3e	Evenly split HarnessTest.Randomized Summary: Currently HarnessTest.Randomized is already split but some of the splits are faster than the others. The reason is that each split takes a continuous range of the generated args and the test with later args takes longer to finish. The patch evenly split the args among splits in a round robin fashion. Before: ``` [ OK ] HarnessTest.Randomized1n2 (2278 ms) [ OK ] HarnessTest.Randomized3n4 (1095 ms) [ OK ] HarnessTest.Randomized5 (658 ms) [ OK ] HarnessTest.Randomized6 (1258 ms) [ OK ] HarnessTest.Randomized7 (6476 ms) [ OK ] HarnessTest.Randomized8 (8182 ms) ``` After ``` [ OK ] HarnessTest.Randomized1 (2649 ms) [ OK ] HarnessTest.Randomized2 (2645 ms) [ OK ] HarnessTest.Randomized3 (2577 ms) [ OK ] HarnessTest.Randomized4 (2490 ms) [ OK ] HarnessTest.Randomized5 (2553 ms) [ OK ] HarnessTest.Randomized6 (2560 ms) [ OK ] HarnessTest.Randomized7 (2501 ms) [ OK ] HarnessTest.Randomized8 (2574 ms) ``` Closes https://github.com/facebook/rocksdb/pull/3808 Differential Revision: D7882663 Pulled By: maysamyabandeh fbshipit-source-id: 09b749a9684b6d7d65466aa4b00c5334a49e833e	2018-05-04 15:28:06 -07:00
Zhongyi Xie	6cab3184f5	avoid double delete on dummy record insertion failure Summary: When the dummy record insertion fails, there is no need to explicitly delete the block as it will be registered for cleanup regardless. Closes https://github.com/facebook/rocksdb/pull/3688 Differential Revision: D7537741 Pulled By: miasantreble fbshipit-source-id: fcd3a3d3d382ee8e2c7ced0a4980e683d93a16d6	2018-05-01 16:01:28 -07:00
Andrew Kryczka	7004e45489	Remove block-based table assertion for non-empty filter block Summary: `7a6353bd1c` prevents empty filter blocks from being written for SST files containing range deletions only. However the assertion this PR removes is still a problem as we could be reading from a DB generated by a RocksDB build without the `7a6353bd1c` patch. So remove the assertion. We already don't do this check when `cache_index_and_filter_blocks=false`, so it should be safe. Closes https://github.com/facebook/rocksdb/pull/3773 Differential Revision: D7769964 Pulled By: ajkr fbshipit-source-id: 7285762446f2cd2ccf16efd7a988a106fbb0d8d3	2018-04-26 14:43:11 -07:00
Maysam Yabandeh	7e4e381495	Fix the bloom filter skipping empty prefixes Summary: `bc0da4b512` optimized bloom filters by skipping duplicate entires when the whole key and prefixes are both added to the bloom. It however used empty string as the initial value of the last entry added to the bloom. This is incorrect since empty key/prefix are valid entires by themselves. This patch fixes that. Closes https://github.com/facebook/rocksdb/pull/3776 Differential Revision: D7778803 Pulled By: maysamyabandeh fbshipit-source-id: d5a065daebee17f9403cac51e9d5626aac87bfbc	2018-04-26 13:28:31 -07:00
Maysam Yabandeh	bc0da4b512	Skip duplicate bloom keys when whole_key and prefix are mixed Summary: Currently we rely on FilterBitsBuilder to skip the duplicate keys. It does that by comparing that hash of the key to the hash of the last added entry. This logic breaks however when we have whole_key_filtering mixed with prefix blooms as their addition to FilterBitsBuilder will be interleaved. The patch fixes that by comparing the last whole key and last prefix with the whole key and prefix of the new key respectively and skip the call to FilterBitsBuilder if it is a duplicate. Closes https://github.com/facebook/rocksdb/pull/3764 Differential Revision: D7744413 Pulled By: maysamyabandeh fbshipit-source-id: 15df73bbbafdfd754d4e1f42ea07f47b03bc5eb8	2018-04-24 10:58:16 -07:00
Maysam Yabandeh	17e04039dd	Propagate fill_cache config to partitioned index iterator Summary: Currently the partitioned index iterator creates a new ReadOptions which ignores the fill_cache config set to ReadOptions passed by the user. The patch propagates fill_cache from the user's ReadOptions to that of partition index iterator. Also it clarifies the contract of fill_cache that i) it does not apply to filters, ii) it still charges block cache for the size of the data block, it still pin the block if it is already in the block cache. Closes https://github.com/facebook/rocksdb/pull/3739 Differential Revision: D7678308 Pulled By: maysamyabandeh fbshipit-source-id: 53ed96424ae922e499e2d4e3580ddc3f0db893da	2018-04-20 15:13:05 -07:00
Zhongyi Xie	954b496b3f	fix memory leak in two_level_iterator Summary: this PR fixes a few failed contbuild: 1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406 2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662 3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag. Closes https://github.com/facebook/rocksdb/pull/3718 Reviewed By: maysamyabandeh Differential Revision: D7621192 Pulled By: miasantreble fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146	2018-04-15 17:26:26 -07:00
David Lai	3be9b36453	comment unused parameters to turn on -Wunused-parameter flag Summary: This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557. Closes https://github.com/facebook/rocksdb/pull/3662 Differential Revision: D7426121 Pulled By: Dayvedde fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352	2018-04-12 17:59:16 -07:00
Maysam Yabandeh	d2bcd7611f	Fix the memory leak with pinned partitioned filters Summary: The existing unit test did not set the level so the check for pinned partitioned filter/index being properly released from the block cache was not properly exercised as they only take effect in level 0. As a result a memory leak in pinned partitioned filters was hidden. The patch fix the test as well as the bug. Closes https://github.com/facebook/rocksdb/pull/3692 Differential Revision: D7559763 Pulled By: maysamyabandeh fbshipit-source-id: 55eff274945838af983c764a7d71e8daff092e4a	2018-04-09 16:28:19 -07:00
Maysam Yabandeh	67182678a5	Stats for false positive rate of full filtesr Summary: Adds two stats to allow us measuring the false positive rate of full filters: - The total count of positives: rocksdb.bloom.filter.full.positive - The total count of true positives: rocksdb.bloom.filter.full.true.positive Not the term "full" in the stat name to indicate that they are meaningful in full filters. block-based filters are to be deprecated soon and supporting it is not worth the the additional cost of if-then-else branches. Closes #3680 Tested by: $ ./db_bench -benchmarks=fillrandom -db /dev/shm/rocksdb-tmpdb --num=1000000 -bloom_bits=10 $ ./db_bench -benchmarks="readwhilewriting" -db /dev/shm/rocksdb-tmpdb --statistics -bloom_bits=10 --duration=60 --num=2000000 --use_existing_db 2>&1 > /tmp/full.log $ grep filter.full /tmp/full.log rocksdb.bloom.filter.full.positive COUNT : 3628593 rocksdb.bloom.filter.full.true.positive COUNT : 3536026 which gives the false positive rate of 2.5% Closes https://github.com/facebook/rocksdb/pull/3681 Differential Revision: D7517570 Pulled By: maysamyabandeh fbshipit-source-id: 630ab1a473afdce404916d297035b6318de4c052	2018-04-05 15:58:48 -07:00
Sagar Vemuri	d687670256	Fix a leak in FilterBlockBuilder when adding prefix Summary: Our valgrind continuous test found an interesting leak which got introduced in #3614. We were adding the prefix key before saving the previous prefix start offset, due to which previous prefix offset is always incorrect. Fixed it by saving the the previous sate before adding the key. Closes https://github.com/facebook/rocksdb/pull/3660 Differential Revision: D7418698 Pulled By: sagar0 fbshipit-source-id: 9933685f943cf2547ed5c553f490035a2fa785cf	2018-03-27 15:13:56 -07:00
Anand Ananthabhotla	f9f4d40f93	Align SST file data blocks to avoid spanning multiple pages Summary: Provide a block_align option in BlockBasedTableOptions to allow alignment of SST file data blocks. This will avoid higher IOPS/throughput load due to < 4KB data blocks spanning 2 4KB pages. When this option is set to true, the block alignment is set to lower of block size and 4KB. Closes https://github.com/facebook/rocksdb/pull/3502 Differential Revision: D7400897 Pulled By: anand1976 fbshipit-source-id: 04cc3bd144e88e3431a4f97604e63ad7a0f06d44	2018-03-26 20:26:10 -07:00
Dmitri Smirnov	d382ae7de6	Imporve perf of random read and insert compare by suggesting inlining to the compiler Summary: Results from 2015 compiler. This improve sequential insert. Random Read results are inconclusive but I hope 2017 will do a better job at inlining. Before: fillseq : 3.638 micros/op 274866 ops/sec; 213.9 MB/s After: fillseq : 3.379 micros/op 295979 ops/sec; 230.3 MB/s Closes https://github.com/facebook/rocksdb/pull/3645 Differential Revision: D7382711 Pulled By: siying fbshipit-source-id: 092a07ffe8a6e598d1226ceff0f11b35e6c5c8e4	2018-03-23 13:26:55 -07:00
Huachao Huang	7a6353bd1c	Ignore empty filter block when data block is empty Summary: Close https://github.com/facebook/rocksdb/issues/3592 Closes https://github.com/facebook/rocksdb/pull/3614 Differential Revision: D7291706 Pulled By: ajkr fbshipit-source-id: 9dd8f40bd7716588e1e3fd6be0c2bc2766861f8c	2018-03-21 23:13:05 -07:00
Bruce Mitchener	a3a3f5497c	Fix some typos in comments and docs. Summary: Closes https://github.com/facebook/rocksdb/pull/3568 Differential Revision: D7170953 Pulled By: siying fbshipit-source-id: 9cfb8dd88b7266da920c0e0c1e10fb2c5af0641c	2018-03-08 10:27:25 -08:00
Siying Dong	b560fc9f62	Fix a block pinning regression introduced in `b555ed30a4` Summary: `b555ed30a4` introduces a regression, which causes blocks always to be pinned in block based iterators. Fix it. Closes https://github.com/facebook/rocksdb/pull/3582 Differential Revision: D7189534 Pulled By: siying fbshipit-source-id: 117dc7a03d0a0e360424db02efb366e12da2be03	2018-03-08 10:12:23 -08:00
Fosco Marotto	d518fe1da6	uint64_t and size_t changes to compile for iOS Summary: In attempting to build a static lib for use in iOS, I ran in to lots of type errors between uint64_t and size_t. This PR contains the changes I made to get `TARGET_OS=IOS make static_lib` to succeed while also getting Xcode to build successfully with the resulting `librocksdb.a` library imported. This also compiles for me on macOS and tests fine, but I'm really not sure if I made the correct decisions about where to `static_cast` and where to change types. Also up for discussion: is iOS worth supporting? Getting the static lib is just part one, we aren't providing any bridging headers or wrappers like the ObjectiveRocks project, it won't be a great experience. Closes https://github.com/facebook/rocksdb/pull/3503 Differential Revision: D7106457 Pulled By: gfosco fbshipit-source-id: 82ac2073de7e1f09b91f6b4faea91d18bd311f8e	2018-03-06 12:43:51 -08:00
Andrew Kryczka	5d68243e61	Comment out unused variables Summary: Submitting on behalf of another employee. Closes https://github.com/facebook/rocksdb/pull/3557 Differential Revision: D7146025 Pulled By: ajkr fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e	2018-03-05 13:13:41 -08:00
Igor Sugak	aba3409740	Back out "[codemod] - comment out unused parameters" Reviewed By: igorsugak fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37	2018-02-22 12:43:17 -08:00
David Lai	f4a030ce81	- comment out unused parameters Reviewed By: everiq, igorsugak Differential Revision: D7046710 fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461	2018-02-22 09:44:23 -08:00
jsteemann	4e7a182d09	Several small "fixes" Summary: - removed a few unneeded variables - fused some variable declarations and their assignments - fixed right-trimming code in string_util.cc to not underflow - simplifed an assertion - move non-nullptr check assertion before dereferencing of that pointer - pass an std::string function parameter by const reference instead of by value (avoiding potential copy) Closes https://github.com/facebook/rocksdb/pull/3507 Differential Revision: D7004679 Pulled By: sagar0 fbshipit-source-id: 52944952d9b56dfcac3bea3cd7878e315bb563c4	2018-02-15 16:57:37 -08:00
Siying Dong	b555ed30a4	Customized BlockBasedTableIterator and LevelIterator Summary: Use a customzied BlockBasedTableIterator and LevelIterator to replace current implementations leveraging two-level-iterator. Hope the customized logic will make code easier to understand. As a side effect, BlockBasedTableIterator reduces the allocation for the data block iterator object, and avoid the virtual function call to it, because we can directly reference BlockIter, a final class. Similarly, LevelIterator reduces virtual function call to the dummy iterator iterating the file metadata. It also enabled further optimization. The upper bound check is also moved from index block to data block. This implementation fits this iterator better. After the change, forwared iterator is slightly optimized to ensure we trim those iterators. The two-level-iterator now is only used by partitioned index, so it is simplified. Closes https://github.com/facebook/rocksdb/pull/3406 Differential Revision: D6809041 Pulled By: siying fbshipit-source-id: 7da3b9b1d3c8e9d9405302c15920af1fcaf50ffa	2018-02-12 17:12:25 -08:00
Andrew Kryczka	e78715c29a	Eliminate a memcpy for uncompressed blocks Summary: `ReadBlockFromFile` uses a stack buffer to hold small data blocks before passing them to the compression library, which outputs uncompressed data in a heap buffer. In the case of `kNoCompression` there is a `memcpy` to copy from stack buffer to heap buffer. This PR optimizes `ReadBlockFromFile` to skip the stack buffer for files whose blocks are known to be uncompressed. We determine this using the SST file property, "compression_name", if it's available. Closes https://github.com/facebook/rocksdb/pull/3472 Differential Revision: D6920848 Pulled By: ajkr fbshipit-source-id: 5c753e804efc178b9229ae5dbe6a4adc32031f07	2018-02-07 15:57:37 -08:00
Zhongyi Xie	2f29991701	split RandomizedHarnessTest more ways Summary: RandomizedHarnessTest enumerates different combinations of test type, compression type, restart interval, etc. For some combinations it takes very long to finish, causing the test to time out in test infrastructure. This PR split the test input into smaller trunks in the hope that they will fit in the timeout window. Another possibility is to reduce `num_entries` of course Closes https://github.com/facebook/rocksdb/pull/3467 Differential Revision: D6910235 Pulled By: miasantreble fbshipit-source-id: 717246ee5d21a8a48ad82d4d9c04f9051a66f07f	2018-02-06 13:58:18 -08:00
Andrew Kryczka	1edac32b77	Update rocksdb.read.block.get.micros when block cache disabled Summary: Previously `ReadBlockFromFile` for data blocks was only measured when reading a block to populate block cache. This PR adds the corresponding measurements for users who disabled block cache. Closes https://github.com/facebook/rocksdb/pull/3442 Differential Revision: D6848671 Pulled By: ajkr fbshipit-source-id: bb4bbe1797fa2cc1d9a5bad44891af2b55384b41	2018-01-31 14:26:52 -08:00
Fosco Marotto	77dc069eb9	Change size_t cast in table_test Summary: Fixes this build error on master (macOS): ``` table/table_test.cc:972:27: error: implicit conversion loses integer precision: 'size_t' (aka 'unsigned long') to 'unsigned int' [-Werror,-Wshorten-64-to-32] ``` Closes https://github.com/facebook/rocksdb/pull/3434 Reviewed By: maysamyabandeh Differential Revision: D6840354 Pulled By: gfosco fbshipit-source-id: fffac6aefbbdd134ce1299453c5590aa855a5fc8	2018-01-30 11:12:51 -08:00
Zhongyi Xie	3fe0937180	Use block cache to track memory usage when ReadOptions.fill_cache=false Summary: ReadOptions.fill_cache is set in compaction inputs and can be set by users in their queries too. It tells RocksDB not to put a data block used to block cache. The memory used by the data block is, however, not trackable by users. To make the system more manageable, we can cost the block to block cache while using it, and then release it after using. Closes https://github.com/facebook/rocksdb/pull/3333 Differential Revision: D6670230 Pulled By: miasantreble fbshipit-source-id: ab848d3ed286bd081a13ee1903de357b56cbc308	2018-01-29 14:43:10 -08:00
Mark Isaacson	b8eb32f8cf	Suppress lint in old files Summary: Grandfather in super old lint issues to make a clean slate for moving forward that allows us to have stronger enforcement on new issues. Reviewed By: yiwu-arbug Differential Revision: D6821806 fbshipit-source-id: 22797d31ec58e9eb0255d3b66fedfcfcb0dc127c	2018-01-29 12:56:42 -08:00
Maysam Yabandeh	46acdc9883	Split HarnessTest_Randomized to avoid timeout Summary: Split HarnessTest_Randomized to two tests Closes https://github.com/facebook/rocksdb/pull/3424 Differential Revision: D6826006 Pulled By: maysamyabandeh fbshipit-source-id: 59c9a11c7da092206effce6e4fa3792f9c66bef2	2018-01-29 07:41:44 -08:00
Sagar Vemuri	d938226af4	Improve performance of long range scans with readahead Summary: This change improves the performance of iterators doing long range scans (e.g. big/full table scans in MyRocks) by using readahead and prefetching additional data on each disk IO. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan. Constraints: - The prefetched data is stored by the OS in page cache. So this currently works only for non direct-reads use-cases i.e applications which use page cache. (Direct-I/O support will be enabled in a later PR). - This gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value). Thanks to siying for the original idea and implementation. Benchmarks: Data fill: ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes ``` Do a long range scan: Seekrandom with large number of nexts ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram ``` Page cache was cleared before each experiment with the command: ``` sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" ``` ``` Before: seekrandom : 34020.945 micros/op 29 ops/sec; 32.5 MB/s (1636 of 1999 found) With this change: seekrandom : 8726.912 micros/op 114 ops/sec; 126.8 MB/s (5702 of 6999 found) ``` ~3.9X performance improvement. Also verified with strace and gdb that the readahead size is increasing as expected. ``` strace -e readahead -f -T -t -p <db_bench process pid> ``` Closes https://github.com/facebook/rocksdb/pull/3282 Differential Revision: D6586477 Pulled By: sagar0 fbshipit-source-id: 8a118a0ed4594fbb7f5b1cafb242d7a4033cb58c	2018-01-25 21:41:53 -08:00
Siying Dong	1039133f2d	BlockBasedTable::NewDataBlockIterator to always return BlockIter Summary: This is a pre-cleaning up before a major block based table iterator refactoring. BlockBasedTable::NewDataBlockIterator() will always return BlockIter. This simplifies the logic and code and enable further refactoring and optimization. Closes https://github.com/facebook/rocksdb/pull/3398 Differential Revision: D6780165 Pulled By: siying fbshipit-source-id: 273f7dc896724f682c0118fb69a359d9cc4418b4	2018-01-25 14:57:18 -08:00
Islam AbdelRahman	c615689bb5	Support skipping bloom filters for SstFileWriter Summary: Add an option for SstFileWriter to skip building bloom filters Closes https://github.com/facebook/rocksdb/pull/3360 Differential Revision: D6709120 Pulled By: IslamAbdelRahman fbshipit-source-id: 964d4bce38822a048691792f447bcfbb4b6bd809	2018-01-22 14:42:18 -08:00
Yi Wu	dc360df81e	Fix multiple build failures Summary: * Fix DBTest.CompactRangeWithEmptyBottomLevel lite build failure * Fix DBTest.AutomaticConflictsWithManualCompaction failure introduce by #3366 * Fix BlockBasedTableTest::IndexUncompressed should be disabled if snappy is disabled * Fix ASAN failure with DBBasicTest::DBClose test Closes https://github.com/facebook/rocksdb/pull/3373 Differential Revision: D6732313 Pulled By: yiwu-arbug fbshipit-source-id: 1eb9b9d9a8d795f56188fa9770db9353f6fdedc5	2018-01-16 17:30:39 -08:00
Dmitri Smirnov	b010116d82	Eliminate some redundant block reads. Summary: Re-use metadata for reading Compression Dictionary on BlockBased table open, this saves two reads from disk. This helps to our 999 percentile in 5.6.1 where prefetch buffer is not present. Closes https://github.com/facebook/rocksdb/pull/3354 Differential Revision: D6695753 Pulled By: ajkr fbshipit-source-id: bb8acd9e9e66e65b89c548ab8940570ae360333c	2018-01-10 17:11:58 -08:00
Anand Ananthabhotla	199405192d	Add a BlockBasedTableOption to turn off index block compression. Summary: Add a new bool option index_uncompressed in BlockBasedTableOptions. Closes https://github.com/facebook/rocksdb/pull/3303 Differential Revision: D6686161 Pulled By: anand1976 fbshipit-source-id: 748b46993d48a01e5f89b6bd3e41f06a59ec6054	2018-01-10 15:11:59 -08:00
Siying Dong	ccc095a016	Speed up BlockTest.BlockReadAmpBitmap Summary: BlockTest.BlockReadAmpBitmap is too slow and times out in some environments. Speed it up by: (1) improve the way the verification is done. With this it is 5 times faster (2) run fewer tests for large blocks. This cut it down by another 10 times. Now it can finish in similar time as other tests. Closes https://github.com/facebook/rocksdb/pull/3313 Differential Revision: D6643711 Pulled By: siying fbshipit-source-id: c2397d666eab5421a78ca87e1e45491e0f832a6d	2018-01-02 10:41:28 -08:00
Siying Dong	6b77c07379	NUMBER_BLOCK_COMPRESSED, etc, shouldn't be treated as timer counter Summary: NUMBER_BLOCK_DECOMPRESSED and NUMBER_BLOCK_COMPRESSED are not reported unless the stats level contain detailed timers, which is wrong. They are normal counters. Fix it. Closes https://github.com/facebook/rocksdb/pull/3263 Differential Revision: D6552519 Pulled By: siying fbshipit-source-id: 40899ccea7b2856bb39752616657c0bfd432f6f9	2017-12-14 10:27:43 -08:00
Zhongyi Xie	51c2ea0feb	Reduce heavy hitter for Get operation Summary: This PR addresses the following heavy hitters in `Get` operation by moving calls to `StatisticsImpl::recordTick` from `BlockBasedTable` to `Version::Get` - rocksdb.block.cache.bytes.write - rocksdb.block.cache.add - rocksdb.block.cache.data.miss - rocksdb.block.cache.data.bytes.insert - rocksdb.block.cache.data.add - rocksdb.block.cache.hit - rocksdb.block.cache.data.hit - rocksdb.block.cache.bytes.read The db_bench statistics before and after the change are: \|1GB block read\|Children \|Self \|Command \|Shared Object \|Symbol\| \|---\|---\|---\|---\|---\|---\| \|master: \|4.22% \|1.31% \|db_bench \|db_bench \|[.] rocksdb::StatisticsImpl::recordTick\| \|updated: \|0.51% \|0.21% \|db_bench \|db_bench \|[.] rocksdb::StatisticsImpl::recordTick\| \| \|0.14% \|0.14% \|db_bench \|db_bench \|[.] rocksdb::GetContext::record_counters\| \|1MB block read\|Children \|Self \|Command \|Shared Object \|Symbol\| \|---\|---\|---\|---\|---\|---\| \|master: \|3.48% \|1.08% \|db_bench \|db_bench \|[.] rocksdb::StatisticsImpl::recordTick\| \|updated: \|0.80% \|0.31% \|db_bench \|db_bench \|[.] rocksdb::StatisticsImpl::recordTick\| \| \|0.35% \|0.35% \|db_bench \|db_bench \|[.] rocksdb::GetContext::record_counters\| Closes https://github.com/facebook/rocksdb/pull/3172 Differential Revision: D6330532 Pulled By: miasantreble fbshipit-source-id: 2b492959e00a3db29e9437ecdcc5e48ca4ec5741	2017-12-12 21:11:33 -08:00
Yi Wu	7393ef779c	Fix BlockFetcher ASAN error Summary: Some call sites of BlockFetcher create temporary ReadOptions and pass to BlockFetcher. The temporary object will be gone after BlockFetcher construction but BlockFetcher keep its reference, causing stack-use-after-scope. Fixing it. Closes https://github.com/facebook/rocksdb/pull/3258 Differential Revision: D6547152 Pulled By: yiwu-arbug fbshipit-source-id: 6b49e9dd46bb72307f5d8f88ea15faacff35b9bc	2017-12-12 12:12:38 -08:00
Siying Dong	a9c8d4ef15	Fix memory issue introduced by `2f1a3a4d74` Summary: Closes https://github.com/facebook/rocksdb/pull/3256 Differential Revision: D6541714 Pulled By: siying fbshipit-source-id: 40efd89b68587a9d58cfe6f4eebd771c2d9f1542	2017-12-11 18:27:28 -08:00
Siying Dong	2f1a3a4d74	Refactor ReadBlockContents() Summary: Divide ReadBlockContents() to multiple sub-functions. Maintaining the input and intermediate data in a new class BlockFetcher. I hope in general it makes the code easier to maintain. Another motivation to do it is to clearly divide the logic before file reading and after file reading. The refactor will help us evaluate how can we make I/O async in the future. Closes https://github.com/facebook/rocksdb/pull/3244 Differential Revision: D6520983 Pulled By: siying fbshipit-source-id: 338d90bc0338472d46be7a7682028dc9114b12e9	2017-12-11 15:27:32 -08:00
Prashant D	baff91c1ad	table: Fix coverity issues Summary: table/block.cc: 420 } CID 1396127 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 7. uninit_member: Non-static class member restart_offset_ is not initialized in this constructor nor in any functions that it calls. 421} table/block_based_table_builder.cc: CID 1418259 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 7. uninit_member: Non-static class member compressed_cache_key_prefix_size is not initialized in this constructor nor in any functions that it calls. table/block_based_table_reader.h: 3. uninit_member: Non-static class member index_type is not initialized in this constructor nor in any functions that it calls. CID 1396147 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 5. uninit_member: Non-static class member hash_index_allow_collision is not initialized in this constructor nor in any functions that it calls. 413 global_seqno(kDisableGlobalSequenceNumber) {} 414 table/cuckoo_table_reader.cc: 55 if (hash_funs == user_props.end()) { 56 status_ = Status::Corruption("Number of hash functions not found"); 5. uninit_member: Non-static class member is_last_level_ is not initialized in this constructor nor in any functions that it calls. 7. uninit_member: Non-static class member identity_as_first_hash_ is not initialized in this constructor nor in any functions that it calls. 9. uninit_member: Non-static class member use_module_hash_ is not initialized in this constructor nor in any functions that it calls. 11. uninit_member: Non-static class member num_hash_func_ is not initialized in this constructor nor in any functions that it calls. 13. uninit_member: Non-static class member key_length_ is not initialized in this constructor nor in any functions that it calls. 15. uninit_member: Non-static class member user_key_length_ is not initialized in this constructor nor in any functions that it calls. 17. uninit_member: Non-static class member value_length_ is not initialized in this constructor nor in any functions that it calls. 19. uninit_member: Non-static class member bucket_length_ is not initialized in this constructor nor in any functions that it calls. 21. uninit_member: Non-static class member cuckoo_block_size_ is not initialized in this constructor nor in any functions that it calls. 23. uninit_member: Non-static class member cuckoo_block_bytes_minus_one_ is not initialized in this constructor nor in any functions that it calls. CID 1322785 (#2 of 2): Uninitialized scalar field (UNINIT_CTOR) 25. uninit_member: Non-static class member table_size_ is not initialized in this constructor nor in any functions that it calls. 57 return; table/plain_table_index.h: 2. uninit_member: Non-static class member index_size_ is not initialized in this constructor nor in any functions that it calls. CID 1322801 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 4. uninit_member: Non-static class member sub_index_size_ is not initialized in this constructor nor in any functions that it calls. 128 huge_page_tlb_size_(huge_page_tlb_size) {} 129 Closes https://github.com/facebook/rocksdb/pull/3113 Differential Revision: D6505719 Pulled By: yiwu-arbug fbshipit-source-id: 38f44d8f9dfefb4c2e25d83b8df25a5201c75618	2017-12-07 11:57:36 -08:00
Andrew Kryczka	63f1c0a57d	fix gflags namespace Summary: I started adding gflags support for cmake on linux and got frustrated that I'd need to duplicate the build_detect_platform logic, which determines namespace based on attempting compilation. We can do it differently -- use the GFLAGS_NAMESPACE macro if available, and if not, that indicates it's an old gflags version without configurable namespace so we can simply hardcode "google". Closes https://github.com/facebook/rocksdb/pull/3212 Differential Revision: D6456973 Pulled By: ajkr fbshipit-source-id: 3e6d5bde3ca00d4496a120a7caf4687399f5d656	2017-12-01 10:42:05 -08:00
Andrew Kryczka	c85f8ccca3	convert null terminator in ascii dump Summary: The ASCII output is almost always useless to me as the first '\0' byte in the key or value causes it to stop printing. Since all characters are already surrounded by spaces, "\ 0" (how we display a backslash followed by a zero) and "\0" (how this PR displays a null terminator) are distinguishable. My assumption is the value of seeing all the bytes outweighs the value of the alignment we had before, where we always had one character followed by one space. Closes https://github.com/facebook/rocksdb/pull/3203 Differential Revision: D6428651 Pulled By: ajkr fbshipit-source-id: aafc978a51e9ea029cfe3e763e2bb0e1751b9ccf	2017-11-28 17:28:58 -08:00
Phani Shekhar Mantripragada	4b65cfc723	Support for block_cache num_shards and other config via option string. Summary: Problem: Option string accepts only cache_size as parameter for block_cache which is specified as "block_cache=1M". It doesn't accept other parameters like num_shards etc. Changes : 1) ParseBlockBasedTableOption in block_based_table_factory is edited to accept cache options in the format "block_cache=<cache_size>:<num_shard_bits>:<strict_capacity_limit>:<high_pri_pool_ratio>". Options other than cache_size are optional to maintain backward compatibility. The changes are valid for block_cache_compressed as well. For example, "block_cache=1M:6:true:0.5", "block_cache=1M:6:true", "block_cache=1M:6" and "block_cache=1M" are all valid option strings. 2) Corresponding unit tests are added. Closes https://github.com/facebook/rocksdb/pull/3108 Differential Revision: D6420997 Pulled By: sagar0 fbshipit-source-id: cdea8b785688d2802907974af27225ccc1c0cd43	2017-11-28 10:48:53 -08:00
Zhongyi Xie	a6c6b8b38c	Revert "No need for Restart Interval for meta blocks" Summary: See [issue 3169](https://github.com/facebook/rocksdb/issues/3169) for more information This reverts commit `593d3de371`. Closes https://github.com/facebook/rocksdb/pull/3188 Differential Revision: D6379271 Pulled By: miasantreble fbshipit-source-id: 88f9ed67ba52237ad9b6f7251db83672b62d7537	2017-11-20 16:42:03 -08:00
Maysam Yabandeh	30285ee31c	Fix calculating filter partition target size Summary: block_size_deviation is in percentage while the partition size is in bytes. The current code fails to take that into account resulting into very large target size for filter partitions. Closes https://github.com/facebook/rocksdb/pull/3187 Differential Revision: D6376069 Pulled By: maysamyabandeh fbshipit-source-id: 276546fc68f50e0da32c462abb46f6cf676db9b2	2017-11-20 13:26:57 -08:00
Yi Wu	84a04af9a9	TableProperty::oldest_key_time defaults to 0 Summary: We don't propagate TableProperty::oldest_key_time on compaction and just write the default value to SST files. It is more natural to default the value to 0. Also revert db_sst_test back to before #2842. Closes https://github.com/facebook/rocksdb/pull/3079 Differential Revision: D6165702 Pulled By: yiwu-arbug fbshipit-source-id: ca3ce5928d96ae79a5beb12bb7d8c640a71478a0	2017-10-27 15:00:05 -07:00
Yi Wu	66a2c44ef4	Add DB::Properties::kEstimateOldestKeyTime Summary: With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property. My plan is to override the property with a more accurate value with blob db, where we actually have timestamp. Closes https://github.com/facebook/rocksdb/pull/2842 Differential Revision: D5770600 Pulled By: yiwu-arbug fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a	2017-10-23 15:27:27 -07:00
Dmitri Smirnov	ebab2e2d42	Enable MSVC W4 with a few exceptions. Fix warnings and bugs Summary: Closes https://github.com/facebook/rocksdb/pull/3018 Differential Revision: D6079011 Pulled By: yiwu-arbug fbshipit-source-id: 988a721e7e7617967859dba71d660fc69f4dff57	2017-10-19 10:57:12 -07:00
Yi Wu	60c09f5fbb	print more table_options to info log Summary: print more table_options to info log Closes https://github.com/facebook/rocksdb/pull/3003 Differential Revision: D6054490 Pulled By: yiwu-arbug fbshipit-source-id: 8e6f96e08bdc906077b6c62ade419d7cb739110f	2017-10-13 14:42:26 -07:00
Yi Wu	31d3e41810	PinnableSlice move assignment Summary: Allow `std::move(pinnable_slice)`. Closes https://github.com/facebook/rocksdb/pull/2997 Differential Revision: D6036782 Pulled By: yiwu-arbug fbshipit-source-id: 583fb0419a97e437ff530f4305822341cd3381fa	2017-10-12 18:28:24 -07:00
Yi Wu	fb4ae4d810	fix DBImpl::NewInternalIterator super-version leak on failure Summary: Close #2955 Closes https://github.com/facebook/rocksdb/pull/2960 Differential Revision: D5962872 Pulled By: yiwu-arbug fbshipit-source-id: a6472d5c015bea3dc476c572ff5a5c90259e6059	2017-10-11 14:57:43 -07:00
Manuel Ung	88ed1f6ea6	Allow upgrades from nullptr to some merge operator Summary: Currently, RocksDB does not allow reopening a preexisting DB with no merge operator defined, with a merge operator defined. This means that if a DB ever want to add a merge operator, there's no way to do so currently. Fix this by adding a new verification type `kByNameAllowFromNull` which will allow old values to be nullptr, and new values to be non-nullptr. Closes https://github.com/facebook/rocksdb/pull/2958 Differential Revision: D5961131 Pulled By: lth fbshipit-source-id: 06179bebd0d90db3d43690b5eb7345e2d5bab1eb	2017-10-04 09:57:23 -07:00
Yi Wu	d1cab2b64e	Add ValueType::kTypeBlobIndex Summary: Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to 1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex. 2. Make rocksdb able to detect if the db contains value written by blob db, if so return error. 3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type). The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob(). Changes on blob db side will be in a separate patch. Closes https://github.com/facebook/rocksdb/pull/2886 Differential Revision: D5838431 Pulled By: yiwu-arbug fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca	2017-10-03 09:11:23 -07:00
Zhongyi Xie	593d3de371	No need for Restart Interval for meta blocks Summary: In SST files, restart interval helps us search in data blocks. However, some meta blocks will be read sequentially, so there's no need for restart points. Restart interval will introduce extra space in the block (https://github.com/facebook/rocksdb/blob/master/table/block_builder.cc#L80). We will see if we can remove this redundant space. (Maybe set restart interval to infinite.) Closes https://github.com/facebook/rocksdb/pull/2940 Differential Revision: D5930139 Pulled By: miasantreble fbshipit-source-id: 92b1b23c15cffa90378343ac846b713623b19c21	2017-09-29 20:26:20 -07:00
Maysam Yabandeh	ab0542f5ec	Fix for when block.cache_handle is nullptr Summary: When using with compressed cache it is possible that the status is ok but the block is not actually added to the block cache. The patch takes this case into account. Closes https://github.com/facebook/rocksdb/pull/2945 Differential Revision: D5937613 Pulled By: maysamyabandeh fbshipit-source-id: 5428cf1115e5046b3d01ab78d26cb181122af4c6	2017-09-29 07:56:55 -07:00
Sagar Vemuri	93c2b91740	Introduce conditional merge-operator invocation in point lookups Summary: For every merge operand encountered for a key in the read path we now have the ability to decide whether to look further (to retrieve more merge operands for the key) or stop and invoke the merge operator to return the value. The user needs to override `ShouldMerge()` method with a condition to terminate search when true to avail this facility. This has a couple of advantages: 1. It helps in limiting the number of merge operands that are looked at to compute a value as part of a user Get operation. 2. It allows to peek at a merge key-value to see if further merge operands need to look at. Example: Limiting the number of merge operands that are looked at: Lets say you have 10 merge operands for a key spread over various levels. If you only want RocksDB to look at the latest two merge operands instead of all 10 to compute the value, it is now possible with this PR. You can set the condition in `ShouldMerge()` to return true when the size of the operand list is 2. Look at the example implementation in the unit test. Without this PR, a Get might look at all the 10 merge operands in different levels before invoking the merge-operator. Added a new unit test. Made sure that there is no perf regression by running benchmarks. Command line to Load data: ``` TEST_TMPDIR=/dev/shm ./db_bench --benchmarks="mergerandom" --merge_operator="uint64add" --num=10000000 ... mergerandom : 12.861 micros/op 77757 ops/sec; 8.6 MB/s ( updates:10000000) ``` ReadRandomMergeRandom bechmark results: Command line: ``` TEST_TMPDIR=/dev/shm ./db_bench --benchmarks="readrandommergerandom" --merge_operator="uint64add" --num=10000000 ``` Base -- Without this code change (on commit `fc7476b`): ``` readrandommergerandom : 38.586 micros/op 25916 ops/sec; (reads:3001599 merges:6998401 total:10000000 hits:842235 maxlength:8) ``` With this code change: ``` readrandommergerandom : 38.653 micros/op 25870 ops/sec; (reads:3001599 merges:6998401 total:10000000 hits:842235 maxlength:8) ``` Closes https://github.com/facebook/rocksdb/pull/2923 Differential Revision: D5898239 Pulled By: sagar0 fbshipit-source-id: daefa325019f77968639a75c851d46352c2303ef	2017-09-28 15:58:49 -07:00
Andrew Kryczka	c2f6e45aa3	prevent nullptr dereference in table reader error case Summary: A user encountered segfault on the call to `CacheDependencies()`, probably because `NewIndexIterator()` failed before populating `*index_entry`. Let's avoid the call in that case. Closes https://github.com/facebook/rocksdb/pull/2939 Differential Revision: D5928611 Pulled By: ajkr fbshipit-source-id: 484be453dbb00e5e160e9c6a1bc933df7d80f574	2017-09-28 00:12:34 -07:00
Siying Dong	edcbb36944	Three code-level optimization to Iterator::Next() Summary: Three small optimizations: (1) iter_->IsKeyPinned() shouldn't be called if read_options.pin_data is not true. This may trigger function call all the way down the iterator tree. (2) reuse the iterator key object in DBIter::FindNextUserEntryInternal(). The constructor of the class has some overheads. (3) Move the switching direction logic in MergingIterator::Next() to a separate function. These three in total improves readseq performance by about 3% in my benchmark setting. Closes https://github.com/facebook/rocksdb/pull/2880 Differential Revision: D5829252 Pulled By: siying fbshipit-source-id: 991aea10c6d6c3b43769cb4db168db62954ad1e3	2017-09-14 17:57:31 -07:00
Siying Dong	64b6452e0c	Make InternalKeyComparator final and directly use it in merging iterator Summary: Merging iterator invokes InternalKeyComparator.Compare() frequently to heap merge. By making InternalKeyComparator final and merging iterator to directly use InternalKeyComparator rather than through Iterator interface, we can give compiler a choice to avoid one more virtual function call if possible. I ran readseq benchmark in memory-only use case to make sure the performance at least doesn't regress. I have to disable the final key word in debug build, as a hack test class depends on overriding the class. Closes https://github.com/facebook/rocksdb/pull/2860 Differential Revision: D5800461 Pulled By: siying fbshipit-source-id: ab876f22a09bb5c560740911412336e0e25ccb53	2017-09-11 12:04:21 -07:00
Maysam Yabandeh	f46464d383	write-prepared txn: call IsInSnapshot Summary: This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot. Closes https://github.com/facebook/rocksdb/pull/2850 Differential Revision: D5787375 Pulled By: maysamyabandeh fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c	2017-09-11 09:14:48 -07:00
Andrew Kryczka	7fbb9eccaf	support disabling checksum in block-based table Summary: store a zero as the checksum when disabled since it's easier to keep block trailer a fixed length. Closes https://github.com/facebook/rocksdb/pull/2781 Differential Revision: D5694702 Pulled By: ajkr fbshipit-source-id: 69cea9da415778ba2b600dfd9d0dfc8cb5188ecd	2017-08-23 19:40:47 -07:00
Andrew Kryczka	19cc66dc4f	fix clang bug in block-based table reader Summary: This is the warning that clang considers a bug and has been causing it to fail: ``` table/block_based_table_reader.cc:240:27: warning: Potential leak of memory pointed to by 'block.value' for (; biter.Valid(); biter.Next()) { ^~~~~ ``` Actually clang just doesn't have enough knowledge to statically determine it's safe. We can teach it using an assert. Closes https://github.com/facebook/rocksdb/pull/2779 Differential Revision: D5691225 Pulled By: ajkr fbshipit-source-id: 3f0d545bf44636953b30ee5243c63239e8f16d8e	2017-08-23 15:12:05 -07:00
Andrew Kryczka	234f33a3f9	allow nullptr Slice only as sentinel Summary: Allow `Slice` holding nullptr as a sentinel value but not in comparisons. This new restriction eliminates the need for the manual checks in `39ef900551`, while still conforming to glibc's `memcmp` API. Thanks siying for the idea. Users may need to migrate, so mentioned it in HISTORY.md. Closes https://github.com/facebook/rocksdb/pull/2777 Differential Revision: D5686016 Pulled By: ajkr fbshipit-source-id: 03a2ca3fd9a0ebade9d0d5686c81d59a9534f563	2017-08-23 10:56:06 -07:00
Maysam Yabandeh	1dfcdb15f9	Extend pin_l0 to filter partitions Summary: This is the continuation of https://github.com/facebook/rocksdb/pull/2661 for filter partitions. When pin_l0 is set (along with cache_xxx), then open table open the filter partitions are loaded into the cache and pinned there. Closes https://github.com/facebook/rocksdb/pull/2766 Differential Revision: D5671098 Pulled By: maysamyabandeh fbshipit-source-id: 174f24018f1d7f1129621e7380287b65b67d2115	2017-08-23 07:56:08 -07:00
Maysam Yabandeh	1efc600ddf	Preload l0 index partitions Summary: This fixes the existing logic for pinning l0 index partitions. The patch preloads the partitions into block cache and pin them if they belong to level 0 and pin_l0 is set. The drawback is that it does many small IOs when preloading all the partitions into the cache is direct io is enabled. Working for a solution for that. Closes https://github.com/facebook/rocksdb/pull/2661 Differential Revision: D5554010 Pulled By: maysamyabandeh fbshipit-source-id: 1e6f32a3524d71355c77d4138516dcfb601ca7b2	2017-08-18 10:56:20 -07:00
Siying Dong	666a005f9b	Support prefetch last 512KB with direct I/O in block based file reader Summary: Right now, if direct I/O is enabled, prefetching the last 512KB cannot be applied, except compaction inputs or readahead is enabled for iterators. This can create a lot of I/O for HDD cases. To solve the problem, the 512KB is prefetched in block based table if direct I/O is enabled. The prefetched buffer is passed in totegher with random access file reader, so that we try to read from the buffer before reading from the file. This can be extended in the future to support flexible user iterator readahead too. Closes https://github.com/facebook/rocksdb/pull/2708 Differential Revision: D5593091 Pulled By: siying fbshipit-source-id: ee36ff6d8af11c312a2622272b21957a7b5c81e7	2017-08-11 12:16:45 -07:00
Aaron G	7848f0b24c	add VerifyChecksum() to db.h Summary: We need a tool to check any sst file corruption in the db. It will check all the sst files in current version and read all the blocks (data, meta, index) with checksum verification. If any verification fails, the function will return non-OK status. Closes https://github.com/facebook/rocksdb/pull/2498 Differential Revision: D5324269 Pulled By: lightmark fbshipit-source-id: 6f8a272008b722402a772acfc804524c9d1a483b	2017-08-09 15:58:13 -07:00
Siying Dong	21696ba502	Replace dynamic_cast<> Summary: Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available. Some nontrivial changes: 1. Add Comparator::GetRootComparator() to get around the internal comparator hack 2. Add the two experiemental functions to DB 3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string 4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric. Closes https://github.com/facebook/rocksdb/pull/2645 Differential Revision: D5502723 Pulled By: siying fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c	2017-07-28 16:27:16 -07:00
Andrew Kryczka	710411aea6	fix asan/valgrind for TableCache cleanup Summary: Breaking commit: `d12691b86f` In the above commit, I moved the `TableCache` cleanup logic from `Version` destructor into `PurgeObsoleteFiles`. I missed cleaning up `TableCache` entries for the current `Version` during DB destruction. This PR adds that logic to `VersionSet` destructor. One unfortunate side effect is now we're potentially deleting `TableReader`s after `column_family_set_.reset()`, which means we can't call `BlockBasedTableReader::Close` a second time as the block cache might already be destroyed. Closes https://github.com/facebook/rocksdb/pull/2662 Differential Revision: D5515108 Pulled By: ajkr fbshipit-source-id: 2cb820e19aa813e0d258d17f76b2d7b6b7ee0b18	2017-07-27 20:28:04 -07:00
Aaron Gao	8f553d3c52	remove unnecessary internal_comparator param in newIterator Summary: solved https://github.com/facebook/rocksdb/issues/2604 Closes https://github.com/facebook/rocksdb/pull/2648 Differential Revision: D5504875 Pulled By: lightmark fbshipit-source-id: c14bb62ccbdc9e7bda9cd914cae4ea0765d882ee	2017-07-27 14:30:42 -07:00
Sagar Vemuri	72502cf227	Revert "comment out unused parameters" Summary: This reverts the previous commit `1d7048c598`, which broke the build. Did a `git revert 1d7048c`. Closes https://github.com/facebook/rocksdb/pull/2627 Differential Revision: D5476473 Pulled By: sagar0 fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06	2017-07-21 18:26:26 -07:00
Victor Gao	1d7048c598	comment out unused parameters Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually. Reviewed By: igorsugak Differential Revision: D5454343 fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2	2017-07-21 14:57:44 -07:00
Siying Dong	ae28634e9f	Remove some left-over BSD headers Summary: Closes https://github.com/facebook/rocksdb/pull/2608 Differential Revision: D5444797 Pulled By: siying fbshipit-source-id: 690581d03f37822e059a16085088e8e2d8a45016	2017-07-18 11:56:57 -07:00
Sushma Devendrappa	0655b58582	enable PinnableSlice for RowCache Summary: This patch enables using PinnableSlice for RowCache, changes include not releasing the cache handle immediately after lookup in TableCache::Get, instead pass a Cleanble function which does Cache::RleaseHandle. Closes https://github.com/facebook/rocksdb/pull/2492 Differential Revision: D5316216 Pulled By: maysamyabandeh fbshipit-source-id: d2a684bd7e4ba73772f762e58a82b5f4fbd5d362	2017-07-17 15:08:30 -07:00
Daniel Black	cbaab30449	table/block.h: change memset Summary: In gcc-7 the following is an error identified by -Werror=class-memaccess In file included from ./table/get_context.h:14:0, from db/version_set.cc:43: ./table/block.h: In constructor ‘rocksdb::BlockReadAmpBitmap::BlockReadAmpBitmap(size_t, size_t, rocksdb::Statistics)’: ./table/block.h:73:53: error: ‘void memset(void, int, size_t)’ clearing an object of type ‘struct std::atomic<unsigned int>’ with no trivial copy-assignment; use value-initialization instead [-Werror=class-memaccess] memset(bitmap_, 0, bitmap_size kBytesPersEntry); ^ In file included from ./db/version_set.h:23:0, from db/version_set.cc:12: /toolchain/include/c++/8.0.0/atomic:684:12: note: ‘struct std::atomic<unsigned int>’ declared here struct atomic<unsigned int> : __atomic_base<unsigned int> ^~~~~~~~~~~~~~~~~~~~ As a solution the default initializer can be applied in list context. Signed-off-by: Daniel Black <daniel.black@au.ibm.com> Closes https://github.com/facebook/rocksdb/pull/2561 Differential Revision: D5398714 Pulled By: siying fbshipit-source-id: d883fb88ec7535eee60d551038fe91f14488be36	2017-07-17 10:41:56 -07:00
Yedidya Feldblum	f1a056e005	CodeMod: Prefer ADD_FAILURE() over EXPECT_TRUE(false), et cetera Summary: CodeMod: Prefer `ADD_FAILURE()` over `EXPECT_TRUE(false)`, et cetera. The tautologically-conditioned and tautologically-contradicted boolean expectations/assertions have better alternatives: unconditional passes and failures. Reviewed By: Orvid Differential Revision: D5432398 Tags: codemod, codemod-opensource fbshipit-source-id: d16b447e8696a6feaa94b41199f5052226ef6914	2017-07-16 21:26:02 -07:00
Siying Dong	3c327ac2d0	Change RocksDB License Summary: Closes https://github.com/facebook/rocksdb/pull/2589 Differential Revision: D5431502 Pulled By: siying fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75	2017-07-15 16:11:23 -07:00
奏之章	70440f7a63	Add virtual func IsDeleteRangeSupported Summary: this modify allows third-party tables able to support delete range Closes https://github.com/facebook/rocksdb/pull/2035 Differential Revision: D5407973 Pulled By: ajkr fbshipit-source-id: 82e364b7dd5a198660788d59543f15b8f95cc418	2017-07-12 16:58:45 -07:00
Daniel Black	7a0b5de771	Gcc 7 ignored quantifiers Summary: The casting seemed to cause a problem. I think this might increase it to unsigned long. Closes https://github.com/facebook/rocksdb/pull/2562 Differential Revision: D5406842 Pulled By: siying fbshipit-source-id: 736adef31448229a58a1a48bdbe77792f36736e8	2017-07-12 09:41:15 -07:00
Maysam Yabandeh	98669b5356	init filters_in_partition_ Summary: Valgrind reports that it is not initialized. Closes https://github.com/facebook/rocksdb/pull/2541 Differential Revision: D5376084 Pulled By: maysamyabandeh fbshipit-source-id: 55c312f4f506863aa0d25ff92c8c34b57f48b860	2017-07-06 08:48:53 -07:00
Maysam Yabandeh	0013bf14ef	fix asan and valgrind leak report in test Summary: Closes https://github.com/facebook/rocksdb/pull/2537 Differential Revision: D5371433 Pulled By: maysamyabandeh fbshipit-source-id: 90d3e8bb1a8576f48b1ddf1bdbba5512b5986ba0	2017-07-05 19:11:39 -07:00
Maysam Yabandeh	f6b9d9355e	Fix clang error in PartitionedFilterBlockBuilder Summary: Closes https://github.com/facebook/rocksdb/pull/2536 Differential Revision: D5371271 Pulled By: maysamyabandeh fbshipit-source-id: f1355ac658a79c9982a24986f0925c9e24fc39d5	2017-07-05 11:16:04 -07:00
Maysam Yabandeh	45b9bb0331	Cut filter partition based on metadata_block_size Summary: Currently metadata_block_size controls only index partition size. With this patch a partition is cut after any of index or filter partitions reaches metadata_block_size. Closes https://github.com/facebook/rocksdb/pull/2452 Differential Revision: D5275651 Pulled By: maysamyabandeh fbshipit-source-id: 5057e4424b4c8902043782e6bf8c38f0c4f25160	2017-07-02 10:42:12 -07:00
Mike Kolupaev	397ab11152	Improve Status message for block checksum mismatches Summary: We've got some DBs where iterators return Status with message "Corruption: block checksum mismatch" all the time. That's not very informative. It would be much easier to investigate if the error message contained the file name - then we would know e.g. how old the corrupted file is, which would be very useful for finding the root cause. This PR adds file name, offset and other stuff to some block corruption-related status messages. It doesn't improve all the error messages, just a few that were easy to improve. I'm mostly interested in "block checksum mismatch" and "Bad table magic number" since they're the only corruption errors that I've ever seen in the wild. Closes https://github.com/facebook/rocksdb/pull/2507 Differential Revision: D5345702 Pulled By: al13n321 fbshipit-source-id: fc8023d43f1935ad927cef1b9c55481ab3cb1339	2017-06-28 21:27:01 -07:00
Sagar Vemuri	1cd45cd1b3	FIFO Compaction with TTL Summary: Introducing FIFO compactions with TTL. FIFO compaction is based on size only which makes it tricky to enable in production as use cases can have organic growth. A user requested an option to drop files based on the time of their creation instead of the total size. To address that request: - Added a new TTL option to FIFO compaction options. - Updated FIFO compaction score to take TTL into consideration. - Added a new table property, creation_time, to keep track of when the SST file is created. - Creation_time is set as below: - On Flush: Set to the time of flush. - On Compaction: Set to the max creation_time of all the files involved in the compaction. - On Repair and Recovery: Set to the time of repair/recovery. - Old files created prior to this code change will have a creation_time of 0. - FIFO compaction with TTL is enabled when ttl > 0. All files older than ttl will be deleted during compaction. i.e. `if (file.creation_time < (current_time - ttl)) then delete(file)`. This will enable cases where you might want to delete all files older than, say, 1 day. - FIFO compaction will fall back to the prior way of deleting files based on size if: - the creation_time of all files involved in compaction is 0. - the total size (of all SST files combined) does not drop below `compaction_options_fifo.max_table_files_size` even if the files older than ttl are deleted. This feature is not supported if max_open_files != -1 or with table formats other than Block-based. Test Plan: Added tests. Benchmark results: Base: FIFO with max size: 100MB :: ``` svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 readwhilewriting : 1.924 micros/op 519858 ops/sec; 13.6 MB/s (1176277 of 5000000 found) ``` With TTL (a low one for testing) :: ``` svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 --fifo_compaction_ttl=20 readwhilewriting : 1.902 micros/op 525817 ops/sec; 13.7 MB/s (1185057 of 5000000 found) ``` Example Log lines: ``` 2017/06/26-15:17:24.609249 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609177) [db/compaction_picker.cc:1471] [default] FIFO compaction: picking file 40 with creation time 1498515423 for deletion 2017/06/26-15:17:24.609255 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609234) [db/db_impl_compaction_flush.cc:1541] [default] Deleted 1 files ... 2017/06/26-15:17:25.553185 7fd5a61a5800 [DEBUG] [db/db_impl_files.cc:309] [JOB 0] Delete /dev/shm/dbbench/000040.sst type=2 #40 -- OK 2017/06/26-15:17:25.553205 7fd5a61a5800 EVENT_LOG_v1 {"time_micros": 1498515445553199, "job": 0, "event": "table_file_deletion", "file_number": 40} ``` SST Files remaining in the dbbench dir, after db_bench execution completed: ``` svemuri@dev15905 ~/rocksdb (fifo-compaction) $ ls -l /dev/shm//dbbench/*.sst -rw-r--r--. 1 svemuri users 30749887 Jun 26 15:17 /dev/shm//dbbench/000042.sst -rw-r--r--. 1 svemuri users 30768779 Jun 26 15:17 /dev/shm//dbbench/000044.sst -rw-r--r--. 1 svemuri users 30757481 Jun 26 15:17 /dev/shm//dbbench/000046.sst ``` Closes https://github.com/facebook/rocksdb/pull/2480 Differential Revision: D5305116 Pulled By: sagar0 fbshipit-source-id: 3e5cfcf5dd07ed2211b5b37492eb235b45139174	2017-06-27 17:11:48 -07:00
Siying Dong	5c97a7c066	Unit Tests for sync, range sync and file close failures Summary: Closes https://github.com/facebook/rocksdb/pull/2454 Differential Revision: D5255320 Pulled By: siying fbshipit-source-id: 0080830fa8eb5da6de25e17ba68aee91018c7913	2017-06-26 13:27:58 -07:00
Maysam Yabandeh	499ebb3ab5	Optimize for serial commits in 2PC Summary: Throughput: 46k tps in our sysbench settings (filling the details later) The idea is to have the simplest change that gives us a reasonable boost in 2PC throughput. Major design changes: 1. The WAL file internal buffer is not flushed after each write. Instead it is flushed before critical operations (WAL copy via fs) or when FlushWAL is called by MySQL. Flushing the WAL buffer is also protected via mutex_. 2. Use two sequence numbers: last seq, and last seq for write. Last seq is the last visible sequence number for reads. Last seq for write is the next sequence number that should be used to write to WAL/memtable. This allows to have a memtable write be in parallel to WAL writes. 3. BatchGroup is not used for writes. This means that we can have parallel writers which changes a major assumption in the code base. To accommodate for that i) allow only 1 WriteImpl that intends to write to memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes come via group commit phase which is serial anyway, ii) make all the parts in the code base that assumed to be the only writer (via EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are protected via a stat_mutex_. Note: the first commit has the approach figured out but is not clean. Submitting the PR anyway to get the early feedback on the approach. If we are ok with the approach I will go ahead with this updates: 0) Rebase with Yi's pipelining changes 1) Currently batching is disabled by default to make sure that it will be consistent with all unit tests. Will make this optional via a config. 2) A couple of unit tests are disabled. They need to be updated with the serial commit of 2PC taken into account. 3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires releasing mutex_ beforehand (the same way EnterUnbatched does). This needs to be cleaned up. Closes https://github.com/facebook/rocksdb/pull/2345 Differential Revision: D5210732 Pulled By: maysamyabandeh fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4	2017-06-24 14:11:29 -07:00
Maysam Yabandeh	0ac4afb975	Sanitize partitioning options Summary: We currently do not support partitioning filters if indexes are not partitioned. The patch makes sure that these two are consistent. Closes https://github.com/facebook/rocksdb/pull/2455 Differential Revision: D5275644 Pulled By: maysamyabandeh fbshipit-source-id: b61701ac8914c2206d06f5e33ff6f67b24406d1d	2017-06-23 18:30:01 -07:00
Maysam Yabandeh	6f4154d693	record index partition properties Summary: When Partitioning index/filter is enabled the user might need to check the index block size as well as the top-level index size via sst_dump. This patch records i) number of partitions, ii) top-level index size and make it accessible through sst_dump. The number of partitions for filters is the same as that of indexes. The top-level index for filters has a similar size to top-level index for indexes, so it is not repeated. Closes https://github.com/facebook/rocksdb/pull/2437 Differential Revision: D5224225 Pulled By: maysamyabandeh fbshipit-source-id: 5324598c75793523aef1bb7ee225a5475e95a9cb	2017-06-13 11:21:32 -07:00
Siying Dong	5582123dee	Sample number of reads per SST file Summary: We estimate number of reads per SST files, by updating the counter per file in sampled read requests. This information can later be used to trigger compactions to improve read performacne. Closes https://github.com/facebook/rocksdb/pull/2417 Differential Revision: D5193528 Pulled By: siying fbshipit-source-id: b4241c5ad0eaf444b61afb53f8e6290d9f5da2df	2017-06-12 07:12:08 -07:00
Maysam Yabandeh	cc5f9339ee	Fix concurrency issue with filter_block_set_ Summary: filter_block_set_ access must also be protected with mutex. Closes https://github.com/facebook/rocksdb/pull/2413 Differential Revision: D5193159 Pulled By: maysamyabandeh fbshipit-source-id: 6987fc219d9a65c20b9c7e52151aef4b8e4882e6	2017-06-06 12:56:52 -07:00
Aaron Gao	7f6c02dda1	using ThreadLocalPtr to hide ROCKSDB_SUPPORT_THREAD_LOCAL from public… Summary: … headers https://github.com/facebook/rocksdb/pull/2199 should not reference RocksDB-specific macros (like ROCKSDB_SUPPORT_THREAD_LOCAL in this case) to public headers, `iostats_context.h` and `perf_context.h`. We shouldn't do that because users have to provide these compiler flags when building their binary with RocksDB. We should hide the thread local global variable inside our implementation and just expose a function api to retrieve these variables. It may break some users for now but good for long term. make check -j64 Closes https://github.com/facebook/rocksdb/pull/2380 Differential Revision: D5177896 Pulled By: lightmark fbshipit-source-id: 6fcdfac57f2e2dcfe60992b7385c5403f6dcb390	2017-06-02 17:26:19 -07:00
Mike Kolupaev	138b87eae4	Fix interaction between CompactionFilter::Decision::kRemoveAndSkipUnt… Summary: Fixes the following scenario: 1. Set prefix extractor. Enable bloom filters, with `whole_key_filtering = false`. Use compaction filter that sometimes returns `kRemoveAndSkipUntil`. 2. Do a compaction. 3. Compaction creates an iterator with `total_order_seek = false`, calls `SeekToFirst()` on it, then repeatedly calls `Next()`. 4. At some point compaction filter returns `kRemoveAndSkipUntil`. 5. Compaction calls `Seek(skip_until)` on the iterator. The key that it seeks to happens to have prefix that doesn't match the bloom filter. Since `total_order_seek = false`, iterator becomes invalid, and compaction thinks that it has reached the end. The rest of the compaction input is silently discarded. The fix is to make compaction iterator use `total_order_seek = true`. The implementation for PlainTable is quite awkward. I've made `kRemoveAndSkipUntil` officially incompatible with PlainTable. If you try to use them together, compaction will fail, and DB will enter read-only mode (`bg_error_`). That's not a very graceful way to communicate a misconfiguration, but the alternatives don't seem worth the implementation time and complexity. To be able to check in advance that `kRemoveAndSkipUntil` is not going to be used with PlainTable, we'd need to extend the interface of either `CompactionFilter` or `InternalIterator`. It seems unlikely that anyone will ever want to use `kRemoveAndSkipUntil` with PlainTable: PlainTable probably has very few users, and `kRemoveAndSkipUntil` has only one user so far: us (logdevice). Closes https://github.com/facebook/rocksdb/pull/2349 Differential Revision: D5110388 Pulled By: lightmark fbshipit-source-id: ec29101a99d9dcd97db33923b87f72bce56cc17a	2017-06-02 15:11:38 -07:00
Andrew Kryczka	a4d9c02511	Pass CF ID to MemTableRepFactory Summary: Some users want to monitor column family activity in their custom memtable implementations. Previously there was no way to figure out with which column family a memtable is associated. This diff: - adds an overload to MemTableRepFactory::CreateMemTableRep() that provides the CF ID. For compatibility, its default implementation calls the old overload. - updates MemTable to create MemTableRep's using the new overload. Closes https://github.com/facebook/rocksdb/pull/2346 Differential Revision: D5108061 Pulled By: ajkr fbshipit-source-id: 3a1921214a348dd8ea0f54e1cab3b71c3d46d616	2017-06-02 12:12:06 -07:00
Aaron Gao	f7bb1a0060	support merge and delete in file ingestion Summary: Previously sst_file_writer only supports kTypeValue, we need kTypeMerge and kTypeDeletion also as user requested. Closes https://github.com/facebook/rocksdb/pull/2361 Differential Revision: D5139402 Pulled By: lightmark fbshipit-source-id: 092a60756d01692539d817a3765ebfd58a8d7f88	2017-05-26 12:11:21 -07:00
Giuseppe Ottaviano	69ec8356b2	Allow SstFileWriter to use the rate limiter Summary: The default IO priority of WritableFiles is IO_TOTAL, meaning that they will bypass the rate limiter if it's passed in the options. This change allows to pass an io priority in construction, so that by setting IO_LOW or IO_HIGH the rate limit will be honored. It also fixes a minor bug: SstFileWriter's copy and move constructor are not disabled and incorrect, as any copy/move will result in a double free. Switching to unique_ptr makes the object correctly movable and non-copyable as expected. Also fix minor style inconsistencies. Closes https://github.com/facebook/rocksdb/pull/2335 Differential Revision: D5113260 Pulled By: sagar0 fbshipit-source-id: e084236e7ff0b50a56cbeceaa9fedd5e210bf9f8	2017-05-23 11:42:09 -07:00

1 2 3 4 5 ...

875 Commits