rocksdb

Author	SHA1	Message	Date
Yanqin Jin	729a617b5b	Add listener to sample file io (#3933 ) Summary: We would like to collect file-system-level statistics including file name, offset, length, return code, latency, etc., which requires to add callbacks to intercept file IO function calls when RocksDB is running. To collect file-system-level statistics, users can inherit the class `EventListener`, as in `TestFileOperationListener `. Note that `TestFileOperationListener::ShouldBeNotifiedOnFileIO()` returns true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3933 Differential Revision: D10219571 Pulled By: riversand963 fbshipit-source-id: 7acc577a2d31097766a27adb6f78eaf8b1e8ff15	2018-10-12 18:36:11 -07:00
jsteemann	517d3b8b77	fix typo in error message, twice (#4457 ) Summary: Fixes a typo in error messages returned by Iterator::GetProperty(...) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4457 Differential Revision: D10281965 Pulled By: sagar0 fbshipit-source-id: 1cd3c665f467ef06cdfd9f482692e6f8568f3d22	2018-10-09 17:07:27 -07:00
Dmitry Alimov	e13d8dcbbb	Fix typos in comments (#4456 ) Summary: Fix some typos in the comments Pull Request resolved: https://github.com/facebook/rocksdb/pull/4456 Differential Revision: D10209214 Pulled By: miasantreble fbshipit-source-id: dff857ba60396bc95126e635db96d7dc8330d2cb	2018-10-04 20:46:50 -07:00
Zhongyi Xie	ce1fc5af09	fix unused param `allocator` in compression.h (#4453 ) Summary: this should fix currently failing contrun test: rocksdb-contrun-no_compression, rocksdb-contrun-tsan, rocksdb-contrun-tsan_crash Pull Request resolved: https://github.com/facebook/rocksdb/pull/4453 Differential Revision: D10202626 Pulled By: miasantreble fbshipit-source-id: 850b07f14f671b5998c22d8239e2a55b2fc1e355	2018-10-04 13:24:22 -07:00
Igor Canadi	1cf5deb8fd	Introduce CacheAllocator, a custom allocator for cache blocks (#4437 ) Summary: This is a conceptually simple change, but it touches many files to pass the allocator through function calls. We introduce CacheAllocator, which can be used by clients to configure custom allocator for cache blocks. Our motivation is to hook this up with folly's `JemallocNodumpAllocator` (`f43ce6d686/folly/experimental/JemallocNodumpAllocator.h`), but there are many other possible use cases. Additionally, this commit cleans up memory allocation in `util/compression.h`, making sure that all allocations are wrapped in a unique_ptr as soon as possible. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4437 Differential Revision: D10132814 Pulled By: yiwu-arbug fbshipit-source-id: be1343a4b69f6048df127939fea9bbc96969f564	2018-10-02 17:24:58 -07:00
Maysam Yabandeh	65ac72edd9	Fix bug in partition filters with format_version=4 (#4381 ) Summary: Value delta encoding in format_version 4 requires the differences between the size of two consecutive handles to be sent to BlockBuilder::Add. This applies not only to indexes on blocks but also the indexes on indexes and filters in partitioned indexes and filters respectively. The patch fixes a bug where the partitioned filters would encode the entire size of the handle rather than the difference of the size with the last size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4381 Differential Revision: D9879505 Pulled By: maysamyabandeh fbshipit-source-id: 27a22e49b482b927fbd5629dc310c46d63d4b6d1	2018-09-17 17:28:15 -07:00
Sagar Vemuri	ac46790374	Fix sync-point comment in Block destructor (#4380 ) Summary: This is a follow up to #4370. The earlier comment is not correct. Thanks to ajkr for pointing this out. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4380 Differential Revision: D9874667 Pulled By: sagar0 fbshipit-source-id: f4e092d86b29c715258210b770643d367e38caae	2018-09-17 11:58:11 -07:00
Sagar Vemuri	3db584059c	Remove sync point from Block destructor (#4370 ) Summary: AddressSanitizer: heap-use-after-free in std::__atomic_base<bool>::load(std::memory_order) const ==1798517==ABORTING ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4370 Differential Revision: D9844146 Pulled By: sagar0 fbshipit-source-id: 18a2970b1d504b4f6c8fb04857f26e0f32124dd1	2018-09-15 00:12:57 -07:00
Maysam Yabandeh	9ea9007b50	Reduce IndexBlockIter size (#4358 ) Summary: With #3983 the size of IndexBlockIter was increased. This had resulted in a regression on P50 latencies in one of our benchmarks. The patch reduces IndexBlockIter size be eliminating active_comparator_ field from the class. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4358 Differential Revision: D9781737 Pulled By: maysamyabandeh fbshipit-source-id: 71e2b28d90ff0813db9e04b737ae73e185583c52	2018-09-12 10:03:35 -07:00
Andrew Kryczka	2c14662213	Revert "Digest ZSTD compression dictionary once per SST file (#4251 )" (#4347 ) Summary: Reverting is needed to unblock a user building against master, who is blocked for multiple days due to a thread-safety issue in `GetEmptyDict`. We haven't been able to fix it quickly, so reverting. Simply ran `git revert 6c40806e51a89386d2b066fddf73d3fd03a36f65`. There were no merge conflicts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4347 Differential Revision: D9668365 Pulled By: ajkr fbshipit-source-id: 0c56334f0a23cf5ee0233d4e4679eae6709739cd	2018-09-06 09:58:34 -07:00
cngzhnp	64324e329e	Support pragma once in all header files and cleanup some warnings (#4339 ) Summary: As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited. Besides this, try to fix some warnings about loss of data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339 Differential Revision: D9654990 Pulled By: ajkr fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848	2018-09-05 18:13:31 -07:00
Andrew Kryczka	6c40806e51	Digest ZSTD compression dictionary once per SST file (#4251 ) Summary: In RocksDB, for a given SST file, all data blocks are compressed with the same dictionary. When we compress a block using the dictionary's raw bytes, the compression library first has to digest the dictionary to get it into a usable form. This digestion work is redundant and ideally should be done once per file. ZSTD offers APIs for the caller to create and reuse a digested dictionary object (`ZSTD_CDict`). In this PR, we call `ZSTD_createCDict` once per file to digest the raw bytes. Then we use `ZSTD_compress_usingCDict` to compress each data block using the pre-digested dictionary. Once the file's created `ZSTD_freeCDict` releases the resources held by the digested dictionary. There are a couple other changes included in this PR: - Changed the parameter object for (un)compression functions from `CompressionContext`/`UncompressionContext` to `CompressionInfo`/`UncompressionInfo`. This avoids the previous pattern, where `CompressionContext`/`UncompressionContext` had to be mutated before calling a (un)compression function depending on whether dictionary should be used. I felt that mutation was error-prone so eliminated it. - Added support for digested uncompression dictionaries (`ZSTD_DDict`) as well. However, this PR does not support reusing them across uncompression calls for the same file. That work is deferred to a later PR when we will store the `ZSTD_DDict` objects in block cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4251 Differential Revision: D9257078 Pulled By: ajkr fbshipit-source-id: 21b8cb6bbdd48e459f1c62343780ab66c0a64438	2018-08-23 19:28:18 -07:00
Fenggang Wu	da40d45267	DataBlockHashIndex: avoiding expensive iiter->Next when handling hash kNoEntry (#4296 ) Summary: When returning `kNoEntry` from HashIndex lookup, previously we invalidate the `biter` by set `current_=restarts_`, so that the search can continue to the next block in case the search result may reside in the next block. There is one problem: when we are searching for a missing key, if the search finds a `kNoEntry` and continue the search to the next block, there is also a non-trivial possibility that the HashIndex return `kNoEntry` too, and the expensive index iterator `Next()` will happen several times for nothing. The solution is that if the hash table returns `kNoEntry`, `SeekForGetImpl()` just search the last restart interval for the key. It will stop at the first key that is large than the seek_key, or to the end of the block, and each case will be handled correctly. Microbenchmark script: ``` TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillseq,readtocache,readmissing \ --cache_size=20000000000 --use_data_block_hash_index={true\|false} ``` `readmissing` performance (lower is better): ``` binary: 3.6098 micros/op hash (before applying diff): 4.1048 micros/op hash (after applying diff): 3.3502 micros/op ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4296 Differential Revision: D9419159 Pulled By: fgwu fbshipit-source-id: 21e3eedcccbc47a249aa8eb4bf405c9def0b8a05	2018-08-23 10:12:58 -07:00
Yanqin Jin	bb5dcea98e	Add path to WritableFileWriter. (#4039 ) Summary: We want to sample the file I/O issued by RocksDB and report the function calls. This requires us to include the file paths otherwise it's hard to tell what has been going on. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4039 Differential Revision: D8670178 Pulled By: riversand963 fbshipit-source-id: 97ee806d1c583a2983e28e213ee764dc6ac28f7a	2018-08-23 10:12:58 -07:00
Fenggang Wu	640cfa7c33	DataBlockHashIndex: fix comment in NumRestarts() (#4286 ) Summary: Improve the description of the backward compatibility check in NumRestarts() Pull Request resolved: https://github.com/facebook/rocksdb/pull/4286 Differential Revision: D9412490 Pulled By: fgwu fbshipit-source-id: ea7dd5c61d8ff8eacef623b729d4e4fd53cca066	2018-08-21 17:12:45 -07:00
Fenggang Wu	6d37fdb365	DataBlockHashIndex: Remove the division from EstimateSize() (#4293 ) Summary: `BlockBasedTableBuilder::Add()` eventually calls `DataBlockHashIndexBuilder::EstimateSize()`. The previous implementation divides the `num_keys` by the `util_ratio_` to get an estimizted `num_buckets`. Such division is expensive as it happens in every `BlockBasedTableBuilder::Add()`. This diff estimates the `num_buckets` by double addition instead of double division. Specifically, in each `Add()`, we add `bucket_per_key_` (inverse of `util_ratio_`) to the current `estimiated_num_buckets_`. The cost is that we are gonna have the `estimated_num_buckets_` stored as one extra field in the DataBlockHashIndexBuilder. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4293 Differential Revision: D9412472 Pulled By: fgwu fbshipit-source-id: 2925c8509a401e7bd3c1ab1d9e9c7244755c277a	2018-08-20 23:13:50 -07:00
Siying Dong	dc064f302e	Suppress two CLANG Analyze warning (#4291 ) Summary: Suppress two CLANG analyze warnings. They don't seem to be real bugs Pull Request resolved: https://github.com/facebook/rocksdb/pull/4291 Differential Revision: D9407333 Pulled By: siying fbshipit-source-id: 2ed63d88fa0b217fdccb1572d7508467c2203dc8	2018-08-20 16:57:38 -07:00
Andrey Zagrebin	fa4de6e30f	#3865 followup for fix performance degression introduced by switching order of operands (#4284 ) Summary: Followup for #4266. There is one more place in get_context.cc where MergeOperator::ShouldMerge should be called with reversed list of operands. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4284 Differential Revision: D9380008 Pulled By: sagar0 fbshipit-source-id: 70ec26e607e5b88465e1acbdcd6c6171bd76b9f2	2018-08-17 10:57:25 -07:00
Fenggang Wu	9d646a6311	Add db_bench options of data block hash index (#4281 ) Summary: Add `--data_block_index_type` and `--data_block_hash_table_util_ratio` option to `db_bench`. `--data_block_index_type` can be either of `binary` (default) or `binary_and_hash`; `--data_block_hash_table_util_ratio` will be a double. The default value is `0.75`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4281 Differential Revision: D9361476 Pulled By: fgwu fbshipit-source-id: dc53e01acef9db81b9eec5e8a96f3bc8ed718c10	2018-08-16 18:42:46 -07:00
Fenggang Wu	19ec44fd39	Improve point-lookup performance using a data block hash index (#4174 ) Summary: Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.` The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174 Differential Revision: D8965914 Pulled By: fgwu fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198	2018-08-15 14:30:03 -07:00
Siying Dong	f3d91a0b57	Add a unit test to verify iterators release data blocks after using them (#4170 ) Summary: Add a unit test to check that iterators release data blocks after it has moved away from it. Verify the same for compaction input iterators. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4170 Differential Revision: D8962513 Pulled By: siying fbshipit-source-id: 05a5b604d7d29887fb488f2cda7286f554a14407	2018-08-13 17:43:14 -07:00
Maysam Yabandeh	d511f35ea7	Fix wrong partitioned index size recorded in properties block (#4259 ) Summary: After refactoring in https://github.com/facebook/rocksdb/pull/4158 the properties block is written after the index block. This breaks the existing logic in estimating the index size in partitioned indexes. The patch fixes that by using the accurate index block size, which is available since by the time we write the properties block, the index block is already written. The patch also fixes an issue in estimating the partition size with format_version=3 which was resulting into partitions smaller than the configured metadata_block_size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4259 Differential Revision: D9274454 Pulled By: maysamyabandeh fbshipit-source-id: c82d045505cca3e7ed1a44ee1eaa26e4f25a4272	2018-08-10 15:27:20 -07:00
Maysam Yabandeh	058026a885	Fix unity compile error (#4257 ) Summary: Fix the compile error in "make unity_test" caused by #3983. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4257 Differential Revision: D9271740 Pulled By: maysamyabandeh fbshipit-source-id: 94e56d1675bf8bdc0e94439467eb4f40dd107517	2018-08-10 10:27:55 -07:00
Maysam Yabandeh	caf0f53a74	Index value delta encoding (#3983 ) Summary: Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the block cache more efficiently. The feature is enabled with using format_version 4. The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size. Results with sysbench read-only using 4k blocks and using 16 index restart interval: Format 2: 19585 rocksdb read-only range=100 Format 3: 19569 rocksdb read-only range=100 Format 4: 19352 rocksdb read-only range=100 Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983 Differential Revision: D8361343 Pulled By: maysamyabandeh fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651	2018-08-09 16:58:40 -07:00
Maysam Yabandeh	eb8885a08a	Return correct usable_size for BlockContents (#4246 ) Summary: If jemalloc is disabled or the API is incorrectly referenced (jemalloc api on windows have a prefix je_) memory usage is incorrectly reported for all block sizes. This is because sizeof(char) is always 1. sizeof() is calculated at compile time and (char) is char. The patch uses the size of the slice to fix that. Fixes #4245 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4246 Differential Revision: D9233958 Pulled By: maysamyabandeh fbshipit-source-id: 9646933b24504e2814c7379f06a31148829c6b4e	2018-08-08 17:43:00 -07:00
Yanqin Jin	54de56844d	Remove random writes from SST file ingestion (#4172 ) Summary: RocksDB used to store global_seqno in external SST files written by SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the `global_seqno`. Since random write is not supported in some non-POSIX compliant file systems, external SST file ingestion is not supported on these file systems. To address this limitation, we no longer update `global_seqno` during file ingestion. Later RocksDB uses the MANIFEST and other information in table properties to deduce global seqno for externally-ingested SST files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172 Differential Revision: D8961465 Pulled By: riversand963 fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071	2018-07-27 16:12:23 -07:00
Fenggang Wu	a11df583ec	Add DataBlockIndexType option in BlockBasedTableOptions (#4150 ) Summary: Added DataBlockIndexType option in BlockBasedTableOptions. ``` enum DataBlockIndexType : char { kDataBlockBinarySearch = 0, // traditional block type kDataBlockHashIndex = 1, // additional hash index appended to the end. }; ``` The default type is the traditional binary seek option: `kDataBlockBinarySearch`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4150 Differential Revision: D8895958 Pulled By: fgwu fbshipit-source-id: 480adef48104cf11d30db3bad9a73f98b4a80c10	2018-07-27 15:42:27 -07:00
Siying Dong	2a81633da2	Fix bug when seeking backward against an out-of-bound iterator (#4187 ) Summary: `92ee3350e0` introduces an out-of-bound check in BlockBasedTableIterator::Valid(). However, this flag is not reset when re-seeking in backward direction. This caused the iterator to be invalide by mistake. Fix it by always resetting the out-of-bound flag in every seek. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4187 Differential Revision: D8996600 Pulled By: siying fbshipit-source-id: b6235ea614f71381e50e7904c4fb036300604ac1	2018-07-25 17:14:01 -07:00
Fenggang Wu	8805ec2f49	DataBlockHashIndex: Standalone Implementation with Unit Test (#4139 ) Summary: The first step of the `DataBlockHashIndex` implementation. A string based hash table is implemented and unit-tested. `DataBlockHashIndexBuilder`: `Add()` takes pairs of `<key, restart_index>`, and formats it into a string when `Finish()` is called. `DataBlockHashIndex`: initialized by the formatted string, and can interpret it as a hash table. Lookup for a key is supported by iterator operation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4139 Reviewed By: sagar0 Differential Revision: D8866764 Pulled By: fgwu fbshipit-source-id: 7f015f0098632c65979a22898a50424384730b10	2018-07-24 11:43:37 -07:00
Zhongyi Xie	f95a5b2464	Avoid unnecessary big for-loop when reporting ticker stats stored in GetContext (#3490 ) Summary: Currently in `Version::Get` when reporting ticker stats stored in `GetContext`, there is a big for-loop through all `Ticker` which adds unnecessary cost to overall CPU usage. We can optimize by storing only ticker values that are used in `Get()` calls in a new struct `GetContextStats` since only a small fraction of all tickers are used in `Get()` calls. For comparison, with the new approach we only need to visit 17 values while old approach will require visiting 100+ `Ticker` Pull Request resolved: https://github.com/facebook/rocksdb/pull/3490 Differential Revision: D6969154 Pulled By: miasantreble fbshipit-source-id: fc27072965a3a94125a3e6883d20dafcf5b84029	2018-07-20 16:58:13 -07:00
Siying Dong	a5e851e113	Reformatting some recent changes (#4161 ) Summary: Lint is not happy with some new code recently committed. Format them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4161 Differential Revision: D8940582 Pulled By: siying fbshipit-source-id: c9b43b1ef8c88b5e923911058b44eb77234b36b7	2018-07-20 14:43:38 -07:00
Siying Dong	8425c8bd4d	BlockBasedTableReader: automatically adjust tail prefetch size (#4156 ) Summary: Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others. Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10% Pull Request resolved: https://github.com/facebook/rocksdb/pull/4156 Differential Revision: D8916847 Pulled By: siying fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758	2018-07-20 14:43:37 -07:00
Andrew Kryczka	ab35505e21	Write properties metablock last in block-based tables (#4158 ) Summary: The properties meta-block should come at the end since we always need to read it when opening a file, unlike index/filter/other meta-blocks, which are sometimes read depending on the user's configuration. This ordering will allow us to (in a future PR) do a small readahead on the end of the file to read properties and meta-index blocks with one I/O. The bulk of this PR is a refactoring of the `BlockBasedTableBuilder::Finish` function. It was previously too large with inconsistent error handling, which made it difficult to change. So I broke it up into one function per meta-block write, and tried to make error handling consistent within those functions. Then reordering the metablocks was trivial -- just reorder the calls to these helper functions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4158 Differential Revision: D8921705 Pulled By: ajkr fbshipit-source-id: 96c9cc3182eb1adf11af46adab79dbeba7b12fcc	2018-07-20 09:11:59 -07:00
Andrew Kryczka	b5613227a9	Smaller tail readahead when not reading index/filters (#4159 ) Summary: In all cases during `BlockBasedTable::Open`, we issue at least three read requests to the file's tail: (1) footer, (2) metaindex block, and (3) properties block. Depending on the config, we may also read other metablocks like filter and index. This PR issues smaller readahead when we expect to do only the three necessary reads mentioned above. Then, 4KB should be enough (ignoring the case where there are lots of user-defined properties). We can keep doing 512KB readahead when additional reads are expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4159 Differential Revision: D8924002 Pulled By: ajkr fbshipit-source-id: cfc713275de4d05ce11f18571f1d72e27ccd3356	2018-07-19 16:13:22 -07:00
Maysam Yabandeh	b55da012f6	Refactor IndexBlockIter (#4141 ) Summary: Refactor IndexBlockIter to reduce conditional branches on key_includes_seq_. IndexBlockIter::Prev is also separated from DataBlockIter::Prev, not to cache the prev entries as they are of less importance when iterating over the index block. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4141 Differential Revision: D8866437 Pulled By: maysamyabandeh fbshipit-source-id: fdac76880426fc2be7d3c6354c09ab98f6657d4b	2018-07-16 17:13:10 -07:00
Siying Dong	8f06b4fa01	Separate some IndexBlockIter logic from BlockIter (#4136 ) Summary: Some logic only related to IndexBlockIter is separated from BlockIter to IndexBlockIter. This is done by writing an exclusive Seek() and SeekForPrev() for DataBlockIter, and all metadata block iter and tombstone block iter now use data block iter. Dealing with the BinarySeek() sharing problem by passing in the comparator to use. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4136 Reviewed By: maysamyabandeh Differential Revision: D8859673 Pulled By: siying fbshipit-source-id: 703e5e6824b82b7cbf4721f3594b94127797ca9e	2018-07-16 10:13:18 -07:00
Nathan VanBenschoten	ef7815b803	Support range deletion tombstones in IngestExternalFile SSTs (#3778 ) Summary: Fixes #3391. This change adds a `DeleteRange` method to `SstFileWriter` and adds support for ingesting SSTs with range deletion tombstones. This is important for applications that need to atomically ingest SSTs while clearing out any existing keys in a given key range. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3778 Differential Revision: D8821836 Pulled By: anand1976 fbshipit-source-id: ca7786c1947ff129afa703dab011d524c7883844	2018-07-13 22:43:09 -07:00
Maysam Yabandeh	8581a93a6b	Per-thread unique test db names (#4135 ) Summary: The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel. Example: ``` ~/gtest-parallel/gtest-parallel ./table_test``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135 Differential Revision: D8846653 Pulled By: maysamyabandeh fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84	2018-07-13 17:27:39 -07:00
Tamir Duberstein	7bee48bdbd	Add GCC 8 to Travis (#3433 ) Summary: - Avoid `strdup` to use jemalloc on Windows - Use `size_t` for consistency - Add GCC 8 to Travis - Add CMAKE_BUILD_TYPE=Release to Travis Pull Request resolved: https://github.com/facebook/rocksdb/pull/3433 Differential Revision: D6837948 Pulled By: sagar0 fbshipit-source-id: b8543c3a4da9cd07ee9a33f9f4623188e233261f	2018-07-13 10:58:06 -07:00
Maysam Yabandeh	d4ad32d7bd	Refactor BlockIter (#4121 ) Summary: BlockIter is getting crowded including details that specific only to either index or data blocks. The patch moves down such details to DataBlockIter and IndexBlockIter, both inheriting from BlockIter. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4121 Differential Revision: D8816832 Pulled By: maysamyabandeh fbshipit-source-id: d492e74155c11d8a0c1c85cd7ee33d24c7456197	2018-07-12 17:27:31 -07:00
Nikhil Benesch	5cd8240b86	Test range deletions with more configurations (#4021 ) Summary: Run the basic range deletion tests against the standard set of configurations. This testing exposed that files with hash indexes and partitioned indexes were not handling the case where the file contained only range deletions--i.e., where the index was empty. Additionally file a TODO about the fact that range deletions are broken when allow_mmap_reads = true is set. /cc ajkr nvanbenschoten Best viewed with ?w=1: https://github.com/facebook/rocksdb/pull/4021/files?w=1 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4021 Differential Revision: D8811860 Pulled By: ajkr fbshipit-source-id: 3cc07e6d6210a2a00b932866481b3d5c59775343	2018-07-11 15:57:49 -07:00
Yanqin Jin	d4d9fe8e57	Fix a bug caused by not copying the block trailer. (#4096 ) Summary: This was caught by crash test, and the following is a simple way to reproduce it and verify the fix. One way to trigger this code path is to use the following configuration: - Compress SST file - Enable direct IO and prefetch buffer - Do NOT use compressed block cache Closes https://github.com/facebook/rocksdb/pull/4096 Differential Revision: D8742009 Pulled By: riversand963 fbshipit-source-id: f13381078bbb0dce92f60bd313a78ab602bcacd2	2018-07-06 13:12:39 -07:00
Yanqin Jin	39218a72a4	Increase the size of LRU cache. (#4090 ) Summary: Increase the size of each shard so that the number of cache hit/miss match expectation. Otherwise FilterBlockInBlockCache test will fail. Closes https://github.com/facebook/rocksdb/pull/4090 Differential Revision: D8736158 Pulled By: riversand963 fbshipit-source-id: 5cdbc06b02390389fd5b72a6d251d88949ad3d91	2018-07-05 11:45:11 -07:00
Maysam Yabandeh	2462763b2e	Fix mis-spoken assert on prefetch_filter and prefetch_index (#4077 ) Summary: We can have prefetch_index without prefetch_filter but not the other way around. The assert statement is fixed. Closes https://github.com/facebook/rocksdb/pull/4077 Differential Revision: D8694472 Pulled By: maysamyabandeh fbshipit-source-id: ccd2804d9d9cdafb1c3e65062c7bc38603e69004	2018-06-29 09:28:12 -07:00
Maysam Yabandeh	29ffbb8a50	Charging block cache more accurately (#4073 ) Summary: Currently the block cache is charged only by the size of the raw data block and excludes the overhead of the c++ objects that contain the raw data block. The patch improves the accuracy of the charge by including the c++ object overhead into it. Closes https://github.com/facebook/rocksdb/pull/4073 Differential Revision: D8686552 Pulled By: maysamyabandeh fbshipit-source-id: 8472f7fc163c0644533bc6942e20cdd5725f520f	2018-06-29 08:57:20 -07:00
Zhongyi Xie	14f409c0f1	PrefixMayMatch: remove unnecessary check for prefix_extractor_ (#4067 ) Summary: with https://github.com/facebook/rocksdb/pull/3601 and https://github.com/facebook/rocksdb/pull/3899, `prefix_extractor_` is not really being used in block based filter and full filter's version of `PrefixMayMatch` because now `prefix_extractor` is passed as an argument. Also it is now possible that prefix_extractor_ may be initialized to nullptr when a non-standard prefix_extractor is used and also for ROCKSDB_LITE. Removing these checks should not break any existing tests. Closes https://github.com/facebook/rocksdb/pull/4067 Differential Revision: D8669002 Pulled By: miasantreble fbshipit-source-id: 0e701ba912b8a26734fadb72d15bb1b266b6176a	2018-06-27 20:42:43 -07:00
Zhichao Cao	1f6efabe23	Add bottommost_compression_opts to for bottommost_compression (#3985 ) Summary: …ression For `CompressionType` we have options `compression` and `bottommost_compression`. Thus, to make the compression options consitent with the compression type when bottommost_compression is enabled, we add the bottommost_compression_opts Closes https://github.com/facebook/rocksdb/pull/3985 Reviewed By: riversand963 Differential Revision: D8385911 Pulled By: zhichao-cao fbshipit-source-id: 07bc533dd61bcf1cef5927d8d62901c13d38d5fc	2018-06-27 17:42:38 -07:00
Maysam Yabandeh	235ab9dd32	Pin mmap files in ReadOnlyDB (#4053 ) Summary: https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case. Closes https://github.com/facebook/rocksdb/pull/4053 Differential Revision: D8662546 Pulled By: maysamyabandeh fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd	2018-06-27 17:13:34 -07:00
Manuel Ung	a16e00b7b9	WriteUnPrepared Txn: Disable seek to snapshot optimization (#3955 ) Summary: This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data. There are the places where reads had to be modified. - Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible. - Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key. - Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case. Closes https://github.com/facebook/rocksdb/pull/3955 Differential Revision: D8286688 Pulled By: lth fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe	2018-06-27 12:23:07 -07:00
Nikhil Benesch	17339dc2f3	Add table property tracking number of range deletions (#4016 ) Summary: Add a new table property, rocksdb.num.range-deletions, which tracks the number of range deletions in a block-based table. Range deletions are no longer counted in rocksdb.num.entries; as discovered in PR #3778, there are various code paths that implicitly assume that rocksdb.num.entries counts only true keys, not range deletions. /cc ajkr nvanbenschoten Closes https://github.com/facebook/rocksdb/pull/4016 Differential Revision: D8527575 Pulled By: ajkr fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7	2018-06-26 20:27:35 -07:00
Zhongyi Xie	408205a36b	use user_key and iterate_upper_bound to determine compatibility of bloom filters (#3899 ) Summary: Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file. This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at. Closes https://github.com/facebook/rocksdb/pull/3899 Differential Revision: D8157459 Pulled By: miasantreble fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41	2018-06-26 15:57:26 -07:00
Sagar Vemuri	189f0c27aa	Make BlockBasedTableIterator compaction-aware (#4048 ) Summary: Pass in `for_compaction` to `BlockBasedTableIterator` via `BlockBasedTableReader::NewIterator`. In `7103559f49`, `for_compaction` was set in `BlockBasedTable::Rep` via `BlockBasedTable::SetupForCompaction`. In hindsight it was not the right decision; it also caused TSAN to complain. Closes https://github.com/facebook/rocksdb/pull/4048 Differential Revision: D8601056 Pulled By: sagar0 fbshipit-source-id: 30127e898c15c38c1080d57710b8c5a6d64a0ab3	2018-06-25 13:19:27 -07:00
Maysam Yabandeh	80ade9ad83	Pin top-level index on partitioned index/filter blocks (#4037 ) Summary: Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead. Closes https://github.com/facebook/rocksdb/pull/4037 Differential Revision: D8596218 Pulled By: maysamyabandeh fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2	2018-06-22 15:27:46 -07:00
Sagar Vemuri	7103559f49	Improve direct IO range scan performance with readahead (#3884 ) Summary: This PR extends the improvements in #3282 to also work when using Direct IO. We see 4.5X performance improvement in seekrandom benchmark doing long range scans, when using direct reads, on flash. Description: This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan. Implementation Details: - Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead. - `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled. - `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer. - Made sure not to re-read partial chunks of data that were already available in the buffer, from device again. - Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date. Constraints: - Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value). - Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously. - Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them. Benchmarks: I used the same benchmark as used in #3282. Data fill: ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes ``` Do a long range scan: Seekrandom with large number of nexts ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram ``` ``` Before: seekrandom : 37939.906 micros/op 26 ops/sec; 29.2 MB/s (1636 of 1999 found) With this change: seekrandom : 8527.720 micros/op 117 ops/sec; 129.7 MB/s (6530 of 7999 found) ``` ~4.5X perf improvement. Taken on an average of 3 runs. Closes https://github.com/facebook/rocksdb/pull/3884 Differential Revision: D8082143 Pulled By: sagar0 fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb	2018-06-21 11:13:08 -07:00
Maysam Yabandeh	28a9d8910b	Fix the bug with duplicate prefix in partition filters (#4024 ) Summary: https://github.com/facebook/rocksdb/pull/3764 introduced an optimization feature to skip duplicate prefix entires in full bloom filters. Unfortunately it also introduces a bug in partitioned full filters, where the duplicate prefix should still be inserted if it is in a new partition. The patch fixes the bug by resetting the duplicate detection logic each time a partition is cut. This bug could result into false negatives, which means that DB could skip an existing key. Closes https://github.com/facebook/rocksdb/pull/4024 Differential Revision: D8518866 Pulled By: maysamyabandeh fbshipit-source-id: 044f4d988e606a330ecafd8c79daceb68b8796bf	2018-06-19 14:12:46 -07:00
Siying Dong	92ee3350e0	BlockBasedTableIterator to keep BlockIter after out of upper bound (#4004 ) Summary: `b555ed30a4` makes the BlockBasedTableIterator to be invalidated if the current position if over the upper bound. However, this can bring performance regression to the case of multiple Seek()s hitting the same data block but all out of upper bound. For example, if an SST file has a data block containing following keys : {a, z} The user sets the upper bound to be "x", and it executed following queries: Seek("b") Seek("c") Seek("d") Before the upper bound optimization, these queries always come to this same current data block of the iterator, but now inside each Seek() the data block is read from the block cache but is returned again. To prevent this regression case, we keep the current data block iterator if it is upper bound. Closes https://github.com/facebook/rocksdb/pull/4004 Differential Revision: D8463192 Pulled By: siying fbshipit-source-id: 8710628b30acde7063a097c3184d6c4333a8ef81	2018-06-19 09:57:11 -07:00
Zhongyi Xie	80bc35927c	Should only decode restart points for uncompressed blocks (#3996 ) Summary: The Block object assumes contents are uncompressed. Block's constructor tries to read the number of restarts, but does not get an accurate number when its contents are compressed, which is causing issues like https://github.com/facebook/rocksdb/issues/3843. This PR address this issue by skipping reconstruction of restart points when blocks are known to be compressed. Somehow the restart points can be read directly when Snappy is used and some tests (for example https://github.com/facebook/rocksdb/blob/master/db/db_block_cache_test.cc#L196) expects blocks to be fully constructed even when Snappy compression is used, so here we keep the restart point logic for Snappy. Closes https://github.com/facebook/rocksdb/pull/3996 Differential Revision: D8416186 Pulled By: miasantreble fbshipit-source-id: 002c0b62b9e5d89fb7736563d354ce0023c8cb28	2018-06-15 19:26:58 -07:00
Siying Dong	d82f1421b4	Fix regression bug of Prev() with upper bound (#3989 ) Summary: A recent change pushed down the upper bound checking to child iterators. However, this causes the logic of following sequence wrong: Seek(key); if (!Valid()) SeekToLast(); Because !Valid() may be caused by upper bounds, rather than the end of the iterator. In this case SeekToLast() points to totally wrong places. This can cause wrong results, infinite loops, or segfault in some cases. This sequence is called when changing direction from forward to backward. And this by itself also implicitly happen during reseeking optimization in Prev(). Fix this bug by using SeekForPrev() rather than this sequuence, as what is already done in prefix extrator case. Closes https://github.com/facebook/rocksdb/pull/3989 Differential Revision: D8385422 Pulled By: siying fbshipit-source-id: 429e869990cfd2dc389421e0836fc496bed67bb4	2018-06-12 16:57:36 -07:00
Andrew Kryczka	9d347332fb	Fix argument mismatch in BlockBasedTableBuilder (#3974 ) Summary: The sixth argument should be `key_includes_seq` bool, the seventh a `GetContext`. We were mistakenly passing the `GetContext` as the sixth argument and relying on the default (nullptr) for the seventh. This would make statistics inaccurate, at least. Blame: `402b7aa0` Closes https://github.com/facebook/rocksdb/pull/3974 Differential Revision: D8344907 Pulled By: ajkr fbshipit-source-id: 3ad865a0541d6d30f75dfc726352788118cfe12e	2018-06-12 13:57:44 -07:00
Fenggang Wu	3593275357	Remove restart point from the properties_block (#3970 ) Summary: Property block will be read sequentially and cached in a heap located object, so there's no need for restart points. Thus we set the restart interval to infinity to save space. Closes https://github.com/facebook/rocksdb/pull/3970 Differential Revision: D8332586 Pulled By: fgwu fbshipit-source-id: 899c3267832a81d0f084ec2db6b387332f461134	2018-06-12 12:57:37 -07:00
Fenggang Wu	f4502944c3	Change db path for BlockBasedTableTest.BadOptions (#3965 ) Summary: BadOptions test creates a temporary db path changed to table_block_based_bad_options_test to avoid collide with that created by the PrefixAndWholeKeyTest Closes https://github.com/facebook/rocksdb/pull/3965 Differential Revision: D8316080 Pulled By: fgwu fbshipit-source-id: bb8e0fdfdb9abf0e5ce94494b4388cd1622ee032	2018-06-08 12:57:14 -07:00
Maysam Yabandeh	b73652169e	Extend format 3 to partitioned index/filters (#3958 ) Summary: format_version 3 changes the format of index blocks by storing user keys instead of the internal keys, which saves 8-bytes per key. This patch extends the format to top-level indexes in partitioned index/filters. Closes https://github.com/facebook/rocksdb/pull/3958 Differential Revision: D8294615 Pulled By: maysamyabandeh fbshipit-source-id: 17666cc16b8076c363972e2308e31547e835f0fe	2018-06-06 16:58:16 -07:00
Zhongyi Xie	f1592a06c2	run make format for PR 3838 (#3954 ) Summary: PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings. Run `make format` to fix formatting as suggested by siying . Also piggyback two changes: 1) fix singleton destruction order for windows and posix env 2) fix two clang warnings Closes https://github.com/facebook/rocksdb/pull/3954 Differential Revision: D8272041 Pulled By: miasantreble fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f	2018-06-05 12:58:02 -07:00
Mike Kolupaev	812c7371d3	Fix performance regression in Get() for block-based tables (#3953 ) Summary: This fixes a regression in one of myrocks regression tests (readwhilewriting), introduced in `8bf555f487` This PR changes two lines of code: one of them actually fixes the observed regression, the other is a mostly unrelated small fix that I'm piggy-backing here. EDIT: Nevermind, it fixes one line. More details in inline comments. Closes https://github.com/facebook/rocksdb/pull/3953 Differential Revision: D8270664 Pulled By: al13n321 fbshipit-source-id: a7d91e196807d1e816551591257c700f70e4ccac	2018-06-05 11:43:16 -07:00
Maysam Yabandeh	d0c38c0c8c	Extend some tests to format_version=3 (#3942 ) Summary: format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3. Closes https://github.com/facebook/rocksdb/pull/3942 Differential Revision: D8238413 Pulled By: maysamyabandeh fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035	2018-06-04 20:13:00 -07:00
Dmitri Smirnov	f4b72d7056	Provide a way to override windows memory allocator with jemalloc for ZSTD Summary: Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably. For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance. The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression. Closes https://github.com/facebook/rocksdb/pull/3838 Differential Revision: D8229794 Pulled By: miasantreble fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d	2018-06-04 12:12:48 -07:00
Andrew Kryczka	fea2b1dfb2	Copy Get() result when file reads use mmap Summary: For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region. For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`). This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied. This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime. Closes https://github.com/facebook/rocksdb/pull/3881 Differential Revision: D8076288 Pulled By: ajkr fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08	2018-06-01 16:57:58 -07:00
Zhongyi Xie	2a0dfaa044	fix PrefixExtractorChanged: pass raw pointer instead shared_ptr Summary: This should resolve the performance regression caused by the unnecessary copying of the shared_ptr. Closes https://github.com/facebook/rocksdb/pull/3937 Differential Revision: D8232330 Pulled By: miasantreble fbshipit-source-id: 7885bf7cd190b6f87164c52d6edd328298c13f97	2018-05-31 21:42:50 -07:00
Maysam Yabandeh	44cf84932f	Fix the bug of some test scenarios being put after kEnd Summary: DBTestBase::OptionConfig includes the scenarios that unit tests could iterate over them by calling ChangeOptions(). Some of the options have been mistakenly put after kEnd which makes them essentially invisible to ChangeOptions() caller. This patch fixes it except for kUniversalSubcompactions which is left as TODO since it would break some unit tests. Closes https://github.com/facebook/rocksdb/pull/3935 Differential Revision: D8230748 Pulled By: maysamyabandeh fbshipit-source-id: edddb8fffcd161af1809fef24798ce118f8593db	2018-05-31 19:28:00 -07:00
Maysam Yabandeh	03cda531e4	Check for rep_->table_properties being nullptr Summary: The very old sst formats do not have table_properties and rep_->table_properties is thus nullptr. The recent patch in https://github.com/facebook/rocksdb/pull/3894 does not check for nullptr and hence makes it backward incompatible. This patch adds the check. Closes https://github.com/facebook/rocksdb/pull/3918 Differential Revision: D8188638 Pulled By: maysamyabandeh fbshipit-source-id: b1d986665ecf0b4d1c442adfa8a193b97707d47b	2018-05-29 12:13:55 -07:00
Maysam Yabandeh	402b7aa07f	Exclude seq from index keys Summary: Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator. The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s Closes https://github.com/facebook/rocksdb/pull/3894 Differential Revision: D8118775 Pulled By: maysamyabandeh fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd	2018-05-25 18:42:43 -07:00
Nathan VanBenschoten	8c3bf0801b	Check status when reading HashIndexPrefixesMetadataBlock Summary: This was missed in a refactor of `ReadBlockContents` (`2f1a3a4`). Closes https://github.com/facebook/rocksdb/pull/3906 Differential Revision: D8172648 Pulled By: ajkr fbshipit-source-id: 27e453b19795fea974bfed4721105be6f3a12090	2018-05-25 17:42:51 -07:00
Yanqin Jin	aa53579d6c	Fix segfault caused by object premature destruction Summary: Please refer to earlier discussion in [issue 3609](https://github.com/facebook/rocksdb/issues/3609). There was also an alternative fix in [PR 3888](https://github.com/facebook/rocksdb/pull/3888), but the proposed solution requires complex change. To summarize the cause of the problem. Upon creation of a column family, a `BlockBasedTableFactory` object is `new`ed and encapsulated by a `std::shared_ptr`. Since there is no other `std::shared_ptr` pointing to this `BlockBasedTableFactory`, when the column family is dropped, the `ColumnFamilyData` is `delete`d, causing the destructor of `std::shared_ptr`. Since there is no other `std::shared_ptr`, the underlying memory is also freed. Later when the db exits, it releases all the table readers, including the table readers that have been operating on the dropped column family. This needs to access the `table_options` owned by `BlockBasedTableFactory` that has already been deleted. Therefore, a segfault is raised. Previous workaround is to purge all obsolete files upon `ColumnFamilyData` destruction, which leads to a force release of table readers of the dropped column family. However this does not work when the user disables file deletion. Our solution in this PR is making a copy of `table_options` in `BlockBasedTable::Rep`. This solution increases memory copy and usage, but is much simpler. Test plan ``` $ make -j16 $ ./column_family_test --gtest_filter=ColumnFamilyTest.CreateDropAndDestroy:ColumnFamilyTest.CreateDropAndDestroyWithoutFileDeletion ``` Expected behavior: All tests should pass. Closes https://github.com/facebook/rocksdb/pull/3898 Differential Revision: D8149421 Pulled By: riversand963 fbshipit-source-id: eaecc2e064057ef607fbdd4cc275874f866c3438	2018-05-25 11:57:51 -07:00
Zhongyi Xie	6c73a46693	Fix a backward compatibility problem with table_properties being nullptr Summary: Currently when ldb built from master tries to open a DB from version 2.2, there will be a segfault because table_properties didn't exist back then. Closes https://github.com/facebook/rocksdb/pull/3890 Differential Revision: D8100914 Pulled By: miasantreble fbshipit-source-id: b255e8aedc54695432be2e704839c857dabdd65a	2018-05-22 13:57:17 -07:00
Zhongyi Xie	c3ebc75843	Move prefix_extractor to MutableCFOptions Summary: Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users. This PR aims to make it possible to dynamically change bloom filter config. Closes https://github.com/facebook/rocksdb/pull/3601 Differential Revision: D7253114 Pulled By: miasantreble fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c	2018-05-21 14:43:11 -07:00
Andrew Kryczka	7b655214d2	Assert keys/values pinned by range deletion meta-block iterators Summary: `RangeDelAggregator` holds the pointers returned by `BlockIter::key()` and `BlockIter::value()` so requires the data to which they point is pinned. `BlockIter::key()` points into block memory and is guaranteed to be pinned if and only if prefix encoding is disabled (or, equivalently, restart interval is set to one). I think `BlockIter::value()` is always pinned. Added an assert for these and removed the wrong TODO about increasing restart interval, which would enable key prefix encoding and break the assertion. Closes https://github.com/facebook/rocksdb/pull/3875 Differential Revision: D8063667 Pulled By: ajkr fbshipit-source-id: 60b5ebcc0cdd610dd6aad9e74a23378793672c41	2018-05-21 09:57:00 -07:00
Siying Dong	26da3676d9	class Block to store num_restarts_ Summary: Right now, every Block::NewIterator() reads num_restarts_ from the block, which is already read in Block::Block(). This sometimes cause a CPU cache miss. Although fetching this cacheline can usually benefit follow-up block restart offset reading, as they are close to each other, it's almost free to get ride of this read by storing it in the Block class. Closes https://github.com/facebook/rocksdb/pull/3869 Differential Revision: D8052493 Pulled By: siying fbshipit-source-id: 9c72360f0c2d7329f3c198ce4eaedd2bc14b87c1	2018-05-18 12:56:55 -07:00
Mike Kolupaev	8bf555f487	Change and clarify the relationship between Valid(), status() and Seek() for all iterators. Also fix some bugs Summary: Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are: If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted. * When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not. However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours). This PR changes the convention to: * If status() is not ok, Valid() always returns false. * Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.) This does sacrifice the two use cases listed above, but siying said it's ok. Overview of the changes: * A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario. * Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed. * A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator. To find the iterator types that needed changes I searched for "public .Iterator" in the code. Here's an overview of all 27 iterator types: Iterators that didn't need changes: status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator. * Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator. Iterators with changes (see inline comments for details): * DBIter - an overhaul: - It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back. - It had a few code paths silently discarding subiterator's status. The stress test caught a few. - The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example. - Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption. - It used to not reset status on seek for some types of errors. - Some simplifications and better comments. - Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer. * MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status. * ForwardIterator - changed to the new convention, also slightly simplified. * ForwardLevelIterator - fixed some bugs and simplified. * LevelIterator - simplified. * TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_. * BlockBasedTableIterator - minor changes. * BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid. * PlainTableIterator - some seeks used to not reset status. * CuckooTableIterator - tiny code cleanup. * ManagedIterator - fixed some bugs. * BaseDeltaIterator - changed to the new convention and fixed a bug. * BlobDBIterator - seeks used to not reset status. * KeyConvertingIterator - some small change. Closes https://github.com/facebook/rocksdb/pull/3810 Differential Revision: D7888019 Pulled By: al13n321 fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d	2018-05-17 02:56:56 -07:00
Siying Dong	ddfd2525d2	Make BlockIter final Summary: Now BlockBasedTableIterator directly uses BlockIter. By making BlockIter final, we can prevent unintended virtual function overriding. Closes https://github.com/facebook/rocksdb/pull/3828 Differential Revision: D7933816 Pulled By: siying fbshipit-source-id: 026a08cb5c5b6d3d6f44743152b4251da4756f2c	2018-05-09 10:27:26 -07:00
Maysam Yabandeh	fc522bdb3e	Evenly split HarnessTest.Randomized Summary: Currently HarnessTest.Randomized is already split but some of the splits are faster than the others. The reason is that each split takes a continuous range of the generated args and the test with later args takes longer to finish. The patch evenly split the args among splits in a round robin fashion. Before: ``` [ OK ] HarnessTest.Randomized1n2 (2278 ms) [ OK ] HarnessTest.Randomized3n4 (1095 ms) [ OK ] HarnessTest.Randomized5 (658 ms) [ OK ] HarnessTest.Randomized6 (1258 ms) [ OK ] HarnessTest.Randomized7 (6476 ms) [ OK ] HarnessTest.Randomized8 (8182 ms) ``` After ``` [ OK ] HarnessTest.Randomized1 (2649 ms) [ OK ] HarnessTest.Randomized2 (2645 ms) [ OK ] HarnessTest.Randomized3 (2577 ms) [ OK ] HarnessTest.Randomized4 (2490 ms) [ OK ] HarnessTest.Randomized5 (2553 ms) [ OK ] HarnessTest.Randomized6 (2560 ms) [ OK ] HarnessTest.Randomized7 (2501 ms) [ OK ] HarnessTest.Randomized8 (2574 ms) ``` Closes https://github.com/facebook/rocksdb/pull/3808 Differential Revision: D7882663 Pulled By: maysamyabandeh fbshipit-source-id: 09b749a9684b6d7d65466aa4b00c5334a49e833e	2018-05-04 15:28:06 -07:00
Zhongyi Xie	6cab3184f5	avoid double delete on dummy record insertion failure Summary: When the dummy record insertion fails, there is no need to explicitly delete the block as it will be registered for cleanup regardless. Closes https://github.com/facebook/rocksdb/pull/3688 Differential Revision: D7537741 Pulled By: miasantreble fbshipit-source-id: fcd3a3d3d382ee8e2c7ced0a4980e683d93a16d6	2018-05-01 16:01:28 -07:00
Andrew Kryczka	7004e45489	Remove block-based table assertion for non-empty filter block Summary: `7a6353bd1c` prevents empty filter blocks from being written for SST files containing range deletions only. However the assertion this PR removes is still a problem as we could be reading from a DB generated by a RocksDB build without the `7a6353bd1c` patch. So remove the assertion. We already don't do this check when `cache_index_and_filter_blocks=false`, so it should be safe. Closes https://github.com/facebook/rocksdb/pull/3773 Differential Revision: D7769964 Pulled By: ajkr fbshipit-source-id: 7285762446f2cd2ccf16efd7a988a106fbb0d8d3	2018-04-26 14:43:11 -07:00
Maysam Yabandeh	7e4e381495	Fix the bloom filter skipping empty prefixes Summary: `bc0da4b512` optimized bloom filters by skipping duplicate entires when the whole key and prefixes are both added to the bloom. It however used empty string as the initial value of the last entry added to the bloom. This is incorrect since empty key/prefix are valid entires by themselves. This patch fixes that. Closes https://github.com/facebook/rocksdb/pull/3776 Differential Revision: D7778803 Pulled By: maysamyabandeh fbshipit-source-id: d5a065daebee17f9403cac51e9d5626aac87bfbc	2018-04-26 13:28:31 -07:00
Maysam Yabandeh	bc0da4b512	Skip duplicate bloom keys when whole_key and prefix are mixed Summary: Currently we rely on FilterBitsBuilder to skip the duplicate keys. It does that by comparing that hash of the key to the hash of the last added entry. This logic breaks however when we have whole_key_filtering mixed with prefix blooms as their addition to FilterBitsBuilder will be interleaved. The patch fixes that by comparing the last whole key and last prefix with the whole key and prefix of the new key respectively and skip the call to FilterBitsBuilder if it is a duplicate. Closes https://github.com/facebook/rocksdb/pull/3764 Differential Revision: D7744413 Pulled By: maysamyabandeh fbshipit-source-id: 15df73bbbafdfd754d4e1f42ea07f47b03bc5eb8	2018-04-24 10:58:16 -07:00
Maysam Yabandeh	17e04039dd	Propagate fill_cache config to partitioned index iterator Summary: Currently the partitioned index iterator creates a new ReadOptions which ignores the fill_cache config set to ReadOptions passed by the user. The patch propagates fill_cache from the user's ReadOptions to that of partition index iterator. Also it clarifies the contract of fill_cache that i) it does not apply to filters, ii) it still charges block cache for the size of the data block, it still pin the block if it is already in the block cache. Closes https://github.com/facebook/rocksdb/pull/3739 Differential Revision: D7678308 Pulled By: maysamyabandeh fbshipit-source-id: 53ed96424ae922e499e2d4e3580ddc3f0db893da	2018-04-20 15:13:05 -07:00
Zhongyi Xie	954b496b3f	fix memory leak in two_level_iterator Summary: this PR fixes a few failed contbuild: 1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406 2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662 3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag. Closes https://github.com/facebook/rocksdb/pull/3718 Reviewed By: maysamyabandeh Differential Revision: D7621192 Pulled By: miasantreble fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146	2018-04-15 17:26:26 -07:00
David Lai	3be9b36453	comment unused parameters to turn on -Wunused-parameter flag Summary: This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557. Closes https://github.com/facebook/rocksdb/pull/3662 Differential Revision: D7426121 Pulled By: Dayvedde fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352	2018-04-12 17:59:16 -07:00
Maysam Yabandeh	d2bcd7611f	Fix the memory leak with pinned partitioned filters Summary: The existing unit test did not set the level so the check for pinned partitioned filter/index being properly released from the block cache was not properly exercised as they only take effect in level 0. As a result a memory leak in pinned partitioned filters was hidden. The patch fix the test as well as the bug. Closes https://github.com/facebook/rocksdb/pull/3692 Differential Revision: D7559763 Pulled By: maysamyabandeh fbshipit-source-id: 55eff274945838af983c764a7d71e8daff092e4a	2018-04-09 16:28:19 -07:00
Maysam Yabandeh	67182678a5	Stats for false positive rate of full filtesr Summary: Adds two stats to allow us measuring the false positive rate of full filters: - The total count of positives: rocksdb.bloom.filter.full.positive - The total count of true positives: rocksdb.bloom.filter.full.true.positive Not the term "full" in the stat name to indicate that they are meaningful in full filters. block-based filters are to be deprecated soon and supporting it is not worth the the additional cost of if-then-else branches. Closes #3680 Tested by: $ ./db_bench -benchmarks=fillrandom -db /dev/shm/rocksdb-tmpdb --num=1000000 -bloom_bits=10 $ ./db_bench -benchmarks="readwhilewriting" -db /dev/shm/rocksdb-tmpdb --statistics -bloom_bits=10 --duration=60 --num=2000000 --use_existing_db 2>&1 > /tmp/full.log $ grep filter.full /tmp/full.log rocksdb.bloom.filter.full.positive COUNT : 3628593 rocksdb.bloom.filter.full.true.positive COUNT : 3536026 which gives the false positive rate of 2.5% Closes https://github.com/facebook/rocksdb/pull/3681 Differential Revision: D7517570 Pulled By: maysamyabandeh fbshipit-source-id: 630ab1a473afdce404916d297035b6318de4c052	2018-04-05 15:58:48 -07:00
Sagar Vemuri	d687670256	Fix a leak in FilterBlockBuilder when adding prefix Summary: Our valgrind continuous test found an interesting leak which got introduced in #3614. We were adding the prefix key before saving the previous prefix start offset, due to which previous prefix offset is always incorrect. Fixed it by saving the the previous sate before adding the key. Closes https://github.com/facebook/rocksdb/pull/3660 Differential Revision: D7418698 Pulled By: sagar0 fbshipit-source-id: 9933685f943cf2547ed5c553f490035a2fa785cf	2018-03-27 15:13:56 -07:00
Anand Ananthabhotla	f9f4d40f93	Align SST file data blocks to avoid spanning multiple pages Summary: Provide a block_align option in BlockBasedTableOptions to allow alignment of SST file data blocks. This will avoid higher IOPS/throughput load due to < 4KB data blocks spanning 2 4KB pages. When this option is set to true, the block alignment is set to lower of block size and 4KB. Closes https://github.com/facebook/rocksdb/pull/3502 Differential Revision: D7400897 Pulled By: anand1976 fbshipit-source-id: 04cc3bd144e88e3431a4f97604e63ad7a0f06d44	2018-03-26 20:26:10 -07:00
Dmitri Smirnov	d382ae7de6	Imporve perf of random read and insert compare by suggesting inlining to the compiler Summary: Results from 2015 compiler. This improve sequential insert. Random Read results are inconclusive but I hope 2017 will do a better job at inlining. Before: fillseq : 3.638 micros/op 274866 ops/sec; 213.9 MB/s After: fillseq : 3.379 micros/op 295979 ops/sec; 230.3 MB/s Closes https://github.com/facebook/rocksdb/pull/3645 Differential Revision: D7382711 Pulled By: siying fbshipit-source-id: 092a07ffe8a6e598d1226ceff0f11b35e6c5c8e4	2018-03-23 13:26:55 -07:00
Huachao Huang	7a6353bd1c	Ignore empty filter block when data block is empty Summary: Close https://github.com/facebook/rocksdb/issues/3592 Closes https://github.com/facebook/rocksdb/pull/3614 Differential Revision: D7291706 Pulled By: ajkr fbshipit-source-id: 9dd8f40bd7716588e1e3fd6be0c2bc2766861f8c	2018-03-21 23:13:05 -07:00
Bruce Mitchener	a3a3f5497c	Fix some typos in comments and docs. Summary: Closes https://github.com/facebook/rocksdb/pull/3568 Differential Revision: D7170953 Pulled By: siying fbshipit-source-id: 9cfb8dd88b7266da920c0e0c1e10fb2c5af0641c	2018-03-08 10:27:25 -08:00
Siying Dong	b560fc9f62	Fix a block pinning regression introduced in `b555ed30a4` Summary: `b555ed30a4` introduces a regression, which causes blocks always to be pinned in block based iterators. Fix it. Closes https://github.com/facebook/rocksdb/pull/3582 Differential Revision: D7189534 Pulled By: siying fbshipit-source-id: 117dc7a03d0a0e360424db02efb366e12da2be03	2018-03-08 10:12:23 -08:00
Fosco Marotto	d518fe1da6	uint64_t and size_t changes to compile for iOS Summary: In attempting to build a static lib for use in iOS, I ran in to lots of type errors between uint64_t and size_t. This PR contains the changes I made to get `TARGET_OS=IOS make static_lib` to succeed while also getting Xcode to build successfully with the resulting `librocksdb.a` library imported. This also compiles for me on macOS and tests fine, but I'm really not sure if I made the correct decisions about where to `static_cast` and where to change types. Also up for discussion: is iOS worth supporting? Getting the static lib is just part one, we aren't providing any bridging headers or wrappers like the ObjectiveRocks project, it won't be a great experience. Closes https://github.com/facebook/rocksdb/pull/3503 Differential Revision: D7106457 Pulled By: gfosco fbshipit-source-id: 82ac2073de7e1f09b91f6b4faea91d18bd311f8e	2018-03-06 12:43:51 -08:00
Andrew Kryczka	5d68243e61	Comment out unused variables Summary: Submitting on behalf of another employee. Closes https://github.com/facebook/rocksdb/pull/3557 Differential Revision: D7146025 Pulled By: ajkr fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e	2018-03-05 13:13:41 -08:00
Igor Sugak	aba3409740	Back out "[codemod] - comment out unused parameters" Reviewed By: igorsugak fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37	2018-02-22 12:43:17 -08:00
David Lai	f4a030ce81	- comment out unused parameters Reviewed By: everiq, igorsugak Differential Revision: D7046710 fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461	2018-02-22 09:44:23 -08:00
jsteemann	4e7a182d09	Several small "fixes" Summary: - removed a few unneeded variables - fused some variable declarations and their assignments - fixed right-trimming code in string_util.cc to not underflow - simplifed an assertion - move non-nullptr check assertion before dereferencing of that pointer - pass an std::string function parameter by const reference instead of by value (avoiding potential copy) Closes https://github.com/facebook/rocksdb/pull/3507 Differential Revision: D7004679 Pulled By: sagar0 fbshipit-source-id: 52944952d9b56dfcac3bea3cd7878e315bb563c4	2018-02-15 16:57:37 -08:00
Siying Dong	b555ed30a4	Customized BlockBasedTableIterator and LevelIterator Summary: Use a customzied BlockBasedTableIterator and LevelIterator to replace current implementations leveraging two-level-iterator. Hope the customized logic will make code easier to understand. As a side effect, BlockBasedTableIterator reduces the allocation for the data block iterator object, and avoid the virtual function call to it, because we can directly reference BlockIter, a final class. Similarly, LevelIterator reduces virtual function call to the dummy iterator iterating the file metadata. It also enabled further optimization. The upper bound check is also moved from index block to data block. This implementation fits this iterator better. After the change, forwared iterator is slightly optimized to ensure we trim those iterators. The two-level-iterator now is only used by partitioned index, so it is simplified. Closes https://github.com/facebook/rocksdb/pull/3406 Differential Revision: D6809041 Pulled By: siying fbshipit-source-id: 7da3b9b1d3c8e9d9405302c15920af1fcaf50ffa	2018-02-12 17:12:25 -08:00
Andrew Kryczka	e78715c29a	Eliminate a memcpy for uncompressed blocks Summary: `ReadBlockFromFile` uses a stack buffer to hold small data blocks before passing them to the compression library, which outputs uncompressed data in a heap buffer. In the case of `kNoCompression` there is a `memcpy` to copy from stack buffer to heap buffer. This PR optimizes `ReadBlockFromFile` to skip the stack buffer for files whose blocks are known to be uncompressed. We determine this using the SST file property, "compression_name", if it's available. Closes https://github.com/facebook/rocksdb/pull/3472 Differential Revision: D6920848 Pulled By: ajkr fbshipit-source-id: 5c753e804efc178b9229ae5dbe6a4adc32031f07	2018-02-07 15:57:37 -08:00
Zhongyi Xie	2f29991701	split RandomizedHarnessTest more ways Summary: RandomizedHarnessTest enumerates different combinations of test type, compression type, restart interval, etc. For some combinations it takes very long to finish, causing the test to time out in test infrastructure. This PR split the test input into smaller trunks in the hope that they will fit in the timeout window. Another possibility is to reduce `num_entries` of course Closes https://github.com/facebook/rocksdb/pull/3467 Differential Revision: D6910235 Pulled By: miasantreble fbshipit-source-id: 717246ee5d21a8a48ad82d4d9c04f9051a66f07f	2018-02-06 13:58:18 -08:00
Andrew Kryczka	1edac32b77	Update rocksdb.read.block.get.micros when block cache disabled Summary: Previously `ReadBlockFromFile` for data blocks was only measured when reading a block to populate block cache. This PR adds the corresponding measurements for users who disabled block cache. Closes https://github.com/facebook/rocksdb/pull/3442 Differential Revision: D6848671 Pulled By: ajkr fbshipit-source-id: bb4bbe1797fa2cc1d9a5bad44891af2b55384b41	2018-01-31 14:26:52 -08:00
Fosco Marotto	77dc069eb9	Change size_t cast in table_test Summary: Fixes this build error on master (macOS): ``` table/table_test.cc:972:27: error: implicit conversion loses integer precision: 'size_t' (aka 'unsigned long') to 'unsigned int' [-Werror,-Wshorten-64-to-32] ``` Closes https://github.com/facebook/rocksdb/pull/3434 Reviewed By: maysamyabandeh Differential Revision: D6840354 Pulled By: gfosco fbshipit-source-id: fffac6aefbbdd134ce1299453c5590aa855a5fc8	2018-01-30 11:12:51 -08:00
Zhongyi Xie	3fe0937180	Use block cache to track memory usage when ReadOptions.fill_cache=false Summary: ReadOptions.fill_cache is set in compaction inputs and can be set by users in their queries too. It tells RocksDB not to put a data block used to block cache. The memory used by the data block is, however, not trackable by users. To make the system more manageable, we can cost the block to block cache while using it, and then release it after using. Closes https://github.com/facebook/rocksdb/pull/3333 Differential Revision: D6670230 Pulled By: miasantreble fbshipit-source-id: ab848d3ed286bd081a13ee1903de357b56cbc308	2018-01-29 14:43:10 -08:00
Mark Isaacson	b8eb32f8cf	Suppress lint in old files Summary: Grandfather in super old lint issues to make a clean slate for moving forward that allows us to have stronger enforcement on new issues. Reviewed By: yiwu-arbug Differential Revision: D6821806 fbshipit-source-id: 22797d31ec58e9eb0255d3b66fedfcfcb0dc127c	2018-01-29 12:56:42 -08:00
Maysam Yabandeh	46acdc9883	Split HarnessTest_Randomized to avoid timeout Summary: Split HarnessTest_Randomized to two tests Closes https://github.com/facebook/rocksdb/pull/3424 Differential Revision: D6826006 Pulled By: maysamyabandeh fbshipit-source-id: 59c9a11c7da092206effce6e4fa3792f9c66bef2	2018-01-29 07:41:44 -08:00
Sagar Vemuri	d938226af4	Improve performance of long range scans with readahead Summary: This change improves the performance of iterators doing long range scans (e.g. big/full table scans in MyRocks) by using readahead and prefetching additional data on each disk IO. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan. Constraints: - The prefetched data is stored by the OS in page cache. So this currently works only for non direct-reads use-cases i.e applications which use page cache. (Direct-I/O support will be enabled in a later PR). - This gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value). Thanks to siying for the original idea and implementation. Benchmarks: Data fill: ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes ``` Do a long range scan: Seekrandom with large number of nexts ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram ``` Page cache was cleared before each experiment with the command: ``` sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" ``` ``` Before: seekrandom : 34020.945 micros/op 29 ops/sec; 32.5 MB/s (1636 of 1999 found) With this change: seekrandom : 8726.912 micros/op 114 ops/sec; 126.8 MB/s (5702 of 6999 found) ``` ~3.9X performance improvement. Also verified with strace and gdb that the readahead size is increasing as expected. ``` strace -e readahead -f -T -t -p <db_bench process pid> ``` Closes https://github.com/facebook/rocksdb/pull/3282 Differential Revision: D6586477 Pulled By: sagar0 fbshipit-source-id: 8a118a0ed4594fbb7f5b1cafb242d7a4033cb58c	2018-01-25 21:41:53 -08:00
Siying Dong	1039133f2d	BlockBasedTable::NewDataBlockIterator to always return BlockIter Summary: This is a pre-cleaning up before a major block based table iterator refactoring. BlockBasedTable::NewDataBlockIterator() will always return BlockIter. This simplifies the logic and code and enable further refactoring and optimization. Closes https://github.com/facebook/rocksdb/pull/3398 Differential Revision: D6780165 Pulled By: siying fbshipit-source-id: 273f7dc896724f682c0118fb69a359d9cc4418b4	2018-01-25 14:57:18 -08:00
Islam AbdelRahman	c615689bb5	Support skipping bloom filters for SstFileWriter Summary: Add an option for SstFileWriter to skip building bloom filters Closes https://github.com/facebook/rocksdb/pull/3360 Differential Revision: D6709120 Pulled By: IslamAbdelRahman fbshipit-source-id: 964d4bce38822a048691792f447bcfbb4b6bd809	2018-01-22 14:42:18 -08:00
Yi Wu	dc360df81e	Fix multiple build failures Summary: * Fix DBTest.CompactRangeWithEmptyBottomLevel lite build failure * Fix DBTest.AutomaticConflictsWithManualCompaction failure introduce by #3366 * Fix BlockBasedTableTest::IndexUncompressed should be disabled if snappy is disabled * Fix ASAN failure with DBBasicTest::DBClose test Closes https://github.com/facebook/rocksdb/pull/3373 Differential Revision: D6732313 Pulled By: yiwu-arbug fbshipit-source-id: 1eb9b9d9a8d795f56188fa9770db9353f6fdedc5	2018-01-16 17:30:39 -08:00
Dmitri Smirnov	b010116d82	Eliminate some redundant block reads. Summary: Re-use metadata for reading Compression Dictionary on BlockBased table open, this saves two reads from disk. This helps to our 999 percentile in 5.6.1 where prefetch buffer is not present. Closes https://github.com/facebook/rocksdb/pull/3354 Differential Revision: D6695753 Pulled By: ajkr fbshipit-source-id: bb8acd9e9e66e65b89c548ab8940570ae360333c	2018-01-10 17:11:58 -08:00
Anand Ananthabhotla	199405192d	Add a BlockBasedTableOption to turn off index block compression. Summary: Add a new bool option index_uncompressed in BlockBasedTableOptions. Closes https://github.com/facebook/rocksdb/pull/3303 Differential Revision: D6686161 Pulled By: anand1976 fbshipit-source-id: 748b46993d48a01e5f89b6bd3e41f06a59ec6054	2018-01-10 15:11:59 -08:00
Siying Dong	ccc095a016	Speed up BlockTest.BlockReadAmpBitmap Summary: BlockTest.BlockReadAmpBitmap is too slow and times out in some environments. Speed it up by: (1) improve the way the verification is done. With this it is 5 times faster (2) run fewer tests for large blocks. This cut it down by another 10 times. Now it can finish in similar time as other tests. Closes https://github.com/facebook/rocksdb/pull/3313 Differential Revision: D6643711 Pulled By: siying fbshipit-source-id: c2397d666eab5421a78ca87e1e45491e0f832a6d	2018-01-02 10:41:28 -08:00
Siying Dong	6b77c07379	NUMBER_BLOCK_COMPRESSED, etc, shouldn't be treated as timer counter Summary: NUMBER_BLOCK_DECOMPRESSED and NUMBER_BLOCK_COMPRESSED are not reported unless the stats level contain detailed timers, which is wrong. They are normal counters. Fix it. Closes https://github.com/facebook/rocksdb/pull/3263 Differential Revision: D6552519 Pulled By: siying fbshipit-source-id: 40899ccea7b2856bb39752616657c0bfd432f6f9	2017-12-14 10:27:43 -08:00
Zhongyi Xie	51c2ea0feb	Reduce heavy hitter for Get operation Summary: This PR addresses the following heavy hitters in `Get` operation by moving calls to `StatisticsImpl::recordTick` from `BlockBasedTable` to `Version::Get` - rocksdb.block.cache.bytes.write - rocksdb.block.cache.add - rocksdb.block.cache.data.miss - rocksdb.block.cache.data.bytes.insert - rocksdb.block.cache.data.add - rocksdb.block.cache.hit - rocksdb.block.cache.data.hit - rocksdb.block.cache.bytes.read The db_bench statistics before and after the change are: \|1GB block read\|Children \|Self \|Command \|Shared Object \|Symbol\| \|---\|---\|---\|---\|---\|---\| \|master: \|4.22% \|1.31% \|db_bench \|db_bench \|[.] rocksdb::StatisticsImpl::recordTick\| \|updated: \|0.51% \|0.21% \|db_bench \|db_bench \|[.] rocksdb::StatisticsImpl::recordTick\| \| \|0.14% \|0.14% \|db_bench \|db_bench \|[.] rocksdb::GetContext::record_counters\| \|1MB block read\|Children \|Self \|Command \|Shared Object \|Symbol\| \|---\|---\|---\|---\|---\|---\| \|master: \|3.48% \|1.08% \|db_bench \|db_bench \|[.] rocksdb::StatisticsImpl::recordTick\| \|updated: \|0.80% \|0.31% \|db_bench \|db_bench \|[.] rocksdb::StatisticsImpl::recordTick\| \| \|0.35% \|0.35% \|db_bench \|db_bench \|[.] rocksdb::GetContext::record_counters\| Closes https://github.com/facebook/rocksdb/pull/3172 Differential Revision: D6330532 Pulled By: miasantreble fbshipit-source-id: 2b492959e00a3db29e9437ecdcc5e48ca4ec5741	2017-12-12 21:11:33 -08:00
Yi Wu	7393ef779c	Fix BlockFetcher ASAN error Summary: Some call sites of BlockFetcher create temporary ReadOptions and pass to BlockFetcher. The temporary object will be gone after BlockFetcher construction but BlockFetcher keep its reference, causing stack-use-after-scope. Fixing it. Closes https://github.com/facebook/rocksdb/pull/3258 Differential Revision: D6547152 Pulled By: yiwu-arbug fbshipit-source-id: 6b49e9dd46bb72307f5d8f88ea15faacff35b9bc	2017-12-12 12:12:38 -08:00
Siying Dong	a9c8d4ef15	Fix memory issue introduced by `2f1a3a4d74` Summary: Closes https://github.com/facebook/rocksdb/pull/3256 Differential Revision: D6541714 Pulled By: siying fbshipit-source-id: 40efd89b68587a9d58cfe6f4eebd771c2d9f1542	2017-12-11 18:27:28 -08:00
Siying Dong	2f1a3a4d74	Refactor ReadBlockContents() Summary: Divide ReadBlockContents() to multiple sub-functions. Maintaining the input and intermediate data in a new class BlockFetcher. I hope in general it makes the code easier to maintain. Another motivation to do it is to clearly divide the logic before file reading and after file reading. The refactor will help us evaluate how can we make I/O async in the future. Closes https://github.com/facebook/rocksdb/pull/3244 Differential Revision: D6520983 Pulled By: siying fbshipit-source-id: 338d90bc0338472d46be7a7682028dc9114b12e9	2017-12-11 15:27:32 -08:00
Prashant D	baff91c1ad	table: Fix coverity issues Summary: table/block.cc: 420 } CID 1396127 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 7. uninit_member: Non-static class member restart_offset_ is not initialized in this constructor nor in any functions that it calls. 421} table/block_based_table_builder.cc: CID 1418259 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 7. uninit_member: Non-static class member compressed_cache_key_prefix_size is not initialized in this constructor nor in any functions that it calls. table/block_based_table_reader.h: 3. uninit_member: Non-static class member index_type is not initialized in this constructor nor in any functions that it calls. CID 1396147 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 5. uninit_member: Non-static class member hash_index_allow_collision is not initialized in this constructor nor in any functions that it calls. 413 global_seqno(kDisableGlobalSequenceNumber) {} 414 table/cuckoo_table_reader.cc: 55 if (hash_funs == user_props.end()) { 56 status_ = Status::Corruption("Number of hash functions not found"); 5. uninit_member: Non-static class member is_last_level_ is not initialized in this constructor nor in any functions that it calls. 7. uninit_member: Non-static class member identity_as_first_hash_ is not initialized in this constructor nor in any functions that it calls. 9. uninit_member: Non-static class member use_module_hash_ is not initialized in this constructor nor in any functions that it calls. 11. uninit_member: Non-static class member num_hash_func_ is not initialized in this constructor nor in any functions that it calls. 13. uninit_member: Non-static class member key_length_ is not initialized in this constructor nor in any functions that it calls. 15. uninit_member: Non-static class member user_key_length_ is not initialized in this constructor nor in any functions that it calls. 17. uninit_member: Non-static class member value_length_ is not initialized in this constructor nor in any functions that it calls. 19. uninit_member: Non-static class member bucket_length_ is not initialized in this constructor nor in any functions that it calls. 21. uninit_member: Non-static class member cuckoo_block_size_ is not initialized in this constructor nor in any functions that it calls. 23. uninit_member: Non-static class member cuckoo_block_bytes_minus_one_ is not initialized in this constructor nor in any functions that it calls. CID 1322785 (#2 of 2): Uninitialized scalar field (UNINIT_CTOR) 25. uninit_member: Non-static class member table_size_ is not initialized in this constructor nor in any functions that it calls. 57 return; table/plain_table_index.h: 2. uninit_member: Non-static class member index_size_ is not initialized in this constructor nor in any functions that it calls. CID 1322801 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 4. uninit_member: Non-static class member sub_index_size_ is not initialized in this constructor nor in any functions that it calls. 128 huge_page_tlb_size_(huge_page_tlb_size) {} 129 Closes https://github.com/facebook/rocksdb/pull/3113 Differential Revision: D6505719 Pulled By: yiwu-arbug fbshipit-source-id: 38f44d8f9dfefb4c2e25d83b8df25a5201c75618	2017-12-07 11:57:36 -08:00
Andrew Kryczka	63f1c0a57d	fix gflags namespace Summary: I started adding gflags support for cmake on linux and got frustrated that I'd need to duplicate the build_detect_platform logic, which determines namespace based on attempting compilation. We can do it differently -- use the GFLAGS_NAMESPACE macro if available, and if not, that indicates it's an old gflags version without configurable namespace so we can simply hardcode "google". Closes https://github.com/facebook/rocksdb/pull/3212 Differential Revision: D6456973 Pulled By: ajkr fbshipit-source-id: 3e6d5bde3ca00d4496a120a7caf4687399f5d656	2017-12-01 10:42:05 -08:00
Andrew Kryczka	c85f8ccca3	convert null terminator in ascii dump Summary: The ASCII output is almost always useless to me as the first '\0' byte in the key or value causes it to stop printing. Since all characters are already surrounded by spaces, "\ 0" (how we display a backslash followed by a zero) and "\0" (how this PR displays a null terminator) are distinguishable. My assumption is the value of seeing all the bytes outweighs the value of the alignment we had before, where we always had one character followed by one space. Closes https://github.com/facebook/rocksdb/pull/3203 Differential Revision: D6428651 Pulled By: ajkr fbshipit-source-id: aafc978a51e9ea029cfe3e763e2bb0e1751b9ccf	2017-11-28 17:28:58 -08:00
Phani Shekhar Mantripragada	4b65cfc723	Support for block_cache num_shards and other config via option string. Summary: Problem: Option string accepts only cache_size as parameter for block_cache which is specified as "block_cache=1M". It doesn't accept other parameters like num_shards etc. Changes : 1) ParseBlockBasedTableOption in block_based_table_factory is edited to accept cache options in the format "block_cache=<cache_size>:<num_shard_bits>:<strict_capacity_limit>:<high_pri_pool_ratio>". Options other than cache_size are optional to maintain backward compatibility. The changes are valid for block_cache_compressed as well. For example, "block_cache=1M:6:true:0.5", "block_cache=1M:6:true", "block_cache=1M:6" and "block_cache=1M" are all valid option strings. 2) Corresponding unit tests are added. Closes https://github.com/facebook/rocksdb/pull/3108 Differential Revision: D6420997 Pulled By: sagar0 fbshipit-source-id: cdea8b785688d2802907974af27225ccc1c0cd43	2017-11-28 10:48:53 -08:00
Zhongyi Xie	a6c6b8b38c	Revert "No need for Restart Interval for meta blocks" Summary: See [issue 3169](https://github.com/facebook/rocksdb/issues/3169) for more information This reverts commit `593d3de371`. Closes https://github.com/facebook/rocksdb/pull/3188 Differential Revision: D6379271 Pulled By: miasantreble fbshipit-source-id: 88f9ed67ba52237ad9b6f7251db83672b62d7537	2017-11-20 16:42:03 -08:00
Maysam Yabandeh	30285ee31c	Fix calculating filter partition target size Summary: block_size_deviation is in percentage while the partition size is in bytes. The current code fails to take that into account resulting into very large target size for filter partitions. Closes https://github.com/facebook/rocksdb/pull/3187 Differential Revision: D6376069 Pulled By: maysamyabandeh fbshipit-source-id: 276546fc68f50e0da32c462abb46f6cf676db9b2	2017-11-20 13:26:57 -08:00
Yi Wu	84a04af9a9	TableProperty::oldest_key_time defaults to 0 Summary: We don't propagate TableProperty::oldest_key_time on compaction and just write the default value to SST files. It is more natural to default the value to 0. Also revert db_sst_test back to before #2842. Closes https://github.com/facebook/rocksdb/pull/3079 Differential Revision: D6165702 Pulled By: yiwu-arbug fbshipit-source-id: ca3ce5928d96ae79a5beb12bb7d8c640a71478a0	2017-10-27 15:00:05 -07:00
Yi Wu	66a2c44ef4	Add DB::Properties::kEstimateOldestKeyTime Summary: With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property. My plan is to override the property with a more accurate value with blob db, where we actually have timestamp. Closes https://github.com/facebook/rocksdb/pull/2842 Differential Revision: D5770600 Pulled By: yiwu-arbug fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a	2017-10-23 15:27:27 -07:00
Dmitri Smirnov	ebab2e2d42	Enable MSVC W4 with a few exceptions. Fix warnings and bugs Summary: Closes https://github.com/facebook/rocksdb/pull/3018 Differential Revision: D6079011 Pulled By: yiwu-arbug fbshipit-source-id: 988a721e7e7617967859dba71d660fc69f4dff57	2017-10-19 10:57:12 -07:00
Yi Wu	60c09f5fbb	print more table_options to info log Summary: print more table_options to info log Closes https://github.com/facebook/rocksdb/pull/3003 Differential Revision: D6054490 Pulled By: yiwu-arbug fbshipit-source-id: 8e6f96e08bdc906077b6c62ade419d7cb739110f	2017-10-13 14:42:26 -07:00
Yi Wu	31d3e41810	PinnableSlice move assignment Summary: Allow `std::move(pinnable_slice)`. Closes https://github.com/facebook/rocksdb/pull/2997 Differential Revision: D6036782 Pulled By: yiwu-arbug fbshipit-source-id: 583fb0419a97e437ff530f4305822341cd3381fa	2017-10-12 18:28:24 -07:00
Yi Wu	fb4ae4d810	fix DBImpl::NewInternalIterator super-version leak on failure Summary: Close #2955 Closes https://github.com/facebook/rocksdb/pull/2960 Differential Revision: D5962872 Pulled By: yiwu-arbug fbshipit-source-id: a6472d5c015bea3dc476c572ff5a5c90259e6059	2017-10-11 14:57:43 -07:00
Manuel Ung	88ed1f6ea6	Allow upgrades from nullptr to some merge operator Summary: Currently, RocksDB does not allow reopening a preexisting DB with no merge operator defined, with a merge operator defined. This means that if a DB ever want to add a merge operator, there's no way to do so currently. Fix this by adding a new verification type `kByNameAllowFromNull` which will allow old values to be nullptr, and new values to be non-nullptr. Closes https://github.com/facebook/rocksdb/pull/2958 Differential Revision: D5961131 Pulled By: lth fbshipit-source-id: 06179bebd0d90db3d43690b5eb7345e2d5bab1eb	2017-10-04 09:57:23 -07:00
Yi Wu	d1cab2b64e	Add ValueType::kTypeBlobIndex Summary: Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to 1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex. 2. Make rocksdb able to detect if the db contains value written by blob db, if so return error. 3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type). The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob(). Changes on blob db side will be in a separate patch. Closes https://github.com/facebook/rocksdb/pull/2886 Differential Revision: D5838431 Pulled By: yiwu-arbug fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca	2017-10-03 09:11:23 -07:00
Zhongyi Xie	593d3de371	No need for Restart Interval for meta blocks Summary: In SST files, restart interval helps us search in data blocks. However, some meta blocks will be read sequentially, so there's no need for restart points. Restart interval will introduce extra space in the block (https://github.com/facebook/rocksdb/blob/master/table/block_builder.cc#L80). We will see if we can remove this redundant space. (Maybe set restart interval to infinite.) Closes https://github.com/facebook/rocksdb/pull/2940 Differential Revision: D5930139 Pulled By: miasantreble fbshipit-source-id: 92b1b23c15cffa90378343ac846b713623b19c21	2017-09-29 20:26:20 -07:00
Maysam Yabandeh	ab0542f5ec	Fix for when block.cache_handle is nullptr Summary: When using with compressed cache it is possible that the status is ok but the block is not actually added to the block cache. The patch takes this case into account. Closes https://github.com/facebook/rocksdb/pull/2945 Differential Revision: D5937613 Pulled By: maysamyabandeh fbshipit-source-id: 5428cf1115e5046b3d01ab78d26cb181122af4c6	2017-09-29 07:56:55 -07:00
Sagar Vemuri	93c2b91740	Introduce conditional merge-operator invocation in point lookups Summary: For every merge operand encountered for a key in the read path we now have the ability to decide whether to look further (to retrieve more merge operands for the key) or stop and invoke the merge operator to return the value. The user needs to override `ShouldMerge()` method with a condition to terminate search when true to avail this facility. This has a couple of advantages: 1. It helps in limiting the number of merge operands that are looked at to compute a value as part of a user Get operation. 2. It allows to peek at a merge key-value to see if further merge operands need to look at. Example: Limiting the number of merge operands that are looked at: Lets say you have 10 merge operands for a key spread over various levels. If you only want RocksDB to look at the latest two merge operands instead of all 10 to compute the value, it is now possible with this PR. You can set the condition in `ShouldMerge()` to return true when the size of the operand list is 2. Look at the example implementation in the unit test. Without this PR, a Get might look at all the 10 merge operands in different levels before invoking the merge-operator. Added a new unit test. Made sure that there is no perf regression by running benchmarks. Command line to Load data: ``` TEST_TMPDIR=/dev/shm ./db_bench --benchmarks="mergerandom" --merge_operator="uint64add" --num=10000000 ... mergerandom : 12.861 micros/op 77757 ops/sec; 8.6 MB/s ( updates:10000000) ``` ReadRandomMergeRandom bechmark results: Command line: ``` TEST_TMPDIR=/dev/shm ./db_bench --benchmarks="readrandommergerandom" --merge_operator="uint64add" --num=10000000 ``` Base -- Without this code change (on commit `fc7476b`): ``` readrandommergerandom : 38.586 micros/op 25916 ops/sec; (reads:3001599 merges:6998401 total:10000000 hits:842235 maxlength:8) ``` With this code change: ``` readrandommergerandom : 38.653 micros/op 25870 ops/sec; (reads:3001599 merges:6998401 total:10000000 hits:842235 maxlength:8) ``` Closes https://github.com/facebook/rocksdb/pull/2923 Differential Revision: D5898239 Pulled By: sagar0 fbshipit-source-id: daefa325019f77968639a75c851d46352c2303ef	2017-09-28 15:58:49 -07:00
Andrew Kryczka	c2f6e45aa3	prevent nullptr dereference in table reader error case Summary: A user encountered segfault on the call to `CacheDependencies()`, probably because `NewIndexIterator()` failed before populating `*index_entry`. Let's avoid the call in that case. Closes https://github.com/facebook/rocksdb/pull/2939 Differential Revision: D5928611 Pulled By: ajkr fbshipit-source-id: 484be453dbb00e5e160e9c6a1bc933df7d80f574	2017-09-28 00:12:34 -07:00
Siying Dong	edcbb36944	Three code-level optimization to Iterator::Next() Summary: Three small optimizations: (1) iter_->IsKeyPinned() shouldn't be called if read_options.pin_data is not true. This may trigger function call all the way down the iterator tree. (2) reuse the iterator key object in DBIter::FindNextUserEntryInternal(). The constructor of the class has some overheads. (3) Move the switching direction logic in MergingIterator::Next() to a separate function. These three in total improves readseq performance by about 3% in my benchmark setting. Closes https://github.com/facebook/rocksdb/pull/2880 Differential Revision: D5829252 Pulled By: siying fbshipit-source-id: 991aea10c6d6c3b43769cb4db168db62954ad1e3	2017-09-14 17:57:31 -07:00
Siying Dong	64b6452e0c	Make InternalKeyComparator final and directly use it in merging iterator Summary: Merging iterator invokes InternalKeyComparator.Compare() frequently to heap merge. By making InternalKeyComparator final and merging iterator to directly use InternalKeyComparator rather than through Iterator interface, we can give compiler a choice to avoid one more virtual function call if possible. I ran readseq benchmark in memory-only use case to make sure the performance at least doesn't regress. I have to disable the final key word in debug build, as a hack test class depends on overriding the class. Closes https://github.com/facebook/rocksdb/pull/2860 Differential Revision: D5800461 Pulled By: siying fbshipit-source-id: ab876f22a09bb5c560740911412336e0e25ccb53	2017-09-11 12:04:21 -07:00
Maysam Yabandeh	f46464d383	write-prepared txn: call IsInSnapshot Summary: This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot. Closes https://github.com/facebook/rocksdb/pull/2850 Differential Revision: D5787375 Pulled By: maysamyabandeh fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c	2017-09-11 09:14:48 -07:00
Andrew Kryczka	7fbb9eccaf	support disabling checksum in block-based table Summary: store a zero as the checksum when disabled since it's easier to keep block trailer a fixed length. Closes https://github.com/facebook/rocksdb/pull/2781 Differential Revision: D5694702 Pulled By: ajkr fbshipit-source-id: 69cea9da415778ba2b600dfd9d0dfc8cb5188ecd	2017-08-23 19:40:47 -07:00
Andrew Kryczka	19cc66dc4f	fix clang bug in block-based table reader Summary: This is the warning that clang considers a bug and has been causing it to fail: ``` table/block_based_table_reader.cc:240:27: warning: Potential leak of memory pointed to by 'block.value' for (; biter.Valid(); biter.Next()) { ^~~~~ ``` Actually clang just doesn't have enough knowledge to statically determine it's safe. We can teach it using an assert. Closes https://github.com/facebook/rocksdb/pull/2779 Differential Revision: D5691225 Pulled By: ajkr fbshipit-source-id: 3f0d545bf44636953b30ee5243c63239e8f16d8e	2017-08-23 15:12:05 -07:00
Andrew Kryczka	234f33a3f9	allow nullptr Slice only as sentinel Summary: Allow `Slice` holding nullptr as a sentinel value but not in comparisons. This new restriction eliminates the need for the manual checks in `39ef900551`, while still conforming to glibc's `memcmp` API. Thanks siying for the idea. Users may need to migrate, so mentioned it in HISTORY.md. Closes https://github.com/facebook/rocksdb/pull/2777 Differential Revision: D5686016 Pulled By: ajkr fbshipit-source-id: 03a2ca3fd9a0ebade9d0d5686c81d59a9534f563	2017-08-23 10:56:06 -07:00
Maysam Yabandeh	1dfcdb15f9	Extend pin_l0 to filter partitions Summary: This is the continuation of https://github.com/facebook/rocksdb/pull/2661 for filter partitions. When pin_l0 is set (along with cache_xxx), then open table open the filter partitions are loaded into the cache and pinned there. Closes https://github.com/facebook/rocksdb/pull/2766 Differential Revision: D5671098 Pulled By: maysamyabandeh fbshipit-source-id: 174f24018f1d7f1129621e7380287b65b67d2115	2017-08-23 07:56:08 -07:00
Maysam Yabandeh	1efc600ddf	Preload l0 index partitions Summary: This fixes the existing logic for pinning l0 index partitions. The patch preloads the partitions into block cache and pin them if they belong to level 0 and pin_l0 is set. The drawback is that it does many small IOs when preloading all the partitions into the cache is direct io is enabled. Working for a solution for that. Closes https://github.com/facebook/rocksdb/pull/2661 Differential Revision: D5554010 Pulled By: maysamyabandeh fbshipit-source-id: 1e6f32a3524d71355c77d4138516dcfb601ca7b2	2017-08-18 10:56:20 -07:00
Siying Dong	666a005f9b	Support prefetch last 512KB with direct I/O in block based file reader Summary: Right now, if direct I/O is enabled, prefetching the last 512KB cannot be applied, except compaction inputs or readahead is enabled for iterators. This can create a lot of I/O for HDD cases. To solve the problem, the 512KB is prefetched in block based table if direct I/O is enabled. The prefetched buffer is passed in totegher with random access file reader, so that we try to read from the buffer before reading from the file. This can be extended in the future to support flexible user iterator readahead too. Closes https://github.com/facebook/rocksdb/pull/2708 Differential Revision: D5593091 Pulled By: siying fbshipit-source-id: ee36ff6d8af11c312a2622272b21957a7b5c81e7	2017-08-11 12:16:45 -07:00
Aaron G	7848f0b24c	add VerifyChecksum() to db.h Summary: We need a tool to check any sst file corruption in the db. It will check all the sst files in current version and read all the blocks (data, meta, index) with checksum verification. If any verification fails, the function will return non-OK status. Closes https://github.com/facebook/rocksdb/pull/2498 Differential Revision: D5324269 Pulled By: lightmark fbshipit-source-id: 6f8a272008b722402a772acfc804524c9d1a483b	2017-08-09 15:58:13 -07:00
Siying Dong	21696ba502	Replace dynamic_cast<> Summary: Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available. Some nontrivial changes: 1. Add Comparator::GetRootComparator() to get around the internal comparator hack 2. Add the two experiemental functions to DB 3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string 4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric. Closes https://github.com/facebook/rocksdb/pull/2645 Differential Revision: D5502723 Pulled By: siying fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c	2017-07-28 16:27:16 -07:00
Andrew Kryczka	710411aea6	fix asan/valgrind for TableCache cleanup Summary: Breaking commit: `d12691b86f` In the above commit, I moved the `TableCache` cleanup logic from `Version` destructor into `PurgeObsoleteFiles`. I missed cleaning up `TableCache` entries for the current `Version` during DB destruction. This PR adds that logic to `VersionSet` destructor. One unfortunate side effect is now we're potentially deleting `TableReader`s after `column_family_set_.reset()`, which means we can't call `BlockBasedTableReader::Close` a second time as the block cache might already be destroyed. Closes https://github.com/facebook/rocksdb/pull/2662 Differential Revision: D5515108 Pulled By: ajkr fbshipit-source-id: 2cb820e19aa813e0d258d17f76b2d7b6b7ee0b18	2017-07-27 20:28:04 -07:00

1 2 3 4 5 ...

902 Commits