rocksdb

Author	SHA1	Message	Date
mrambacher	6bab3a34e9	Move RegisterOptions into the Configurable API (#8223 ) Summary: As previously coded, a Configurable extension would need access to code not in the public API. This change moves RegisterOptions into the Configurable class and therefore available to public extensions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8223 Reviewed By: anand1976 Differential Revision: D27960188 Pulled By: mrambacher fbshipit-source-id: ac88b19397183df633902def5b5701b9b65fbf40	2021-04-26 03:13:24 -07:00
Saketh Are	cc1c3ee54e	Eliminate double-buffering of keys in block_based_table_builder (#8219 ) Summary: The block_based_table_builder buffers some blocks in memory to construct a good compression dictionary. Before this commit, the keys from each block were buffered separately for convenience. However, the buffered block data implicitly contains all keys. This commit eliminates the redundant key buffers and reduces memory usage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8219 Reviewed By: ajkr Differential Revision: D27945851 Pulled By: saketh-are fbshipit-source-id: caf3cac1217201e080a1e24b542bedf20973afee	2021-04-23 12:45:02 -07:00
Peter Dillinger	95f6add746	Revert Ribbon starting level support from #8198 (#8212 ) Summary: This partially reverts commit `10196d7edc`. The problem with this change is because of important filter use cases: FIFO compaction and SST writer. FIFO "compaction" always uses level 0 so would only use Ribbon filters if specifically including level 0 for the Ribbon filter policy. SST writer sets level_at_creation=-1 to indicate unknown level, and this would be treated the same as level 0 unless fixed. We are keeping the part about committing to permanent schema, which is only changes to API comments and HISTORY.md. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8212 Test Plan: CI Reviewed By: jay-zhuang Differential Revision: D27896468 Pulled By: pdillinger fbshipit-source-id: 50a775f7cba5d64fb729d9b982e355864020596e	2021-04-20 19:46:40 -07:00
Peter Dillinger	10196d7edc	Ribbon long-term support, starting level support (#8198 ) Summary: Since the Ribbon filter schema seems good (compatible back to 6.15.0), this change commits to long term support of the SST schema, even though we expect the API for enabling Ribbon to change (still called NewExperimentalRibbonFilterPolicy). This also adds support for "hybrid" configuration in which some levels use Bloom (higher levels, lower numbered) for speed and the rest use Ribbon (lower levels, higher numbered) for memory space efficiency. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8198 Test Plan: unit test added, crash test support Reviewed By: jay-zhuang Differential Revision: D27831232 Pulled By: pdillinger fbshipit-source-id: 90e528677689474d293ed6710b42ba89fbd5b5ab	2021-04-16 15:43:08 -07:00
Yanqin Jin	09528f9fa1	Fix a bug for SeekForPrev with partitioned filter and prefix (#8137 ) Summary: According to https://github.com/facebook/rocksdb/issues/5907, each filter partition "should include the bloom of the prefix of the last key in the previous partition" so that SeekForPrev() in prefix mode can return correct result. The prefix of the last key in the previous partition does not necessarily have the same prefix as the first key in the current partition. Regardless of the first key in current partition, the prefix of the last key in the previous partition should be added. The existing code, however, does not follow this. Furthermore, there is another issue: when finishing current filter partition, `FullFilterBlockBuilder::AddPrefix()` is called for the first key in next filter partition, which effectively overwrites `last_prefix_str_` prematurely. Consequently, when the filter block builder proceeds to the next partition, `last_prefix_str_` will be the prefix of its first key, leaving no way of adding the bloom of the prefix of the last key of the previous partition. Prefix extractor is FixedLength.2. ``` [ filter part 1 ] [ filter part 2 ] abc d ``` When SeekForPrev("abcd"), checking the filter partition will land on filter part 2 because "abcd" > "abc" but smaller than "d". If the filter in filter part 2 happens to return false for the test for "ab", then SeekForPrev("abcd") will build incorrect iterator tree in non-total-order mode. Also fix a unit test which starts to fail following this PR. `InDomain` should not fail due to assertion error when checking on an arbitrary key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8137 Test Plan: ``` make check ``` Without this fix, the following command will fail pretty soon. ``` ./db_stress --acquire_snapshot_one_in=10000 --avoid_flush_during_recovery=0 \ --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 \ --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=17 \ --bottommost_compression_type=disable --cache_index_and_filter_blocks=1 --cache_size=1048576 \ --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 \ --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 \ --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 \ --compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=0 \ --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_whitebox \ --db_write_buffer_size=8388608 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --enable_blob_files=0 \ --enable_compaction_filter=0 --enable_pipelined_write=1 --file_checksum_impl=big --flush_one_in=1000000 \ --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 \ --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --index_type=2 --ingest_external_file_one_in=0 \ --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True \ --log2_keys_per_lock=10 --long_running_snapshots=1 --mark_for_compaction_one_file_in=0 \ --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000000 --max_key_len=3 \ --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 \ --max_write_buffer_size_to_maintain=8388608 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False \ --nooverwritepercent=0 --open_files=500000 --ops_per_thread=20000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=1 --partition_pinning=0 --pause_background_one_in=1000000 \ --periodic_compaction_seconds=0 --prefixpercent=5 --progress_reports=0 --read_fault_one_in=0 --read_only=0 \ --readpercent=45 --recycle_log_file_num=0 --reopen=20 --secondary_catch_up_one_in=0 \ --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 \ --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=False \ --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 \ --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_blob_db=0 --use_block_based_filter=0 \ --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 \ --use_multiget=0 --use_ribbon_filter=0 --use_txn=0 --user_timestamp_size=8 --verify_checksum=1 \ --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=4194304 \ --write_dbid_to_manifest=1 --writepercent=35 ``` Reviewed By: pdillinger Differential Revision: D27553054 Pulled By: riversand963 fbshipit-source-id: 60e391e4a2d8d98a9a3172ec5d6176b90ec3de98	2021-04-06 12:14:08 -07:00
Andrew Kryczka	c43a37a922	Fix compression dictionary sampling with dedicated range tombstone SSTs (#8141 ) Summary: Return early in case there are zero data blocks when `BlockBasedTableBuilder::EnterUnbuffered()` is called. This crash can only be triggered by applying dictionary compression to SST files that contain only range tombstones. It cannot be triggered by a low buffer limit alone since we only consider entering unbuffered mode after buffering a data block causing the limit to be breached, or `Finish()`ing the file. It also cannot be triggered by a totally empty file because those go through `Abandon()` rather than `Finish()` so unbuffered mode is never entered. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8141 Test Plan: added a unit test that repro'd the "Floating point exception" Reviewed By: riversand963 Differential Revision: D27495640 Pulled By: ajkr fbshipit-source-id: a463cfba476919dc5c5c380800a75a86c31ffa23	2021-04-01 05:08:17 -07:00
Andrew Kryczka	1ba2b8a568	Add sample_for_compression results to table properties (#8139 ) Summary: Added `TableProperties::{fast,slow}_compression_estimated_data_size`. These properties are present in block-based tables when `ColumnFamilyOptions::sample_for_compression > 0` and the necessary compression library is supported when the file is generated. They contain estimates of what `TableProperties::data_size` would be if the "fast"/"slow" compression library had been used instead. One limitation is we do not record exactly which "fast" (ZSTD or Zlib) or "slow" (LZ4 or Snappy) compression library produced the result. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8139 Test Plan: - new unit test - ran `db_bench` with `sample_for_compression=1`; verified the `data_size` property matches the `{slow,fast}_compression_estimated_data_size` when the same compression type is used for the output file compression and the sampled compression Reviewed By: riversand963 Differential Revision: D27454338 Pulled By: ajkr fbshipit-source-id: 9529293de93ddac7f03b2e149d746e9f634abac4	2021-03-31 18:21:50 -07:00
Andrew Kryczka	c20a7cd6c7	Apply `sample_for_compression` to all block-based tables (#8105 ) Summary: Previously it only applied to block-based tables generated by flush. This restriction was undocumented and blocked a new use case. Now compression sampling applies to all block-based tables we generate when it is enabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8105 Test Plan: new unit test Reviewed By: riversand963 Differential Revision: D27317275 Pulled By: ajkr fbshipit-source-id: cd9fcc5178d6515e8cb59c6facb5ac01893cb5b0	2021-03-25 15:00:45 -07:00
Yanqin Jin	0304352882	Fix a bug in key comparison when index type is kBinarySearchWithFirstKey (#8062 ) Summary: When timestamp is enabled, key comparison should take this into account. In `BlockBasedTableReader::Get()`, `BlockBasedTableReader::MultiGet()`, assume the target key is `key`, and the timestamp upper bound is `ts`. The highest key in current block is (key, ts1), while the lowest key in next block is (key, ts2). If ``` ts1 > ts > ts2 ``` then ``` (key, ts1) < (key, ts) < (key, ts2) ``` It can be shown that if `Compare()` is used, then we will mistakenly skip the next block. Instead, we should use `CompareWithoutTimestamp()`. The majority of this PR makes some existing tests in `db_with_timestamp_basic_test.cc` parameterized so that different index types can be tested. A new unit test is also added for more coverage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8062 Test Plan: make check Reviewed By: ltamasi Differential Revision: D27057557 Pulled By: riversand963 fbshipit-source-id: c1062fa7c159ed600a1ad7e461531d52265021f1	2021-03-15 17:44:52 -07:00
mrambacher	3dff28cf9b	Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033 ) Summary: For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes. For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource. There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved. Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17: 6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found) 6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found) PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found) (Note that the times for 6.18 are prior to revert of the SystemClock). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033 Reviewed By: pdillinger Differential Revision: D27014563 Pulled By: mrambacher fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67	2021-03-15 04:34:11 -07:00
Peter Dillinger	a8b3b9a20c	Refine Ribbon configuration, improve testing, add Homogeneous (#7879 ) Summary: This change only affects non-schema-critical aspects of the production candidate Ribbon filter. Specifically, it refines choice of internal configuration parameters based on inputs. The changes are minor enough that the schema tests in bloom_test, some of which depend on this, are unaffected. There are also some minor optimizations and refactorings. This would be a schema change for "smash" Ribbon, to fix some known issues with small filters, but "smash" Ribbon is not accessible in public APIs. Unit test CompactnessAndBacktrackAndFpRate updated to test small and medium-large filters. Run with --thoroughness=100 or so for much better detection power (not appropriate for continuous regression testing). Homogenous Ribbon: This change adds internally a Ribbon filter variant we call Homogeneous Ribbon, in collaboration with Stefan Walzer. The expected "result" value for every key is zero, instead of computed from a hash. Entropy for queries not to be false positives comes from free variables ("overhead") in the solution structure, which are populated pseudorandomly. Construction is slightly faster for not tracking result values, and never fails. Instead, FP rate can jump up whenever and whereever entries are packed too tightly. For small structures, we can choose overhead to make this FP rate jump unlikely, as seen in updated unit test CompactnessAndBacktrackAndFpRate. Unlike standard Ribbon, Homogeneous Ribbon seems to scale to arbitrary number of keys when accepting an FP rate penalty for small pockets of high FP rate in the structure. For example, 64-bit ribbon with 8 solution columns and 10% allocated space overhead for slots seems to achieve about 10.5% space overhead vs. information-theoretic minimum based on its observed FP rate with expected pockets of degradation. (FP rate is close to 1/256.) If targeting a higher FP rate with fewer solution columns, Homogeneous Ribbon can be even more space efficient, because the penalty from degradation is relatively smaller. If targeting a lower FP rate, Homogeneous Ribbon is less space efficient, as more allocated overhead is needed to keep the FP rate impact of degradation relatively under control. The new OptimizeHomogAtScale tool in ribbon_test helps to find these optimal allocation overheads for different numbers of solution columns. And Ribbon widths, with 128-bit Ribbon apparently cutting space overheads in half vs. 64-bit. Other misc item specifics: * Ribbon APIs in util/ribbon_config.h now provide configuration data for not just 5% construction failure rate (95% success), but also 50% and 0.1%. * Note that the Ribbon structure does not exhibit "threshold" behavior as standard Xor filter does, so there is a roughly fixed space penalty to cut construction failure rate in half. Thus, there isn't really an "almost sure" setting. * Although we can extrapolate settings for large filters, we don't have a good formula for configuring smaller filters (< 2^17 slots or so), and efforts to summarize with a formula have failed. Thus, small data is hard-coded from updated FindOccupancy tool. * Enhances ApproximateNumEntries for public API Ribbon using more precise data (new API GetNumToAdd), thus a more accurate but not perfect reversal of CalculateSpace. (bloom_test updated to expect the greater precision) * Move EndianSwapValue from coding.h to coding_lean.h to keep Ribbon code easily transferable from RocksDB * Add some missing 'const' to member functions * Small optimization to 128-bit BitParity * Small refactoring of BandingStorage in ribbon_alg.h to support Homogeneous Ribbon * CompactnessAndBacktrackAndFpRate now has an "expand" test: on construction failure, a possible alternative to re-seeding hash functions is simply to increase the number of slots (allocated space overhead) and try again with essentially the same hash values. (Start locations will be different roundings of the same scaled hash values--because fastrange not mod.) This seems to be as effective or more effective than re-seeding, as long as we increase the number of slots (m) by roughly m += m/w where w is the Ribbon width. This way, there is effectively an expansion by one slot for each ribbon-width window in the banding. (This approach assumes that getting "bad data" from your hash function is as unlikely as it naturally should be, e.g. no adversary.) * 32-bit and 16-bit Ribbon configurations are added to ribbon_test for understanding their behavior, e.g. with FindOccupancy. They are not considered useful at this time and not tested with CompactnessAndBacktrackAndFpRate. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7879 Test Plan: unit test updates included Reviewed By: jay-zhuang Differential Revision: D26371245 Pulled By: pdillinger fbshipit-source-id: da6600d90a3785b99ad17a88b2a3027710b4ea3a	2021-02-26 08:50:42 -08:00
xinyuliu	b085ee13e0	Append all characters not captured by xsputn() in overflow() function (#7991 ) Summary: In the adapter class `WritableFileStringStreamAdapter`, which wraps WritableFile to be used for std::ostream, previouly only `std::endl` is considered a special case because `endl` is written by `os.put()` directly without going through `xsputn()`. `os.put()` will call `sputc()` and if we further check the internal implementation of `sputc()`, we will see it is ``` int_type __CLR_OR_THIS_CALL sputc(_Elem _Ch) { // put a character return 0 < _Pnavail() ? _Traits::to_int_type(*_Pninc() = _Ch) : overflow(_Traits::to_int_type(_Ch)); ``` As we explicitly disabled buffering, _Pnavail() is always 0. Thus every write, not captured by xsputn, becomes an overflow. When I run tests on Windows, I found not only `std::endl` will drop into this case, writing an unsigned long long will also call `os.put()` then followed by `sputc()` and eventually call `overflow()`. Therefore, instead of only checking `std::endl`, we should try to append other characters as well unless the appending operation fails. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7991 Reviewed By: jay-zhuang Differential Revision: D26615692 Pulled By: ajkr fbshipit-source-id: 4c0003de1645b9531545b23df69b000e07014468	2021-02-23 21:44:48 -08:00
Akanksha Mahajan	cd79a00903	Make BlockBasedTable::kMaxAutoReadAheadSize configurable (#7951 ) Summary: RocksDB does auto-readahead for iterators on noticing more than two reads for a table file. The readahead starts at 8KB and doubles on every additional read upto BlockBasedTable::kMaxAutoReadAheadSize which is 256*1024. This PR adds a new option BlockBasedTableOptions::max_auto_readahead_size which replaces BlockBasedTable::kMaxAutoReadAheadSize and the new option can be configured. If max_auto_readahead_size is set 0 then no implicit auto prefetching will be done. If max_auto_readahead_size provided is less than 8KB (which is initial readahead size used by rocksdb in case of auto-readahead), readahead size will remain same as max_auto_readahead_size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7951 Test Plan: Add new unit test case. Reviewed By: anand1976 Differential Revision: D26568085 Pulled By: akankshamahajan15 fbshipit-source-id: b6543520fc74e97d859f2002328d4c5254d417af	2021-02-23 16:54:08 -08:00
Andrew Kryczka	daca92c17a	Pick samples for compression dictionary using prime number (#7987 ) Summary: The sample selection technique taken in https://github.com/facebook/rocksdb/issues/7970 was problematic because it had two code paths for sample selection depending on the number of data blocks, and one of those code paths involved an allocation. Using prime numbers, we can consolidate into one code path without allocation. The downside is there will be values of N (number of data blocks buffered) that suffer from poor spread in the selected samples. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7987 Test Plan: `make check -j48` Reviewed By: pdillinger Differential Revision: D26586147 Pulled By: ajkr fbshipit-source-id: 62028e54336fadb6e2c7a7fe6747daa05a263d32	2021-02-22 17:43:03 -08:00
Andrew Kryczka	d904233d2f	Limit buffering for collecting samples for compression dictionary (#7970 ) Summary: For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file. However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage. Related changes include: - Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks - Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary - Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970 Test Plan: - updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level - looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set. Reviewed By: pdillinger Differential Revision: D26467994 Pulled By: ajkr fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465	2021-02-19 14:09:54 -08:00
Ziyue Yang	0c2d71edba	Fix typo: replace readadhead with readahead (#7953 ) Summary: This PR replaces several "readadhead" typos with "readahead". Pull Request resolved: https://github.com/facebook/rocksdb/pull/7953 Reviewed By: ajkr Differential Revision: D26518903 Pulled By: jay-zhuang fbshipit-source-id: 6f7dece0e39ec4f71c4a936399bcb2e02574f42a	2021-02-18 14:31:20 -08:00
Peter Dillinger	e4f1e64c30	Add prefetching (batched MultiGet) for experimental Ribbon filter (#7889 ) Summary: Adds support for prefetching data in Ribbon queries, which especially optimizes batched Ribbon queries for MultiGet (~222ns/key to ~97ns/key) but also single key queries on cold memory (~333ns to ~226ns) because many queries span more than one cache line. This required some refactoring of the query algorithm, and there does not appear to be a noticeable regression in "hot memory" query times (perhaps from 48ns to 50ns). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7889 Test Plan: existing unit tests, plus performance validation with filter_bench: Each data point is the best of two runs. I saturated the machine CPUs with other filter_bench runs in the background. Before: $ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50 WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 125.86 Number of filters: 1993 Total size (MB): 168.166 Reported total allocated memory (MB): 183.211 Reported internal fragmentation: 8.94626% Bits/key stored: 7.05341 Prelim FP rate %: 0.951827 ---------------------------- Mixed inside/outside queries... Single filter net ns/op: 48.0111 Batched, prepared net ns/op: 222.384 Batched, unprepared net ns/op: 343.908 Skewed 50% in 1% net ns/op: 252.916 Skewed 80% in 20% net ns/op: 320.579 Random filter net ns/op: 332.957 After: $ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50 WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 128.117 Number of filters: 1993 Total size (MB): 168.166 Reported total allocated memory (MB): 183.211 Reported internal fragmentation: 8.94626% Bits/key stored: 7.05341 Prelim FP rate %: 0.951827 ---------------------------- Mixed inside/outside queries... Single filter net ns/op: 49.8812 Batched, prepared net ns/op: 97.1514 Batched, unprepared net ns/op: 222.025 Skewed 50% in 1% net ns/op: 197.48 Skewed 80% in 20% net ns/op: 212.457 Random filter net ns/op: 226.464 Bloom comparison, for reference: $ ./filter_bench -impl=2 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50 WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 35.3042 Number of filters: 1993 Total size (MB): 238.488 Reported total allocated memory (MB): 262.875 Reported internal fragmentation: 10.2255% Bits/key stored: 10.0029 Prelim FP rate %: 0.965327 ---------------------------- Mixed inside/outside queries... Single filter net ns/op: 9.09931 Batched, prepared net ns/op: 34.21 Batched, unprepared net ns/op: 88.8564 Skewed 50% in 1% net ns/op: 139.75 Skewed 80% in 20% net ns/op: 181.264 Random filter net ns/op: 173.88 Reviewed By: jay-zhuang Differential Revision: D26378710 Pulled By: pdillinger fbshipit-source-id: 058428967c55ed763698284cd3b4bbe3351b6e69	2021-02-10 21:04:56 -08:00
mrambacher	12f1137355	Add a SystemClock class to capture the time functions of an Env (#7858 ) Summary: Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock. Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock. There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc). Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858 Reviewed By: pdillinger Differential Revision: D26006406 Pulled By: mrambacher fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90	2021-01-25 22:09:11 -08:00
Adam Retter	6e0f62f2b6	Add more tests to ASSERT_STATUS_CHECKED (3), API change (#7715 ) Summary: Third batch of adding more tests to ASSERT_STATUS_CHECKED. * db_compaction_filter_test * db_compaction_test * db_dynamic_level_test * db_inplace_update_test * db_sst_test * db_tailing_iter_test * db_io_failure_test Also update GetApproximateSizes APIs to all return Status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7715 Reviewed By: jay-zhuang Differential Revision: D25806896 Pulled By: pdillinger fbshipit-source-id: 6cb9d62ba5a756c645812754c596ad3995d7c262	2021-01-06 14:15:02 -08:00
mrambacher	c1a65a4de4	Make StringEnv, StringSink, StringSource use FS classes (#7786 ) Summary: Change the StringEnv and related classes to be based on FileSystem APIs rather than the corresponding Env ones. The StringSink and StringSource classes were changed to be based on the corresponding FS file classes. Part of a cleanup to use the newer interfaces. This change also eliminates some of the casts/wrappers to LegacyFile classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7786 Reviewed By: jay-zhuang Differential Revision: D25761460 Pulled By: anand1976 fbshipit-source-id: 428ae8e32b3db97dbeeca08c9d3bb0d9d4d3a38f	2021-01-04 16:01:01 -08:00
Akanksha Mahajan	30a5ed9c53	Update "num_data_read" stat in RetrieveMultipleBlocks (#7770 ) Summary: RetrieveMultipleBlocks which is used by MultiGet to read data blocks is not updating num_data_read stat in GetContextStats. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7770 Test Plan: make check -j64 Reviewed By: anand1976 Differential Revision: D25538982 Pulled By: akankshamahajan15 fbshipit-source-id: e3daedb035b1be8ab6af6f115cb3793ccc7b1ec6	2020-12-23 15:16:46 -08:00
mrambacher	02418194d7	Add more tests for assert status checked (#7524 ) Summary: Added 10 more tests that pass the ASSERT_STATUS_CHECKED test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7524 Reviewed By: akankshamahajan15 Differential Revision: D24323093 Pulled By: ajkr fbshipit-source-id: 28d4106d0ca1740c3b896c755edf82d504b74801	2020-12-22 23:45:58 -08:00
Peter Dillinger	4d897e51df	Migrate away from Travis+Linux+amd64 (#7791 ) Summary: This disables Linux/amd64 builds in Travis for PRs, and adds a gcc-10+c++20 build in CircleCI, which should fill out sufficient coverage vs. what we had in Travis Fixed a use of std::is_pod, which is deprecated in c++20 Fixed ++ on a volatile in db_repl_stress.cc, with bigger refactoring. Although ++ on this volatile was probably ok with one thread writer and one thread reader, the code was still overly complex. There was a deadcode check for error `if (replThread.no_read < dataPump.no_records)` which can be proven never to happen based on the structure of the code. It infinite loops instead for the case intended to be checked. I just simplified the code for what should be the same checking power. Also most configurations seem to be using make parallelism = 2 * vcores, so fixing / using that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7791 Test Plan: CI and `while ./db_repl_stress; do echo again; done` for a while Reviewed By: siying Differential Revision: D25669834 Pulled By: pdillinger fbshipit-source-id: b2c688053d0b1d52c989903449d3cd27a04130d6	2020-12-22 00:20:57 -08:00
Peter Dillinger	239d17a19c	Support optimize_filters_for_memory for Ribbon filter (#7774 ) Summary: Primarily this change refactors the optimize_filters_for_memory code for Bloom filters, based on malloc_usable_size, to also work for Ribbon filters. This change also replaces the somewhat slow but general BuiltinFilterBitsBuilder::ApproximateNumEntries with implementation-specific versions for Ribbon (new) and Legacy Bloom (based on a recently deleted version). The reason is to emphasize speed in ApproximateNumEntries rather than 100% accuracy. Justification: ApproximateNumEntries (formerly CalculateNumEntry) is only used by RocksDB for range-partitioned filters, called each time we start to construct one. (In theory, it should be possible to reuse the estimate, but the abstractions provided by FilterPolicy don't really make that workable.) But this is only used as a heuristic estimate for hitting a desired partitioned filter size because of alignment to data blocks, which have various numbers of unique keys or prefixes. The two factors lead us to prioritize reasonable speed over 100% accuracy. optimize_filters_for_memory adds extra complication, because precisely calculating num_entries for some allowed number of bytes depends on state with optimize_filters_for_memory enabled. And the allocator-agnostic implementation of optimize_filters_for_memory, using malloc_usable_size, means we would have to actually allocate memory, many times, just to precisely determine how many entries (keys) could be added and stay below some size budget, for the current state. (In a draft, I got this working, and then realized the balance of speed vs. accuracy was all wrong.) So related to that, I have made CalculateSpace, an internal-only API only used for testing, non-authoritative also if optimize_filters_for_memory is enabled. This simplifies some code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7774 Test Plan: unit test updated, and for FilterSize test, range of tested values is greatly expanded (still super fast) Also tested `db_bench -benchmarks=fillrandom,stats -bloom_bits=10 -num=1000000 -partition_index_and_filters -format_version=5 [-optimize_filters_for_memory] [-use_ribbon_filter]` with temporary debug output of generated filter sizes. Bloom+optimize_filters_for_memory: 1 Filter size: 197 (224 in memory) 134 Filter size: 3525 (3584 in memory) 107 Filter size: 4037 (4096 in memory) Total on disk: 904,506 Total in memory: 918,752 Ribbon+optimize_filters_for_memory: 1 Filter size: 3061 (3072 in memory) 110 Filter size: 3573 (3584 in memory) 58 Filter size: 4085 (4096 in memory) Total on disk: 633,021 (-30.0%) Total in memory: 634,880 (-30.9%) Bloom (no offm): 1 Filter size: 261 (320 in memory) 1 Filter size: 3333 (3584 in memory) 240 Filter size: 3717 (4096 in memory) Total on disk: 895,674 (-1% on disk vs. +offm; known tolerable overhead of offm) Total in memory: 986,944 (+7.4% vs. +offm) Ribbon (no offm): 1 Filter size: 2949 (3072 in memory) 1 Filter size: 3381 (3584 in memory) 167 Filter size: 3701 (4096 in memory) Total on disk: 624,397 (-30.3% vs. Bloom) Total in memory: 690,688 (-30.0% vs. Bloom) Note that optimize_filters_for_memory is even more effective for Ribbon filter than for cache-local Bloom, because it can close the unused memory gap even tighter than Bloom filter, because of 16 byte increments for Ribbon vs. 64 byte increments for Bloom. Reviewed By: jay-zhuang Differential Revision: D25592970 Pulled By: pdillinger fbshipit-source-id: 606fdaa025bb790d7e9c21601e8ea86e10541912	2020-12-18 14:31:03 -08:00
Peter Dillinger	003e72b201	Use size_t for filter APIs, protect against overflow (#7726 ) Summary: Deprecate CalculateNumEntry and replace with ApproximateNumEntries (better name) using size_t instead of int and uint32_t, to minimize confusing casts and bad overflow behavior (possible though probably not realistic). Bloom sizes are now explicitly capped at max size supported by implementations: just under 4GiB for fv=5 Bloom, and just under 512MiB for fv<5 Legacy Bloom. This hardening could help to set up for fuzzing. Also, since RocksDB only uses this information as an approximation for trying to hit certain sizes for partitioned filters, it's more important that the function be reasonably fast than for it to be completely accurate. It's hard enough to be 100% accurate for Ribbon (currently reversing CalculateSpace) that adding optimize_filters_for_memory into the mix is just not worth trying to be 100% accurate for num entries for bytes. Also: - Cleaned up filter_policy.h to remove MSVC warning handling and potentially unsafe use of exception for "not implemented" - Correct the number of entries limit beyond which current Ribbon implementation falls back on Bloom instead. - Consistently use "num_entries" rather than "num_entry" - Remove LegacyBloomBitsBuilder::CalculateNumEntry as it's essentially obsolete from general implementation BuiltinFilterBitsBuilder::CalculateNumEntries. - Fix filter_bench to skip some tests that don't make sense when only one or a small number of filters has been generated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7726 Test Plan: expanded existing unit tests for CalculateSpace / ApproximateNumEntries. Also manually used filter_bench to verify Legacy and fv=5 Bloom size caps work (much too expensive for unit test). Note that the actual bits per key is below requested due to space cap. $ ./filter_bench -impl=0 -bits_per_key=20 -average_keys_per_filter=256000000 -vary_key_count_ratio=0 -m_keys_total_max=256 -allow_bad_fp_rate ... Total size (MB): 511.992 Bits/key stored: 16.777 ... $ ./filter_bench -impl=2 -bits_per_key=20 -average_keys_per_filter=2000000000 -vary_key_count_ratio=0 -m_keys_total_max=2000 ... Total size (MB): 4096 Bits/key stored: 17.1799 ... $ Reviewed By: jay-zhuang Differential Revision: D25239800 Pulled By: pdillinger fbshipit-source-id: f94e6d065efd31e05ec630ae1a82e6400d8390c4	2020-12-11 22:18:12 -08:00
anand76	8a1488efbf	Ensure that MultiGet works properly with compressed cache (#7756 ) Summary: Ensure that when direct IO is enabled and a compressed block cache is configured, MultiGet inserts compressed data blocks into the compressed block cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7756 Test Plan: Add unit test to db_basic_test Reviewed By: cheng-chang Differential Revision: D25416240 Pulled By: anand1976 fbshipit-source-id: 75d57526370c9c0a45ff72651f3278dbd8a9086f	2020-12-09 17:01:13 -08:00
Jay Zhuang	9e1640403a	Exclude timestamp from prefix extractor (#7668 ) Summary: Timestamp should not be included in prefix extractor, as we discussed here: https://github.com/facebook/rocksdb/pull/7589#discussion_r511068586 Pull Request resolved: https://github.com/facebook/rocksdb/pull/7668 Test Plan: added unittest Reviewed By: riversand963 Differential Revision: D24966265 Pulled By: jay-zhuang fbshipit-source-id: 0dae618c333d4b7942a40d556535a1795e060aea	2020-12-01 14:07:15 -08:00
Yanqin Jin	84a700819e	Fix the logic of setting read_amp_bytes_per_bit from OPTIONS file (#7680 ) Summary: Instead of using `EncodeFixed32` which always serialize a integer to little endian, we should use the local machine's endianness when populating a native data structure during options parsing. Without this fix, `read_amp_bytes_per_bit` may be populated incorrectly on big-endian machines. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7680 Test Plan: make check Reviewed By: pdillinger Differential Revision: D24999166 Pulled By: riversand963 fbshipit-source-id: dc603cff6e17f8fa32479ce6df93b93082e6b0c4	2020-11-17 00:44:30 -08:00
Yanqin Jin	9aa1b1dc19	Hack to load OPTIONS file for read_amp_bytes_per_bit (#7659 ) Summary: A temporary hack to work around a bug in 6.10, 6.11, 6.12, 6.13 and 6.14. The bug will write out 8 bytes to OPTIONS file from the starting address of BlockBasedTableOptions.read_amp_bytes_per_bit which is actually a uint32. Consequently, the value of read_amp_bytes_per_bit written in the OPTIONS file is wrong. From 6.15, RocksDB will try to parse the read_amp_bytes_per_bit from OPTIONS file as a uint32. To be able to load OPTIONS file generated by affected releases before the fix, we need to manually parse read_amp_bytes_per_bit with this hack. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7659 Test Plan: Generate a db with current 6.14.fb (head at `b6db05dbb5`). Maybe use db_stress. Checkout this PR, run ``` ~/rocksdb/ldb --db=. --try_load_options --ignore_unknown_options idump --count_only ``` Expect success, and should not see ``` Failed: Invalid argument: Error parsing read_amp_bytes_per_bit:17179869184 ``` Also make check Reviewed By: anand1976 Differential Revision: D24954752 Pulled By: riversand963 fbshipit-source-id: c7b802fc3e52acd050a4fc1cd475016122234394	2020-11-13 11:52:50 -08:00
Peter Dillinger	60af964372	Experimental (production candidate) SST schema for Ribbon filter (#7658 ) Summary: Added experimental public API for Ribbon filter: NewExperimentalRibbonFilterPolicy(). This experimental API will take a "Bloom equivalent" bits per key, and configure the Ribbon filter for the same FP rate as Bloom would have but ~30% space savings. (Note: optimize_filters_for_memory is not yet implemented for Ribbon filter. That can be added with no effect on schema.) Internally, the Ribbon filter is configured using a "one_in_fp_rate" value, which is 1 over desired FP rate. For example, use 100 for 1% FP rate. I'm expecting this will be used in the future for configuring Bloom-like filters, as I expect people to more commonly hold constant the filter accuracy and change the space vs. time trade-off, rather than hold constant the space (per key) and change the accuracy vs. time trade-off, though we might make that available. ### Benchmarking ``` $ ./filter_bench -impl=2 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing Building... Build avg ns/key: 34.1341 Number of filters: 1993 Total size (MB): 238.488 Reported total allocated memory (MB): 262.875 Reported internal fragmentation: 10.2255% Bits/key stored: 10.0029 ---------------------------- Mixed inside/outside queries... Single filter net ns/op: 18.7508 Random filter net ns/op: 258.246 Average FP rate %: 0.968672 ---------------------------- Done. (For more info, run with -legend or -help.) $ ./filter_bench -impl=3 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing Building... Build avg ns/key: 130.851 Number of filters: 1993 Total size (MB): 168.166 Reported total allocated memory (MB): 183.211 Reported internal fragmentation: 8.94626% Bits/key stored: 7.05341 ---------------------------- Mixed inside/outside queries... Single filter net ns/op: 58.4523 Random filter net ns/op: 363.717 Average FP rate %: 0.952978 ---------------------------- Done. (For more info, run with -legend or -help.) ``` 168.166 / 238.488 = 0.705 -> 29.5% space reduction 130.851 / 34.1341 = 3.83x construction time for this Ribbon filter vs. lastest Bloom filter (could make that as little as about 2.5x for less space reduction) ### Working around a hashing "flaw" bloom_test discovered a flaw in the simple hashing applied in StandardHasher when num_starts == 1 (num_slots == 128), showing an excessively high FP rate. The problem is that when many entries, on the order of number of hash bits or kCoeffBits, are associated with the same start location, the correlation between the CoeffRow and ResultRow (for efficiency) can lead to a solution that is "universal," or nearly so, for entries mapping to that start location. (Normally, variance in start location breaks the effective association between CoeffRow and ResultRow; the same value for CoeffRow is effectively different if start locations are different.) Without kUseSmash and with num_starts > 1 (thus num_starts ~= num_slots), this flaw should be completely irrelevant. Even with 10M slots, the chances of a single slot having just 16 (or more) entries map to it--not enough to cause an FP problem, which would be local to that slot if it happened--is 1 in millions. This spreadsheet formula shows that: =1/(10000000(1 - POISSON(15, 1, TRUE))) As kUseSmash==false (the setting for Standard128RibbonBitsBuilder) is intended for CPU efficiency of filters with many more entries/slots than kCoeffBits, a very reasonable work-around is to disallow num_starts==1 when !kUseSmash, by making the minimum non-zero number of slots 2kCoeffBits. This is the work-around I've applied. This also means that the new Ribbon filter schema (Standard128RibbonBitsBuilder) is not space-efficient for less than a few hundred entries. Because of this, I have made it fall back on constructing a Bloom filter, under existing schema, when that is more space efficient for small filters. (We can change this in the future if we want.) TODO: better unit tests for this case in ribbon_test, and probably update StandardHasher for kUseSmash case so that it can scale nicely to small filters. ### Other related changes * Add Ribbon filter to stress/crash test * Add Ribbon filter to filter_bench as -impl=3 * Add option string support, as in "filter_policy=experimental_ribbon:5.678;" where 5.678 is the Bloom equivalent bits per key. * Rename internal mode BloomFilterPolicy::kAuto to kAutoBloom * Add a general BuiltinFilterBitsBuilder::CalculateNumEntry based on binary searching CalculateSpace (inefficient), so that subclasses (especially experimental ones) don't have to provide an efficient implementation inverting CalculateSpace. * Minor refactor FastLocalBloomBitsBuilder for new base class XXH3pFilterBitsBuilder shared with new Standard128RibbonBitsBuilder, which allows the latter to fall back on Bloom construction in some extreme cases. * Mostly updated bloom_test for Ribbon filter, though a test like FullBloomTest::Schema is a next TODO to ensure schema stability (in case this becomes production-ready schema as it is). * Add some APIs to ribbon_impl.h for configuring Ribbon filters. Although these are reasonably covered by bloom_test, TODO more unit tests in ribbon_test * Added a "tool" FindOccupancyForSuccessRate to ribbon_test to get data for constructing the linear approximations in GetNumSlotsFor95PctSuccess. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7658 Test Plan: Some unit tests updated but other testing is left TODO. This is considered experimental but laying down schema compatibility as early as possible in case it proves production-quality. Also tested in stress/crash test. Reviewed By: jay-zhuang Differential Revision: D24899349 Pulled By: pdillinger fbshipit-source-id: 9715f3e6371c959d923aea8077c9423c7a9f82b8	2020-11-12 20:46:14 -08:00
mrambacher	c442f6809f	Create a Customizable class to load classes and configurations (#6590 ) Summary: The Customizable class is an extension of the Configurable class and allows instances to be created by a name/ID. Classes that extend customizable can define their Type (e.g. "TableFactory", "Cache") and a method to instantiate them (TableFactory::CreateFromString). Customizable objects can be registered with the ObjectRegistry and created dynamically. Future PRs will make more types of objects extend Customizable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6590 Reviewed By: cheng-chang Differential Revision: D24841553 Pulled By: zhichao-cao fbshipit-source-id: d0c2132bd932e971cbfe2c908ca2e5db30c5e155	2020-11-11 15:10:41 -08:00
Huisheng Liu	16d103d35b	fix read_amp_bytes_per_bit field size (#7651 ) Summary: The field in BlockBasedTableOptions is 4 bytes: // Default: 0 (disabled) uint32_t read_amp_bytes_per_bit = 0; Pull Request resolved: https://github.com/facebook/rocksdb/pull/7651 Reviewed By: ltamasi Differential Revision: D24844994 Pulled By: riversand963 fbshipit-source-id: e2695e55532256ef8996dd6939cad06987a80293	2020-11-10 11:14:48 -08:00
Jay Zhuang	881e0dcc09	Fix MultiGet unable to query timestamp data issue (#7589 ) Summary: The filter query key should not contain timestamp. The timestamp is stripped for Get(), but not MultiGet(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7589 Reviewed By: riversand963 Differential Revision: D24494661 Pulled By: jay-zhuang fbshipit-source-id: fc5ff40f9d683a89a760c6ff0ab3aed05a70c317	2020-11-03 09:45:41 -08:00
Yanqin Jin	394210f280	Remove unused includes (#7604 ) Summary: This is a PR generated semi-automatically by an internal tool to remove unused includes and `using` statements. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7604 Test Plan: make check Reviewed By: ajkr Differential Revision: D24579392 Pulled By: riversand963 fbshipit-source-id: c4bfa6c6b08da1de186690d37eb73d8fff45aecd	2020-10-28 23:22:27 -07:00
Ramkumar Vadivelu	9a690a74e1	In ParseInternalKey(), include corrupt key info in Status (#7515 ) Summary: Fixes Issue https://github.com/facebook/rocksdb/issues/7497 When allow_data_in_errors db_options is set, log error key details in `ParseInternalKey()` Have fixed most of the calls. Have few TODOs still pending - because have to make more deeper changes to pass in the allow_data_in_errors flag. Will do those in a separate PR later. Tests: - make check - some of the existing tests that exercise the "internal key too small" condition are: dbformat_test, cuckoo_table_builder_test - some of the existing tests that exercise the corrupted key path are: corruption_test, merge_helper_test, compaction_iterator_test Example of new status returns: - Key too small - `Corrupted Key: Internal Key too small. Size=5` - Corrupt key with allow_data_in_errors option set to false: `Corrupted Key: '<redacted>' seq:3, type:3` - Corrupt key with allow_data_in_errors option set to true: `Corrupted Key: '61' seq:3, type:3` Pull Request resolved: https://github.com/facebook/rocksdb/pull/7515 Reviewed By: ajkr Differential Revision: D24240264 Pulled By: ramvadiv fbshipit-source-id: bc48f5d4475ac19d7713e16df37505b31aac42e7	2020-10-28 10:12:58 -07:00
mrambacher	f35f7f2704	Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566 ) Summary: This PR does a few things: 1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation. 2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed: - The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated - The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory). 3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10). I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged. Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently. Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :) Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566 Reviewed By: zhichao-cao Differential Revision: D24408980 Pulled By: jay-zhuang fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f	2020-10-27 10:33:09 -07:00
Ziyue Yang	1c78e4b235	Make parallel compression optimization code tidier (#6888 ) Summary: This commit makes https://github.com/facebook/rocksdb/issues/6262's code change tidier and easier to understand by: 1. Wrapping parallel compression initialization and termination into common methods; 2. Wrapping BlockRep initialization, push/pop into common methods; 3. Wrapping file size estimation into common methods; 4. Fixing function declarations that use non-const reference; 5. Fixing some uninitialized variables; 6. Fixing first_block data race; 7. Making BlockRep::status check in BlockBasedTableBuilder::Finish only present if ok(); 8. Making assert(ok()) in BlockBasedTableBuilder::CompressAndVerifyBlock only present in non-parallel compression mode. In parallel compression mode, compression will abort if status is not OK; 9. Eliminating potential data race caused by BlockBasedTableBuilder::GetStatus() and BlockBasedTableBuilder::GetIOStatus() by returning status copy instead of unprotected reference. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6888 Reviewed By: ajkr Differential Revision: D21957110 Pulled By: jay-zhuang fbshipit-source-id: 3a29892f249209513f030349756cecd7736eae80	2020-10-22 11:05:25 -07:00
mrambacher	1eda625eab	Revert `Status`es returned from pre-`Configurable` options functions (#7563 ) Summary: Further refinement of the earlier PR. Now the Status is NotFound with a subcode of PathNotFound. Also the existing functions for options parsing/loading are reverted to return InvalidArgument no matter in which way the user-provided arguments are deemed invalid. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7563 Reviewed By: zhichao-cao Differential Revision: D24422491 Pulled By: ajkr fbshipit-source-id: ba6b237cd0584d3f925c5ba0d349aeb8c250af67	2020-10-20 11:53:28 -07:00
Akanksha Mahajan	b4cd51d847	Fix for stress test failure (#7574 ) Summary: Ignore read error in 'FilePrefetchBuffer::TryReadFromCache' as status is ignored and bool value is returned. Return error if prefetch fails in 'PrefetchTail' as we have planned to return Prefetch failures to users. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7574 Test Plan: make check -j64, python -u tools/db_crashtest.py blackbox Reviewed By: zhichao-cao Differential Revision: D24408825 Pulled By: akankshamahajan15 fbshipit-source-id: feebda771415998253fbe54632f13e6e75b7a243	2020-10-20 09:13:42 -07:00
anand76	00751e4292	Add a host location property to TableProperties (#7479 ) Summary: This PR adds support for writing a location identifier of the DB host to SST files as a table property. By default, the hostname is used, but can be overridden by the user. There have been some recent corruptions in files written by ```SstFileWriter``` before checksumming, so this property can be used to trace it back to the writing host and checking the host for hardware isues. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7479 Test Plan: Add new unit tests Reviewed By: pdillinger Differential Revision: D24340671 Pulled By: anand1976 fbshipit-source-id: 2038949fd8d160c0633ccb4f9da77740f19fa2a2	2020-10-19 11:38:48 -07:00
Akanksha Mahajan	db87afbcb3	Return error if Get/Multi() fails in Prefetching Filter blocks (#7543 ) Summary: Right now all I/O failures under PartitionFilterBlock::CacheDependencies() is swallowed. Return error in case prefetch fails. On returning error in PartitionedFilterBlockReader::CacheDependencies was causing stress test failure because PrefetchBuffer is initialized with enable_ = true, as result when PosixMmapReadableFile::Read is called from Prefetch, scratch is ignored causing buffer to fill with garbage values. Initializing prefetch buffer by CreatePrefetchBuffer that sets enable_ with !ioptions.allow_mmap_reads fixed the problem as it returns without prefetching data if allow_mmap_reads is set. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7543 Test Plan: make check -j64; python -u tools/db_crashtest.py --simple blackbox Reviewed By: anand1976 Differential Revision: D24284596 Pulled By: akankshamahajan15 fbshipit-source-id: f3f0fd44b59dcf60645730436f28564a07884868	2020-10-14 10:45:36 -07:00
Andrew Kryczka	75d3b6fdf0	Redesign block cache pinning API (#7520 ) Summary: The old flag-based APIs (`BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache` and `BlockBasedTableOptions::pin_top_level_index_and_filter`) were insufficient for our needs. For example, it was impossible to pin only unpartitioned meta-blocks, which could prevent block cache contention when turning on dictionary compression or during a migration to partitioned indexes/filters. It was also impossible to pin all meta-blocks in memory while having predictable memory usage via block cache. If we had continued adding flags to address these scenarios, they would have had significant overlap causing confusion. Instead, this PR deprecates the flags and starts a new API with non-overlapping options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7520 Test Plan: - new unit test - added new options to stress/crash test and ran for a while: `$ python tools/db_crashtest.py blackbox --simple --max_key=1000000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 --interval=10 -value_size_mult=33 -column_families=1 -reopen=0` Reviewed By: pdillinger Differential Revision: D24200034 Pulled By: ajkr fbshipit-source-id: 3fa7cfc71e7960f7a867511dd6ae5834dd73b13e	2020-10-11 14:58:24 -07:00
Akanksha Mahajan	24498ab1ec	Add few unit test cases in ASSERT_STATUS_CHECKED (#7500 ) Summary: Add status enforcement for following tests: 1. import_column_family_test 2. memory_test 3. table_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/7500 Reviewed By: zhichao-cao Differential Revision: D24095887 Pulled By: akankshamahajan15 fbshipit-source-id: db8e1ec595852df143fad78a0c07bfdd27dc3c84	2020-10-08 11:22:44 -07:00
Akanksha Mahajan	38d0a365e3	Add Stats for MultiGet (#7366 ) Summary: Add following stats for MultiGet in Histogram to get more insight on MultiGet. 1. Number of index and filter blocks read from file as part of MultiGet request per level. 2. Number of data blocks read from file per level. 3. Number of SST files loaded from file system per level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7366 Reviewed By: anand1976 Differential Revision: D24127040 Pulled By: akankshamahajan15 fbshipit-source-id: e63a003056b833729b277edc0639c08fb432756b	2020-10-07 13:28:48 -07:00
sdong	2496d3d4b8	Avoid to suppress status code in ~BlockBasedTableBuilder (#7507 ) Summary: We just used a hacky way to fix db_basic_test: suppress status code in ~BlockBasedTableBuilder. Rather, we should pass them back in Finish() and suppress them in Abandon(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7507 Test Plan: Watch existing tests to succeed. Reviewed By: riversand963 Differential Revision: D24119527 fbshipit-source-id: 71c4d4a81c0fd1c5595224692275f20f7759973a	2020-10-05 14:58:49 -07:00
Levi Tamasi	5d16325ce3	Revert "Return error if Get() fails in Prefetching Filter blocks (#7463 )" (#7505 ) Summary: This reverts commit `7d503e66a9`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7505 Reviewed By: ajkr Differential Revision: D24100875 Pulled By: ltamasi fbshipit-source-id: 8705e3e6e8be4b4fd175ffdb031baa6530b61151	2020-10-03 18:38:04 -07:00
Akanksha Mahajan	7d503e66a9	Return error if Get() fails in Prefetching Filter blocks (#7463 ) Summary: Right now all I/O failures under PartitionFilterBlock::CacheDependencies() is swallowed. Return error in case prefetch fails. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7463 Test Plan: make check -j64 Reviewed By: anand1976 Differential Revision: D24008226 Pulled By: akankshamahajan15 fbshipit-source-id: b65d63b2d01465db92500b78de7ad58650ec9b3b	2020-10-02 19:40:43 -07:00
darionyaphet	2af46f1011	Move break into block (#7468 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7468 Reviewed By: riversand963 Differential Revision: D24028736 Pulled By: ajkr fbshipit-source-id: bd2b4b8d069491a16373d3d2705fddf7ebfe6723	2020-09-30 20:24:23 -07:00
Ramkumar Vadivelu	e04a50923d	Change ParseInternalKey() to return Status instead of bool (#7457 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/7430 Change ParseInternalKey() to return Status instead of bool. db_bench (seekrandom) based before/after results with value size of 100 bytes and 16 bytes can be found at (tests ran on an udb server): https://www.dropbox.com/s/47bwamdy5ozngph/PIK_ret_Status_results.xlsx?dl=0 ![db_bench_results](https://user-images.githubusercontent.com/62277872/94642825-2a21a800-029a-11eb-88f2-124136c83fd3.png) Pull Request resolved: https://github.com/facebook/rocksdb/pull/7457 Reviewed By: ajkr Differential Revision: D24002433 Pulled By: ramvadiv fbshipit-source-id: ac253ecf577a29044c47c3fe254a01e71404c44c	2020-09-30 19:16:47 -07:00
Jay Zhuang	1bdaef7a06	Status check enforcement for timestamp_basic_test (#7454 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7454 Reviewed By: riversand963 Differential Revision: D23981719 Pulled By: jay-zhuang fbshipit-source-id: 01073f73e54c17067b886c4a2f179b2804198399	2020-09-29 18:23:27 -07:00
anand76	bd5d9e2a1d	Fix misspelling of PartitionedIndexIterator (#7450 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7450 Reviewed By: zhichao-cao Differential Revision: D23992811 Pulled By: anand1976 fbshipit-source-id: 71bd898aafce6a3add3c8cd86d9f0e0fb63860cf	2020-09-29 16:28:13 -07:00
sdong	d08a9005b7	Make db_basic_test pass assert status checked (#7452 ) Summary: Add db_basic_test status check list. Some of the warnings are suppressed. It is possible that some of them are due to real bugs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7452 Test Plan: See CI tests pass. Reviewed By: zhichao-cao Differential Revision: D23979764 fbshipit-source-id: 6151570c2a9b931b0fbb3fe939a94b2bd1583cbe	2020-09-29 09:49:04 -07:00
mrambacher	7d472accdc	Bring the Configurable options together (#5753 ) Summary: This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753 Reviewed By: ajkr Differential Revision: D23385030 Pulled By: zhichao-cao fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a	2020-09-14 17:01:01 -07:00
Jay Zhuang	8a8a01c642	Fix compile error for old gcc-4.8 (#7358 ) Summary: gcc-4.8 returns error when using the constructor. Not sure if it's a compiler bug/limitation or code issue: ``` table/block_based/block_based_table_reader.cc:3183:67: error: use of deleted function ‘rocksdb::WritableFileStringStreamAdapter::WritableFileStringStreamAdapter(rocksdb::WritableFileStringStreamAdapter&&)’ ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/7358 Reviewed By: pdillinger Differential Revision: D23577651 Pulled By: jay-zhuang fbshipit-source-id: b0197e3d3538da61a6f3866410d88d2047fb9695	2020-09-08 12:09:34 -07:00
Jay Zhuang	27aa443a15	Add sst_file_dumper status check (#7315 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7315 Test Plan: `ASSERT_STATUS_CHECKED=1 make sst_dump_test && ./sst_dump_test` And manually run `./sst_dump --file=*.sst` before and after the change. Reviewed By: pdillinger Differential Revision: D23361669 Pulled By: jay-zhuang fbshipit-source-id: 5bf51a2a90ee35c8c679e5f604732ec2aef5949a	2020-09-04 19:26:42 -07:00
Andrew Kryczka	af54c4092a	fix SstFileWriter with dictionary compression (#7323 ) Summary: In block-based table builder, the cut-over from buffered to unbuffered mode involves sampling the buffered blocks and generating a dictionary. There was a bug where `SstFileWriter` passed zero as the `target_file_size` causing the cutover to happen immediately, so there were no samples available for generating the dictionary. This PR changes the meaning of `target_file_size == 0` to mean buffer the whole file before cutting over. It also adds dictionary compression support to `sst_dump --command=recompress` for easy evaluation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7323 Reviewed By: cheng-chang Differential Revision: D23412158 Pulled By: ajkr fbshipit-source-id: 3b232050e70ef3c2ee85a4b5f6fadb139c569873	2020-09-03 15:49:57 -07:00
Bingyi Sun	61d5a132c9	Fix typo: rename "bounary" to "boundary" in block.cc (#7328 ) Summary: Fix typo in comment for SeekForGetImpl(). Rename "bounary" to "boundary" Pull Request resolved: https://github.com/facebook/rocksdb/pull/7328 Reviewed By: riversand963 Differential Revision: D23439748 Pulled By: zhichao-cao fbshipit-source-id: 83a34c417c71a3210ce54a090d76c4d5571313f3	2020-09-02 20:47:18 -07:00
Jay Zhuang	c2485f2d81	Add buffer prefetch support for non directIO usecase (#7312 ) Summary: A new file interface `SupportPrefetch()` is added. When the user overrides it to `false`, an internal prefetch buffer will be used for readahead. Useful for non-directIO but FS doesn't have readahead support. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7312 Reviewed By: anand1976 Differential Revision: D23329847 Pulled By: jay-zhuang fbshipit-source-id: 71cd4ce6f4a820840294e4e6aec111ab76175527	2020-08-27 18:16:53 -07:00
sdong	722814e357	Get() to fail with underlying failures in PartitionIndexReader::CacheDependencies() (#7297 ) Summary: Right now all I/O failures under PartitionIndexReader::CacheDependencies() is swallowed. This doesn't impact correctness but we've made a decision that any I/O error in read path now should be returned to users for awareness. Return errors in those cases instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7297 Test Plan: Add a new unit test that ingest errors in this code path and see Get() fails. Only one I/O path is hit in PartitionIndexReader::CacheDependencies(). Several option changes are attempt but not able to got other pread paths triggered. Not sure whether other failure cases would be even possible. Would rely on continuous stress test to validate it. Reviewed By: anand1976 Differential Revision: D23257950 fbshipit-source-id: 859dbc92fa239996e1bb378329344d3d54168c03	2020-08-25 19:01:05 -07:00
mrambacher	e9befdebbf	Add EnvTestWithParam::OptionsTest to the ASSERT_STATUS_CHECKED passes (#7283 ) Summary: This test uses database functionality and required more extensive work to get it to pass than the other tests. The DB functionality required for this test now passes the check. When it was unclear what the proper behavior was for unchecked status codes, a TODO was added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7283 Reviewed By: akankshamahajan15 Differential Revision: D23251497 Pulled By: ajkr fbshipit-source-id: 52b79629bdafa0a58de8ead1d1d66f141b331523	2020-08-20 19:18:35 -07:00
Levi Tamasi	9d6f48ec1d	Clean up CompressBlock/CompressBlockInternal a bit (#7249 ) Summary: The patch cleans up and refactors `CompressBlock` and `CompressBlockInternal` a bit. In particular, it does the following: * It renames `CompressBlockInternal` to `CompressData` and moves it to `util/compression.h`, where other general compression-related utilities are located. This will facilitate reuse in the BlobDB write path. * The signature of the method is changed so it now takes `compression_format_version` (similarly to the compression library specific methods) instead of `format_version` (which is specific to the block based table). * `GetCompressionFormatForVersion` no longer takes `compression_type` as a parameter. This parameter was only used in a (not entirely up-to-date) assertion; also, removing it eliminates the need to ensure this precondition holds at all call sites. * Does some minor cleanup in `CompressBlock`, for instance, it is now possible to pass only one of `sampled_output_fast` and `sampled_output_slow`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7249 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D23087278 Pulled By: ltamasi fbshipit-source-id: e6316e45baed8b4e7de7c1780c90501c2a3439b3	2020-08-12 18:25:48 -07:00
anand76	832b056a30	Enable IO timeouts for iterators (#7161 ) Summary: Introduce io_timeout in ReadOptions and enabled deadline/io_timeout for Iterators. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7161 Test Plan: New unit tests in db_basic_test Reviewed By: riversand963 Differential Revision: D22687352 Pulled By: anand1976 fbshipit-source-id: 67bbb0e6d7ae80b256589244468494292538c6ec	2020-08-07 12:01:08 -07:00
sdong	5c1a544122	Clean up InternalIterator upper bound logic a little bit (#7200 ) Summary: IteratorIterator::IsOutOfBound() and IteratorIterator::MayBeOutOfUpperBound() are two functions that related to upper bound check. It is hard for users to reason about this complexity. Consolidate the two functions into one and assign an enum as results to improve readability. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7200 Test Plan: Run all existing test. Would run crash test with atomic for a while. Reviewed By: anand1976 Differential Revision: D22833181 fbshipit-source-id: a0c724267056adbd0476bde74650e6c7226077e6	2020-08-05 10:44:57 -07:00
sdong	41c328fe57	Fix a perf regression that caused every key to go through upper bound check (#7209 ) Summary: https://github.com/facebook/rocksdb/pull/5289 introduces a performance regression that caused an upper bound check within every BlockBasedTableIterator::Next(). This is unnecessary if we've checked the boundary key for current block and it is within upper bound. Fix the bug. Also rename the boolean to a enum so that the code is slightly better readable. The original regression was probably to fix a bug that the block upper bound check status is not reset after a new block is created. Fix it bug so that the regression can be avoided without hitting the bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7209 Test Plan: Run all existing tests. Will run atomic black box crash test for a while. Reviewed By: anand1976 Differential Revision: D22859246 fbshipit-source-id: cbdad1f5e656c55fd8b71726d5a4f6cb53ff9140	2020-08-04 11:30:09 -07:00
Andrew Kryczka	a4a4a2dabd	dedup ReadOptions in iterator hierarchy (#7210 ) Summary: Previously, a `ReadOptions` object was stored in every `BlockBasedTableIterator` and every `LevelIterator`. This redundancy consumes extra memory, resulting in the `Arena` making more allocations, and iteration observing worse cache performance. This PR migrates callers of `NewInternalIterator()` and `MakeInputIterator()` to provide a `ReadOptions` object guaranteed to outlive the returned iterator. When the iterator's lifetime will be managed by the user, this lifetime guarantee is achieved by storing the `ReadOptions` value in `ArenaWrappedDBIter`. Then, sub-iterators of `NewInternalIterator()` and `MakeInputIterator()` can hold a reference-to-const `ReadOptions`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7210 Test Plan: - `make check` under ASAN and valgrind - benchmark: on a DB with 2 L0 files and 3 L1+ levels, this PR reduced `Arena` allocation 4792 -> 4160 bytes. Reviewed By: anand1976 Differential Revision: D22861323 Pulled By: ajkr fbshipit-source-id: 54aebb3e89c872eeab0f5793b4b6e42878d093ce	2020-08-03 15:23:04 -07:00
Andrew Kryczka	643c863b72	minimize BlockIter comparator scope (#7149 ) Summary: PR https://github.com/facebook/rocksdb/issues/6944 transitioned `BlockIter` from using `Comparator` to using concrete `UserComparatorWrapper` and `InternalKeyComparator`. However, adding them as instance variables to `BlockIter` was not optimal. Bloating `BlockIter` caused the `ArenaWrappedDBIter`'s arena allocator to do more heap allocations (in certain cases) which harmed performance of `DB::NewIterator()`. This PR pushes down the concrete comparator objects to the point of usage, which forces them to be on the stack. As a result, the `BlockIter` is back to its original size prior to https://github.com/facebook/rocksdb/issues/6944 (actually a bit smaller since there were two `Comparator` before). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7149 Test Plan: verified our internal `DB::NewIterator()`-heavy regression test no longer reports regression. Reviewed By: riversand963 Differential Revision: D22623189 Pulled By: ajkr fbshipit-source-id: f6d69accfe5de51e0bd9874a480b32b29909bab6	2020-07-20 14:07:04 -07:00
mrambacher	c7c7b07f06	More Makefile Cleanup (#7097 ) Summary: Cleans up some of the dependencies on test code in the Makefile while building tools: - Moves the test::RandomString, DBBaseTest::RandomString into Random - Moves the test::RandomHumanReadableString into Random - Moves the DestroyDir method into file_utils - Moves the SetupSyncPointsToMockDirectIO into sync_point. - Moves the FaultInjection Env and FS classes under env These changes allow all of the tools to build without dependencies on test_util, thereby simplifying the build dependencies. By moving the FaultInjection code, the dependency in db_stress on different libraries for debug vs release was eliminated. Tested both release and debug builds via Make and CMake for both static and shared libraries. More work remains to clean up how the tools are built and remove some unnecessary dependencies. There is also more work that should be done to get the Makefile and CMake to align in their builds -- what is in the libraries and the sizes of the executables are different. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7097 Reviewed By: riversand963 Differential Revision: D22463160 Pulled By: pdillinger fbshipit-source-id: e19462b53324ab3f0b7c72459dbc73165cc382b2	2020-07-09 14:35:17 -07:00
Andrew Kryczka	82611ee25a	save key comparisons in BlockIter::BinarySeek (#7068 ) Summary: This is a followup to https://github.com/facebook/rocksdb/issues/6646. In that PR, for simplicity I just appended a comparison against the 0th restart key in case `BinarySeek()`'s binary search landed at index 0. As a result there were `2/(N+1) + log_2(N)` key comparisons. This PR does it differently. Now we expand the binary search range by one so it also covers the case where target is at or before the restart key at index 0. As a result, it involves `log_2(N+1)` key comparisons. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7068 Test Plan: ran readrandom with mostly default settings and counted key comparisons using `PerfContext`. before: `user_key_comparison_count = 28881965` after: `user_key_comparison_count = 27823245` setup command: ``` $ TEST_TMPDIR=/dev/shm/dbbench ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -level_compaction_dynamic_level_bytes=true -num=10000000 ``` benchmark command: ``` $ TEST_TMPDIR=/dev/shm/dbbench/ ./db_bench -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=10000000 -compression_type=none -reads=1000000 -perf_level=3 ``` Reviewed By: anand1976 Differential Revision: D22357032 Pulled By: ajkr fbshipit-source-id: 8b01e9c1c2a4e9d02fc9dfe16c1cc0327f8bdf24	2020-07-09 12:27:20 -07:00
Akanksha Mahajan	54f171fe90	Update Flush policy in PartitionedIndexBuilder on switching from user-key to internal-key mode (#7096 ) Summary: When format_version is high enough to support user-key and there are index entries for same user key that spans multiple data blocks then it changes from user-key mode to internal-key mode. But the flush policy is not reset to point to Block Builder of internal-keys. After this switch, no entries are added to user key index partition result, thus it never triggers flushing the block. Fix: 1. After adding the entry in sub_builder_index_, if there is a switch from user-key to internal-key, then flush policy is updated to point to Block Builder of internal-keys index partition. 2. Set sub_builder_index_->seperator_is_key_plus_seq_ = true if seperator_is_key_plus_seq_ is set to true so that subsequent partitions can also use internal key mode. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7096 Test Plan: make check -j64 Reviewed By: ajkr Differential Revision: D22416598 Pulled By: akankshamahajan15 fbshipit-source-id: 01fc2dc07ea1b32f8fb803995ebe6e9a3fbe67ac	2020-07-08 21:03:04 -07:00
Andrew Kryczka	dd29ad4223	Separate internal and user key comparators in `BlockIter` (#6944 ) Summary: Replace `BlockIter::comparator_` and `IndexBlockIter::user_comparator_wrapper_` with a concrete `UserComparatorWrapper` and `InternalKeyComparator`. The motivation for this change was the inconvenience of not knowing the concrete type of `BlockIter::comparator_`, which prevented calling specialized internal key comparison functions to optimize comparison of keys with global seqno applied. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6944 Test Plan: benchmark setup -- single file DBs, in-memory, no compression. "normal_db" created by regular flush; "ingestion_db" created by ingesting a file. Both DBs have same contents. ``` $ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000 $ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst \| awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}') $ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ ``` benchmark run command: ``` $ TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=$SEEK_NEXT -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=0 -threads=1 -reads=200000000 -mmap_read=1 -verify_checksum=false ``` results: perf improved marginally for ingestion_db and did not change significantly for normal_db: SEEK_NEXT \| DB \| code \| ops/sec \| % change -- \| -- \| -- \| -- \| -- 0 \| normal_db \| master \| 350880 \| 0 \| normal_db \| PR6944 \| 351040 \| 0.0 0 \| ingestion_db \| master \| 343255 \| 0 \| ingestion_db \| PR6944 \| 349424 \| 1.8 10 \| normal_db \| master \| 218711 \| 10 \| normal_db \| PR6944 \| 217892 \| -0.4 10 \| ingestion_db \| master \| 220334 \| 10 \| ingestion_db \| PR6944 \| 226437 \| 2.8 Reviewed By: pdillinger Differential Revision: D21924676 Pulled By: ajkr fbshipit-source-id: ea4288a2eefa8112eb6c651a671c1de18c12e538	2020-07-07 17:26:16 -07:00
Peter Dillinger	a680a7ea37	Un-revert #7049 , revert #7022 (#7071 ) Summary: Even though local bisection gave me a clear signal (and still does) that reverting https://github.com/facebook/rocksdb/issues/7049 would fix the failures in MultiThreadedDBTest, https://github.com/facebook/rocksdb/issues/7022 seems to be the root cause. Reverting https://github.com/facebook/rocksdb/issues/7022 and keeping https://github.com/facebook/rocksdb/issues/7049 seems to fix the issue in local reproducer also. (Had these landed in opposite order, bisection would have found the root cause.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/7071 Reviewed By: akankshamahajan15 Differential Revision: D22362857 Pulled By: pdillinger fbshipit-source-id: ed63df3d74e9d4ce1604de8fe43b216166c7a3f0	2020-07-02 13:30:41 -07:00
Akanksha Mahajan	5edfe3a3d8	Update Flush policy in PartitionedIndexBuilder on switching from user-key to internal-key mode (#7022 ) Summary: When format_version is high enough to support user-key and there are index entries for same user key that spans multiple data blocks then it changes from user-key mode to internal-key mode. But the flush policy is not reset to point to Block Builder of internal-keys. After this switch, no entries are added to user key index partition result, thus it never triggers flushing the block. Fix: After adding the entry in sub_builder_index_, if there is a switch from user-key to internal-key, then flush policy is updated to point to Block Builder of internal-keys index partition. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7022 Test Plan: 1. make check -j64 2. Added one unit test case Reviewed By: ajkr Differential Revision: D22197734 Pulled By: akankshamahajan15 fbshipit-source-id: d87e9e46bccab8e896ee6979d6b79c51f73d479e	2020-07-01 14:58:08 -07:00
Anand Ananthabhotla	9a5886bd8c	Extend Get/MultiGet deadline support to table open (#6982 ) Summary: Current implementation of the ```read_options.deadline``` option only checks the deadline for random file reads during point lookups. This PR extends the checks to file opens, prefetches and preloads as part of table open. The main changes are in the ```BlockBasedTable```, partitioned index and filter readers, and ```TableCache``` to take ReadOptions as an additional parameter. In ```BlockBasedTable::Open```, in order to retain existing behavior w.r.t checksum verification and block cache usage, we filter out most of the options in ```ReadOptions``` except ```deadline```. However, having the ```ReadOptions``` gives us more flexibility to honor other options like verify_checksums, fill_cache etc. in the future. Additional changes in callsites due to function signature changes in ```NewTableReader()``` and ```FilePrefetchBuffer```. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6982 Test Plan: Add new unit tests in db_basic_test Reviewed By: riversand963 Differential Revision: D22219515 Pulled By: anand1976 fbshipit-source-id: 8a3b92f4a889808013838603aa3ca35229cd501b	2020-06-29 14:53:17 -07:00
sdong	f9817201af	Add unity build to CircleCI (#7026 ) Summary: We are still keeping unity build working. So it's a good idea to add to a pre-commit CI. A latest GCC docker image just to get a little bit more coverage. Fix three small issues to make it pass. Also make unity_test to run db_basic_test rather than db_test to cut the test time. There is no point to run expensive tests here. It was set to run db_test before db_basic_test was separated out. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7026 Test Plan: watch tests to pass. Reviewed By: zhichao-cao Differential Revision: D22223197 fbshipit-source-id: baa3b6cbb623bf359829b63ce35715c75bcb0ed4	2020-06-26 11:14:08 -07:00
Peter Dillinger	5b2bbacb6f	Minimize memory internal fragmentation for Bloom filters (#6427 ) Summary: New experimental option BBTO::optimize_filters_for_memory builds filters that maximize their use of "usable size" from malloc_usable_size, which is also used to compute block cache charges. Rather than always "rounding up," we track state in the BloomFilterPolicy object to mix essentially "rounding down" and "rounding up" so that the average FP rate of all generated filters is the same as without the option. (YMMV as heavily accessed filters might be unluckily lower accuracy.) Thus, the option near-minimizes what the block cache considers as "memory used" for a given target Bloom filter false positive rate and Bloom filter implementation. There are no forward or backward compatibility issues with this change, though it only works on the format_version=5 Bloom filter. With Jemalloc, we see about 10% reduction in memory footprint (and block cache charge) for Bloom filters, but 1-2% increase in storage footprint, due to encoding efficiency losses (FP rate is non-linear with bits/key). Why not weighted random round up/down rather than state tracking? By only requiring malloc_usable_size, we don't actually know what the next larger and next smaller usable sizes for the allocator are. We pick a requested size, accept and use whatever usable size it has, and use the difference to inform our next choice. This allows us to narrow in on the right balance without tracking/predicting usable sizes. Why not weight history of generated filter false positive rates by number of keys? This could lead to excess skew in small filters after generating a large filter. Results from filter_bench with jemalloc (irrelevant details omitted): (normal keys/filter, but high variance) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.6278 Number of filters: 5516 Total size (MB): 200.046 Reported total allocated memory (MB): 220.597 Reported internal fragmentation: 10.2732% Bits/key stored: 10.0097 Average FP rate %: 0.965228 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 30.5104 Number of filters: 5464 Total size (MB): 200.015 Reported total allocated memory (MB): 200.322 Reported internal fragmentation: 0.153709% Bits/key stored: 10.1011 Average FP rate %: 0.966313 (very few keys / filter, optimization not as effective due to ~59 byte internal fragmentation in blocked Bloom filter representation) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.5649 Number of filters: 162950 Total size (MB): 200.001 Reported total allocated memory (MB): 224.624 Reported internal fragmentation: 12.3117% Bits/key stored: 10.2951 Average FP rate %: 0.821534 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 31.8057 Number of filters: 159849 Total size (MB): 200 Reported total allocated memory (MB): 208.846 Reported internal fragmentation: 4.42297% Bits/key stored: 10.4948 Average FP rate %: 0.811006 (high keys/filter) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.7017 Number of filters: 164 Total size (MB): 200.352 Reported total allocated memory (MB): 221.5 Reported internal fragmentation: 10.5552% Bits/key stored: 10.0003 Average FP rate %: 0.969358 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 30.7131 Number of filters: 160 Total size (MB): 200.928 Reported total allocated memory (MB): 200.938 Reported internal fragmentation: 0.00448054% Bits/key stored: 10.1852 Average FP rate %: 0.963387 And from db_bench (block cache) with jemalloc: $ ./db_bench -db=/dev/shm/dbbench.no_optimize -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false $ ./db_bench -db=/dev/shm/dbbench -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -optimize_filters_for_memory -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false $ (for FILE in /dev/shm/dbbench.no_optimize/.sst; do ./sst_dump --file=$FILE --show_properties \| grep 'filter block' ; done) \| awk '{ t += $4; } END { print t; }' 17063835 $ (for FILE in /dev/shm/dbbench/.sst; do ./sst_dump --file=$FILE --show_properties \| grep 'filter block' ; done) \| awk '{ t += $4; } END { print t; }' 17430747 $ #^ 2.1% additional filter storage $ ./db_bench -db=/dev/shm/dbbench.no_optimize -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000 rocksdb.block.cache.index.add COUNT : 33 rocksdb.block.cache.index.bytes.insert COUNT : 8440400 rocksdb.block.cache.filter.add COUNT : 33 rocksdb.block.cache.filter.bytes.insert COUNT : 21087528 rocksdb.bloom.filter.useful COUNT : 4963889 rocksdb.bloom.filter.full.positive COUNT : 1214081 rocksdb.bloom.filter.full.true.positive COUNT : 1161999 $ #^ 1.04 % observed FP rate $ ./db_bench -db=/dev/shm/dbbench -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -optimize_filters_for_memory -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000 rocksdb.block.cache.index.add COUNT : 33 rocksdb.block.cache.index.bytes.insert COUNT : 8448592 rocksdb.block.cache.filter.add COUNT : 33 rocksdb.block.cache.filter.bytes.insert COUNT : 18220328 rocksdb.bloom.filter.useful COUNT : 5360933 rocksdb.bloom.filter.full.positive COUNT : 1321315 rocksdb.bloom.filter.full.true.positive COUNT : 1262999 $ #^ 1.08 % observed FP rate, 13.6% less memory usage for filters (Due to specific key density, this example tends to generate filters that are "worse than average" for internal fragmentation. "Better than average" cases can show little or no improvement.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6427 Test Plan: unit test added, 'make check' with gcc, clang and valgrind Reviewed By: siying Differential Revision: D22124374 Pulled By: pdillinger fbshipit-source-id: f3e3aa152f9043ddf4fae25799e76341d0d8714e	2020-06-22 13:32:07 -07:00
Peter Dillinger	25a0d0ca30	Fix block checksum for >=4GB, refactor (#6978 ) Summary: Although RocksDB falls over in various other ways with KVs around 4GB or more, this change fixes how XXH32 and XXH64 were being called by the block checksum code to support >= 4GB in case that should ever happen, or the code copied for other uses. This change is not a schema compatibility issue because the checksum verification code would checksum the first (block_size + 1) mod 2^32 bytes while the checksum construction code would checksum the first block_size mod 2^32 plus the compression type byte, meaning the XXH32/64 checksums for >=4GB block would not match about 255/256 times. While touching this code, I refactored to consolidate redundant implementations, improving diagnostics and performance tracking in some cases. Also used less confusing language in those diagnostics. Makes https://github.com/facebook/rocksdb/issues/6875 obsolete. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6978 Test Plan: I was able to write a test for this using an SST file writer and VerifyChecksum in a reader. The test fails before the fix, though I'm leaving the test disabled because I don't think it's worth the expense of running regularly. Reviewed By: gg814 Differential Revision: D22143260 Pulled By: pdillinger fbshipit-source-id: 982993d16134e8c50bea2269047f901c1783726e	2020-06-19 16:18:24 -07:00
sdong	223b57eeb8	Fix the bug that compressed cache is disabled in read-only DBs (#6990 ) Summary: Compressed block cache is disabled in https://github.com/facebook/rocksdb/pull/4650 for no good reason. Re-enable it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6990 Test Plan: Add a unit test to make sure a general function works with read-only DB + compressed block cache. Reviewed By: ltamasi Differential Revision: D22072755 fbshipit-source-id: 2a55df6363de23a78979cf6c747526359e5dc7a1	2020-06-17 14:30:54 -07:00
Zitan Chen	94d04529de	Store DB identity and DB session ID in SST files (#6983 ) Summary: `db_id` and `db_session_id` are now part of the table properties for all formats and stored in SST files. This adds about 99 bytes to each new SST file. The `TablePropertiesNames` for these two identifiers are `rocksdb.creating.db.identity` and `rocksdb.creating.session.identity`. In addition, SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as `DB::GetDbSessionId`. A table property test is added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6983 Test Plan: make check and some manual tests. Reviewed By: zhichao-cao Differential Revision: D22048826 Pulled By: gg814 fbshipit-source-id: afdf8c11424a6f509b5c0b06dafad584a80103c9	2020-06-17 10:57:40 -07:00
Zhen Li	9c24a5cb4d	Fix persistent cache on windows (#6932 ) Summary: Persistent cache feature caused rocks db crash on windows. I posted a issue for it, https://github.com/facebook/rocksdb/issues/6919. I found this is because no "persistent_cache_key_prefix" is generated for persistent cache. Looking repo history, "GetUniqueIdFromFile" is not implemented on Windows. So my fix is adding "NewId()" function in "persistent_cache" and using it to generate prefix for persistent cache. In this PR, i also re-enable related test cases defined in "db_test2" and "persistent_cache_test" for windows. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6932 Test Plan: 1. run related test cases in "db_test2" and "persistent_cache_test" on windows and see it passed. 2. manually run db_bench.exe with "read_cache_path" and verified. Reviewed By: riversand963 Differential Revision: D21911608 Pulled By: cheng-chang fbshipit-source-id: cdfd938d54a385edbb2836b13aaa1d39b0a6f1c2	2020-06-13 13:28:31 -07:00
Andrew Kryczka	e6be168aa5	save a key comparison in block seeks (#6646 ) Summary: This saves up to two key comparisons in block seeks. The first key comparison saved is a redundant key comparison against the restart key where the linear scan starts. This comparison is saved in all cases except when the found key is in the first restart interval. The second key comparison saved is a redundant key comparison against the restart key where the linear scan ends. This is only saved in cases where all keys in the restart interval are less than the target (probability roughly `1/restart_interval`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6646 Test Plan: ran a benchmark with mostly default settings and counted key comparisons before: `user_key_comparison_count = 19399529` after: `user_key_comparison_count = 18431498` setup command: ``` $ TEST_TMPDIR=/dev/shm/dbbench ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -level_compaction_dynamic_level_bytes=true -num=10000000 ``` benchmark command: ``` $ TEST_TMPDIR=/dev/shm/dbbench/ ./db_bench -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=10000000 -compression_type=none -reads=1000000 -perf_level=3 ``` Reviewed By: pdillinger Differential Revision: D20849707 Pulled By: ajkr fbshipit-source-id: 1f01c5cd99ea771fd27974046e37b194f1cdcfac	2020-06-10 13:58:39 -07:00
Andrew Kryczka	02db03af8d	make L0 index/filter pinned memory usage predictable (#6911 ) Summary: Memory pinned by `pin_l0_filter_and_index_blocks_in_cache` needs to be predictable based on user config. This PR makes sure we do not pin extra memory for large files generated by intra-L0 (see https://github.com/facebook/rocksdb/issues/6889). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6911 Test Plan: unit test Reviewed By: siying Differential Revision: D21835818 Pulled By: ajkr fbshipit-source-id: a11a088549d06bed8aacc2548d266e5983f0ead4	2020-06-09 16:51:23 -07:00
Yanqin Jin	3020df9df5	Remove unnecessary inclusion of version_edit.h in env (#6952 ) Summary: In db_options.c, we should avoid including header files in the `db` directory to avoid introducing unnecessary dependency. The reason why `version_edit.h` has been included in `db_options.cc` is because we need two constants, `kUnknownChecksum` and `kUnknownChecksumFuncName`. We can put these two constants as `constexpr` in the public header `file_checksum.h`. Test plan (devserver): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6952 Reviewed By: zhichao-cao Differential Revision: D21925341 Pulled By: riversand963 fbshipit-source-id: 2902f3b74c97f0cf16c58ad24c095c787c3a40e2	2020-06-07 21:56:55 -07:00
anand76	98b0cbea88	Check iterator status BlockBasedTableReader::VerifyChecksumInBlocks() (#6909 ) Summary: The ```for``` loop in ```VerifyChecksumInBlocks``` only checks ```index_iter->Valid()``` which could be ```false``` either due to reaching the end of the index or, in case of partitioned index, it could be due to a checksum mismatch error when reading a 2nd level index block. Instead of throwing away the index iterator status, we need to return any errors back to the caller. Tests: Add a test in block_based_table_reader_test.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6909 Reviewed By: pdillinger Differential Revision: D21833922 Pulled By: anand1976 fbshipit-source-id: bc778ebf1121dbbdd768689de5183f07a9f0beae	2020-06-05 11:08:25 -07:00
sdong	afa3518839	Revert "Update googletest from 1.8.1 to 1.10.0 (#6808 )" (#6923 ) Summary: This reverts commit `8d87e9cea1`. Based on offline discussions, it's too early to upgrade to gtest 1.10, as it prevents some developers from using an older version of gtest to integrate to some other systems. Revert it for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6923 Reviewed By: pdillinger Differential Revision: D21864799 fbshipit-source-id: d0726b1ff649fc911b9378f1763316200bd363fc	2020-06-03 15:55:03 -07:00
Peter Dillinger	9360776cb9	Fix handling of too-small filter partition size (#6905 ) Summary: Because ARM and some other platforms have a larger cache line size, they have a larger minimum filter size, which causes recently added PartitionedMultiGet test in db_bloom_filter_test to fail on those platforms. The code would actually end up using larger partitions, because keys_per_partition_ would be 0 and never == number of keys added. The code now attempts to get as close as possible to the small target size, while fully utilizing that filter size, if the target partition size is smaller than the minimum filter size. Also updated the test to break more uniformly across platforms Pull Request resolved: https://github.com/facebook/rocksdb/pull/6905 Test Plan: updated test, tested on ARM Reviewed By: anand1976 Differential Revision: D21840639 Pulled By: pdillinger fbshipit-source-id: 11684b6d35f43d2e98b85ddb2c8dcfd59d670817	2020-06-03 10:43:01 -07:00
Peter Dillinger	14eca6bf04	For ApproximateSizes, pro-rate table metadata size over data blocks (#6784 ) Summary: The implementation of GetApproximateSizes was inconsistent in its treatment of the size of non-data blocks of SST files, sometimes including and sometimes now. This was at its worst with large portion of table file used by filters and querying a small range that crossed a table boundary: the size estimate would include large filter size. It's conceivable that someone might want only to know the size in terms of data blocks, but I believe that's unlikely enough to ignore for now. Similarly, there's no evidence the internal function AppoximateOffsetOf is used for anything other than a one-sided ApproximateSize, so I intend to refactor to remove redundancy in a follow-up commit. So to fix this, GetApproximateSizes (and implementation details ApproximateSize and ApproximateOffsetOf) now consistently include in their returned sizes a portion of table file metadata (incl filters and indexes) based on the size portion of the data blocks in range. In other words, if a key range covers data blocks that are X% by size of all the table's data blocks, returned approximate size is X% of the total file size. It would technically be more accurate to attribute metadata based on number of keys, but that's not computationally efficient with data available and rarely a meaningful difference. Also includes miscellaneous comment improvements / clarifications. Also included is a new approximatesizerandom benchmark for db_bench. No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784 Test Plan: Test added to DBTest.ApproximateSizesFilesWithErrorMargin. Old code running new test... [ RUN ] DBTest.ApproximateSizesFilesWithErrorMargin db/db_test.cc:1562: Failure Expected: (size) <= (11 * 100), actual: 9478 vs 1100 Other tests updated to reflect consistent accounting of metadata. Reviewed By: siying Differential Revision: D21334706 Pulled By: pdillinger fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185	2020-06-02 12:30:23 -07:00
sdong	298b00a396	Reduce dependency on gtest dependency in release code (#6907 ) Summary: Release code now depends on gtest, indirectly through including "test_util/testharness.h". This creates multiple problems. One important reason is the definition of IGNORE_STATUS_IF_ERROR() in test_util/testharness.h. Move it to sync_point.h instead. Note that utilities/cassandra/format.h still depends on "test_util/testharness.h". This will be resolved in a separate diff. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6907 Test Plan: Run all existing tests. Reviewed By: ajkr Differential Revision: D21829884 fbshipit-source-id: 9253c19ffde2936f3ae68998210f8e54f645a6e6	2020-06-02 12:11:24 -07:00
Adam Retter	8d87e9cea1	Update googletest from 1.8.1 to 1.10.0 (#6808 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6808 Reviewed By: anand1976 Differential Revision: D21483984 Pulled By: pdillinger fbshipit-source-id: 70c5eff2bd54ddba469761d95e4cd4611fb8e598	2020-06-01 20:33:42 -07:00
anand76	66942e8158	Avoid unnecessary reads of uncompression dictionary in MultiGet (#6906 ) Summary: We may sometimes read the uncompression dictionary when its not necessary, when we lookup a key in an SST file but the index indicates the key is not present. This can happen with index_type 3. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6906 Test Plan: make check Reviewed By: cheng-chang Differential Revision: D21828944 Pulled By: anand1976 fbshipit-source-id: 7aef4f0a39548d0874eafefd2687006d2652f9bb	2020-06-01 19:43:37 -07:00
Cheng Chang	bcb9e41080	Explicitly free allocated buffer when status is not ok (#6903 ) Summary: Currently we rely on `BlockContents` to implicitly free the allocated scratch buffer, but when IO error happens, it doesn't make sense to construct the `BlockContents` which might be corrupted. In the stress test, we find that `assert(req.result.size() == block_size(handle));` fails because of potential IO errors. In this PR, we explicitly free the scratch buffer on error without constructing `BlockContents`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6903 Test Plan: watch stress test Reviewed By: anand1976 Differential Revision: D21823869 Pulled By: cheng-chang fbshipit-source-id: 5603fc80e9bf3f44a9d7250ddebd871afe1eb89f	2020-06-01 15:19:40 -07:00
Andrew Kryczka	c5abf78bca	avoid `IterKey::UpdateInternalKey()` in `BlockIter` (#6843 ) Summary: `IterKey::UpdateInternalKey()` is an error-prone API as it's incompatible with `IterKey::TrimAppend()`, which is used for decoding delta-encoded internal keys. This PR stops using it in `BlockIter`. Instead, it assigns global seqno in a separate `IterKey`'s buffer when needed. The logic for safely getting a Slice with global seqno properly assigned is encapsulated in `GlobalSeqnoAppliedKey`. `BinarySeek()` is also migrated to use this API (previously it ignored global seqno entirely). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6843 Test Plan: benchmark setup -- single file DBs, in-memory, no compression. "normal_db" created by regular flush; "ingestion_db" created by ingesting a file. Both DBs have same contents. ``` $ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000 $ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst \| awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}') $ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ ``` benchmark run command: ``` TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=10 -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=1048576000 -threads=1 -reads=40000000 ``` results: \| DB \| code \| throughput \| \|---\|---\|---\| \| normal_db \| master \| 267.9 \| \| normal_db \| PR6843 \| 254.2 (-5.1%) \| \| ingestion_db \| master \| 259.6 \| \| ingestion_db \| PR6843 \| 250.5 (-3.5%) \| Reviewed By: pdillinger Differential Revision: D21562604 Pulled By: ajkr fbshipit-source-id: 937596f836930515da8084d11755e1f247dcb264	2020-05-28 10:51:30 -07:00
Cheng Chang	82a82c76e7	Fix potential memory leak of scratch buffer (#6879 ) Summary: If `req.scratch` is an internally allocated buffer, but `raw_block_contents` is not constructed to own `req.scratch`, then `req.scratch` will be leaked. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6879 Test Plan: make asan_check Reviewed By: anand1976 Differential Revision: D21728498 Pulled By: cheng-chang fbshipit-source-id: 8fc6a4f2543918c565ddc16ecfad1807eb9a42cf	2020-05-26 15:29:04 -07:00
Andrew Kryczka	292bcf6227	skip direct I/O tests in rocksdb lite (#6867 ) Summary: Fix a couple places where direct I/O was used even though it is unsupported in lite builds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6867 Test Plan: `LITE=1 make check -j48` Reviewed By: pdillinger Differential Revision: D21689185 Pulled By: ajkr fbshipit-source-id: 3eaa3abf69cd7d0bcaabbcad3bb5a26fb8dd7301	2020-05-21 13:57:17 -07:00
mrambacher	38be686160	Add Struct Type to OptionsTypeInfo (#6425 ) Summary: Added code for generically handing structs to OptionTypeInfo. A struct is a collection of variables handled by their own map of OptionTypeInfos. Examples of structs include Compaction and Cache options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6425 Reviewed By: siying Differential Revision: D21668789 Pulled By: zhichao-cao fbshipit-source-id: 064b110de39dadf82361ed4663f7ac1a535b0b07	2020-05-21 10:58:39 -07:00
Cheng Chang	91b7553293	Enable IO Uring in MultiGet in direct IO mode (#6815 ) Summary: Currently, in direct IO mode, `MultiGet` retrieves the data blocks one by one instead of in parallel, see `BlockBasedTable::RetrieveMultipleBlocks`. Since direct IO is supported in `RandomAccessFileReader::MultiRead` in https://github.com/facebook/rocksdb/pull/6446, this PR applies `MultiRead` to `MultiGet` so that the data blocks can be retrieved in parallel. Also, in direct IO mode and when data blocks are compressed and need to uncompressed, this PR only allocates one continuous aligned buffer to hold the data blocks, and then directly uncompress the blocks to insert into block cache, there is no longer intermediate copies to scratch buffers. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6815 Test Plan: 1. added a new unit test `BlockBasedTableReaderTest::MultiGet`. 2. existing unit tests and stress tests contain tests against `MultiGet` in direct IO mode. Reviewed By: anand1976 Differential Revision: D21426347 Pulled By: cheng-chang fbshipit-source-id: b8446ae0e74152444ef9111e97f8e402ac31b24f	2020-05-14 23:26:26 -07:00
sdong	4a4b8a1344	sst_dump to reduce number of file reads (#6836 ) Summary: sst_dump can issue many file reads from the file system. This doesn't work well with file systems without a OS cache, especially remote file systems. In order to mitigate this problem, several improvements are done: 1. --readahead_size is added, so that users can specify readahead size when scanning the data. 2. Force a 512KB tail readahead, which prevents three I/Os for footer, meta index and property blocks and hopefully index and filter blocks too. 3. Consoldiate SSTDump's I/Os before opening the file for read. Use the same file prefetch buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6836 Test Plan: Add a test that covers this new feature. Reviewed By: pdillinger Differential Revision: D21516607 fbshipit-source-id: 3ae43526286f67b2f4a5bdedfbc92719d579b87e	2020-05-12 18:23:33 -07:00
Ziyue Yang	c384c08a4f	Add tests for compression failure in BlockBasedTableBuilder (#6709 ) Summary: Currently there is no check for whether BlockBasedTableBuilder will expose compression error status if compression fails during the table building. This commit adds fake faulting compressors and a unit test to test such cases. This check finds 5 bugs, and this commit also fixes them: 1. Not handling compression failure well in BlockBasedTableBuilder::BGWorkWriteRawBlock. 2. verify_compression failing in BlockBasedTableBuilder when used with ZSTD. 3. Wrongly passing the same reference of block contents to BlockBasedTableBuilder::CompressAndVerifyBlock in parallel compression. 4. Wrongly setting block_rep->first_key_in_next_block to nullptr in BlockBasedTableBuilder::EnterUnbuffered when there are still incoming data blocks. 5. Not maintaining variables for compression ratio estimation and first_block in BlockBasedTableBuilder::EnterUnbuffered. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6709 Reviewed By: ajkr Differential Revision: D21236254 fbshipit-source-id: 101f6e62b2bac2b7be72be198adf93cd32a1ff46	2020-05-12 09:27:35 -07:00
Peter Dillinger	b27a1448b6	Fix false NotFound from batched MultiGet with kHashSearch (#6821 ) Summary: The error is assigning KeyContext::s to NotFound status in a table reader for a "not found in this table" case, which skips searching in later tables, like only a delete should. (The hash search index iterator is the only one that can return status NotFound even if Valid() == false.) This was detected by intermittent failure in MultiThreadedDBTest.MultiThreaded/5, a kHashSearch configuration. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6821 Test Plan: modified existing unit test to reproduce problem Reviewed By: anand1976 Differential Revision: D21450469 Pulled By: pdillinger fbshipit-source-id: 7478003684d637dbd491cdac81468041a791be2c	2020-05-07 15:41:37 -07:00
mrambacher	394f2bbd13	Add OptionTypeInfo::Enum and related methods (#6423 ) Summary: Add methods and constructors for handling enums to the OptionTypeInfo. This change allows enums to be converted/compared without adding a special "type" to the OptionType. This change addresses a couple of issues: - It allows new enumerated types to be added to the options without editing the OptionType base class (and related methods) - It standardizes the procedure for adding enumerated types to the options, reducing potential mistakes - It moves the enum maps to the location where they are used, allowing them to be static file members rather than global values - It reduces the number of types and cases that need to be handled in the various OptionType methods Pull Request resolved: https://github.com/facebook/rocksdb/pull/6423 Reviewed By: siying Differential Revision: D21408713 Pulled By: zhichao-cao fbshipit-source-id: fc492af285d011822578b95d186a0fce25d35626	2020-05-05 15:04:04 -07:00
sdong	079e50d2ba	Disallow BlockBasedTableBuilder to set status from non-OK (#6776 ) Summary: There is no systematic mechanism to prevent BlockBasedTableBuilder's status to be set from non-OK to OK. Adding a mechanism to force this will help us prevent failures in the future. The solution is to only make it possible to set the status code if the status code to set is not OK. Since the status code passed to CompressAndVerifyBlock() is changed, a mini refactoring is done too so that the output arguments are changed from reference to pointers, based on Google C++ Style. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6776 Test Plan: Run all existing test. Reviewed By: pdillinger Differential Revision: D21314382 fbshipit-source-id: 27000c10f1e4c121661e026548d6882066409375	2020-04-30 15:37:03 -07:00
anand76	ab13d43e1d	Pass a timeout to FileSystem for random reads (#6751 ) Summary: Calculate ```IOOptions::timeout``` using ```ReadOptions::deadline``` and pass it to ```FileSystem::Read/FileSystem::MultiRead```. This allows us to impose a tighter bound on the time taken by Get/MultiGet on FileSystem/Envs that support IO timeouts. Even on those that don't support, check in ```RandomAccessFileReader::Read``` and ```MultiRead``` and return ```Status::TimedOut()``` if the deadline is exceeded. For now, TableReader creation, which might do file opens and reads, are not covered. It will be implemented in another PR. Tests: Update existing unit tests to verify the correct timeout value is being passed Pull Request resolved: https://github.com/facebook/rocksdb/pull/6751 Reviewed By: riversand963 Differential Revision: D21285631 Pulled By: anand1976 fbshipit-source-id: d89af843e5a91ece866e87aa29438b52a65a8567	2020-04-30 14:50:39 -07:00
mrambacher	618bf638aa	Add Functions to OptionTypeInfo (#6422 ) Summary: Added functions for parsing, serializing, and comparing elements to OptionTypeInfo. These functions allow all of the special cases that could not be handled directly in the map of OptionTypeInfo to be moved into the map. Using these functions, every type can be handled via the map rather than special cased. By adding these functions, the code for handling options can become more standardized (fewer special cases) and (eventually) handled completely by common classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6422 Test Plan: pass make check Reviewed By: siying Differential Revision: D21269005 Pulled By: zhichao-cao fbshipit-source-id: 9ba71c721a38ebf9ee88259d60bd81b3282b9077	2020-04-28 18:04:26 -07:00
Peter Dillinger	bae6f58696	Basic MultiGet support for partitioned filters (#6757 ) Summary: In MultiGet, access each applicable filter partition only once per batch, rather than for each applicable key. Also, * Fix Bloom stats for MultiGet * Fix/refactor MultiGetContext::Range::KeysLeft, including * Add efficient BitsSetToOne implementation * Assert that MultiGetContext::Range does not go beyond shift range Performance test: Generate db: $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 -partition_index_and_filters=true ... Before (middle performing run of three; note some missing Bloom stats): $ ./db_bench --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 \| egrep 'micros/op\|block.cache.filter.hit\|bloom.filter.(full\|use)\|number.multiget' multireadrandom : 26.403 micros/op 597517 ops/sec; (548427 of 671968 found) rocksdb.block.cache.filter.hit COUNT : 83443275 rocksdb.bloom.filter.useful COUNT : 0 rocksdb.bloom.filter.full.positive COUNT : 0 rocksdb.bloom.filter.full.true.positive COUNT : 7931450 rocksdb.number.multiget.get COUNT : 385984 rocksdb.number.multiget.keys.read COUNT : 12351488 rocksdb.number.multiget.bytes.read COUNT : 793145000 rocksdb.number.multiget.keys.found COUNT : 7931450 After (middle performing run of three): $ ./db_bench_new --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 \| egrep 'micros/op\|block.cache.filter.hit\|bloom.filter.(full\|use)\|number.multiget' multireadrandom : 21.024 micros/op 752963 ops/sec; (705188 of 863968 found) rocksdb.block.cache.filter.hit COUNT : 49856682 rocksdb.bloom.filter.useful COUNT : 45684579 rocksdb.bloom.filter.full.positive COUNT : 10395458 rocksdb.bloom.filter.full.true.positive COUNT : 9908456 rocksdb.number.multiget.get COUNT : 481984 rocksdb.number.multiget.keys.read COUNT : 15423488 rocksdb.number.multiget.bytes.read COUNT : 990845600 rocksdb.number.multiget.keys.found COUNT : 9908456 So that's about 25% higher throughput even for random keys Pull Request resolved: https://github.com/facebook/rocksdb/pull/6757 Test Plan: unit test included Reviewed By: anand1976 Differential Revision: D21243256 Pulled By: pdillinger fbshipit-source-id: 5644a1468d9e8c8575be02f4e04bc5d62dbbb57f	2020-04-28 14:49:34 -07:00
Peter Dillinger	249eff0f30	Stats for redundant insertions into block cache (#6681 ) Summary: Since read threads do not coordinate on loading data into block cache, two threads between Lookup and Insert can end up loading and inserting the same data. This is particularly concerning with cache_index_and_filter_blocks since those are hot and more likely to be race targets if ejected from (or not pre-populated in) the cache. Particularly with moves toward disaggregated / network storage, the cost of redundant retrieval might be high, and we should at least have some hard statistics from which we can estimate impact. Example with full filter thrashing "cliff": $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 ... $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((130 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 \| egrep '^rocksdb.block.cache.(.add\|.redundant)' \| grep -v compress \| sort rocksdb.block.cache.add COUNT : 14181 rocksdb.block.cache.add.failures COUNT : 0 rocksdb.block.cache.add.redundant COUNT : 476 rocksdb.block.cache.data.add COUNT : 12749 rocksdb.block.cache.data.add.redundant COUNT : 18 rocksdb.block.cache.filter.add COUNT : 1003 rocksdb.block.cache.filter.add.redundant COUNT : 217 rocksdb.block.cache.index.add COUNT : 429 rocksdb.block.cache.index.add.redundant COUNT : 241 $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((120 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 \| egrep '^rocksdb.block.cache.(.add\|.redundant)' \| grep -v compress \| sort rocksdb.block.cache.add COUNT : 1182223 rocksdb.block.cache.add.failures COUNT : 0 rocksdb.block.cache.add.redundant COUNT : 302728 rocksdb.block.cache.data.add COUNT : 31425 rocksdb.block.cache.data.add.redundant COUNT : 12 rocksdb.block.cache.filter.add COUNT : 795455 rocksdb.block.cache.filter.add.redundant COUNT : 130238 rocksdb.block.cache.index.add COUNT : 355343 rocksdb.block.cache.index.add.redundant COUNT : 172478 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6681 Test Plan: Some manual testing (above) and unit test covering key metrics is included Reviewed By: ltamasi Differential Revision: D21134113 Pulled By: pdillinger fbshipit-source-id: c11497b5f00f4ffdfe919823904e52d0a1a91d87	2020-04-27 13:20:27 -07:00
anand76	9e7b7e2c08	Silence false alarms in db_stress fault injection (#6741 ) Summary: False alarms are caused by codepaths that intentionally swallow IO errors. Tests: make crash_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/6741 Reviewed By: ltamasi Differential Revision: D21181138 Pulled By: anand1976 fbshipit-source-id: 5ccfbc68eb192033488de6269e59c00f2c65ce00	2020-04-24 13:06:12 -07:00
Ibrahim Jarif	ae77880223	Fix some typos in code comments (#6733 ) Summary: This PR fixes some typos in code comments. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6733 Reviewed By: siying Differential Revision: D21209037 Pulled By: zhichao-cao fbshipit-source-id: d9274611fab1f5e992998c8c4117b8078c4cbc69	2020-04-23 12:28:49 -07:00
mrambacher	4cbc19d2a1	Add a ConfigOptions for use in comparing objects and converting to/from strings (#6389 ) Summary: The methods in convenience.h are used to compare/convert objects to/from strings. There is a mishmash of parameters in use here with more needed in the future. This PR replaces those parameters with a single structure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389 Reviewed By: siying Differential Revision: D21163707 Pulled By: zhichao-cao fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd	2020-04-21 17:38:17 -07:00
Andrew Kryczka	f9155a3404	Prevent uninitialized load in `IndexBlockIter` (#6736 ) Summary: When index block is empty or an error happens while reading it, `Invalidate()` is called rather than `Initialize()`. So `Seek()` must not refer to member variables that are only initialized in `Initialize()` until it is sure `Initialize()` has been called. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6736 Reviewed By: siying Differential Revision: D21139641 Pulled By: ajkr fbshipit-source-id: 71c58cc1adbd795dc3729dd5023bf7df1515ff32	2020-04-20 16:32:43 -07:00
Peter Dillinger	31da5e34c1	C++20 compatibility (#6697 ) Summary: Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended: * Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists * Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition * std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API * Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line * Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20) * Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor * Added GCC 9 C++11 & GCC9 C++20 in Travis Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697 Test Plan: make check and CI Reviewed By: cheng-chang Differential Revision: D21020318 Pulled By: pdillinger fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91	2020-04-20 13:24:25 -07:00
Mike Kolupaev	e45673dece	Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621 ) Summary: Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype. Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling. It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas. Note that the deferred value loading only happens for internal iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621 Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats. Reviewed By: siying Differential Revision: D20786930 Pulled By: al13n321 fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee	2020-04-15 17:40:44 -07:00
Ziyue Yang	41563b61db	Fix data racing of BlockBasedTableBuilder::ParallelCompressionRep::first_block (#6640 ) Summary: BlockBasedTableBuilder::ParallelCompressionRep::first_block can be read in Flush() and written in BGWorkWriteRawBlock() concurrently. This commit fixes the issue by reading first_block out before pushing the block to compression and write. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6640 Test Plan: Run all tests concurrently with TSAN. Reviewed By: cheng-chang Differential Revision: D20851370 fbshipit-source-id: 6f039222e8319d31e15f1b45e05c106527253f72	2020-04-13 16:24:57 -07:00
Andrew Kryczka	9eca6d651d	fix comparison count for format_version=3 indexes (#6650 ) Summary: In index blocks since `format_version=3`, user keys are written rather than internal keys. When reading such blocks, the comparator is obtained via `InternalKeyComparator::user_comparator()`. That function must not return an unwrapped result as the wrapper class provides accounting logic to populate `PerfContext::user_key_comparison_count`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6650 Test Plan: ran db_bench and verified `PerfContext::user_key_comparison_count` became larger. Reviewed By: cheng-chang Differential Revision: D20866325 Pulled By: ajkr fbshipit-source-id: ad755d46bda31157dacc5b66e532279f19ad538c	2020-04-13 11:18:37 -07:00
anand76	79c838eb0f	Fix a few bugs in db_stress fault injection (#6693 ) Summary: Fix the following issues - 1. Output parsing error in db_crashtest.py 2. Memory leak on exit 3. False alarm on filter block read error Pull Request resolved: https://github.com/facebook/rocksdb/pull/6693 Test Plan: asan_crash Reviewed By: cheng-chang Differential Revision: D20990399 Pulled By: anand1976 fbshipit-source-id: 178ee0dd7c69a4bc5db698379db0dedb29281699	2020-04-13 11:01:03 -07:00
anand76	5c19a441c4	Fault injection in db_stress (#6538 ) Summary: This PR implements a fault injection mechanism for injecting errors in reads in db_stress. The FaultInjectionTestFS is used for this purpose. A thread local structure is used to track the errors, so that each db_stress thread can independently enable/disable error injection and verify observed errors against expected errors. This is initially enabled only for Get and MultiGet, but can be extended to iterator as well once its proven stable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6538 Test Plan: crash_test make check Reviewed By: riversand963 Differential Revision: D20714347 Pulled By: anand1976 fbshipit-source-id: d7598321d4a2d72bda0ced57411a337a91d87dc7	2020-04-10 17:21:26 -07:00
anand76	d600e5b0eb	Fix a Centos build failure reported in #6651 (#6656 ) Summary: Fixes issue https://github.com/facebook/rocksdb/issues/6651 Tests: make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6656 Reviewed By: cheng-chang Differential Revision: D20879084 Pulled By: anand1976 fbshipit-source-id: c2cc508ca2716fcf80dcf9d2ba31c32d211f941e	2020-04-10 11:47:46 -07:00
Yi Wu	eb287c72d7	Fix wrong key being read on ingested file with global seqno and delta encoding (#6669 ) Summary: On reading an ingested SST file, `DataBlockIter` will replace seqno encoded in a key with global seqno. However, if the original seqno was part of the prefix used for the next key, the global seqno is by mistake used as part of the prefix to construct the next key, causing wrong result being returned. Although at this point it is only software error while data in the file is not corrupted, the issue can further cause compaction output out of order and corrupted result when the ingested SST participated in compaction. Fixing the issue by save the actual seqno and restore it before the key being used as prefix to construct next key. The unit test is by Little-Wallace from https://github.com/facebook/rocksdb/issues/6666. Fixing https://github.com/facebook/rocksdb/issues/6666. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6669 Test Plan: New unit test Signed-off-by: Yi Wu <yiwu@pingcap.com> Reviewed By: cheng-chang Differential Revision: D20931808 Pulled By: ajkr fbshipit-source-id: f01959c35d6a493954dca981663766c7a5a9e8ab	2020-04-08 21:22:15 -07:00
sdong	00f8016b36	Fix clang anaylze warning caused by #6262 (#6641 ) Summary: https://github.com/facebook/rocksdb/pull/6262 causes CLANG analyze to complain. Add assertion to suppress the warning. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6641 Test Plan: Run "clang analyze" and make sure it passes. Reviewed By: anand1976 Differential Revision: D20841722 fbshipit-source-id: 5fa6e0c5cfe7a822214c9b898a408df59d4fd2cd	2020-04-03 15:47:51 -07:00
mrambacher	259b6ec8da	Move the OptionTypeMap code closer to home (#6198 ) Summary: This is a predecessor to the Configurable PR. This change moves the OptionTypeInfo maps closer to where they will be used. When the Configurable changes are adopted, these values will become static and not associated with the OptionsHelper. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6198 Reviewed By: siying Differential Revision: D20778108 Pulled By: zhichao-cao fbshipit-source-id: a9f85fc73bc53503656e1958ecc1e764052fd1aa	2020-04-03 10:52:38 -07:00
sdong	d0f3894cf1	In block based table builder, make variables for estimating file size atomic (#6636 ) Summary: With https://github.com/facebook/rocksdb/issues/6262, TSAN complains about data race of some variables. Those variables are used to estimate file size and are accessed in writer and background threads. Since file size estimation doesn't have to be 100% accurate, we make some variables atomic and use relaxed memory order. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6636 Test Plan: Run all tests with TSAN. Reviewed By: anand1976 Differential Revision: D20820635 fbshipit-source-id: 1ea45ff38be15e33674ffe06b7d42fc9fe161ea5	2020-04-02 16:16:24 -07:00
Ziyue Yang	8088482dd6	Fix a division by zero after #6262 (#6633 ) Summary: With https://github.com/facebook/rocksdb/issues/6262, UBSAN fails with "division by zero": [ RUN ] Timestamp/DBBasicTestWithTimestampCompressionSettings.PutAndGetWithCompaction/3 internal_repo_rocksdb/repo/table/block_based/block_based_table_builder.cc:1066:39: runtime error: division by zero #0 0x7ffb3117b071 in rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const&, rocksdb::CompressionType, rocksdb::BlockHandle, bool) internal_repo_rocksdb/repo/table/block_based/block_based_table_builder.cc:1066 https://github.com/facebook/rocksdb/issues/1 0x7ffb311775e1 in rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::Slice const&, rocksdb::BlockHandle, bool) internal_repo_rocksdb/repo/table/block_based/block_based_table_builder.cc:848 https://github.com/facebook/rocksdb/issues/2 0x7ffb311771a2 in rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::BlockBuilder, rocksdb::BlockHandle, bool) internal_repo_rocksdb/repo/table/block_based/block_based_table_builder.cc:832 This is caused by not returning immediately after CompressAndVerifyBlock call in WriteBlock when rep_->status == kBuffered. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6633 Test Plan: Run all existing test. Reviewed By: anand1976 Differential Revision: D20808366 fbshipit-source-id: 09f24b7c0fbaf4c7a8fc48cac61fa6fcb9b85811	2020-04-02 11:57:05 -07:00
Ziyue Yang	03a781a90c	Add pipelined & parallel compression optimization (#6262 ) Summary: This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262 Reviewed By: ajkr Differential Revision: D20651306 fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb	2020-04-01 16:40:18 -07:00
Levi Tamasi	e6f86cfb36	Revert the recent cache deleter change (#6620 ) Summary: Revert "Use function objects as deleters in the block cache (https://github.com/facebook/rocksdb/issues/6545)" This reverts commit `6301dbe7a7`. Revert "Call out the cache deleter related interface change in HISTORY.md (https://github.com/facebook/rocksdb/issues/6606)" This reverts commit `3a35542f86`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6620 Test Plan: `make check` Reviewed By: zhichao-cao Differential Revision: D20773311 Pulled By: ltamasi fbshipit-source-id: 7637a761f718f323ef0e7da959462e8fb06e7a2b	2020-03-31 16:11:06 -07:00
Zhichao Cao	e8d332d97e	Use FileChecksumGenFactory for SST file checksum (#6600 ) Summary: In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600 Test Plan: tested with make asan_check Reviewed By: riversand963 Differential Revision: D20717670 Pulled By: zhichao-cao fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6	2020-03-29 15:58:46 -07:00
Zhichao Cao	4246888101	Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487 ) Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0	2020-03-27 16:04:43 -07:00
Levi Tamasi	6301dbe7a7	Use function objects as deleters in the block cache (#6545 ) Summary: As the first step of reintroducing eviction statistics for the block cache, the patch switches from using simple function pointers as deleters to function objects implementing an interface. This will enable using deleters that have state, like a smart pointer to the statistics object that is to be updated when an entry is removed from the cache. For now, the patch adds a deleter template class `SimpleDeleter`, which simply casts the `value` pointer to its original type and calls `delete` or `delete[]` on it as appropriate. Note: to prevent object lifecycle issues, deleters must outlive the cache entries referring to them; `SimpleDeleter` ensures this by using the ("leaky") Meyers singleton pattern. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6545 Test Plan: `make asan_check` Reviewed By: siying Differential Revision: D20475823 Pulled By: ltamasi fbshipit-source-id: fe354c33dd96d9bafc094605462352305449a22a	2020-03-26 16:19:58 -07:00
Mike Kolupaev	963af52f15	Fix iterator reading filter block despite read_tier == kBlockCacheTier (#6562 ) Summary: We're seeing iterators with `ReadOptions::read_tier == kBlockCacheTier` sometimes doing file reads. Stack trace: ``` rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, rocksdb::Slice, char, bool) const rocksdb::BlockFetcher::ReadBlockContents() rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>, rocksdb::BlockType, rocksdb::GetContext, rocksdb::BlockCacheLookupContext, rocksdb::BlockContents) const rocksdb::Status rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>, rocksdb::BlockType, rocksdb::GetContext, rocksdb::BlockCacheLookupContext, bool, bool) const rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable const, rocksdb::FilePrefetchBuffer, rocksdb::ReadOptions const&, bool, rocksdb::GetContext, rocksdb::BlockCacheLookupContext, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>) rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool, rocksdb::GetContext, rocksdb::BlockCacheLookupContext, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>) const rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, bool, rocksdb::GetContext, rocksdb::BlockCacheLookupContext) const rocksdb::FullFilterBlockReader::RangeMayExist(rocksdb::Slice const, rocksdb::Slice const&, rocksdb::SliceTransform const, rocksdb::Comparator const, rocksdb::Slice const, bool, bool, rocksdb::BlockCacheLookupContext) rocksdb::BlockBasedTable::PrefixMayMatch(rocksdb::Slice const&, rocksdb::ReadOptions const&, rocksdb::SliceTransform const, bool, rocksdb::BlockCacheLookupContext) const rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::SeekImpl(rocksdb::Slice const) rocksdb::ForwardIterator::SeekInternal(rocksdb::Slice const&, bool) rocksdb::DBIter::Seek(rocksdb::Slice const&) ``` `BlockBasedTableIterator::CheckPrefixMayMatch` was missing a check for `kBlockCacheTier`. This PR adds it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6562 Test Plan: deployed it to a logdevice test cluster and looked at logdevice's IO tracing. Reviewed By: siying Differential Revision: D20529368 Pulled By: al13n321 fbshipit-source-id: 65bf33964b1951464415c900336635fb20919611	2020-03-26 15:21:26 -07:00
Cheng Chang	4fc216649d	Support direct IO in RandomAccessFileReader::MultiRead (#6446 ) Summary: By supporting direct IO in RandomAccessFileReader::MultiRead, the benefits of parallel IO (IO uring) and direct IO can be combined. In direct IO mode, read requests are aligned and merged together before being issued to RandomAccessFile::MultiRead, so blocks in the original requests might share the same underlying buffer, the shared buffers are returned in `aligned_bufs`, which is a new parameter of the `MultiRead` API. For example, suppose alignment requirement for direct IO is 4KB, one request is (offset: 1KB, len: 1KB), another request is (offset: 3KB, len: 1KB), then since they all belong to page (offset: 0, len: 4KB), `MultiRead` only reads the page with direct IO into a buffer on heap, and returns 2 Slices referencing regions in that same buffer. See `random_access_file_reader_test.cc` for more examples. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6446 Test Plan: Added a new test `random_access_file_reader_test.cc`. Reviewed By: anand1976 Differential Revision: D20097518 Pulled By: cheng-chang fbshipit-source-id: ca48a8faf9c3af146465c102ef6b266a363e78d1	2020-03-20 16:33:26 -07:00
sdong	712bc4b6a2	Fix regression bug in partitioned index reseek caused by #6531 (#6551 ) Summary: https://github.com/facebook/rocksdb/pull/6531 removed some code in partitioned index seek logic. By mistake the logic of storing previous index offset is removed, while the logic of using it is preserved, so that the code might use wrong value to determine reseeking condition. This will trigger a bug, if following a Seek() not going to the last block, SeekToLast() is called, and then Seek() is called which should position the cursor to the block before SeekToLast(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6551 Test Plan: Add a unit test that reproduces the bug. In the same unit test, also some reseek cases are covered to avoid regression. Reviewed By: pdillinger Differential Revision: D20493990 fbshipit-source-id: 3919aa4861c0481ec96844e053048da1a934b91d	2020-03-17 12:33:10 -07:00
Yanqin Jin	098dce2d1a	Fix compiler warning treated as error (#6547 ) Summary: Define a private member variable only in debug mode. Without fix, build will fail ``` In file included from table/block_based/partitioned_index_iterator.cc:9: ./table/block_based/partitioned_index_iterator.h:125:32: error: private field 'icomp_' is not used [-Werror,-Wunused-private-field] const InternalKeyComparator& icomp_; ``` Test plan (dev server) 1. make check 2. Make sure fixed in Travis Pull Request resolved: https://github.com/facebook/rocksdb/pull/6547 Reviewed By: siying Differential Revision: D20480027 Pulled By: pdillinger fbshipit-source-id: 288bc94280e240c3136335b6c73eb1ccb0db459d	2020-03-17 09:59:28 -07:00
sdong	d66908091d	De-template block based table iterator (#6531 ) Summary: Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done: 1. Copy the block based iterator code into partitioned index iterator, and de-template them. 2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks. 3. Separate out the prefetch logic to a helper class and both classes call them. This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531 Test Plan: build using make and cmake. And build release Differential Revision: D20473108 fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f	2020-03-16 12:20:50 -07:00
sdong	674cf41732	Divide block_based_table_reader.cc (#6527 ) Summary: block_based_table_reader.cc is a giant file, which makes it hard for users to navigate the code. Divide the files to multiple files. Some class templates cannot be moved to .cc file. They are moved to .h files. It is still better than including them all in block_based_table_reader.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6527 Test Plan: "make all check" and "make release". Also build using cmake. Differential Revision: D20428455 fbshipit-source-id: ca713c698469f07f35bc0c271358c0874ed4eb28	2020-03-12 21:41:50 -07:00
Yanqin Jin	d93812c9ae	Iterator with timestamp (#6255 ) Summary: Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests. Create an iterator with timestamp. ``` ... read_opts.timestamp = &ts; auto* iter = db->NewIterator(read_opts); // target is key without timestamp. for (iter->Seek(target); iter->Valid(); iter->Next()) {} for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {} delete iter; read_opts.timestamp = &ts1; // lower_bound and upper_bound are without timestamp. read_opts.iterate_lower_bound = &lower_bound; read_opts.iterate_upper_bound = &upper_bound; auto* iter1 = db->NewIterator(read_opts); // Do Seek or SeekToFirst() delete iter1; ``` Test plan (dev server) ``` $make check ``` Simple benchmarking (dev server) 1. The overhead introduced by this PR even when timestamp is disabled. key size: 16 bytes value size: 100 bytes Entries: 1000000 Data reside in main memory, and try to stress iterator. Repeated three times on master and this PR. - Seek without next ``` ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 ``` master: 159047.0 ops/sec this PR: 158922.3 ops/sec (2% drop in throughput) - Seek and next 10 times ``` ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10 ``` master: 109539.3 ops/sec this PR: 107519.7 ops/sec (2% drop in throughput) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255 Differential Revision: D19438227 Pulled By: riversand963 fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746	2020-03-06 16:24:27 -08:00
Michael R. Crusoe	051696bf98	fix some spelling typos (#6464 ) Summary: Found from Debian's "Lintian" program Pull Request resolved: https://github.com/facebook/rocksdb/pull/6464 Differential Revision: D20162862 Pulled By: zhichao-cao fbshipit-source-id: 06941ee2437b038b2b8045becbe9d2c6fbff3e12	2020-02-28 14:14:03 -08:00
Andrew Kryczka	69679e7375	Fix range deletion tombstone ingestion with global seqno (#6429 ) Summary: Original author: jeffrey-xiao If we are writing a global seqno for an ingested file, the range tombstone metablock gets accessed and put into the cache during ingestion preparation. At the time, the global seqno of the ingested file has not yet been determined, so the cached block will not have a global seqno. When the file is ingested and we read its range tombstone metablock, it will be returned from the cache with no global seqno. In that case, we use the actual seqnos stored in the range tombstones, which are all zero, so the tombstones cover nothing. This commit removes global_seqno_ variable from Block. When iterating over a block, the global seqno for the block is determined by the iterator instead of storing this mutable attribute in Block. Additionally, this commit adds a regression test to check that keys are deleted when ingesting a file with a global seqno and range deletion tombstones. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6429 Differential Revision: D19961563 Pulled By: ajkr fbshipit-source-id: 5cf777397fa3e452401f0bf0364b0750492487b7	2020-02-25 15:31:48 -08:00
Yanqin Jin	890d87fadc	Some minor fix-ups (#6440 ) Summary: Cleanup some code without any real change in functionality. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6440 Differential Revision: D20015891 Pulled By: riversand963 fbshipit-source-id: 33e18754b0f002006a6d4805e9aaf84c0c8ad25a	2020-02-21 15:09:56 -08:00
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	2020-02-20 12:09:57 -08:00
anand76	3e49249d30	Ensure all MultiGet IO errors are propagated to user (#6403 ) Summary: Unrevert the previous fix to propagate error status, and an additional fix to not treat a memtable lookup MergeInProgress status as an error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6403 Test Plan: Unit tests Tried running stress tests but couldn't repro the stress failure Differential Revision: D19846721 Pulled By: anand1976 fbshipit-source-id: 7db10cccbdc863d9b559497f0a46b608d2488ca4	2020-02-11 17:27:22 -08:00
anand76	35ed530d2c	Revert "Check KeyContext status in MultiGet (#6387 )" (#6401 ) Summary: This reverts commit `d70011bccc`. The commit is causing some stress test failure due to unexpected Status::MergeInProgress() return for some keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6401 Differential Revision: D19826623 Pulled By: anand1976 fbshipit-source-id: edd634cede9cb7bdd2cb8f46e662ea709b16d2f1	2020-02-10 22:23:36 -08:00
Zhichao Cao	4369f2c7bb	Checksum for each SST file and stores in MANIFEST (#6216 ) Summary: In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata. Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string). A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216 Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check. Differential Revision: D19171461 Pulled By: zhichao-cao fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87	2020-02-10 15:52:52 -08:00
anand76	d70011bccc	Check KeyContext status in MultiGet (#6387 ) Summary: Currently, any IO errors and checksum mismatches while reading data blocks, are being ignored by the batched MultiGet. Its only looking at the GetContext state. Fix that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6387 Test Plan: Add unit tests Differential Revision: D19799819 Pulled By: anand1976 fbshipit-source-id: 46133dccbb04e64067b9fe6cda73e282203db969	2020-02-07 16:48:16 -08:00
sdong	8f2bee6747	Add ReadOptions.auto_prefix_mode (#6314 ) Summary: Add a new option ReadOptions.auto_prefix_mode. When set to true, iterator should return the same result as total order seek, but may choose to do prefix seek internally, based on iterator upper bounds. Also fix two previous bugs when handling prefix extrator changes: (1) reverse iterator should not rely on upper bound to determine prefix. Fix it with skipping prefix check. (2) block-based filter is not handled properly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6314 Test Plan: (1) add a unit test; (2) add the check to stress test and run see whether it can pass at least one run. Differential Revision: D19458717 fbshipit-source-id: 51c1bcc5cdd826c2469af201979a39600e779bce	2020-01-28 14:44:05 -08:00
sdong	f10f135938	Fix regression bug of hash index with iterator total order seek (#6328 ) Summary: https://github.com/facebook/rocksdb/pull/6028 introduces a bug for hash index in SST files. If a table reader is created when total order seek is used, prefix_extractor might be passed into table reader as null. While later when prefix seek is used, the same table reader used, hash index is checked but prefix extractor is null and the program would crash. Fix the issue by fixing http://github.com/facebook/rocksdb/pull/6028 in the way that prefix_extractor is preserved but ReadOptions.total_order_seek is checked Also, a null pointer check is added so that a bug like this won't cause segfault in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6328 Test Plan: Add a unit test that would fail without the fix. Stress test that reproduces the crash would pass. Differential Revision: D19586751 fbshipit-source-id: 8de77690167ddf5a77a01e167cf89430b1bfba42	2020-01-27 15:44:54 -08:00
Peter Dillinger	986df37135	Clean up PartitionedFilterBlockBuilder (#6299 ) Summary: Remove the redundant PartitionedFilterBlockBuilder::num_added_ and ::NumAdded since the parent class, FullFilterBlockBuilder, already provides them. Also rename filters_in_partition_ and filters_per_partition_ to keys_added_to_partition_ and keys_per_partition_ to improve readability. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6299 Test Plan: make check Differential Revision: D19413278 Pulled By: pdillinger fbshipit-source-id: 04926ee7874477d659cb2b6ae03f2d995fb747e5	2020-01-27 13:15:14 -08:00
Peter Dillinger	8aa99fc71e	Warn on excessive keys for legacy Bloom filter with 32-bit hash (#6317 ) Summary: With many millions of keys, the old Bloom filter implementation for the block-based table (format_version <= 4) would have excessive FP rate due to the limitations of feeding the Bloom filter with a 32-bit hash. This change computes an estimated inflated FP rate due to this effect and warns in the log whenever an SST filter is constructed (almost certainly a "full" not "partitioned" filter) that exceeds 1.5x FP rate due to this effect. The detailed condition is only checked if 3 million keys or more have been added to a filter, as this should be a lower bound for common bits/key settings (< 20). Recommended remedies include smaller SST file size, using format_version >= 5 (for new Bloom filter), or using partitioned filters. This does not change behavior other than generating warnings for some constructed filters using the old implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6317 Test Plan: Example with warning, 15M keys @ 15 bits / key: (working_mem_size_mb is just to stop after building one filter if it's large) $ ./filter_bench -quick -impl=0 -working_mem_size_mb=1 -bits_per_key=15 -average_keys_per_filter=15000000 2>&1 \| grep 'FP rate' [WARN] [/block_based/filter_policy.cc:292] Using legacy SST/BBT Bloom filter with excessive key count (15.0M @ 15bpk), causing estimated 1.8x higher filter FP rate. Consider using new Bloom with format_version>=5, smaller SST file size, or partitioned filters. Predicted FP rate %: 0.766702 Average FP rate %: 0.66846 Example without warning (150K keys): $ ./filter_bench -quick -impl=0 -working_mem_size_mb=1 -bits_per_key=15 -average_keys_per_filter=150000 2>&1 \| grep 'FP rate' Predicted FP rate %: 0.422857 Average FP rate %: 0.379301 $ With more samples at 15 bits/key: 150K keys -> no warning; actual: 0.379% FP rate (baseline) 1M keys -> no warning; actual: 0.396% FP rate, 1.045x 9M keys -> no warning; actual: 0.563% FP rate, 1.485x 10M keys -> warning (1.5x); actual: 0.564% FP rate, 1.488x 15M keys -> warning (1.8x); actual: 0.668% FP rate, 1.76x 25M keys -> warning (2.4x); actual: 0.880% FP rate, 2.32x At 10 bits/key: 150K keys -> no warning; actual: 1.17% FP rate (baseline) 1M keys -> no warning; actual: 1.16% FP rate 10M keys -> no warning; actual: 1.32% FP rate, 1.13x 25M keys -> no warning; actual: 1.63% FP rate, 1.39x 35M keys -> warning (1.6x); actual: 1.81% FP rate, 1.55x At 5 bits/key: 150K keys -> no warning; actual: 9.32% FP rate (baseline) 25M keys -> no warning; actual: 9.62% FP rate, 1.03x 200M keys -> no warning; actual: 12.2% FP rate, 1.31x 250M keys -> warning (1.5x); actual: 12.8% FP rate, 1.37x 300M keys -> warning (1.6x); actual: 13.4% FP rate, 1.43x The reason for the modest inaccuracy at low bits/key is that the assumption of independence between a collision between 32-hash values feeding the filter and an FP in the filter is not quite true for implementations using "simple" logic to compute indices from the stock hash result. There's math on this in my dissertation, but I don't think it's worth the effort just for these extreme cases (> 100 million keys and low-ish bits/key). Differential Revision: D19471715 Pulled By: pdillinger fbshipit-source-id: f80c96893a09bf1152630ff0b964e5cdd7e35c68	2020-01-20 21:31:47 -08:00
Peter Dillinger	4b86fe1123	Log warning for high bits/key in legacy Bloom filter (#6312 ) Summary: Help users that would benefit most from new Bloom filter implementation by logging a warning that recommends the using format_version >= 5. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6312 Test Plan: $ (for BPK in 10 13 14 19 20 50; do ./filter_bench -quick -impl=0 -bits_per_key=$BPK -m_queries=1 2>&1; done) \| grep 'its/key' Bits/key actual: 10.0647 Bits/key actual: 13.0593 [WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (14) bits/key. Significant filter space and/or accuracy improvement is available with format_verion>=5. Bits/key actual: 14.0581 [WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (19) bits/key. Significant filter space and/or accuracy improvement is available with format_verion>=5. Bits/key actual: 19.0542 [WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (20) bits/key. Dramatic filter space and/or accuracy improvement is available with format_verion>=5. Bits/key actual: 20.0584 [WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (50) bits/key. Dramatic filter space and/or accuracy improvement is available with format_verion>=5. Bits/key actual: 50.0577 Differential Revision: D19457191 Pulled By: pdillinger fbshipit-source-id: 073d94cde5c70e03a160f953e1100c15ea83eda4	2020-01-17 19:37:35 -08:00
sdong	d87cffaea4	Fix another bug caused by recent hash index fix (#6305 ) Summary: Recent bug fix related to hash index introduced a new bug: hash index can return NotFound but it is not handled by BlockBasedTable::Get(). The end result is that Get() stops being executed too early. Fix it by ignoring NotFound code in Get(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6305 Test Plan: A problematic DB used to return NotFound incorrectly, and now able to return correct result. Will try to construct a unit test too.0 Differential Revision: D19438925 fbshipit-source-id: e751afa8c13728d56511cfeb1bc811ecb99f3217	2020-01-17 01:41:04 -08:00
sdong	f8b5ef85ec	Fix a bug caused by recent fix of Prefix Hash (#6302 ) Summary: Recent fix to Prefix Hash https://github.com/facebook/rocksdb/pull/6292 caused a bug that the newly created NotFound status in hash index is never reset. This causes reseek or implict reseek to return wrong results sometimes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6302 Test Plan: Add a unit test that would fail. Not fix. crash test with hash test would fail in several seconds. With the fix, it will run about several minutes before failing with another failure. Differential Revision: D19424572 fbshipit-source-id: c5276f36a95fd0e2837e30190476d2fe21ed8566	2020-01-16 10:47:20 -08:00
sdong	d2b4d42d4b	Fix kHashSearch bug with SeekForPrev (#6297 ) Summary: When prefix is enabled the expected behavior when the prefix of the target does not exist is for Seek is to seek to any key larger than target and SeekToPrev to any key less than the target. Currently. the prefix index (kHashSearch) returns OK status but sets Invalid() to indicate two cases: a prefix of the searched key does not exist, ii) the key is beyond the range of the keys in SST file. The SeekForPrev implementation in BlockBasedTable thus does not have enough information to know when it should set the index key to first (to return a key smaller than target). The patch fixes that by returning NotFound status for cases that the prefix does not exist. SeekForPrev in BlockBasedTable accordingly SeekToFirst instead of SeekToLast on the index iterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6297 Test Plan: SeekForPrev of non-exsiting prefix is added to block_test.cc, and a test case is added in db_test2, which fails without the fix. Differential Revision: D19404695 fbshipit-source-id: cafbbf95f8f60ff9ede9ccc99d25bfa1cf6fcdc3	2020-01-15 14:28:39 -08:00
Maysam Yabandeh	d4b7fbf0d5	kHashSearch incompatible with index_block_restart_interval>1 (#6294 ) Summary: kHashSearch index type is incompatible with index_block_restart_interval larger than 1. The patch asserts that and also resets index_block_restart_interval value if it is incompatible with kHashSearch. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6294 Differential Revision: D19394229 Pulled By: maysamyabandeh fbshipit-source-id: 8a12712ab25e81094a7f71ecd43f773dd4fb6acd	2020-01-14 11:21:27 -08:00
sdong	f295b099f6	BlockBasedTable::ApproximateSize() should use total order seek (#6222 ) Summary: Right now BlockBasedTable::ApproximateSize() uses default setting about whether to use total order seek. There is no reason for that. There is no reason to do any filtering for approximate size boundary key, and it may introduce bugs. Disable it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6222 Test Plan: Run existing tests Differential Revision: D19184787 fbshipit-source-id: 64180660bd2800914fff75104172b61c06f0b1c9	2019-12-19 14:56:38 -08:00

1 2 3 4 5 ...

328 Commits