rocksdb/table/block_based
Peter Dillinger 8aa99fc71e Warn on excessive keys for legacy Bloom filter with 32-bit hash (#6317)
Summary:
With many millions of keys, the old Bloom filter implementation
for the block-based table (format_version <= 4) would have excessive FP
rate due to the limitations of feeding the Bloom filter with a 32-bit hash.
This change computes an estimated inflated FP rate due to this effect
and warns in the log whenever an SST filter is constructed (almost
certainly a "full" not "partitioned" filter) that exceeds 1.5x FP rate
due to this effect. The detailed condition is only checked if 3 million
keys or more have been added to a filter, as this should be a lower
bound for common bits/key settings (< 20).

Recommended remedies include smaller SST file size, using
format_version >= 5 (for new Bloom filter), or using partitioned
filters.

This does not change behavior other than generating warnings for some
constructed filters using the old implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6317

Test Plan:
Example with warning, 15M keys @ 15 bits / key: (working_mem_size_mb is just to stop after building one filter if it's large)

    $ ./filter_bench -quick -impl=0 -working_mem_size_mb=1 -bits_per_key=15 -average_keys_per_filter=15000000 2>&1 | grep 'FP rate'
    [WARN] [/block_based/filter_policy.cc:292] Using legacy SST/BBT Bloom filter with excessive key count (15.0M @ 15bpk), causing estimated 1.8x higher filter FP rate. Consider using new Bloom with format_version>=5, smaller SST file size, or partitioned filters.
    Predicted FP rate %: 0.766702
    Average FP rate %: 0.66846

Example without warning (150K keys):

    $ ./filter_bench -quick -impl=0 -working_mem_size_mb=1 -bits_per_key=15 -average_keys_per_filter=150000 2>&1 | grep 'FP rate'
    Predicted FP rate %: 0.422857
    Average FP rate %: 0.379301
    $

With more samples at 15 bits/key:
  150K keys -> no warning; actual: 0.379% FP rate (baseline)
  1M keys -> no warning; actual: 0.396% FP rate, 1.045x
  9M keys -> no warning; actual: 0.563% FP rate, 1.485x
  10M keys -> warning (1.5x); actual: 0.564% FP rate, 1.488x
  15M keys -> warning (1.8x); actual: 0.668% FP rate, 1.76x
  25M keys -> warning (2.4x); actual: 0.880% FP rate, 2.32x

At 10 bits/key:
  150K keys -> no warning; actual: 1.17% FP rate (baseline)
  1M keys -> no warning; actual: 1.16% FP rate
  10M keys -> no warning; actual: 1.32% FP rate, 1.13x
  25M keys -> no warning; actual: 1.63% FP rate, 1.39x
  35M keys -> warning (1.6x); actual: 1.81% FP rate, 1.55x

At 5 bits/key:
  150K keys -> no warning; actual: 9.32% FP rate (baseline)
  25M keys -> no warning; actual: 9.62% FP rate, 1.03x
  200M keys -> no warning; actual: 12.2% FP rate, 1.31x
  250M keys -> warning (1.5x); actual: 12.8% FP rate, 1.37x
  300M keys -> warning (1.6x); actual: 13.4% FP rate, 1.43x

The reason for the modest inaccuracy at low bits/key is that the assumption of independence between a collision between 32-hash values feeding the filter and an FP in the filter is not quite true for implementations using "simple" logic to compute indices from the stock hash result. There's math on this in my dissertation, but I don't think it's worth the effort just for these extreme cases (> 100 million keys and low-ish bits/key).

Differential Revision: D19471715

Pulled By: pdillinger

fbshipit-source-id: f80c96893a09bf1152630ff0b964e5cdd7e35c68
2020-01-20 21:31:47 -08:00
..
block_based_filter_block_test.cc Prepare filter tests for more implementations (#5967) 2019-10-31 14:12:33 -07:00
block_based_filter_block.cc Fix regression affecting partitioned indexes/filters when cache_index_and_filter_blocks is false (#5705) 2019-08-14 18:16:06 -07:00
block_based_filter_block.h Use delete to disable automatic generated methods. (#5009) 2019-09-11 18:09:00 -07:00
block_based_table_builder.cc Log warning for high bits/key in legacy Bloom filter (#6312) 2020-01-17 19:37:35 -08:00
block_based_table_builder.h Expose and elaborate FilterBuildingContext (#6088) 2019-11-26 18:24:10 -08:00
block_based_table_factory.cc kHashSearch incompatible with index_block_restart_interval>1 (#6294) 2020-01-14 11:21:27 -08:00
block_based_table_factory.h Organizing rocksdb/table directory by format 2019-05-30 14:51:11 -07:00
block_based_table_reader.cc Fix another bug caused by recent hash index fix (#6305) 2020-01-17 01:41:04 -08:00
block_based_table_reader.h Fix kHashSearch bug with SeekForPrev (#6297) 2020-01-15 14:28:39 -08:00
block_builder.cc Organizing rocksdb/table directory by format 2019-05-30 14:51:11 -07:00
block_builder.h Organizing rocksdb/table directory by format 2019-05-30 14:51:11 -07:00
block_prefix_index.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
block_prefix_index.h Organizing rocksdb/table directory by format 2019-05-30 14:51:11 -07:00
block_test.cc upgrade gtest 1.7.0 => 1.8.1 for json result writing 2019-09-09 11:24:11 -07:00
block_type.h Make the 'block read count' performance counters consistent (#5484) 2019-06-18 19:03:24 -07:00
block.cc Fix a bug caused by recent fix of Prefix Hash (#6302) 2020-01-16 10:47:20 -08:00
block.h Fix kHashSearch bug with SeekForPrev (#6297) 2020-01-15 14:28:39 -08:00
cachable_entry.h Move the filter readers out of the block cache (#5504) 2019-07-16 13:14:58 -07:00
data_block_footer.cc Organizing rocksdb/table directory by format 2019-05-30 14:51:11 -07:00
data_block_footer.h Organizing rocksdb/table directory by format 2019-05-30 14:51:11 -07:00
data_block_hash_index_test.cc Introduce a new storage specific Env API (#5761) 2019-12-13 14:48:41 -08:00
data_block_hash_index.cc Organizing rocksdb/table directory by format 2019-05-30 14:51:11 -07:00
data_block_hash_index.h Organizing rocksdb/table directory by format 2019-05-30 14:51:11 -07:00
filter_block_reader_common.cc Store the filter bits reader alongside the filter block contents (#5936) 2019-10-18 19:32:59 -07:00
filter_block_reader_common.h Fix regression affecting partitioned indexes/filters when cache_index_and_filter_blocks is false (#5705) 2019-08-14 18:16:06 -07:00
filter_block.h Apply formatter to recent 200+ commits. (#5830) 2019-09-20 12:04:26 -07:00
filter_policy_internal.h Warn on excessive keys for legacy Bloom filter with 32-bit hash (#6317) 2020-01-20 21:31:47 -08:00
filter_policy.cc Warn on excessive keys for legacy Bloom filter with 32-bit hash (#6317) 2020-01-20 21:31:47 -08:00
flush_block_policy.cc Organizing rocksdb/table directory by format 2019-05-30 14:51:11 -07:00
flush_block_policy.h Organizing rocksdb/table directory by format 2019-05-30 14:51:11 -07:00
full_filter_block_test.cc Expose and elaborate FilterBuildingContext (#6088) 2019-11-26 18:24:10 -08:00
full_filter_block.cc Store the filter bits reader alongside the filter block contents (#5936) 2019-10-18 19:32:59 -07:00
full_filter_block.h Store the filter bits reader alongside the filter block contents (#5936) 2019-10-18 19:32:59 -07:00
index_builder.cc kHashSearch incompatible with index_block_restart_interval>1 (#6294) 2020-01-14 11:21:27 -08:00
index_builder.h Add an option to put first key of each sst block in the index (#5289) 2019-06-24 20:54:04 -07:00
mock_block_based_table.h Log warning for high bits/key in legacy Bloom filter (#6312) 2020-01-17 19:37:35 -08:00
parsed_full_filter_block.cc Store the filter bits reader alongside the filter block contents (#5936) 2019-10-18 19:32:59 -07:00
parsed_full_filter_block.h New Bloom filter implementation for full and partitioned filters (#6007) 2019-11-13 16:44:01 -08:00
partitioned_filter_block_test.cc Expose and elaborate FilterBuildingContext (#6088) 2019-11-26 18:24:10 -08:00
partitioned_filter_block.cc Store the filter bits reader alongside the filter block contents (#5936) 2019-10-18 19:32:59 -07:00
partitioned_filter_block.h Store the filter bits reader alongside the filter block contents (#5936) 2019-10-18 19:32:59 -07:00
uncompression_dict_reader.cc Apply formatter to recent 200+ commits. (#5830) 2019-09-20 12:04:26 -07:00
uncompression_dict_reader.h Revert to storing UncompressionDicts in the cache (#5645) 2019-08-23 08:27:30 -07:00