rocksdb

Go to file

Peter Dillinger 5b2bbacb6f Minimize memory internal fragmentation for Bloom filters (#6427 )

Summary:
New experimental option BBTO::optimize_filters_for_memory builds
filters that maximize their use of "usable size" from malloc_usable_size,
which is also used to compute block cache charges.

Rather than always "rounding up," we track state in the
BloomFilterPolicy object to mix essentially "rounding down" and
"rounding up" so that the average FP rate of all generated filters is
the same as without the option. (YMMV as heavily accessed filters might
be unluckily lower accuracy.)

Thus, the option near-minimizes what the block cache considers as
"memory used" for a given target Bloom filter false positive rate and
Bloom filter implementation. There are no forward or backward
compatibility issues with this change, though it only works on the
format_version=5 Bloom filter.

With Jemalloc, we see about 10% reduction in memory footprint (and block
cache charge) for Bloom filters, but 1-2% increase in storage footprint,
due to encoding efficiency losses (FP rate is non-linear with bits/key).

Why not weighted random round up/down rather than state tracking? By
only requiring malloc_usable_size, we don't actually know what the next
larger and next smaller usable sizes for the allocator are. We pick a
requested size, accept and use whatever usable size it has, and use the
difference to inform our next choice. This allows us to narrow in on the
right balance without tracking/predicting usable sizes.

Why not weight history of generated filter false positive rates by
number of keys? This could lead to excess skew in small filters after
generating a large filter.

Results from filter_bench with jemalloc (irrelevant details omitted):

    (normal keys/filter, but high variance)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.6278
    Number of filters: 5516
    Total size (MB): 200.046
    Reported total allocated memory (MB): 220.597
    Reported internal fragmentation: 10.2732%
    Bits/key stored: 10.0097
    Average FP rate %: 0.965228
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 30.5104
    Number of filters: 5464
    Total size (MB): 200.015
    Reported total allocated memory (MB): 200.322
    Reported internal fragmentation: 0.153709%
    Bits/key stored: 10.1011
    Average FP rate %: 0.966313

    (very few keys / filter, optimization not as effective due to ~59 byte
     internal fragmentation in blocked Bloom filter representation)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.5649
    Number of filters: 162950
    Total size (MB): 200.001
    Reported total allocated memory (MB): 224.624
    Reported internal fragmentation: 12.3117%
    Bits/key stored: 10.2951
    Average FP rate %: 0.821534
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 31.8057
    Number of filters: 159849
    Total size (MB): 200
    Reported total allocated memory (MB): 208.846
    Reported internal fragmentation: 4.42297%
    Bits/key stored: 10.4948
    Average FP rate %: 0.811006

    (high keys/filter)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.7017
    Number of filters: 164
    Total size (MB): 200.352
    Reported total allocated memory (MB): 221.5
    Reported internal fragmentation: 10.5552%
    Bits/key stored: 10.0003
    Average FP rate %: 0.969358
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 30.7131
    Number of filters: 160
    Total size (MB): 200.928
    Reported total allocated memory (MB): 200.938
    Reported internal fragmentation: 0.00448054%
    Bits/key stored: 10.1852
    Average FP rate %: 0.963387

And from db_bench (block cache) with jemalloc:

    $ ./db_bench -db=/dev/shm/dbbench.no_optimize -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false
    $ ./db_bench -db=/dev/shm/dbbench -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -optimize_filters_for_memory -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false
    $ (for FILE in /dev/shm/dbbench.no_optimize/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }'
    17063835
    $ (for FILE in /dev/shm/dbbench/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }'
    17430747
    $ #^ 2.1% additional filter storage
    $ ./db_bench -db=/dev/shm/dbbench.no_optimize -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000
    rocksdb.block.cache.index.add COUNT : 33
    rocksdb.block.cache.index.bytes.insert COUNT : 8440400
    rocksdb.block.cache.filter.add COUNT : 33
    rocksdb.block.cache.filter.bytes.insert COUNT : 21087528
    rocksdb.bloom.filter.useful COUNT : 4963889
    rocksdb.bloom.filter.full.positive COUNT : 1214081
    rocksdb.bloom.filter.full.true.positive COUNT : 1161999
    $ #^ 1.04 % observed FP rate
    $ ./db_bench -db=/dev/shm/dbbench -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -optimize_filters_for_memory -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000
    rocksdb.block.cache.index.add COUNT : 33
    rocksdb.block.cache.index.bytes.insert COUNT : 8448592
    rocksdb.block.cache.filter.add COUNT : 33
    rocksdb.block.cache.filter.bytes.insert COUNT : 18220328
    rocksdb.bloom.filter.useful COUNT : 5360933
    rocksdb.bloom.filter.full.positive COUNT : 1321315
    rocksdb.bloom.filter.full.true.positive COUNT : 1262999
    $ #^ 1.08 % observed FP rate, 13.6% less memory usage for filters

(Due to specific key density, this example tends to generate filters that are "worse than average" for internal fragmentation. "Better than average" cases can show little or no improvement.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6427

Test Plan: unit test added, 'make check' with gcc, clang and valgrind

Reviewed By: siying

Differential Revision: D22124374

Pulled By: pdillinger

fbshipit-source-id: f3e3aa152f9043ddf4fae25799e76341d0d8714e

2020-06-22 13:32:07 -07:00

.circleci

Remove CircleCI clang build's verbose output (#7000 )

2020-06-19 17:11:55 -07:00

.github/workflows

Clean up some code related to file checksums (#6861 )

2020-05-21 08:12:51 -07:00

buckifier

Directly use unit test tempalte buck (#6926 )

2020-06-05 12:16:33 -07:00

build_tools

build fixes for GNU/kFreeBSD (#6992 )

2020-06-18 09:51:28 -07:00

cache

Revert "Update googletest from 1.8.1 to 1.10.0 (#6808 )" (#6923 )

2020-06-03 15:55:03 -07:00

cmake

Add find_dependency() in cmake config file. (#6791 )

2020-05-12 21:18:29 -07:00

coverage

Find the correct gcov (#6904 )

2020-06-01 16:33:05 -07:00

Remove an assertion in FlushAfterIntraL0CompactionCheckConsistencyFail (#7003 )

2020-06-19 16:58:29 -07:00

db_stress_tool

Minimize memory internal fragmentation for Bloom filters (#6427 )

2020-06-22 13:32:07 -07:00

docs

Log warning for high bits/key in legacy Bloom filter (#6312 )

2020-01-17 19:37:35 -08:00

env

Make EncryptEnv inheritable (#6830 )

2020-06-22 13:27:16 -07:00

examples

add WITH_EXAMPLES options to cmake and cleanups. (#6580 )

2020-06-18 18:00:04 -07:00

file

Fix block checksum for >=4GB, refactor (#6978 )

2020-06-19 16:18:24 -07:00

hdfs

fix build with 'USE_HDFS' on windows (#6950 )

2020-06-12 16:21:50 -07:00

include/rocksdb

Minimize memory internal fragmentation for Bloom filters (#6427 )

2020-06-22 13:32:07 -07:00

java

Add logs and stats in DeleteScheduler (#6927 )

2020-06-05 09:43:04 -07:00

logging

Fix info log source file display length (#5824 )

2020-04-08 20:18:08 -07:00

memory

C++20 compatibility (#6697 )

2020-04-20 13:24:25 -07:00

memtable

Fix more defects reported by Coverity Scan (#6935 )

2020-06-04 15:35:08 -07:00

monitoring

Add logs and stats in DeleteScheduler (#6927 )

2020-06-05 09:43:04 -07:00

options

Minimize memory internal fragmentation for Bloom filters (#6427 )

2020-06-22 13:32:07 -07:00

port

build fixes for GNU/kFreeBSD (#6992 )

2020-06-18 09:51:28 -07:00

table

Minimize memory internal fragmentation for Bloom filters (#6427 )

2020-06-22 13:32:07 -07:00

test_util

Remove racially charged terms "whitelist" and "blacklist" (#7008 )

2020-06-19 15:27:32 -07:00

third-party

Revert "Update googletest from 1.8.1 to 1.10.0 (#6808 )" (#6923 )

2020-06-03 15:55:03 -07:00

tools

Minimize memory internal fragmentation for Bloom filters (#6427 )

2020-06-22 13:32:07 -07:00

trace_replay

Fix double define in IO_tracer (#7007 )

2020-06-22 10:20:13 -07:00

util

Minimize memory internal fragmentation for Bloom filters (#6427 )

2020-06-22 13:32:07 -07:00

utilities

Fix persistent cache on windows (#6932 )

2020-06-13 13:28:31 -07:00

.clang-format

A script that automatically reformat affected lines

2014-01-14 12:21:24 -08:00

.gitignore

Allow missing "unversioned" python, as in CentOS 8 (#6883 )

2020-05-29 11:29:23 -07:00

.lgtm.yml

Create lgtm.yml for LGTM.com C/C++ analysis (#4058 )

2018-06-26 12:43:04 -07:00

.travis.yml

Make sure core components not depend on gtest (#6921 )

2020-06-03 18:22:14 -07:00

.watchmanconfig

Added .watchmanconfig file to rocksdb repo (#5593 )

2019-07-19 15:00:33 -07:00

appveyor.yml

Reduce test coverage in older VS versions (#6966 )

2020-06-12 17:05:47 -07:00

AUTHORS

Update RocksDB Authors File

2017-10-18 14:42:10 -07:00

CMakeLists.txt

add WITH_EXAMPLES options to cmake and cleanups. (#6580 )

2020-06-18 18:00:04 -07:00

CODE_OF_CONDUCT.md

Adopt Contributor Covenant

2019-08-29 23:21:01 -07:00

CONTRIBUTING.md

Add Code of Conduct

2017-12-05 18:42:35 -08:00

COPYING

Add GPLv2 as an alternative license.

2017-04-27 18:06:12 -07:00

DEFAULT_OPTIONS_HISTORY.md

options.delayed_write_rate use the rate of rate_limiter by default.

2017-05-24 09:58:24 -07:00

defs.bzl

Make testpilot recognize that these tests have coverage instrumentation

2020-03-20 11:23:23 -07:00

DUMP_FORMAT.md

First version of rocksdb_dump and rocksdb_undump.

2015-06-19 16:24:36 -07:00

HISTORY.md

Minimize memory internal fragmentation for Bloom filters (#6427 )

2020-06-22 13:32:07 -07:00

INSTALL.md

Update the version of the dependencies used by the RocksJava static build (#4761 )

2018-12-18 20:25:43 -08:00

issue_template.md

Add Google Group to Issue Template

2020-01-28 14:40:37 -08:00

LANGUAGE-BINDINGS.md

LANGUAGE-BINDINGS.md: mention python-rocksdb

2019-03-20 11:10:48 -07:00

LICENSE.Apache

Change RocksDB License

2017-07-15 16:11:23 -07:00

LICENSE.leveldb

Add back the LevelDB license file

2017-07-16 18:42:18 -07:00

Makefile

Remove racially charged terms "whitelist" and "blacklist" (#7008 )

2020-06-19 15:27:32 -07:00

README.md

Add Slack forum to README (#6773 )

2020-04-30 11:00:28 -07:00

ROCKSDB_LITE.md

Fix some typos in comments and docs.

2018-03-08 10:27:25 -08:00

src.mk

Add IOTracer reader, writer classes for reading/writing IO operations in a binary file (#6958 )

2020-06-18 10:46:11 -07:00

TARGETS

Add IOTracer reader, writer classes for reading/writing IO operations in a binary file (#6958 )

2020-06-18 10:46:11 -07:00

thirdparty.inc

Fix build jemalloc api (#5470 )

2019-06-24 17:40:32 -07:00

USERS.md

Add YugabyteDB to USERS (#6786 )

2020-05-06 10:28:29 -07:00

Vagrantfile

Adding CentOS 7 Vagrantfile & build script

2018-02-26 15:27:17 -08:00

WINDOWS_PORT.md

#5145 , rename port/dirent.h to port/port_dirent.h to avoid compile err when use port dir as header dir output (#5152 )

2019-04-04 11:38:19 -07:00

README.md

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/master/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Design discussions are conducted in https://www.facebook.com/groups/rocksdb.dev/ and https://rocksdb.slack.com/

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.

Languages

C++ 82.1%

Java 10.3%

C 2.5%

Python 1.7%

Perl 1.1%

Other 2.1%