A library that provides an embeddable, persistent key-value store for fast storage.
Go to file
Peter Dillinger 9d0cae7104 Eliminate unnecessary (slow) block cache Ref()ing in MultiGet (#9899)
Summary:
When MultiGet() determines that multiple query keys can be
served by examining the same data block in block cache (one Lookup()),
each PinnableSlice referring to data in that data block needs to hold
on to the block in cache so that they can be released at arbitrary
times by the API user. Historically this is accomplished with extra
calls to Ref() on the Handle from Lookup(), with each PinnableSlice
cleanup calling Release() on the Handle, but this creates extra
contention on the block cache for the extra Ref()s and Release()es,
especially because they hit the same cache shard repeatedly.

In the case of merge operands (possibly more cases?), the problem was
compounded by doing an extra Ref()+eventual Release() for each merge
operand for a key reusing a block (which could be the same key!), rather
than one Ref() per key. (Note: the non-shared case with `biter` was
already one per key.)

This change optimizes MultiGet not to rely on these extra, contentious
Ref()+Release() calls by instead, in the shared block case, wrapping
the cache Release() cleanup in a refcounted object referenced by the
PinnableSlices, such that after the last wrapped reference is released,
the cache entry is Release()ed. Relaxed atomic refcounts should be
much faster than mutex-guarded Ref() and Release(), and much less prone
to a performance cliff when MultiGet() does a lot of block sharing.

Note that I did not use std::shared_ptr, because that would require an
extra indirection object (shared_ptr itself new/delete) in order to
associate a ref increment/decrement with a Cleanable cleanup entry. (If
I assumed it was the size of two pointers, I could do some hackery to
make it work without the extra indirection, but that's too fragile.)

Some details:
* Fixed (removed) extra block cache tracing entries in cases of cache
entry reuse in MultiGet, but it's likely that in some other cases traces
are missing (XXX comment inserted)
* Moved existing implementations for cleanable.h from iterator.cc to
new cleanable.cc
* Improved API comments on Cleanable
* Added a public SharedCleanablePtr class to cleanable.h in case others
could benefit from the same pattern (potentially many Cleanables and/or
smart pointers referencing a shared Cleanable)
* Add a typedef for MultiGetContext::Mask
* Some variable renaming for clarity

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9899

Test Plan:
Added unit tests for SharedCleanablePtr.

Greatly enhanced ability of existing tests to detect cache use-after-free.
* Release PinnableSlices from MultiGet as they are read rather than in
bulk (in db_test_util wrapper).
* In ASAN build, default to using a trivially small LRUCache for block_cache
so that entries are immediately erased when unreferenced. (Updated two
tests that depend on caching.) New ASAN testsuite running time seems
OK to me.

If I introduce a bug into my implementation where we skip the shared
cleanups on block reuse, ASAN detects the bug in
`db_basic_test *MultiGet*`. If I remove either of the above testing
enhancements, the bug is not detected.

Consider for follow-up work: manipulate or randomize ordering of
PinnableSlice use and release from MultiGet db_test_util wrapper. But in
typical cases, natural ordering gives pretty good functional coverage.

Performance test:
In the extreme (but possible) case of MultiGetting the same or adjacent keys
in a batch, throughput can improve by an order of magnitude.
`./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb -readonly -num=5 -duration=10 -threads=20 -multiread_batched -batch_size=200`
Before ops/sec, num=5: 1,384,394
Before ops/sec, num=500: 6,423,720
After ops/sec, num=500: 10,658,794
After ops/sec, num=5: 16,027,257

Also note that previously, with high parallelism, having query keys
concentrated in a single block was worse than spreading them out a bit. Now
concentrated in a single block is faster than spread out, which is hopefully
consistent with natural expectation.

Random query performance: with num=1000000, over 999 x 10s runs running before & after simultaneously (each -threads=12):
Before: multireadrandom [AVG    999 runs] : 1088699 (± 7344) ops/sec;  120.4 (± 0.8 ) MB/sec
After: multireadrandom [AVG    999 runs] : 1090402 (± 7230) ops/sec;  120.6 (± 0.8 ) MB/sec
Possibly better, possibly in the noise.

Reviewed By: anand1976

Differential Revision: D35907003

Pulled By: pdillinger

fbshipit-source-id: bbd244d703649a8ca12d476f2d03853ed9d1a17e
2022-04-26 21:59:24 -07:00
.circleci Misc CI improvements / additions (#9859) 2022-04-18 20:26:37 -07:00
.github/workflows Use released clang-format instead of the one from dev branch (#9646) 2022-03-01 10:51:38 -08:00
buckifier Remove flaky servicelab metrics DBPut P95/P99 (#9844) 2022-04-15 13:56:22 -07:00
build_tools Add "no compression" job to CircleCI (#9850) 2022-04-18 12:47:16 -07:00
cache Expose CacheEntryRole and map keys for block cache stat collections (#9838) 2022-04-14 09:38:55 -07:00
cmake gcc-11 and cmake related cleanup (#9286) 2021-12-17 17:04:35 -08:00
coverage Fix commit_prereq and other targets (#9797) 2022-04-04 09:58:18 -07:00
db Eliminate unnecessary (slow) block cache Ref()ing in MultiGet (#9899) 2022-04-26 21:59:24 -07:00
db_stress_tool Add WAL compression to stress tests (#9811) 2022-04-06 15:47:09 -07:00
docs Bump nokogiri from 1.13.3 to 1.13.4 in /docs (#9831) 2022-04-12 09:07:14 -07:00
env Add experimental new FS API AbortIO to cancel read request (#9901) 2022-04-25 14:20:03 -07:00
examples Remove BlockBasedTableOptions.hash_index_allow_collision (#9454) 2022-03-01 13:58:02 -08:00
file Add stats related to async prefetching (#9845) 2022-04-25 21:58:22 -07:00
fuzz Fix compilation errors and add fuzzers to CircleCI (#9420) 2022-02-01 10:32:15 -08:00
include/rocksdb Eliminate unnecessary (slow) block cache Ref()ing in MultiGet (#9899) 2022-04-26 21:59:24 -07:00
java Revert open logic changes in #9634 (#9906) 2022-04-26 14:46:53 -07:00
logging Use system-wide thread ID in info log lines (#9164) 2021-11-12 19:46:06 -08:00
memory Meta-internal folly integration with F14FastMap (#9546) 2022-04-13 07:34:01 -07:00
memtable Account memory of big memory users in BlockBasedTable in global memory limit (#9748) 2022-04-06 10:33:00 -07:00
microbench Add mmap DBGet microbench parameters (#9903) 2022-04-26 16:46:39 -07:00
monitoring Revert open logic changes in #9634 (#9906) 2022-04-26 14:46:53 -07:00
options Misc CI improvements / additions (#9859) 2022-04-18 20:26:37 -07:00
plugin Add initial CMake support to plugin (#9214) 2021-11-30 17:16:53 -08:00
port Meta-internal folly integration with F14FastMap (#9546) 2022-04-13 07:34:01 -07:00
table Eliminate unnecessary (slow) block cache Ref()ing in MultiGet (#9899) 2022-04-26 21:59:24 -07:00
test_util Remove BlockBasedTableOptions.hash_index_allow_collision (#9454) 2022-03-01 13:58:02 -08:00
third-party Meta-internal folly integration with F14FastMap (#9546) 2022-04-13 07:34:01 -07:00
tools Add 95% confidence intervals to db_bench output (#9882) 2022-04-25 14:49:54 -07:00
trace_replay Added TraceOptions::preserve_write_order (#9334) 2021-12-28 15:04:26 -08:00
util Eliminate unnecessary (slow) block cache Ref()ing in MultiGet (#9899) 2022-04-26 21:59:24 -07:00
utilities Add rollback_deletion_type_callback to TxnDBOptions (#9873) 2022-04-20 18:57:32 -07:00
.clang-format A script that automatically reformat affected lines 2014-01-14 12:21:24 -08:00
.gitignore Meta-internal folly integration with F14FastMap (#9546) 2022-04-13 07:34:01 -07:00
.lgtm.yml Create lgtm.yml for LGTM.com C/C++ analysis (#4058) 2018-06-26 12:43:04 -07:00
.travis.yml Fix remaining uses of "backupable" (#9792) 2022-04-05 09:52:33 -07:00
.watchmanconfig Added .watchmanconfig file to rocksdb repo (#5593) 2019-07-19 15:00:33 -07:00
AUTHORS Update RocksDB Authors File 2017-10-18 14:42:10 -07:00
CMakeLists.txt Eliminate unnecessary (slow) block cache Ref()ing in MultiGet (#9899) 2022-04-26 21:59:24 -07:00
CODE_OF_CONDUCT.md Adopt Contributor Covenant 2019-08-29 23:21:01 -07:00
CONTRIBUTING.md Add Code of Conduct 2017-12-05 18:42:35 -08:00
COPYING Add GPLv2 as an alternative license. 2017-04-27 18:06:12 -07:00
crash_test.mk Add new determinators for multiops transactions stress test (#9708) 2022-03-23 22:29:50 -07:00
DEFAULT_OPTIONS_HISTORY.md Add Options::DisableExtraChecks, clarify force_consistency_checks (#9363) 2022-01-18 17:31:03 -08:00
DUMP_FORMAT.md First version of rocksdb_dump and rocksdb_undump. 2015-06-19 16:24:36 -07:00
HISTORY.md Eliminate unnecessary (slow) block cache Ref()ing in MultiGet (#9899) 2022-04-26 21:59:24 -07:00
INSTALL.md Update supported VS versions in INSTALL.md (#9823) 2022-04-13 13:03:40 -07:00
issue_template.md Add Google Group to Issue Template 2020-01-28 14:40:37 -08:00
LANGUAGE-BINDINGS.md Update branch name to "main" in README/LANGUAGE_BINDINGS (#8727) 2021-09-01 15:26:34 -07:00
LICENSE.Apache Change RocksDB License 2017-07-15 16:11:23 -07:00
LICENSE.leveldb Add back the LevelDB license file 2017-07-16 18:42:18 -07:00
Makefile Misc CI improvements / additions (#9859) 2022-04-18 20:26:37 -07:00
PLUGINS.md Move RADOS support to separate repo (#9206) 2022-01-24 22:50:07 -08:00
python.mk crash_test Makefile refactoring, add to CircleCI (#9702) 2022-03-16 15:58:06 -07:00
README.md README: De-list slack channel, list Google group (#9387) 2022-01-18 08:19:48 -08:00
ROCKSDB_LITE.md Fix remaining uses of "backupable" (#9792) 2022-04-05 09:52:33 -07:00
src.mk Eliminate unnecessary (slow) block cache Ref()ing in MultiGet (#9899) 2022-04-26 21:59:24 -07:00
TARGETS Eliminate unnecessary (slow) block cache Ref()ing in MultiGet (#9899) 2022-04-26 21:59:24 -07:00
thirdparty.inc Fix build jemalloc api (#5470) 2019-06-24 17:40:32 -07:00
USERS.md Add Solana's RocksDB use case in USERS.md (#9558) 2022-02-16 09:23:01 -08:00
Vagrantfile Adding CentOS 7 Vagrantfile & build script 2018-02-26 15:27:17 -08:00
WINDOWS_PORT.md Update branch name in WINDOWS_PORT.md (#8745) 2021-09-01 19:26:39 -07:00

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

CircleCI Status TravisCI Status Appveyor Build status PPC64le Build Status

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/main/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Questions and discussions are welcome on the RocksDB Developers Public Facebook group and email list on Google Groups.

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.