Summary:
* Clarify that RocksDB is not exception safe on many of our callback
and extension interfaces
* Clarify FSRandomAccessFile::MultiRead implementations must accept
non-sorted inputs (see https://github.com/facebook/rocksdb/issues/8953)
* Clarify ConcurrentTaskLimiter and SstFileManager are not (currently)
extensible interfaces
* Mark WriteBufferManager as `final`, so it is then clearly not a
callback interface, even though it smells like one
* Clarify TablePropertiesCollector Status returns are mostly ignored
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9080
Test Plan: comments only (except WriteBufferManager final)
Reviewed By: ajkr
Differential Revision: D31968782
Pulled By: pdillinger
fbshipit-source-id: 11b648ce3ce3c5e5bdc02d2eafc7ea4b864bd1d2
Summary:
EndWriteStall has a data race: `queue_.empty()` is checked outside of the
mutex, so once we enter the critical section another thread may already have
cleared the list, and accessing the `front()` is undefined behavior (and causes
interesting crashes under high concurrency).
This PR fixes the bug, and also rewrites the logic to make it easier to reason
about it. It also fixes another subtle bug: if some writers are stalled and
`SetBufferSize(0)` is called, which disables the WBM, the writer are not
unblocked because of an early `enabled()` check in `EndWriteStall()`.
It doesn't significantly change the locking behavior, as before writers won't
lock unless entering a stall condition, and `FreeMem` almost always locks if
stalling is allowed, but that is inevitable with the current design. Liveness is
guaranteed by the fact that if some writes are blocked, eventually all writes
will be blocked due to `stall_active_`, and eventually all memory is freed.
While at it, do a couple of optimizations:
- In `WBMStallInterface::Signal()` signal the CV only after releasing the
lock. Signaling under the lock is a common pitfall, as it causes the woken-up
thread to immediately go back to sleep because the mutex is still locked by
the awaker.
- Move all allocations and deallocations outside of the lock.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9009
Test Plan:
```
USE_CLANG=1 make -j64 all check
```
Reviewed By: akankshamahajan15
Differential Revision: D31550668
Pulled By: ot
fbshipit-source-id: 5125387c3dc7ecaaa2b8bbc736e58c4156698580
Summary:
Context:
To help cap various memory usage by a single limit of the block cache capacity, we charge the memory usage through inserting/releasing dummy entries in the block cache. CacheReservationManager is such a class (non thread-safe) responsible for inserting/removing dummy entries to reserve cache space for memory used by the class user.
- Refactored the inner private class CacheRep of WriteBufferManager into public CacheReservationManager class for reusability such as for https://github.com/facebook/rocksdb/pull/8428
- Encapsulated implementation details of cache key generation and dummy entries insertion/release in cache reservation as discussed in https://github.com/facebook/rocksdb/pull/8506#discussion_r666550838
- Consolidated increase/decrease cache reservation into one API - UpdateCacheReservation.
- Adjusted the previous dummy entry release algorithm in decreasing cache reservation to be loop-releasing dummy entries to stay symmetric to dummy entry insertion algorithm
- Made the previous dummy entry release algorithm in delayed decrease mode more aggressive for better decreasing cache reservation when memory used is less likely to increase back.
Previously, the algorithms only release 1 dummy entries when new_mem_used < 3/4 * cache_allocated_size_ and cache_allocated_size_ - kSizeDummyEntry > new_mem_used.
Now, the algorithms loop-releases as many dummy entries as possible when new_mem_used < 3/4 * cache_allocated_size_.
- Updated WriteBufferManager's test cases to adapt to changes on the release algorithm mentioned above and left comment for some test cases for clarity
- Replaced the previous cache key prefix generation (utilizing object address related to the cache client) with one that utilizes Cache->NewID() to prevent cache-key collision among dummy entry clients sharing the same cache.
The specific collision we are preventing happens when the object address is reused for a new cache-key prefix while the old cache-key using that same object address in its prefix still exists in the cache. This could happen due to that, under LRU cache policy, there is a possible delay in releasing a cache entry after the cache client object owning that cache entry get deallocated. In this case, the object address related to the cache client object can get reused for other client object to generate a new cache-key prefix.
This prefix generation can be made obsolete after Peter's unification of all the code generating cache key, mentioned in https://github.com/facebook/rocksdb/pull/8506#discussion_r667265255
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8506
Test Plan:
- Passing the added unit tests cache_reservation_manager_test.cc
- Passing existing and adjusted write_buffer_manager_test.cc
Reviewed By: ajkr
Differential Revision: D29644135
Pulled By: hx235
fbshipit-source-id: 0fc93fbfe4a40bb41be85c314f8f2bafa8b741f7
Summary:
When WriteBufferManager is shared across DBs and column families
to maintain memory usage under a limit, OOMs have been observed when flush cannot
finish but writes continuously insert to memtables.
In order to avoid OOMs, when memory usage goes beyond buffer_limit_ and DBs tries to write,
this change will stall incoming writers until flush is completed and memory_usage
drops.
Design: Stall condition: When total memory usage exceeds WriteBufferManager::buffer_size_
(memory_usage() >= buffer_size_) WriterBufferManager::ShouldStall() returns true.
DBImpl first block incoming/future writers by calling write_thread_.BeginWriteStall()
(which adds dummy stall object to the writer's queue).
Then DB is blocked on a state State::Blocked (current write doesn't go
through). WBStallInterface object maintained by every DB instance is added to the queue of
WriteBufferManager.
If multiple DBs tries to write during this stall, they will also be
blocked when check WriteBufferManager::ShouldStall() returns true.
End Stall condition: When flush is finished and memory usage goes down, stall will end only if memory
waiting to be flushed is less than buffer_size/2. This lower limit will give time for flush
to complete and avoid continous stalling if memory usage remains close to buffer_size.
WriterBufferManager::EndWriteStall() is called,
which removes all instances from its queue and signal them to continue.
Their state is changed to State::Running and they are unblocked. DBImpl
then signal all incoming writers of that DB to continue by calling
write_thread_.EndWriteStall() (which removes dummy stall object from the
queue).
DB instance creates WBMStallInterface which is an interface to block and
signal DBs during stall.
When DB needs to be blocked or signalled by WriteBufferManager,
state_for_wbm_ state is changed accordingly (RUNNING or BLOCKED).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7898
Test Plan: Added a new test db/db_write_buffer_manager_test.cc
Reviewed By: anand1976
Differential Revision: D26093227
Pulled By: akankshamahajan15
fbshipit-source-id: 2bbd982a3fb7033f6de6153aa92a221249861aae
Summary:
This PR adds SetBufferSize() to the WriteBufferManager object. This enables user code to adjust the global budget for write_buffers based upon other memory conditions such as growth in table reader memory as the dataset grows.
The buffer_size_ member variable is now atomic to match design of other changeable size_t members within WriteBufferManager.
This change is useful as is. However, this change is also essential if someone decides they wanted to enable db_write_buffer_size modifications through the DB::SetOptions() API, i.e. no waste taking this as is.
Any format / spacing changes are due to clang-format as required by check-in automation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7961
Reviewed By: ajkr
Differential Revision: D26639075
Pulled By: akankshamahajan15
fbshipit-source-id: 0604348caf092d35f44e85715331dc920e5c1033
Summary:
Add new API WriteBufferManager::dummy_entries_in_cache_usage() which reports the dummy entries size stored in cache to account for DataBlocks in WriteBufferManager.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7837
Test Plan: Updated test ./write_buffer_manager_test
Reviewed By: ajkr
Differential Revision: D25794312
Pulled By: akankshamahajan15
fbshipit-source-id: 197f5e8701e3dc57a7df72dab1735624f90daf4b
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary:
Use delete to disable automatic generated methods instead of private, and put the constructor together for more clear.This modification cause the unused field warning, so add unused attribute to disable this warning.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5009
Differential Revision: D17288733
fbshipit-source-id: 8a767ce096f185f1db01bd28fc88fef1cdd921f3
Summary:
WriteBufferManger is not invoked when allocating memory for memtable if the limit is not set even if a cache is passed. It is inconsistent from the comment syas. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4695
Differential Revision: D13112722
Pulled By: siying
fbshipit-source-id: 0b27eef63867f679cd06033ea56907c0569597f4
Summary:
Even if hard limit hits, flushing more memtable may not help cap the memory usage if already more than half data is scheduled for flush. Not triggering flush instead.
Closes https://github.com/facebook/rocksdb/pull/2469
Differential Revision: D5284249
Pulled By: siying
fbshipit-source-id: 8ab7ba1aba56a634dbe72b318fcab2093063972e
Summary:
Improve write buffer manager in several ways:
1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower.
2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit.
3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable.
Closes https://github.com/facebook/rocksdb/pull/2350
Differential Revision: D5110648
Pulled By: siying
fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017
Summary: If options.write_buffer_size is not set, nor options.write_buffer_manager, no need to update the bytes allocated counter in MemTableAllocator, which is expensive in parallel memtable insert case. Remove it can improve parallel memtable insert throughput by 10% with write batch size 128.
Test Plan:
Run benchmarks
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -disable_auto_compactions -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -num=10000000 --writes=1000000 -max_background_flushes=16 -max_write_buffer_number=16 --threads=32 --batch_size=128 -allow_concurrent_memtable_write -enable_write_thread_adaptive_yield
The throughput grows 10% with the benchmark.
Reviewers: andrewkr, yiwu, IslamAbdelRahman, igor, ngbronson
Reviewed By: ngbronson
Subscribers: ngbronson, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60465
Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances.
Test Plan: Add a new unit test.
Reviewers: yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59925