Summary:
When calling StatisticsImpl::HistogramInfo::getMergedHistogram(), if
there is a dying thread, which is calling
ThreadLocalPtr::StaticMeta::OnThreadExit() to merge its thread values to
HistogramInfo, deadlock will occur. Because the former try to hold
merge_lock then ThreadMeta::mutex_, but the later try to hold
ThreadMeta::mutex_ then merge_lock. In short, the locking order isn't
the same.
This patch addressed this issue by releasing merge_lock before folding
thread values.
Closes https://github.com/facebook/rocksdb/pull/1552
Differential Revision: D4211942
Pulled By: ajkr
fbshipit-source-id: ef89bcb
Summary:
disable UBSAN for functions with intentional left shift on -ve number / overflow
These functions are
rocksdb:: Hash
FixedLengthColBufEncoder::Append
FaultInjectionTest:: Key
Closes https://github.com/facebook/rocksdb/pull/1577
Differential Revision: D4240801
Pulled By: IslamAbdelRahman
fbshipit-source-id: 3e1caf6
Summary:
Having -ve value for max_write_buffer_number does not make sense and cause us to do a left shift on a -ve value number
Closes https://github.com/facebook/rocksdb/pull/1579
Differential Revision: D4240798
Pulled By: IslamAbdelRahman
fbshipit-source-id: bd6267e
Summary:
reland https://reviews.facebook.net/D62523
- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file
Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks
Test Plan: unit tests
Reviewers: sdong, yhchiang
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65211
Summary:
In the current implementation of RateLimiter, the difference
between the configured rate and the actual rate might be more
than 20%, while our test only allows 15% difference. This diff
relaxes the acceptable bias RateLimiterTest::Rate test be 25%
to make the test less flaky.
Test Plan: rate_limiter_test
Reviewers: IslamAbdelRahman, andrewkr, yiwu, lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64941
Summary: A convience method to atomically get and reset ticker count. I'm wanting to use it to have a thin wrapper to the statistics object to export ticker counts to ODS for LogDevice (since they don't even use fb303).
Test Plan:
test in LogDevice shadow cluster.
https://fburl.com/461868822
Reviewers: andrewkr, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64869
Summary:
We always run consistency checks when compiling in debug mode
allow users to set Options::force_consistency_checks to true to be able to run such checks even when compiling in release mode
Test Plan:
make check -j64
make release
Reviewers: lightmark, sdong, yiwu
Reviewed By: yiwu
Subscribers: hermanlee4, andrewkr, yoshinorim, jkedgar, dhruba
Differential Revision: https://reviews.facebook.net/D64701
Summary:
I saw this exception thrown because sometimes we may resize with -ve value
if we have empty max_bytes_for_level_multiplier_additional vector
Test Plan: run the tests
Reviewers: yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D64791
Hello and thank you for RocksDB,
I noticed when using log_write_bench that writes were always 88 bytes:
> strace -e trace=write ./log_write_bench -num_records 2 2>&1 | head -n 2
write(3, "\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371"..., 88) = 88
write(3, "\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371"..., 88) = 88
> strace -e trace=write ./log_write_bench -record_size 4096 -num_records 2 2>&1 | head -n 2
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 88) = 88
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 88) = 88
I think this should be:
<< record.assign('X', FLAGS_record_size);
>> record.assign(FLAGS_record_size, 'X');
So fill and not buffer. Otherwise I always see writes of size 88 (the decimal value for chr "X").
string& assign (const char* s, size_t n);
buffer - Copies the first n characters from the array of characters pointed by s.
string& assign (size_t n, char c);
fill - Replaces the current value by n consecutive copies of character c.
perl -le 'print ord "X"'
88
With the change:
> strace -e trace=write ./log_write_bench -record_size 4096 -num_records 2 2>&1 | head -n 2
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 4096) = 4096
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 4096) = 4096
> strace -e trace=write ./log_write_bench -num_records 2 2>&1 | head -n 2
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 249) = 249
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 249) = 249
Thanks.
01c27be5fbhttps://reviews.facebook.net/D16239
Summary:
- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file
Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks
Test Plan: unit tests
Reviewers: andrewkr, yhchiang, yiwu, sdong
Reviewed By: sdong
Subscribers: hcz, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62523
Summary:
Add new Iterator API, `SeekForPrev`: find the last key that <= target key
support prefix_extractor
support prefix_same_as_start
support upper_bound
not supported in iterators without Prev()
Also add tests in db_iter_test and db_iterator_test
Pass all tests
Cheers!
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64149
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.
Test Plan:
make all check
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64065
Summary:
MyRocks build is broken because they are using "-Werror=missing-field-initializers"
We should fix that by explicitly passing these arguments
Test Plan: Build MyRocks
Reviewers: sdong, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D64161
Summary: 0.9 can make the test flaky since just found one test fail with 0.88
Test Plan: make all check
Reviewers: sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63939
Summary: EnvPosixTestWithParam.TwoPools relies on explicit sleeping, so it sometimes fail. Fix it.
Test Plan: Run tests with high parallelism many times and make sure the test passes.
Reviewers: yiwu, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63417
Summary: Add mutable options info into `OptionsTypeInfo` and use it to parse mutable options map. Also support `max_bytes_for_level_multiplier_additional` in option file.
Test Plan: unit test
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63843
Summary:
* Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions.
* Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable.
Test Plan: existing unit tests.
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63945
Summary:
Add Env::RandomRWFile in env.h and implement it for POSIX
RandomRWFile is a file that allow us to read from / write to random offsets in the file
I will implement it for other Envs later after finishing the whole task for AddFile()
Test Plan: unit tests
Reviewers: andrewkr, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62433
Summary:
ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility.
Still keep ZSTDNotFinal for compatibility reason.
Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older.
Reviewers: andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63141
Summary: Fix two Windows build problems.
Test Plan: Build on Windows and run all Linux tests.
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63189
Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (5665e5e285) but later its reference in /include/rocksdb/table.h is removed.
Test Plan:
make all check
Reviewers: IslamAbdelRahman, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: yhchiang, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63177
* Fix StatsLevel so that kExceptTimeForMutex leaves compression stats enabled and kExceptDetailedTimers disables mutex lock stats. Also change default stats level to kExceptDetailedTimers (disabling both compression and mutex timing).
* Changed order of StatsLevel enum to simplify logic for determining what stats to record.
Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes.
Test Plan: Add two new unit tests. Run all existing tests, including jtest.
Reviewers: yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59829
Summary: basically for SimCache stats. I find most times it is hard to pass Statistics* to SimCache constructor.
Test Plan: make all check
Reviewers: andrewkr, sdong, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62193
Summary:
To reduce contention for atomics when HistogramStats are shared across
threads, this diff makes them thread-specific so updates are faster. This comes
at the expense of slower reads (much less frequent), which now require merging
all histograms. In this diff,
- Thread-specific HistogramImpl is created upon the thread's first measureTime()
- Thread-specific HistogramImpl are merged and deleted upon thread termination or ThreadLocalPtr destruction, whichever comes first
- getHistogramString() and histogramData() merge all histograms, both thread-specific and previously merged ones
Test Plan:
unit tests, ran db_bench and verified histograms look similar
before:
$ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
...
+ 7.63% db_bench db_bench [.] rocksdb::HistogramStat::Add
after:
$ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
...
+ 0.98% db_bench db_bench [.] rocksdb::HistogramStat::Add
Reviewers: sdong, MarkCallaghan, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62649
Summary:
Fix ClockCache memory leak found by valgrind:
# Add destructor to cleanup cached values.
# Delete key with cache handle immediately after handle is recycled, and erase table entry immediately if duplicated cache entry is inserted.
Test Plan:
make DISABLE_JEMALLOC=1 valgrind_check
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62973
Summary:
Add ReadOptions::read_amp_bytes_per_bit option which allow us to create a bitmap for every data block we read
the bitmap will contain (block_size / read_amp_bytes_per_bit) bits.
We will use this bitmap to mark which bytes have been used of the block so we can calculate the read amplification
Test Plan: added new tests
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: yiwu, leveldb, march, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58707
Summary:
This diff split ThreadPool to
-ThreadPool (abstract interface exposed in include/rocksdb/threadpool.h)
-ThreadPoolImpl (actual implementation in util/threadpool_imp.h)
This allow us to expose ThreadPool to the user so we can use it as an option later
Test Plan: existing unit tests
Reviewers: andrewkr, yiwu, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62085
Summary:
The global atomics we previously used for tickers had poor cache performance
since they were typically updated from different threads, causing frequent
invalidations. In this diff,
- recordTick() updates a local ticker value specific to the thread in which it was called
- When a thread exits, its local ticker value is added into merged_sum
- getTickerCount() returns the sum of all threads' local ticker values and the merged_sum
- setTickerCount() resets all threads' local ticker values and sets merged_sum to the value provided by the caller.
In a next diff I will make a similar change for histogram stats.
Test Plan:
before:
$ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
$ perf report -g --stdio | grep recordTick
7.59% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick
...
after:
$ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
$ perf report -g --stdio | grep recordTick
1.46% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick
...
Reviewers: kradhakrishnan, MarkCallaghan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: yiwu, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62337
Summary:
We used to allow insert into full block cache as long as `strict_capacity_limit=false`. This diff further restrict insert to full cache if caller don't intent to hold handle to the cache entry after insert.
Hope this diff fix the assertion failure with db_stress: https://our.intern.facebook.com/intern/sandcastle/log/?instance_id=211853102&step_id=2475070014
db_stress: util/lru_cache.cc:278: virtual void rocksdb::LRUCacheShard::Release(rocksdb::Cache::Handle*): Assertion `lru_.next == &lru_' failed.
The assertion at lru_cache.cc:278 can fail when an entry is inserted into full cache and stay in LRU list.
Test Plan:
make all check
Reviewers: IslamAbdelRahman, lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62325
Summary:
Add option to block based table to insert index/filter blocks to block cache with priority. Combined with LRUCache with high_pri_pool_ratio, we can reserved space for index/filter blocks, make them less likely to be evicted.
Depends on D61977.
Test Plan: See unit test.
Reviewers: lightmark, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, march, leveldb
Differential Revision: https://reviews.facebook.net/D62241
Summary:
This function allows the user to provide a custom function to fold all
threads' local data. It will be used in my next diff for aggregating statistics
stored in thread-local data. Note the test case uses atomics as thread-local
values due to the synchronization requirement (documented in code).
Test Plan: unit test
Reviewers: yhchiang, sdong, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62049
Summary:
Add mid-point insertion functionality to LRU cache. Caller of `Cache::Insert()` can set an additional parameter to make a cache entry have higher priority. The LRU cache will reserve at most `capacity * high_pri_pool_pct` bytes for high-pri cache entries. If `high_pri_pool_pct` is zero, the cache degenerates to normal LRU cache.
Context: If we are to put index and filter blocks into RocksDB block cache, index/filter block can be swap out too early. We want to add an option to RocksDB to reserve some capacity in block cache just for index/filter blocks, to mitigate the issue.
In later diffs I'll update block based table reader to use the interface to cache index/filter blocks at high priority, and expose the option to `DBOptions` and make it dynamic changeable.
Test Plan: unit test.
Reviewers: IslamAbdelRahman, sdong, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, march, leveldb
Differential Revision: https://reviews.facebook.net/D61977
Summary: 1. Range Deletion Tombstone structure 2. Modify Add() in table_builder to make it usable for adding range del tombstones 3. Expose NewTombstoneIterator() API in table_reader
Test Plan: table_test.cc (now BlockBasedTableBuilder::Add() only accepts InternalKey. I make table_test only pass InternalKey to BlockBasedTableBuidler. Also test writing/reading range deletion tombstones in table_test )
Reviewers: sdong, IslamAbdelRahman, lightmark, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61473
Summary:
Clock-based cache implemenetation aim to have better concurreny than
default LRU cache. See inline comments for implementation details.
Test Plan:
Update cache_test to run on both LRUCache and ClockCache. Adding some
new tests to catch some of the bugs that I fixed while implementing the
cache.
Reviewers: kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61647
* Create rate limiter using factory function in the test.
* Convert function local statics in option helper to a C array
that does not perform dynamic memory allocation. This is helpful
when you try to memory isolate different DB instances.