Summary:
Made delete_obsolete_files_period_micros option dynamic. It can be updating using DB::SetDBOptions().
Closes https://github.com/facebook/rocksdb/pull/1595
Differential Revision: D4246569
Pulled By: tonek
fbshipit-source-id: d23f560
Summary:
Reduce number of comparisons in heap by caching which child node in the first level is smallest (left_child or right_child)
So next time we can compare directly against the smallest child
I see that the total number of calls to comparator drops significantly when using this optimization
Before caching (~2mil key comparison for iterating the DB)
```
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq" --db="/dev/shm/heap_opt" --use_existing_db --disable_auto_compactions --cache_size=1000000000 --perf_level=2
readseq : 0.338 micros/op 2959201 ops/sec; 327.4 MB/s user_key_comparison_count = 2000008
```
After caching (~1mil key comparison for iterating the DB)
```
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq" --db="/dev/shm/heap_opt" --use_existing_db --disable_auto_compactions --cache_size=1000000000 --perf_level=2
readseq : 0.309 micros/op 3236801 ops/sec; 358.1 MB/s user_key_comparison_count = 1000011
```
It also improves
Closes https://github.com/facebook/rocksdb/pull/1600
Differential Revision: D4256027
Pulled By: IslamAbdelRahman
fbshipit-source-id: 76fcc66
Summary:
disable UBSAN for functions with intentional left shift on -ve number / overflow
These functions are
rocksdb:: Hash
FixedLengthColBufEncoder::Append
FaultInjectionTest:: Key
Closes https://github.com/facebook/rocksdb/pull/1577
Differential Revision: D4240801
Pulled By: IslamAbdelRahman
fbshipit-source-id: 3e1caf6
Summary:
Having -ve value for max_write_buffer_number does not make sense and cause us to do a left shift on a -ve value number
Closes https://github.com/facebook/rocksdb/pull/1579
Differential Revision: D4240798
Pulled By: IslamAbdelRahman
fbshipit-source-id: bd6267e
Summary:
In one deployment we saw high latencies (presumably from slow iterator operations) and a lot of CPU time reported by perf with this stack:
```
rocksdb::MergingIterator::Next
rocksdb::DBIter::FindNextUserEntryInternal
rocksdb::DBIter::Seek
```
I think what's happening is:
1. we create a snapshot iterator,
2. we do lots of Put()s for the same key x; this creates lots of entries in memtable,
3. we seek the iterator to a key slightly smaller than x,
4. the seek walks over lots of entries in memtable for key x, skipping them because of high sequence numbers.
CC IslamAbdelRahman
Closes https://github.com/facebook/rocksdb/pull/1413
Differential Revision: D4083879
Pulled By: IslamAbdelRahman
fbshipit-source-id: a83ddae
Summary:
Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before.
Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system.
Closes https://github.com/facebook/rocksdb/pull/1562
Differential Revision: D4218519
Pulled By: siying
fbshipit-source-id: 95e4088
Summary:
LZ4 1.7.3 emits warnings when calling the deprecated function `LZ4_compress_limitedOutput_continue()`. Starting in r129, LZ4 introduces `LZ4_compress_fast_continue()` as a replacement, and the two functions calls are [exactly equivalent](https://github.com/lz4/lz4/blob/dev/lib/lz4.c#L1408).
Closes https://github.com/facebook/rocksdb/pull/1532
Differential Revision: D4199240
Pulled By: siying
fbshipit-source-id: 138c2bc
Summary:
We should close the fd, before overriding it. This bug was
introduced by f89caa127b
Closes https://github.com/facebook/rocksdb/pull/1553
Differential Revision: D4214101
Pulled By: siying
fbshipit-source-id: 0d65de0
Summary:
- Made RangeDelAggregator's InternalKeyComparator member a reference-to-const so we don't need to copy-construct it. Also added InternalKeyComparator to ImmutableCFOptions so we don't need to construct one for each DBIter.
- Made MemTable::NewRangeTombstoneIterator and the table readers' NewRangeTombstoneIterator() functions return nullptr instead of NewEmptyInternalIterator to avoid the allocation. Updated callers accordingly.
Closes https://github.com/facebook/rocksdb/pull/1548
Differential Revision: D4208169
Pulled By: ajkr
fbshipit-source-id: 2fd65cf
Summary:
When calling StatisticsImpl::HistogramInfo::getMergedHistogram(), if
there is a dying thread, which is calling
ThreadLocalPtr::StaticMeta::OnThreadExit() to merge its thread values to
HistogramInfo, deadlock will occur. Because the former try to hold
merge_lock then ThreadMeta::mutex_, but the later try to hold
ThreadMeta::mutex_ then merge_lock. In short, the locking order isn't
the same.
This patch addressed this issue by releasing merge_lock before folding
thread values.
Closes https://github.com/facebook/rocksdb/pull/1552
Differential Revision: D4211942
Pulled By: ajkr
fbshipit-source-id: ef89bcb
Summary:
Currently, deadlock cycles are held in std::unordered_map. The problem with it is that it allocates/deallocates memory on every insertion/deletion. This limits throughput since we're doing this expensive operation while holding a global mutex. Fix this by using a vector which caches memory instead.
Running the deadlock stress test, this change increased throughput from 39k txns/s -> 49k txns/s. The effect is more noticeable in MyRocks.
Closes https://github.com/facebook/rocksdb/pull/1545
Differential Revision: D4205662
Pulled By: lth
fbshipit-source-id: ff990e4
Summary:
Currently, in the Direct I/O read mode, the last sector of the file, if not full, is not handled correctly. If the return value of pread is not multiplier of kSectorSize, we still go ahead and continue reading, even if the buffer is not aligned. With the commit, if the return value is not multiplier of kSectorSize, and all but the last sector has been read, we simply return.
Closes https://github.com/facebook/rocksdb/pull/1550
Differential Revision: D4209609
Pulled By: lightmark
fbshipit-source-id: cb0b439
Summary:
This patch clarifies the contract of PositionedAppend with some unit
tests and also implements it for PosixWritableFile. (Tasks: 14524071)
Closes https://github.com/facebook/rocksdb/pull/1514
Differential Revision: D4204907
Pulled By: maysamyabandeh
fbshipit-source-id: 06eabd2
Summary:
It is hard to measure acutal memory usage by std containers. Even
providing a custom allocator will miss count some of the usage. Here we
only do a wild guess on its memory usage.
Closes https://github.com/facebook/rocksdb/pull/1511
Differential Revision: D4179945
Pulled By: yiwu-arbug
fbshipit-source-id: 32ab929
Summary:
Currently our skip-list have an optimization to speedup sequential
inserts from a single stream, by remembering the last insert position.
We extend the idea to support sequential inserts from multiple streams,
and even tolerate small reordering wihtin each stream.
This PR is the interface part adding the following:
- Add `memtable_insert_prefix_extractor` to allow specifying prefix for each key.
- Add `InsertWithHint()` interface to memtable, to allow underlying
implementation to return a hint of insert position, which can be later
pass back to optimize inserts.
- Memtable will maintain a map from prefix to hints and pass the hint
via `InsertWithHint()` if `memtable_insert_prefix_extractor` is non-null.
Closes https://github.com/facebook/rocksdb/pull/1419
Differential Revision: D4079367
Pulled By: yiwu-arbug
fbshipit-source-id: 3555326
Summary:
This fixes a correctness issue where ranges with same begin key would overwrite each other.
This diff uses InternalKey as TombstoneMap's key such that all tombstones have unique keys even when their start keys overlap. We also update TombstoneMap to use an internal key comparator.
End-to-end tests pass and are here (https://gist.github.com/ajkr/851ffe4c1b8a15a68d33025be190a7d9) but cannot be included yet since the DeleteRange() API is yet to be checked in. Note both tests failed before this fix.
Closes https://github.com/facebook/rocksdb/pull/1484
Differential Revision: D4155248
Pulled By: ajkr
fbshipit-source-id: 304b4b9
Summary:
During Get()/MultiGet(), build up a RangeDelAggregator with range
tombstones as we search through live memtable, immutable memtables, and
SST files. This aggregator is then used by memtable.cc's SaveValue() and
GetContext::SaveValue() to check whether keys are covered.
added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761
Closes https://github.com/facebook/rocksdb/pull/1456
Differential Revision: D4111271
Pulled By: ajkr
fbshipit-source-id: 6e388d4
Summary:
DB Stats now are truncated if there are too many CFs. Extend the buffer size to allow more to be printed out. Also, separate out malloc to another log line.
Closes https://github.com/facebook/rocksdb/pull/1439
Differential Revision: D4100943
Pulled By: yiwu-arbug
fbshipit-source-id: 79f7218
Summary:
Rocksdb currently has many references to std::map.emplace_back()
which is not implemented in gcc 4.7, but valid in gcc 4.8. Confirmed that
it did not build with gcc 4.7, but builds fine with gcc 4.8
Closes https://github.com/facebook/rocksdb/pull/1272
Differential Revision: D4101385
Pulled By: IslamAbdelRahman
fbshipit-source-id: f6af453
* util/build_verion.cc.in: add this file, so cmake and make can share the
template file for generating util/build_version.cc.
* CMakeLists.txt: also, cmake v2.8.11 does not support file(GENERATE ...),
so we are using configure_file() for creating build_version.cc.
* Makefile: use util/build_verion.cc.in for creating build_version.cc.
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Summary: Using real clock causes failures of DBSSTTest.RateLimitedDelete in some cases. Turn away from the real time. Use fake time instead.
Test Plan: Run the tests and all existing tests.
Reviewers: yiwu, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65145
Summary:
change ioptions.comparator to user_comparator instread of internal_comparator.
Also change Comparator* to InternalKeyComparator* to make its type explicitly.
Test Plan: make all check -j64
Reviewers: andrewkr, sdong, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D65121
Summary:
reland https://reviews.facebook.net/D62523
- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file
Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks
Test Plan: unit tests
Reviewers: sdong, yhchiang
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65211
Summary:
In the current implementation of RateLimiter, the difference
between the configured rate and the actual rate might be more
than 20%, while our test only allows 15% difference. This diff
relaxes the acceptable bias RateLimiterTest::Rate test be 25%
to make the test less flaky.
Test Plan: rate_limiter_test
Reviewers: IslamAbdelRahman, andrewkr, yiwu, lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64941
Summary: A convience method to atomically get and reset ticker count. I'm wanting to use it to have a thin wrapper to the statistics object to export ticker counts to ODS for LogDevice (since they don't even use fb303).
Test Plan:
test in LogDevice shadow cluster.
https://fburl.com/461868822
Reviewers: andrewkr, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64869
Summary:
We always run consistency checks when compiling in debug mode
allow users to set Options::force_consistency_checks to true to be able to run such checks even when compiling in release mode
Test Plan:
make check -j64
make release
Reviewers: lightmark, sdong, yiwu
Reviewed By: yiwu
Subscribers: hermanlee4, andrewkr, yoshinorim, jkedgar, dhruba
Differential Revision: https://reviews.facebook.net/D64701
Summary:
I saw this exception thrown because sometimes we may resize with -ve value
if we have empty max_bytes_for_level_multiplier_additional vector
Test Plan: run the tests
Reviewers: yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D64791
Hello and thank you for RocksDB,
I noticed when using log_write_bench that writes were always 88 bytes:
> strace -e trace=write ./log_write_bench -num_records 2 2>&1 | head -n 2
write(3, "\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371"..., 88) = 88
write(3, "\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371"..., 88) = 88
> strace -e trace=write ./log_write_bench -record_size 4096 -num_records 2 2>&1 | head -n 2
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 88) = 88
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 88) = 88
I think this should be:
<< record.assign('X', FLAGS_record_size);
>> record.assign(FLAGS_record_size, 'X');
So fill and not buffer. Otherwise I always see writes of size 88 (the decimal value for chr "X").
string& assign (const char* s, size_t n);
buffer - Copies the first n characters from the array of characters pointed by s.
string& assign (size_t n, char c);
fill - Replaces the current value by n consecutive copies of character c.
perl -le 'print ord "X"'
88
With the change:
> strace -e trace=write ./log_write_bench -record_size 4096 -num_records 2 2>&1 | head -n 2
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 4096) = 4096
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 4096) = 4096
> strace -e trace=write ./log_write_bench -num_records 2 2>&1 | head -n 2
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 249) = 249
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 249) = 249
Thanks.
01c27be5fbhttps://reviews.facebook.net/D16239
Summary:
- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file
Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks
Test Plan: unit tests
Reviewers: andrewkr, yhchiang, yiwu, sdong
Reviewed By: sdong
Subscribers: hcz, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62523
Summary:
Add new Iterator API, `SeekForPrev`: find the last key that <= target key
support prefix_extractor
support prefix_same_as_start
support upper_bound
not supported in iterators without Prev()
Also add tests in db_iter_test and db_iterator_test
Pass all tests
Cheers!
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64149
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.
Test Plan:
make all check
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64065
Summary:
MyRocks build is broken because they are using "-Werror=missing-field-initializers"
We should fix that by explicitly passing these arguments
Test Plan: Build MyRocks
Reviewers: sdong, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D64161
Summary: 0.9 can make the test flaky since just found one test fail with 0.88
Test Plan: make all check
Reviewers: sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63939
Summary: EnvPosixTestWithParam.TwoPools relies on explicit sleeping, so it sometimes fail. Fix it.
Test Plan: Run tests with high parallelism many times and make sure the test passes.
Reviewers: yiwu, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63417
Summary: Add mutable options info into `OptionsTypeInfo` and use it to parse mutable options map. Also support `max_bytes_for_level_multiplier_additional` in option file.
Test Plan: unit test
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63843
Summary:
* Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions.
* Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable.
Test Plan: existing unit tests.
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63945
Summary:
Add Env::RandomRWFile in env.h and implement it for POSIX
RandomRWFile is a file that allow us to read from / write to random offsets in the file
I will implement it for other Envs later after finishing the whole task for AddFile()
Test Plan: unit tests
Reviewers: andrewkr, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62433
Summary:
ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility.
Still keep ZSTDNotFinal for compatibility reason.
Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older.
Reviewers: andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63141
Summary: Fix two Windows build problems.
Test Plan: Build on Windows and run all Linux tests.
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63189
Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (5665e5e285) but later its reference in /include/rocksdb/table.h is removed.
Test Plan:
make all check
Reviewers: IslamAbdelRahman, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: yhchiang, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63177
* Fix StatsLevel so that kExceptTimeForMutex leaves compression stats enabled and kExceptDetailedTimers disables mutex lock stats. Also change default stats level to kExceptDetailedTimers (disabling both compression and mutex timing).
* Changed order of StatsLevel enum to simplify logic for determining what stats to record.
Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes.
Test Plan: Add two new unit tests. Run all existing tests, including jtest.
Reviewers: yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59829
Summary: basically for SimCache stats. I find most times it is hard to pass Statistics* to SimCache constructor.
Test Plan: make all check
Reviewers: andrewkr, sdong, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62193
Summary:
To reduce contention for atomics when HistogramStats are shared across
threads, this diff makes them thread-specific so updates are faster. This comes
at the expense of slower reads (much less frequent), which now require merging
all histograms. In this diff,
- Thread-specific HistogramImpl is created upon the thread's first measureTime()
- Thread-specific HistogramImpl are merged and deleted upon thread termination or ThreadLocalPtr destruction, whichever comes first
- getHistogramString() and histogramData() merge all histograms, both thread-specific and previously merged ones
Test Plan:
unit tests, ran db_bench and verified histograms look similar
before:
$ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
...
+ 7.63% db_bench db_bench [.] rocksdb::HistogramStat::Add
after:
$ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
...
+ 0.98% db_bench db_bench [.] rocksdb::HistogramStat::Add
Reviewers: sdong, MarkCallaghan, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62649
Summary:
Fix ClockCache memory leak found by valgrind:
# Add destructor to cleanup cached values.
# Delete key with cache handle immediately after handle is recycled, and erase table entry immediately if duplicated cache entry is inserted.
Test Plan:
make DISABLE_JEMALLOC=1 valgrind_check
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62973
Summary:
Add ReadOptions::read_amp_bytes_per_bit option which allow us to create a bitmap for every data block we read
the bitmap will contain (block_size / read_amp_bytes_per_bit) bits.
We will use this bitmap to mark which bytes have been used of the block so we can calculate the read amplification
Test Plan: added new tests
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: yiwu, leveldb, march, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58707
Summary:
This diff split ThreadPool to
-ThreadPool (abstract interface exposed in include/rocksdb/threadpool.h)
-ThreadPoolImpl (actual implementation in util/threadpool_imp.h)
This allow us to expose ThreadPool to the user so we can use it as an option later
Test Plan: existing unit tests
Reviewers: andrewkr, yiwu, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62085
Summary:
The global atomics we previously used for tickers had poor cache performance
since they were typically updated from different threads, causing frequent
invalidations. In this diff,
- recordTick() updates a local ticker value specific to the thread in which it was called
- When a thread exits, its local ticker value is added into merged_sum
- getTickerCount() returns the sum of all threads' local ticker values and the merged_sum
- setTickerCount() resets all threads' local ticker values and sets merged_sum to the value provided by the caller.
In a next diff I will make a similar change for histogram stats.
Test Plan:
before:
$ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
$ perf report -g --stdio | grep recordTick
7.59% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick
...
after:
$ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
$ perf report -g --stdio | grep recordTick
1.46% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick
...
Reviewers: kradhakrishnan, MarkCallaghan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: yiwu, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62337
Summary:
We used to allow insert into full block cache as long as `strict_capacity_limit=false`. This diff further restrict insert to full cache if caller don't intent to hold handle to the cache entry after insert.
Hope this diff fix the assertion failure with db_stress: https://our.intern.facebook.com/intern/sandcastle/log/?instance_id=211853102&step_id=2475070014
db_stress: util/lru_cache.cc:278: virtual void rocksdb::LRUCacheShard::Release(rocksdb::Cache::Handle*): Assertion `lru_.next == &lru_' failed.
The assertion at lru_cache.cc:278 can fail when an entry is inserted into full cache and stay in LRU list.
Test Plan:
make all check
Reviewers: IslamAbdelRahman, lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62325
Summary:
Add option to block based table to insert index/filter blocks to block cache with priority. Combined with LRUCache with high_pri_pool_ratio, we can reserved space for index/filter blocks, make them less likely to be evicted.
Depends on D61977.
Test Plan: See unit test.
Reviewers: lightmark, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, march, leveldb
Differential Revision: https://reviews.facebook.net/D62241
Summary:
This function allows the user to provide a custom function to fold all
threads' local data. It will be used in my next diff for aggregating statistics
stored in thread-local data. Note the test case uses atomics as thread-local
values due to the synchronization requirement (documented in code).
Test Plan: unit test
Reviewers: yhchiang, sdong, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62049
Summary:
Add mid-point insertion functionality to LRU cache. Caller of `Cache::Insert()` can set an additional parameter to make a cache entry have higher priority. The LRU cache will reserve at most `capacity * high_pri_pool_pct` bytes for high-pri cache entries. If `high_pri_pool_pct` is zero, the cache degenerates to normal LRU cache.
Context: If we are to put index and filter blocks into RocksDB block cache, index/filter block can be swap out too early. We want to add an option to RocksDB to reserve some capacity in block cache just for index/filter blocks, to mitigate the issue.
In later diffs I'll update block based table reader to use the interface to cache index/filter blocks at high priority, and expose the option to `DBOptions` and make it dynamic changeable.
Test Plan: unit test.
Reviewers: IslamAbdelRahman, sdong, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, march, leveldb
Differential Revision: https://reviews.facebook.net/D61977
Summary: 1. Range Deletion Tombstone structure 2. Modify Add() in table_builder to make it usable for adding range del tombstones 3. Expose NewTombstoneIterator() API in table_reader
Test Plan: table_test.cc (now BlockBasedTableBuilder::Add() only accepts InternalKey. I make table_test only pass InternalKey to BlockBasedTableBuidler. Also test writing/reading range deletion tombstones in table_test )
Reviewers: sdong, IslamAbdelRahman, lightmark, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61473
Summary:
Clock-based cache implemenetation aim to have better concurreny than
default LRU cache. See inline comments for implementation details.
Test Plan:
Update cache_test to run on both LRUCache and ClockCache. Adding some
new tests to catch some of the bugs that I fixed while implementing the
cache.
Reviewers: kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61647
* Create rate limiter using factory function in the test.
* Convert function local statics in option helper to a C array
that does not perform dynamic memory allocation. This is helpful
when you try to memory isolate different DB instances.
Summary: ... so that I can include the header and create LRUCache specific tests for D61977
Test Plan:
make check
Reviewers: lightmark, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62145
Summary:
Added 2 statistics in compaction job statistics, to
identify if single deletes are not meeting a matching key
(fallthrough) or single deletes are meeting a merge, delete or
another single delete (i.e. not the expected case of put).
Test Plan: Tested the statistics using write_stress and compaction_job_stats_test
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61749
Summary:
We were frequently seeing a race between SyncPoint::Process() and
SyncPoint::~SyncPoint() (e.g.,
https://our.intern.facebook.com/intern/sandcastle/log/?instance_id=207289975&step_id=2412725431).
The issue was marked_thread_id_ gets deleted when the main thread is exiting and
simultaneously background threads may access it. We can prevent this race
condition by checking whether sync points are disabled (assuming the test terminates
with them disabled) before attempting to access that member. I do not understand
why accesses to other members (mutex_ and enabled_) are ok but anyways the
test no longer fails tsan.
Test Plan: ran tests
Reviewers: sdong, yhchiang, IslamAbdelRahman, yiwu, wanning
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62133
Summary:
Env holds a pointer of ThreadStatusUpdater, which will be deleted when
Env is deleted. However, in case a rocksdb database is deleted after
Env is deleted. Then this will introduce a free-after-use of this
ThreadStatusUpdater.
This patch fix this by never deleting the ThreadStatusUpdater in Env,
which is in general safe as Env is a singleton in most cases.
Test Plan: thread_list_test
Reviewers: andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59187
Summary: preparation for detecting Cache type. If SimCache, we then may trigger some command like "setSimCapacity()" with setOptions()
Test Plan: make all check
Reviewers: yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61953
Summary:
This diff improves the documentation for GetOptionsFromMap APIs and
fixes a bug in GetOptionsFromMap functions in convenience.h
where new_options will still be changed when the function
call is not successful.
Test Plan: options_test
Reviewers: IslamAbdelRahman, kradhakrishnan, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61731
From the Linux manual:
MAP_ANONYMOUS
The mapping is not backed by any file; its contents
are initialized to zero. The fd and offset arguments are
ignored; however, some implementations require fd to be -1
if MAP_ANONYMOUS (or MAP_ANON) is specified, and portable
applications should ensure this.
FreeBSD is such a case, it wil just return an error.
Summary: Implement a time series database that supports DateTieredCompactionStrategy. It wraps a db object and separate SST files in different column families (time windows).
Test Plan: Add `date_tiered_test`.
Reviewers: dhruba, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61653
Summary:
patch for diff https://reviews.facebook.net/D58587
Also change StopWatch class to add a fifth param named overwrite which decides whether to overwrite *elapse or add on it.
Test Plan: make all check -j64
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61239
Summary:
The patch is a continuation of part 5. It glues the abstraction for
file layout and metadata, and flush out the implementation of the API. It
adds unit tests for the implementation.
Test Plan: Run unit tests
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57549
Summary:
The call stack used to look like this during static initialization:
#0 0x00000000008032d1 in rocksdb::ThreadLocalPtr::StaticMeta::StaticMeta() (this=0x7ffff683b300) at util/thread_local.cc:172
#1 0x00000000008030a7 in rocksdb::ThreadLocalPtr::Instance() () at util/thread_local.cc:135
#2 0x000000000080310f in rocksdb::ThreadLocalPtr::StaticMeta::Mutex() () at util/thread_local.cc:141
#3 0x0000000000803103 in rocksdb::ThreadLocalPtr::StaticMeta::InitSingletons() () at util/thread_local.cc:139
#4 0x000000000080305d in rocksdb::ThreadLocalPtr::InitSingletons() () at util/thread_local.cc:106
It involves outer/inner classes and the call stacks goes
outer->inner->outer->inner, which is too difficult to understand. We can avoid
a level of back-and-forth by skipping StaticMeta::InitSingletons(), which
doesn't initialize anything beyond what ThreadLocalPtr::Instance() already
initializes.
Now the call stack looks like this during static initialization:
#0 0x00000000008032c5 in rocksdb::ThreadLocalPtr::StaticMeta::StaticMeta() (this=0x7ffff683b300) at util/thread_local.cc:170
#1 0x00000000008030a7 in rocksdb::ThreadLocalPtr::Instance() () at util/thread_local.cc:135
#2 0x000000000080305d in rocksdb::ThreadLocalPtr::InitSingletons() () at util/thread_local.cc:106
Test Plan:
unit tests
verify StaticMeta::mutex_ is still initialized in DefaultEnv() (StaticMeta::mutex_ is the only variable intended to be initialized via StaticMeta::InitSingletons() which I removed)
#0 0x00000000005cee17 in rocksdb::port::Mutex::Mutex(bool) (this=0x7ffff69500b0, adaptive=false) at port/port_posix.cc:52
#1 0x0000000000769cf8 in rocksdb::ThreadLocalPtr::StaticMeta::StaticMeta() (this=0x7ffff6950000) at util/thread_local.cc:168
#2 0x0000000000769a53 in rocksdb::ThreadLocalPtr::Instance() () at util/thread_local.cc:133
#3 0x0000000000769a09 in rocksdb::ThreadLocalPtr::InitSingletons() () at util/thread_local.cc:105
#4 0x0000000000647d98 in rocksdb::Env::Default() () at util/env_posix.cc:845
Reviewers: lightmark, yhchiang, sdong
Reviewed By: sdong
Subscribers: arahut, IslamAbdelRahman, yiwu, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60813
Summary: Extend the option memtable_prefix_bloom_huge_page_tlb_size from just putting memtable bloom filter to huge page to memtable itself too.
Test Plan: Run all existing tests.
Reviewers: IslamAbdelRahman, yhchiang, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60513
Summary:
TickersNameMap is not consistent with Tickers enum.
this cause us to report wrong statistics and sometimes to access TickersNameMap outside it's boundary causing crashes (in Fb303 statistics)
Test Plan: added new unit test
Reviewers: sdong, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61083
Summary:
Persistent cache tier is the tier abstraction that can work for any block
device based device mounted on a file system. The design/implementation can
handle any generic block device.
Any generic block support is achieved by generalizing the access patten as
{io-size, q-depth, direct-io/buffered}.
We have specifically tested and adapted the IO path for NVM and SSD.
Persistent cache tier consists of there parts :
1) File layout
Provides the implementation for handling IO path for reading and writing data
(key/value pair).
2) Meta-data
Provides the implementation for handling the index for persistent read cache.
3) Implementation
It binds (1) and (2) and flushed out the PersistentCacheTier interface
This patch provides implementation for (1)(2). Follow up patch will provide (3)
and tests.
Test Plan: Compile and run check
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57117
* Added new statistics and refactored to allow ioptions to be passed around as required to access environment and statistics pointers (and, as a convenient side effect, info_log pointer).
* Prevent incrementing compression counter when compression is turned off in options.
* Prevent incrementing compression counter when compression is turned off in options.
* Added two more supported compression types to test code in db_test.cc
* Prevent incrementing compression counter when compression is turned off in options.
* Added new StatsLevel that excludes compression timing.
* Fixed casting error in coding.h
* Fixed CompressionStatsTest for new StatsLevel.
* Removed unused variable that was breaking the Linux build
Summary: Refactor cache.cc so that I can plugin clock cache (D55581). Mainly move `ShardedCache` to separate file, move `LRUHandle` back to cache.cc and rename it lru_cache.cc.
Test Plan:
make check -j64
Reviewers: lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59655
Summary: In Zlib_Compress and BZip2_Compress the calculation for size was slightly off when using compression_foramt_version 2 (which includes the decompressed size in the output). Also there were unnecessary loops around the deflate/BZ2_bzCompress calls. In Zlib_Compress there was also a possible exit from the function after calling deflateInit2 that didn't call deflateEnd.
Test Plan: Standard tests
Reviewers: sdong, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: sdong, IslamAbdelRahman, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60537
Summary:
I was investigating performance issues in the SstFileWriter and found all of the following:
- The SstFileWriter::Add() function created a local InternalKey every time it was called generating a allocation and free each time. Changed to have an InternalKey member variable that can be reset with the new InternalKey::Set() function.
- In SstFileWriter::Add() the smallest_key and largest_key values were assigned the result of a ToString() call, but it is simpler to just assign them directly from the user's key.
- The Slice class had no move constructor so each time one was returned from a function a new one had to be allocated, the old data copied to the new, and the old one was freed. I added the move constructor which also required a copy constructor and assignment operator.
- The BlockBuilder::CurrentSizeEstimate() function calculates the current estimate size, but was being called 2 or 3 times for each key added. I changed the class to maintain a running estimate (equal to the original calculation) so that the function can return an already calculated value.
- The code in BlockBuilder::Add() that calculated the shared bytes between the last key and the new key duplicated what Slice::difference_offset does, so I replaced it with the standard function.
- BlockBuilder::Add() had code to copy just the changed portion into the last key value (and asserted that it now matched the new key). It is more efficient just to copy the whole new key over.
- Moved this same code up into the 'if (use_delta_encoding_)' since the last key value is only needed when delta encoding is on.
- FlushBlockBySizePolicy::BlockAlmostFull calculated a standard deviation value each time it was called, but this information would only change if block_size of block_size_deviation changed, so I created a member variable to hold the value to avoid the calculation each time.
- Each PutVarint??() function has a buffer and calls std::string::append(). Two or three calls in a row could share a buffer and a single call to std::string::append().
Some of these will be helpful outside of the SstFileWriter. I'm not 100% the addition of the move constructor is appropriate as I wonder why this wasn't done before - maybe because of compiler compatibility? I tried it on gcc 4.8 and 4.9.
Test Plan: The changes should not affect the results so the existing tests should all still work and no new tests were added. The value of the changes was seen by manually testing the SstFileWriter class through MyRocks and adding timing code to identify problem areas.
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59607
Summary: this test support CLI option to select HdfsEnv/NativeHdfsEnv now. The latter one is default. add test about when Rename(src, target) should overwrite target
Test Plan: existing test
Reviewers: sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60399
Summary:
Use @omegaga's awesome feature to avoid use of callbacks for ensuring
SyncPoints happen in a particular thread.
Depends on D60375.
Test Plan:
$ ./auto_roll_logger_test
Reviewers: omegaga, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, omegaga, leveldb
Differential Revision: https://reviews.facebook.net/D60471
Summary: Add markers to sync points. A marked sync point will only be active when it is on the same thread as the marker sync point.
Test Plan: Write a unit test to validate.
Reviewers: sdong, IslamAbdelRahman, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60375