Commit Graph

250 Commits

Author SHA1 Message Date
Siying Dong
8843129ece Move some memory related files from util/ to memory/ (#5382)
Summary:
Move arena, allocator, and memory tools under util to a separate memory/ directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382

Differential Revision: D15564655

Pulled By: siying

fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5
2019-05-30 17:44:09 -07:00
Siying Dong
e9e0101ca4 Move test related files under util/ to test_util/ (#5377)
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377

Differential Revision: D15551366

Pulled By: siying

fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
2019-05-30 11:25:51 -07:00
Maysam Yabandeh
eab4f49a2c WritePrepared: skip_concurrency_control option (#5330)
Summary:
This enables the user to set TransactionDBOptions::skip_concurrency_control so the standard `DB::Write(const WriteOptions& opts, WriteBatch* updates)` would skip the concurrency control. This would give higher throughput to the users who know their use case doesn't need concurrency control.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5330

Differential Revision: D15525932

Pulled By: maysamyabandeh

fbshipit-source-id: 68421ac1ba34f549a4a8de9ce4c2dccf6fb4b06b
2019-05-28 16:29:45 -07:00
Zhichao Cao
a13026fb2f Added trace replay fast forward function (#5273)
Summary:
In the current db_bench trace replay, the replay process strictly follows the timestamp to issue the queries. In some cases, user does not care about the time. Therefore, fast forward is needed for users to speed up the replay process.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5273

Differential Revision: D15389232

Pulled By: zhichao-cao

fbshipit-source-id: 735d629b9d2a167b05af3e4fa0ddf9d5d0be1806
2019-05-16 20:21:18 -07:00
Maysam Yabandeh
f383641a1d Unordered Writes (#5218)
Summary:
Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
The patch is prepared based on an original by siying
Existing unit tests are extended to include unordered_write option.

Benchmark Results:
```
TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions  --unordered_write=1
```
With WAL
- Vanilla RocksDB: 78.6 MB/s
- WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
- unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)

Without WAL
- Vanilla RocksDB: 111.3 MB/s
- WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
- unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)

- WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)

Limitations:
- The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218

Differential Revision: D15219029

Pulled By: maysamyabandeh

fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
2019-05-13 17:47:21 -07:00
Yi Wu
92c60547fe db_bench: fix hang on IO error (#5300)
Summary:
db_bench will wait indefinitely if there's background error. Fix by pass `abs_time_us` to cond var.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5300

Differential Revision: D15319945

Pulled By: miasantreble

fbshipit-source-id: 0034fb7f6ec7c3303c4ccf26e54c20fbdac8ab44
2019-05-13 11:30:35 -07:00
Sagar Vemuri
3548e4220d Improve explicit user readahead performance (#5246)
Summary:
Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`.

1. Stop creating new table readers when the user explicitly sets readahead size.
2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases).
3. Add `readahead_size` to db_bench.

**Benchmarks:**
https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737

For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%.
For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%.

**Test Plan:**
Updated `DBIteratorTest.ReadAhead` test to make sure that:
- no new table readers are created for iterators on setting ReadOptions.readahead_size
- At least "readahead" number of bytes are actually getting read on each iterator read.

TODO later:
- Use similar logic for compactions as well.
- This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246

Differential Revision: D15107946

Pulled By: sagar0

fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea
2019-04-26 21:24:10 -07:00
Adam Retter
990b2f4cb3 Fix compilation on db_bench_tool.cc on Windows (#5227)
Summary:
I needed this change to be able to build the v6.0.1 release on Windows.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5227

Differential Revision: D15033815

Pulled By: sagar0

fbshipit-source-id: 579f3b8e694c34c0d43527eb2fa37175e37f5911
2019-04-23 11:16:51 -07:00
Zhongyi Xie
baa5302447 Avoid double-compacting data in bottom level in manual compactions (#5138)
Summary:
Depending on the config, manual compaction (leveled compaction style) does following compactions:
L0->L1
L1->L2
...
Ln-1 -> Ln
Ln -> Ln
The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only.
Resolves issue https://github.com/facebook/rocksdb/issues/4995
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138

Differential Revision: D14940106

Pulled By: miasantreble

fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5
2019-04-16 23:32:20 -07:00
Yi Wu
b70967aac7 db_bench: support seek to non-exist prefix (#5163)
Summary:
Add `--seek_missing_prefix` flag to db_bench to allow benchmarking seeking to non-existing prefix. Usage example:
```
./db_bench --db=/dev/shm/db_bench --use_existing_db=false --benchmarks=fillrandom --num=100000000 --prefix_size=9 --keys_per_prefix=10
./db_bench --db=/dev/shm/db_bench --use_existing_db=true --benchmarks=seekrandom --disable_auto_compactions=true --num=100000000 --prefix_size=9 --keys_per_prefix=10 --reads=1000 --prefix_same_as_start=true --seek_missing_prefix=true
```
Also adding `--total_order_seek` and `--prefix_same_as_start` flags.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5163

Differential Revision: D14935724

Pulled By: riversand963

fbshipit-source-id: 7c41023f007febe373eb1589861f215432a9e18a
2019-04-15 10:54:58 -07:00
Yanqin Jin
3189398c00 Fix bugs detected by clang analyzer (#5185)
Summary:
as titled. False positive included, fixed anyway to make the check
pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5185

Differential Revision: D14909384

Pulled By: riversand963

fbshipit-source-id: dc5177e72b1929ccfd6175a60e2cd7bdb9bd80f3
2019-04-12 10:45:56 -07:00
anand76
fefd4b98c5 Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.

Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency

The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.

Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).

Batch   Sizes

1        | 2        | 4         | 8      | 16  | 32

Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074        - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14        - MultiGet (w/ batching)

Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135

Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62

Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891

dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011

Differential Revision: D14348703

Pulled By: anand1976

fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
2019-04-11 14:28:26 -07:00
Siying Dong
2b4d5ceb47 Remove some "using std::..." from header files. (#5113)
Summary:
The code convention we are following, Google C++ Style, discourage
alias in header files, especially public headers:
https://google.github.io/styleguide/cppguide.html#Aliases
Remove some of them. Might removed some from .cc files as well to be consistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5113

Differential Revision: D14633030

Pulled By: siying

fbshipit-source-id: b990edc919d5de60295992284f980195e501d424
2019-03-27 10:28:21 -07:00
Shobhit Dayal
b45b1cde3e Feature for sampling and reporting compressibility (#4842)
Summary:
This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms:
1. lz4 or snappy for fast compression
2. zstd or zlib for slow but higher compression.

The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType.

The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842

Differential Revision: D13629011

Pulled By: shobhitdayal

fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa
2019-03-18 12:15:34 -07:00
Zhichao Cao
05ebfebc17 Fixed the potential stack overflow of MixGraph in db_bench (#5051)
Summary:
In the MixGraph benchmark of db_bench, The max buffer size used for value of KV-pair might be extremely large (64MB), which might cause function stack overflow in some platforms, reduced to 1MB.

Added the finished ops printing in MixGraph benchmark.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5051

Differential Revision: D14379571

Pulled By: zhichao-cao

fbshipit-source-id: 24084fbe38f60f2902d9a40f6bc9a25e4e2c9bb9
2019-03-08 14:10:17 -08:00
Andrew Kryczka
18d2e4beb7 Run db_bench on database generated externally (#5017)
Summary:
Added an option, `-use_existing_keys`, which can be set to run
benchmarks against an arbitrary existing database. Now users can
benchmark against their actual database rather than synthetic data.

Before the run begins, it loads all the keys into memory, then uses that
set of keys rather than synthesizing new ones in `GenerateKeyFromInt`.
This is mainly intended for small-scale DBs where the memory consumption
is not a concern.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5017

Differential Revision: D14270303

Pulled By: riversand963

fbshipit-source-id: 6328df9dffb5e19170270dd00a69f4bbe424e5ed
2019-03-01 11:19:03 -08:00
Siying Dong
aef763b6d6 Make statistics's stats_level change thread-safe (#5030)
Summary:
Right now, users can change statistics.stats_level while DB is running, but TSAN may report
data race. We make stats_level_ to be atomic, and access them using accessors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5030

Differential Revision: D14267519

Pulled By: siying

fbshipit-source-id: 37d7ebeff7a43a406230143422a16af899163f73
2019-03-01 10:42:09 -08:00
Siying Dong
5e298f865b Add two more StatsLevel (#5027)
Summary:
Statistics cost too much CPU for some use cases. Add two stats levels
so that people can choose to skip two types of expensive stats, timers and
histograms.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027

Differential Revision: D14252765

Pulled By: siying

fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592
2019-02-28 10:27:59 -08:00
Zhongyi Xie
c4f5d0aa15 add GetStatsHistory to retrieve stats snapshots (#4748)
Summary:
This PR adds public `GetStatsHistory` API to retrieve stats history in the form of an std map. The key of the map is the timestamp in microseconds when the stats snapshot is taken, the value is another std map from stats name to stats value (stored in std string). Two DBOptions are introduced: `stats_persist_period_sec` (default 10 minutes) controls the intervals between two snapshots are taken; `max_stats_history_count` (default 10) controls the max number of history snapshots to keep in memory. RocksDB will stop collecting stats snapshots if `stats_persist_period_sec` is set to 0.

(This PR is the in-memory part of https://github.com/facebook/rocksdb/pull/4535)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4748

Differential Revision: D13961471

Pulled By: miasantreble

fbshipit-source-id: ac836d401ecb84ea92216bf9966f969dedf4ad04
2019-02-20 15:52:54 -08:00
Zhongyi Xie
ed995c6a69 add whole key bloom filter support in memtables (#4985)
Summary:
MyRocks calls `GetForUpdate` on `INSERT`, for unique key check, and in almost all cases GetForUpdate returns empty result. For such cases, whole key bloom filter is helpful.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4985

Differential Revision: D14118257

Pulled By: miasantreble

fbshipit-source-id: d35cb7109c62fd5ad541a26968e3a3e16d3e85ea
2019-02-19 12:15:39 -08:00
Siying Dong
4db46aa2e6 Fix LITE Build (#4989)
Summary:
LITE mode has EventListener to be an empty class. However in db_bench,
it is used. When "override" is added to the functions, the build breaks. Fix it
by keeping the listener empty in LITE mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4989

Differential Revision: D14108132

Pulled By: siying

fbshipit-source-id: 80121aab35b1120e502b37b782301dd700692697
2019-02-15 16:13:11 -08:00
Aubin Sanyal
3231a2e581 Deprecate ttl option from CompactionOptionsFIFO (#4965)
Summary:
We introduced ttl option in CompactionOptionsFIFO when ttl-based file
deletion (compaction) was supported only as part of FIFO Compaction. But
with the extension of ttl semantics even to Level compaction,
CompactionOptionsFIFO.ttl can now be deprecated. Instead we will start
using ColumnFamilyOptions.ttl for FIFO compaction as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4965

Differential Revision: D14072960

Pulled By: sagar0

fbshipit-source-id: c98cc2ae695a28136295787cd88d36a220fc219e
2019-02-15 09:51:41 -08:00
Michael Liu
ca89ac2ba9 Apply modernize-use-override (2nd iteration)
Summary:
Use C++11’s override and remove virtual where applicable.
Change are automatically generated.

Reviewed By: Orvid

Differential Revision: D14090024

fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a
2019-02-14 14:41:36 -08:00
Siying Dong
cf3a671733 Remove cuckoo hash memtable (#4953)
Summary:
Cuckoo Hash is less useful than we initially expected. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4953

Differential Revision: D13979264

Pulled By: siying

fbshipit-source-id: 2a60afdaa989f045357398b43a1cc5d46f4492ed
2019-02-07 16:15:27 -08:00
Siying Dong
d9c9f3c809 db_bench: fix "micros/op" reporting (#4949)
Summary:
4985a9f73b (diff-e5276985b26a0551957144f4420a594bR511)
changes the meaning of latency reporting from running time per query, to elapse_time / #ops, without providing a reason why.
Considering that this is a counter-intuitive reporting, Reverting the change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4949

Differential Revision: D13964684

Pulled By: siying

fbshipit-source-id: d6304d3d4b5a802daa292302623c7dbca9a680bc
2019-02-05 17:20:02 -08:00
Alexander Zinoviev
32a6dd9a41 Add a new CPU time counter to compaction report (#4889)
Summary:
Measure CPU time consumed for a compaction and report it in the stats report
Enable NowCPUNanos() to work for MacOS
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889

Differential Revision: D13701276

Pulled By: zinoale

fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055
2019-01-29 17:24:00 -08:00
zhichao-cao
e2547103fd Fix the build error caused by the dynamic array (#4918)
Summary:
In the MixGraph benchmark of db_bench #4788 , the char array is initialized with an argument from user's input, which can cause build error on some platforms. Also, the msg char array size can be potentially smaller than the printed data, which should be extended from 100 to 256.

Tested with make check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4918

Differential Revision: D13844298

Pulled By: sagar0

fbshipit-source-id: 33c4809c5c4438f0a9f7b289d3f42e20c545bbab
2019-01-28 12:39:57 -08:00
Andrew Kryczka
e242fa4664 Add latest toolchain (gcc-8, etc.) build support for fbcode users (#4923)
Summary:
- When building with internal dependencies, specify this toolchain by setting `ROCKSDB_FBCODE_BUILD_WITH_PLATFORM007=1`
- It is not enabled by default. However, it is enabled for TSAN builds in CI since there is a known problem with TSAN in gcc-5: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71090
- I did not add support for Lua since (1) we agreed to deprecate it, and (2) we only have an internal build for v5.3 with this toolchain while that has breaking changes compared to our current version (v5.2).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4923

Differential Revision: D13827226

Pulled By: ajkr

fbshipit-source-id: 9aa3388ed3679777cfb15ef8cbcb83c07f62f947
2019-01-28 11:26:32 -08:00
Sagar Vemuri
0cead31d10 Fix Clang static analyzer warning in db_bench (#4910)
Summary:
Fixed clang static analyzer warning about division by 0.
```
ar: creating librocksdb_debug.a
tools/db_bench_tool.cc:4650:43: warning: Division by zero
      int pos = static_cast<int>(rand_num % range_);
                                 ~~~~~~~~~^~~~~~~~
1 warning generated.
make: *** [analyze] Error 1
```

This is from the new code I recently merged in ce8e88d2d7.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4910

Differential Revision: D13788037

Pulled By: sagar0

fbshipit-source-id: f48851dca85047c19fbb1a361e25ce643aa4c7ea
2019-01-23 13:33:02 -08:00
Zhongyi Xie
cbe0239270 add cast to avoid loss of precision error (#4906)
Summary:
this PR address the following error:
> tools/db_bench_tool.cc:4776:68: error: implicit conversion loses integer precision: 'int64_t' (aka 'long') to 'unsigned int' [-Werror,-Wshorten-64-to-32]
        s = db_with_cfh->db->Put(write_options_, key, gen.Generate(value_size));
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4906

Differential Revision: D13780185

Pulled By: miasantreble

fbshipit-source-id: 1c83a77d341099518c72f0f4a63e97ab9c4784b3
2019-01-22 22:44:17 -08:00
Zhichao Cao
ce8e88d2d7 Generate mixed workload with Get, Put, Seek in db_bench (#4788)
Summary:
Based on the specific workload models (key access distribution, value size distribution, and iterator scan length distribution, the QPS variation), the MixGraph benchmark generate the synthetic workload according to these distributions which can reflect the real-world workload characteristics.

After user enable the tracing function, they will get the trace file. By analyzing the trace file with the trace_analyzer tool, user can generate a set of statistic data files including. The *_accessed_key_stats.txt,  *-accessed_value_size_distribution.txt, *-iterator_length_distribution.txt, and *-qps_stats.txt are mainly used to fit the Matlab model fitting. After that, user can get the parameters of the workload distributions (the modeling details are described: [here](https://github.com/facebook/rocksdb/wiki/RocksDB-Trace%2C-Replay%2C-and-Analyzer))

The key access distribution follows the The two-term power model. The probability density function is: `f(x) = ax^{b}+c`. The corresponding parameters are key_dist_a, key_dist_b, and key_dist_c in db_bench

For the value size distribution and iterator scan length distribution, they both follow the Generalized Pareto Distribution. The probability density function is `f(x) = (1/sigma)(1+k*(x-theta)/sigma))^{-1-1/k)`. The parameters are: value_k, value_theta, value_sigma and iter_k, iter_theta, iter_sigma. For more information about the Generalized Pareto Distribution, users can find the [wiki](https://en.wikipedia.org/wiki/Generalized_Pareto_distribution) and [Matalb page](https://www.mathworks.com/help/stats/generalized-pareto-distribution.html)

As for the QPS, it follows the diurnal pattern. So Sine is a good model to fit it. `F(x) = sine_a*sin(sine_b*x + sine_c) + sine_d`. The trace_will tell you the average QPS in the print out resutls, which is sine_d. After user fit the "*-qps_stats.txt" to the Matlab model, user can get the sine_a, sine_b, and sine_c. By using the 4 parameters, user can control the QPS variation including the period, average, changes.

To use the bench mark, user can indicate the following parameters as examples:
```
-benchmarks="mixgraph" -key_dist_a=0.002312 -key_dist_b=0.3467 -value_k=0.9233 -value_sigma=226.4092 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.7 -mix_put_ratio=0.25 -mix_seek_ratio=0.05 -sine_mix_rate_interval_milliseconds=500 -sine_a=15000 -sine_b=1 -sine_d=20000
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4788

Differential Revision: D13573940

Pulled By: sagar0

fbshipit-source-id: e184c27e07b4f1bc0b436c2be36c5090c1fb0222
2019-01-22 10:44:26 -08:00
Andrew Kryczka
01013ae766 Digest ZSTD compression dictionary once when writing SST file (#4849)
Summary:
This is essentially a re-submission of #4251 with a few improvements:

- Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict`
- Eliminated `Init` functions. Instead do all initialization work in constructors.
- Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849

Differential Revision: D13606039

Pulled By: ajkr

fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447
2019-01-18 19:12:57 -08:00
DorianZheng
2670fe8c73 Get CompactionJobInfo from CompactFiles
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4716

Differential Revision: D13207677

Pulled By: ajkr

fbshipit-source-id: d0ccf5a66df6cbb07288b0c5ebad81fd9df3926b
2018-12-13 14:21:24 -08:00
Sagar Vemuri
70645355ad Move FIFOCompactionPicker to a separate file (#4724)
Summary:
**Summary:**
Simplified the code layout by moving FIFOCompactionPicker to a separate file.
**Why?:**
While trying to add ttl functionality to universal compaction, I found that `FIFOCompactionPicker` class and its impl methods to be interspersed between `LevelCompactionPicker` methods which kind-of made the code a little hard to traverse. So I moved `FIFOCompactionPicker` to a separate compaction_picker_fifo.h/cc file, similar to `UniversalCompactionPicker`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4724

Differential Revision: D13227914

Pulled By: sagar0

fbshipit-source-id: 89471766ea67fa4d87664a41c057dd7df4b3d4e3
2018-11-29 16:04:52 -08:00
Sagar Vemuri
c94f073e5e Fix Mac build break in casting (#4722)
Summary:
Mac build is failing with the below error:
```
$ make db_bench -j8
...
...
tools/db_bench_tool.cc:4583:25: error: no matching function for call to 'max'
              (uint64_t)std::max(0l, seek_pos - FLAGS_max_scan_distance),
                        ^~~~~~~~
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/algorithm:2717:1: note: candidate template ignored: deduced conflicting types for parameter '_Tp' ('long' vs. 'long long')
max(const _Tp& __a, const _Tp& __b)
^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/algorithm:2727:1: note: candidate template ignored: could not match 'initializer_list<type-parameter-0-0>' against 'long'
max(initializer_list<_Tp> __t, _Compare __comp)
^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/algorithm:2709:1: note: candidate function template not viable: requires 3 arguments, but 2 were provided
max(const _Tp& __a, const _Tp& __b, _Compare __comp)
^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/algorithm:2735:1: note: candidate function template not viable: requires single argument '__t', but 2 arguments were provided
max(initializer_list<_Tp> __t)
^
1 error generated.
make: *** [tools/db_bench_tool.o] Error 1
```

My compiler version:
Mac OS X Mojave
```
$ clang++ --version
Apple LLVM version 10.0.0 (clang-1000.11.45.5)
Target: x86_64-apple-darwin18.2.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4722

Differential Revision: D13220196

Pulled By: sagar0

fbshipit-source-id: 01e5e928288a5613027c83a26ad8aedf04438b14
2018-11-27 13:30:16 -08:00
Po-Chuan Hsieh
60deb4485e Fix build with ROCKSDB_LITE and -Wunused-private-field (#4715)
Summary:
The error message of databases/rocksdb-lite (FreeBSD port) is as follows:
```
  tools/db_bench_tool.cc:1976:16: error: private field 'trace_options_' is not used [-Werror,-Wunused-private-field]
    TraceOptions trace_options_;
                 ^
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4715

Differential Revision: D13207902

Pulled By: ajkr

fbshipit-source-id: be3c612eba656aeddb77e35e2f201dd25dc92f7e
2018-11-26 21:35:38 -08:00
Abhishek Madan
0ed738fdd0 Add max_scan_distance flag to db_bench (#4660)
Summary:
The new flag makes it possible to constrain iterator traversal
by the upper/lower bound the iterator is expected to pass. This allows
seekrandom results to be more easily comparable between DBs with and
without deletions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4660

Differential Revision: D13053111

Pulled By: abhimadan

fbshipit-source-id: 33e250f2e2d210b54c7726399da30a33f723c33c
2018-11-14 10:46:12 -08:00
Sagar Vemuri
dc3528077a Update all unique/shared_ptr instances to be qualified with namespace std (#4638)
Summary:
Ran the following commands to recursively change all the files under RocksDB:
```
find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
```
Running `make format` updated some formatting on the files touched.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638

Differential Revision: D12934992

Pulled By: sagar0

fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
2018-11-09 11:19:58 -08:00
Zhongyi Xie
fe0d23059d Fix two contrun job failures (#4587)
Summary:
Currently there are two contrun test failures:
* rocksdb-contrun-lite:
> tools/db_bench_tool.cc: In function ‘int rocksdb::db_bench_tool(int, char**)’:
tools/db_bench_tool.cc:5814:5: error: ‘DumpMallocStats’ is not a member of ‘rocksdb’
     rocksdb::DumpMallocStats(&stats_string);
     ^
make: *** [tools/db_bench_tool.o] Error 1
* rocksdb-contrun-unity:
> In file included from unity.cc:44:0:
db/range_tombstone_fragmenter.cc: In member function ‘void rocksdb::FragmentedRangeTombstoneIterator::FragmentTombstones(std::unique_ptr<rocksdb::InternalIteratorBase<rocksdb::Slice> >, rocksdb::SequenceNumber)’:
db/range_tombstone_fragmenter.cc:90:14: error: reference to ‘ParsedInternalKeyComparator’ is ambiguous
   auto cmp = ParsedInternalKeyComparator(icmp_);

This PR will fix them
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4587

Differential Revision: D10846554

Pulled By: miasantreble

fbshipit-source-id: 8d3358879e105060197b1379c84aecf51b352b93
2018-10-24 20:16:45 -07:00
Yi Wu
0415244bfa option to print malloc stats at the end of db_bench (#4582)
Summary:
Option to print malloc stats to stdout at the end of db_bench. This is different from `--dump_malloc_stats`, which periodically print the same information to LOG file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4582

Differential Revision: D10520814

Pulled By: yiwu-arbug

fbshipit-source-id: beff5e514e414079d31092b630813f82939ffe5c
2018-10-24 11:39:05 -07:00
Simon Grätzer
f959e88048 Fix printf formatting on MacOS (#4533)
Summary:
On MacOS with clang the compilation of _tools/db_bench_tool.cc_ always fails because the format used in a `fprintf` call has the wrong type. This PR should hopefully fix this issue
```
tools/db_bench_tool.cc:4233:61: error: format specifies type 'unsigned long long' but the argument has type 'size_t' (aka 'unsigned long')
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4533

Differential Revision: D10471657

Pulled By: maysamyabandeh

fbshipit-source-id: f20f5f3756d3571b586c895c845d0d4d1e34a398
2018-10-19 14:46:09 -07:00
Abhishek Madan
35cd754a6d Add writes_before_delete_range flag to db_bench (#4538)
Summary:
The new flag allows tombstones to be generated after enough
keys have been written to the database, which makes it easier to ensure
that tombstones cover a lot of keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4538

Differential Revision: D10455685

Pulled By: abhimadan

fbshipit-source-id: f25d5421745a353c830dea12b79784e852056551
2018-10-18 17:19:59 -07:00
Zhongyi Xie
d6ec288703 Add PerfContextByLevel to provide per level perf context information (#4226)
Summary:
Current implementation of perf context is level agnostic. Making it hard to do performance evaluation for the LSM tree. This PR adds `PerfContextByLevel` to decompose the counters by level.
This will be helpful when analyzing point and range query performance as well as tuning bloom filter
Also replaced __thread with thread_local keyword for perf_context
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4226

Differential Revision: D10369509

Pulled By: miasantreble

fbshipit-source-id: f1ced4e0de5fcebdb7f9cff36164516bc6382d82
2018-10-17 11:19:40 -07:00
Igor Canadi
1cf5deb8fd Introduce CacheAllocator, a custom allocator for cache blocks (#4437)
Summary:
This is a conceptually simple change, but it touches many files to
pass the allocator through function calls.

We introduce CacheAllocator, which can be used by clients to configure
custom allocator for cache blocks. Our motivation is to hook this up
with folly's `JemallocNodumpAllocator`
(f43ce6d686/folly/experimental/JemallocNodumpAllocator.h),
but there are many other possible use cases.

Additionally, this commit cleans up memory allocation in
`util/compression.h`, making sure that all allocations are wrapped in a
unique_ptr as soon as possible.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4437

Differential Revision: D10132814

Pulled By: yiwu-arbug

fbshipit-source-id: be1343a4b69f6048df127939fea9bbc96969f564
2018-10-02 17:24:58 -07:00
Abhishek Madan
519f8b145f Generate appropriate number of keys in db_bench (#4404)
Summary:
If range tombstones are generated every few writes, the
KeyGenerator's limit is now extended to account for the additional
Next() calls. This is primarily important for `filluniquerandom`
benchmarks that enforce the call limit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4404

Differential Revision: D9949326

Pulled By: abhimadan

fbshipit-source-id: 0bdfeb2cad2098dc0b8b029236dab5e4bef25e38
2018-09-19 16:28:21 -07:00
Anand Ananthabhotla
a27fce408e Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
  a) On the first occurance of an out of space error during compaction,
subsequent
  compactions will be delayed until the disk free space check indicates
  enough available space. The required space is computed as the sum of
  input sizes.
  b) The free space check requirement will be removed once the amount of
  free space is greater than the size reserved by in progress
  compactions when the first error occured
  c) If the out of space error is a hard error, a background thread in
  SFM will poll for sufficient headroom before triggering the recovery
  of the database and putting it in write-only mode. The headroom is
  calculated as the sum of the write_buffer_size of all the DB instances
  associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()

Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164

Differential Revision: D9846378

Pulled By: anand1976

fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 13:43:04 -07:00
Andrew Kryczka
2c14662213 Revert "Digest ZSTD compression dictionary once per SST file (#4251)" (#4347)
Summary:
Reverting is needed to unblock a user building against master, who is blocked for multiple days due to a thread-safety issue in `GetEmptyDict`. We haven't been able to fix it quickly, so reverting.

Simply ran `git revert 6c40806e51a89386d2b066fddf73d3fd03a36f65`. There were no merge conflicts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4347

Differential Revision: D9668365

Pulled By: ajkr

fbshipit-source-id: 0c56334f0a23cf5ee0233d4e4679eae6709739cd
2018-09-06 09:58:34 -07:00
Andrew Kryczka
6c40806e51 Digest ZSTD compression dictionary once per SST file (#4251)
Summary:
In RocksDB, for a given SST file, all data blocks are compressed with the same dictionary. When we compress a block using the dictionary's raw bytes, the compression library first has to digest the dictionary to get it into a usable form. This digestion work is redundant and ideally should be done once per file.

ZSTD offers APIs for the caller to create and reuse a digested dictionary object (`ZSTD_CDict`). In this PR, we call `ZSTD_createCDict` once per file to digest the raw bytes. Then we use `ZSTD_compress_usingCDict` to compress each data block using the pre-digested dictionary. Once the file's created `ZSTD_freeCDict` releases the resources held by the digested dictionary.

There are a couple other changes included in this PR:

- Changed the parameter object for (un)compression functions from `CompressionContext`/`UncompressionContext` to `CompressionInfo`/`UncompressionInfo`. This avoids the previous pattern, where `CompressionContext`/`UncompressionContext` had to be mutated before calling a (un)compression function depending on whether dictionary should be used. I felt that mutation was error-prone so eliminated it.
- Added support for digested uncompression dictionaries (`ZSTD_DDict`) as well. However, this PR does not support reusing them across uncompression calls for the same file. That work is deferred to a later PR when we will store the `ZSTD_DDict` objects in block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4251

Differential Revision: D9257078

Pulled By: ajkr

fbshipit-source-id: 21b8cb6bbdd48e459f1c62343780ab66c0a64438
2018-08-23 19:28:18 -07:00
Fenggang Wu
9d646a6311 Add db_bench options of data block hash index (#4281)
Summary:
Add `--data_block_index_type` and `--data_block_hash_table_util_ratio` option to `db_bench`.

`--data_block_index_type` can be either of `binary` (default) or `binary_and_hash`;
`--data_block_hash_table_util_ratio` will be a double. The default value is `0.75`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4281

Differential Revision: D9361476

Pulled By: fgwu

fbshipit-source-id: dc53e01acef9db81b9eec5e8a96f3bc8ed718c10
2018-08-16 18:42:46 -07:00
Andrew Kryczka
7a9a164276 Fix db_bench default compression level (#4248)
Summary:
db_bench's previous default compression level (-1) was not the default compression level in all libraries. In particular, in ZSTD negative values are valid compression levels, while ZSTD's default compression level is three.

This PR changes db_bench's default to be RocksDB's library-independent default compression level (see #3895). I also changed a couple other flags to get their default values from an options object directly rather than hardcoding.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4248

Differential Revision: D9235140

Pulled By: ajkr

fbshipit-source-id: be4e0722d59fa1968832183db36d1d20fcf11e5b
2018-08-09 10:28:14 -07:00
Sagar Vemuri
fefdac1004 Fix lite build failure in db_bench due to trace/replay (#4225)
Summary:
Fix lite build failure in db_bench due to trace/replay feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4225

Differential Revision: D9153303

Pulled By: sagar0

fbshipit-source-id: 9f7a8035429d0dcdbe99616d11389ed7bccf44be
2018-08-03 11:58:55 -07:00
Sagar Vemuri
12b6cdeed3 Trace and Replay for RocksDB (#3837)
Summary:
A framework for tracing and replaying RocksDB operations.

A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench.

- Column-families are supported
- Multi-threaded tracing is supported.
- TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it.
- This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations.

Currently supported DB operations:
- Writes:
-- Put
-- Merge
-- Delete
-- SingleDelete
-- DeleteRange
-- Write
- Reads:
-- Get (point lookups)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837

Differential Revision: D7974837

Pulled By: sagar0

fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72
2018-08-01 00:27:08 -07:00
Zhichao Cao
6811fb0658 Fixed the db_bench MergeRandom only access CF_default (#4155)
Summary:
When running the tracing and analyzing, I found that MergeRandom benchmark in db_bench only access the default column family even the -num_column_families is specified > 1.

changes: Using the db_with_cfh as DB to randomly select the column family to execute the Merge operation if -num_column_families is specified > 1.

Tested with make asan_check and verified in tracing
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4155

Differential Revision: D8907888

Pulled By: zhichao-cao

fbshipit-source-id: 2b4bc8fe0e99c8f262f5be6b986c7025d62cf850
2018-07-20 15:58:54 -07:00
Siying Dong
a5e851e113 Reformatting some recent changes (#4161)
Summary:
Lint is not happy with some new code recently committed. Format them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4161

Differential Revision: D8940582

Pulled By: siying

fbshipit-source-id: c9b43b1ef8c88b5e923911058b44eb77234b36b7
2018-07-20 14:43:38 -07:00
Pooja Malik
1857576e03 db_bench support for OPTIONS+bloom and nicer output for perf_context (#4153)
Summary:
Adding the string "PERF_CONTEXT:" before the perf_context stats are printed. Setting the filter policy if it's a block based table even when options are being loaded from the provided FLAGS_options_file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4153

Differential Revision: D8905517

Pulled By: poojam23

fbshipit-source-id: 5956ed7882d39ec8ae654d5dadeb88727a36f0dd
2018-07-18 16:27:49 -07:00
Maysam Yabandeh
8581a93a6b Per-thread unique test db names (#4135)
Summary:
The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel.
Example: ``` ~/gtest-parallel/gtest-parallel ./table_test```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135

Differential Revision: D8846653

Pulled By: maysamyabandeh

fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84
2018-07-13 17:27:39 -07:00
Zhongyi Xie
23b76252c8 db_bench: enable setting cache_size when loading options file
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4118

Differential Revision: D8845554

Pulled By: miasantreble

fbshipit-source-id: 13bd3c1259a7c30bad762a413fe3bb24eea650ba
2018-07-13 16:43:53 -07:00
Zhongyi Xie
de98fd88e3 Support compaction filter in db_bench (#4106)
Summary:
Right now there is no support for enabling compaction filter in db_bench, we should add support for that to facilitate testing of compaction filter.
This PR adds a compaction filter called KeepFilter and make `Filter` always returns false, essentially a noop compaction filter. This will allow us to test compaction filter code path without having to support arbitrary compaction filters
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4106

Differential Revision: D8828517

Pulled By: miasantreble

fbshipit-source-id: 9ad76d04103eaa9d00da98334b4a39e542d26c41
2018-07-12 19:42:27 -07:00
Andrew Kryczka
97fe23fc5c Fix unsigned int flag in db_bench (#4129)
Summary:
`DEFINE_uint32` was unavailable on some platforms, e.g., https://travis-ci.org/facebook/rocksdb/jobs/403352902. Use `DEFINE_uint64` instead which should work as it's used many times elsewhere in this file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4129

Differential Revision: D8830311

Pulled By: ajkr

fbshipit-source-id: b4fc90ba3f50e649c070ce8069c68e530d731f05
2018-07-12 18:43:23 -07:00
Andrew Kryczka
63904434eb db_bench periodically dump stats to info log (#4109)
Summary:
give control of how often stats are printed, including jemalloc stats if enabled. Previously the default was 10 minutes so we'd only see updated stats for very long benchmark runs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4109

Differential Revision: D8796444

Pulled By: ajkr

fbshipit-source-id: fd7902fe3f105fae89322c4ab63316bba4a2b15e
2018-07-12 15:57:42 -07:00
Maysam Yabandeh
80ade9ad83 Pin top-level index on partitioned index/filter blocks (#4037)
Summary:
Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead.
Closes https://github.com/facebook/rocksdb/pull/4037

Differential Revision: D8596218

Pulled By: maysamyabandeh

fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2
2018-06-22 15:27:46 -07:00
Hans-Wilhelm Warlo
4faaab70a6 Benchmark sine wave write rate limit (#3914)
Summary:
As mentioned at the [dev forum.](https://www.facebook.com/groups/rocksdb.dev/1693425187422655/)

Let me know if you would like me to do any changes!
Closes https://github.com/facebook/rocksdb/pull/3914

Differential Revision: D8452824

Pulled By: siying

fbshipit-source-id: 56439b3228ecdcc5a199d5198eff2fab553be961
2018-06-15 12:12:03 -07:00
Zhongyi Xie
f1592a06c2 run make format for PR 3838 (#3954)
Summary:
PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings.
Run `make format` to fix formatting as suggested by siying .
Also piggyback two changes:
1) fix singleton destruction order for windows and posix env
2) fix two clang warnings
Closes https://github.com/facebook/rocksdb/pull/3954

Differential Revision: D8272041

Pulled By: miasantreble

fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f
2018-06-05 12:58:02 -07:00
Maysam Yabandeh
d0c38c0c8c Extend some tests to format_version=3 (#3942)
Summary:
format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3.
Closes https://github.com/facebook/rocksdb/pull/3942

Differential Revision: D8238413

Pulled By: maysamyabandeh

fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035
2018-06-04 20:13:00 -07:00
Dmitri Smirnov
f4b72d7056 Provide a way to override windows memory allocator with jemalloc for ZSTD
Summary:
Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably.

For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance.

The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression.
Closes https://github.com/facebook/rocksdb/pull/3838

Differential Revision: D8229794

Pulled By: miasantreble

fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d
2018-06-04 12:12:48 -07:00
Jacquin Mininger
727eb881a5 Compile error in db bench tool
Summary:
Small format error below causes build to fail. I believe that this :
```
fprintf(stderr, "num reads to do %lu\n", reads_);
```
Can be changed to this:
```
fprintf(stderr, "num reads to do %" PRIu64 "\n", reads_);
```
Successful build
```
  CC       utilities/blob_db/blob_dump_tool.o
  AR       librocksdb_debug.a
ar: creating archive librocksdb_debug.a
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: librocksdb_debug.a(rocks_lua_compaction_filter.o) has no symbols
  CC       tools/db_bench.o
  CC       tools/db_bench_tool.o
tools/db_bench_tool.cc:4532:46: error: format specifies type 'unsigned long' but the argument has type 'int64_t' (aka 'long long') [-Werror,-Wformat]
    fprintf(stderr, "num reads to do %lu\n", reads_);
                                     ~~~     ^~~~~~
                                     %lld
1 error generated.
make: *** [tools/db_bench_tool.o] Error 1
```

```
$ cd rocksdb
$ make all

$ g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 9.1.0 (clang-902.0.39.1)
Target: x86_64-apple-darwin17.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
```
Closes https://github.com/facebook/rocksdb/pull/3909

Differential Revision: D8215710

Pulled By: siying

fbshipit-source-id: 15e49fb02a818fec846e9f9b2a50e372b6b67751
2018-05-30 18:01:36 -07:00
Yi Wu
bc7e8d472e LRUCache midpoint insertion
Summary:
Implement midpoint insertion strategy where new blocks will be insert to the middle of LRU list, then move the head on the first hit in cache.
Closes https://github.com/facebook/rocksdb/pull/3877

Differential Revision: D8100895

Pulled By: yiwu-arbug

fbshipit-source-id: f4bd83cb8be469e5d02072cfc8bd66011391f3da
2018-05-24 15:57:33 -07:00
Andrew Kryczka
072ae671a7 Apply use_direct_io_for_flush_and_compaction to writes only
Summary:
Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used.

This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously.
Closes https://github.com/facebook/rocksdb/pull/3829

Differential Revision: D7915443

Pulled By: ajkr

fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279
2018-05-09 19:42:58 -07:00
Zhongyi Xie
a703432808 MaxFileSizeForLevel: adjust max_file_size for dynamic level compaction
Summary:
`MutableCFOptions::RefreshDerivedOptions` always assume base level is L1, which is not true when `level_compaction_dynamic_level_bytes=true` and Level based compaction is used.
This PR fixes this by recomputing `max_file_size` at query time (in `MaxFileSizeForLevel`)
Fixes https://github.com/facebook/rocksdb/issues/3229

In master:

```
Level Files Size(MB)
--------------------
  0       14      846
  1        0        0
  2        0        0
  3        0        0
  4        0        0
  5       15      366
  6       11      481
Cumulative compaction: 3.83 GB write, 2.27 GB read
```
In branch:
```
Level Files Size(MB)
--------------------
  0        9      544
  1        0        0
  2        0        0
  3        0        0
  4        0        0
  5        0        0
  6      445      935
Cumulative compaction: 2.91 GB write, 1.46 GB read
```

db_bench command used:
```
./db_bench --benchmarks="fillrandom,deleterandom,fillrandom,levelstats,stats" --statistics -deletes=5000 -db=tmp -compression_type=none --num=20000 -value_size=100000 -level_compaction_dynamic_level_bytes=true -target_file_size_base=2097152 -target_file_size_multiplier=2
```
Closes https://github.com/facebook/rocksdb/pull/3755

Differential Revision: D7721381

Pulled By: miasantreble

fbshipit-source-id: 39afb8503190bac3b466adf9bbf2a9b3655789f8
2018-05-03 16:42:13 -07:00
Zhongyi Xie
459bb9028f remove prefixscanrandom from db_bench help
Summary:
fix issue reported in https://github.com/facebook/rocksdb/issues/3757
Closes https://github.com/facebook/rocksdb/pull/3784

Differential Revision: D7794107

Pulled By: miasantreble

fbshipit-source-id: 43535074fcb82adb5656bcb916284b2dfc5cbb64
2018-04-27 12:13:19 -07:00
Gabriel Wicke
090c78a0d7 Support lowering CPU priority of background threads
Summary:
Background activities like compaction can negatively affect
latency of higher-priority tasks like request processing. To avoid this,
rocksdb already lowers the IO priority of background threads on Linux
systems. While this takes care of typical IO-bound systems, it does not
help much when CPU (temporarily) becomes the bottleneck. This is
especially likely when using more expensive compression settings.

This patch adds an API to allow for lowering the CPU priority of
background threads, modeled on the IO priority API. Benchmarks (see
below) show significant latency and throughput improvements when CPU
bound. As a result, workloads with some CPU usage bursts should benefit
from lower latencies at a given utilization, or should be able to push
utilization higher at a given request latency target.

A useful side effect is that compaction CPU usage is now easily visible
in common tools, allowing for an easier estimation of the contribution
of compaction vs. request processing threads.

As with IO priority, the implementation is limited to Linux, degrading
to a no-op on other systems.
Closes https://github.com/facebook/rocksdb/pull/3763

Differential Revision: D7740096

Pulled By: gwicke

fbshipit-source-id: e5d32373e8dc403a7b0c2227023f9ce4f22b413c
2018-04-24 08:41:51 -07:00
Yanqin Jin
2ee1496c43 Add missing whitespace.
Summary: Closes https://github.com/facebook/rocksdb/pull/3729

Differential Revision: D7645465

Pulled By: riversand963

fbshipit-source-id: a64da0960fe6c39847ef848b8888fe9a9c1df25d
2018-04-17 09:57:40 -07:00
Yi Wu
2c2f388897 db_bench fillXXXdeterministic should respect compression type
Summary:
db_bench fillXXXdeterministic should respect compression type when calling CompactFiles().
Closes https://github.com/facebook/rocksdb/pull/3731

Differential Revision: D7647761

Pulled By: yiwu-arbug

fbshipit-source-id: 15e12429e0dd93ece2231b015f2e26c2d94781e6
2018-04-16 18:01:47 -07:00
Zhongyi Xie
954b496b3f fix memory leak in two_level_iterator
Summary:
this PR fixes a few failed contbuild:
1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406
2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662
3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag.
Closes https://github.com/facebook/rocksdb/pull/3718

Reviewed By: maysamyabandeh

Differential Revision: D7621192

Pulled By: miasantreble

fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146
2018-04-15 17:26:26 -07:00
Anand Ananthabhotla
f9f4d40f93 Align SST file data blocks to avoid spanning multiple pages
Summary:
Provide a block_align option in BlockBasedTableOptions to allow
alignment of SST file data blocks. This will avoid higher
IOPS/throughput load due to < 4KB data blocks spanning 2 4KB pages.
When this option is set to true, the block alignment is set to lower of
block size and 4KB.
Closes https://github.com/facebook/rocksdb/pull/3502

Differential Revision: D7400897

Pulled By: anand1976

fbshipit-source-id: 04cc3bd144e88e3431a4f97604e63ad7a0f06d44
2018-03-26 20:26:10 -07:00
Yi Wu
b864bc9b5b Blob DB: Improve FIFO eviction
Summary:
Improving blob db FIFO eviction with the following changes,
* Change blob_dir_size to max_db_size. Take into account SST file size when computing DB size.
* FIFO now only take into account live sst files and live blob files. It is normal for disk usage to go over max_db_size because there are obsolete sst files and blob files pending deletion.
* FIFO eviction now also evict TTL blob files that's still open. It doesn't evict non-TTL blob files.
* If FIFO is triggered, it will pass an expiration and the current sequence number to compaction filter. Compaction filter will then filter inlined keys to evict those with an earlier expiration and smaller sequence number. So call LSM FIFO.
* Compaction filter also filter those blob indexes where corresponding blob file is gone.
* Add an event listener to listen compaction/flush event and update sst file size.
* Implement DB::Close() to make sure base db, as well as event listener and compaction filter, destruct before blob db.
* More blob db statistics around FIFO.
* Fix some locking issue when accessing a blob file.
Closes https://github.com/facebook/rocksdb/pull/3556

Differential Revision: D7139328

Pulled By: yiwu-arbug

fbshipit-source-id: ea5edb07b33dfceacb2682f4789bea61de28bbfa
2018-03-06 11:57:42 -08:00
Pooya Shareghi
0a2354ca8f Added bytes XOR merge operator
Summary:
Closes https://github.com/facebook/rocksdb/pull/575

I fixed the merge conflicts etc.
Closes https://github.com/facebook/rocksdb/pull/3065

Differential Revision: D7128233

Pulled By: sagar0

fbshipit-source-id: 2c23a48c9f0432c290b0cd16a12fb691bb37820c
2018-03-06 10:27:36 -08:00
Andrew Kryczka
5d68243e61 Comment out unused variables
Summary:
Submitting on behalf of another employee.
Closes https://github.com/facebook/rocksdb/pull/3557

Differential Revision: D7146025

Pulled By: ajkr

fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e
2018-03-05 13:13:41 -08:00
Igor Sugak
aba3409740 Back out "[codemod] - comment out unused parameters"
Reviewed By: igorsugak

fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37
2018-02-22 12:43:17 -08:00
David Lai
f4a030ce81 - comment out unused parameters
Reviewed By: everiq, igorsugak

Differential Revision: D7046710

fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461
2018-02-22 09:44:23 -08:00
Andrew Kryczka
0a0fad447b db_bench separate options for partition index and filters
Summary:
Some workloads (like my current benchmarking) may want partitioned indexes without partitioned filters. Particularly, when `-optimize_filters_for_hits=true`, the total index size may be larger than the total filter size, so it can make sense to hold all filters in-memory but not all indexes.
Closes https://github.com/facebook/rocksdb/pull/3492

Differential Revision: D6970092

Pulled By: ajkr

fbshipit-source-id: b7fa1828e1d13829339aefb90fd56eb7c5337f61
2018-02-12 14:57:13 -08:00
Tamir Duberstein
cd5092e168 Suppress unused warnings
Summary:
- Use `__unused__` everywhere
- Suppress unused warnings in Release mode
    + This currently affects non-MSVC builds (e.g. mingw64).
Closes https://github.com/facebook/rocksdb/pull/3448

Differential Revision: D6885496

Pulled By: miasantreble

fbshipit-source-id: f2f6adacec940cc3851a9eee328fafbf61aad211
2018-02-02 12:27:07 -08:00
Siying Dong
e2d4b0efb1 db_bench: sanity check CuckooTable with mmap_read option
Summary:
This is to avoid run time error. Fail the db_bench immediately if cuckoo table is used but mmap_read is not specified.
Closes https://github.com/facebook/rocksdb/pull/3420

Differential Revision: D6838284

Pulled By: siying

fbshipit-source-id: 20893fa28d40fadc31e4ff154bed02f5a1bad341
2018-01-29 14:27:32 -08:00
Andrew Kryczka
9f7ccc8445 fix db_bench filluniquerandom key count assertion
Summary:
It failed every time. I guess people usually ran with assertions disabled.
Closes https://github.com/facebook/rocksdb/pull/3422

Differential Revision: D6822984

Pulled By: ajkr

fbshipit-source-id: 2e90db75618b26ac1c46ddfa9e03c095c7bf16e3
2018-01-29 11:43:21 -08:00
Andrew Kryczka
0e6e405fec db_bench support for memtable in-place update
Summary: Closes https://github.com/facebook/rocksdb/pull/3416

Differential Revision: D6820606

Pulled By: ajkr

fbshipit-source-id: 5035ffb33ade8d50520cafeb685ee8c8fcf1cca8
2018-01-26 10:57:49 -08:00
Anand Ananthabhotla
199405192d Add a BlockBasedTableOption to turn off index block compression.
Summary:
Add a new bool option index_uncompressed in BlockBasedTableOptions.
Closes https://github.com/facebook/rocksdb/pull/3303

Differential Revision: D6686161

Pulled By: anand1976

fbshipit-source-id: 748b46993d48a01e5f89b6bd3e41f06a59ec6054
2018-01-10 15:11:59 -08:00
Yi Wu
46ec52499e Fix db_bench write being disabled in lite build
Summary:
The macro was added by mistake in #2372
Closes https://github.com/facebook/rocksdb/pull/3343

Differential Revision: D6681356

Pulled By: yiwu-arbug

fbshipit-source-id: 4180172fb0eaef4189c07f219241e0c261c03461
2018-01-09 10:57:29 -08:00
yingsu00
f54d7f5fea Port 3 way SSE4.2 crc32c implementation from Folly
Summary:
**# Summary**

RocksDB uses SSE crc32 intrinsics to calculate the crc32 values but it does it in single way fashion (not pipelined on single CPU core). Intel's whitepaper () published an algorithm that uses 3-way pipelining for the crc32 intrinsics, then use pclmulqdq intrinsic to combine the values. Because pclmulqdq has overhead on its own, this algorithm will show perf gains on buffers larger than 216 bytes, which makes RocksDB a perfect user, since most of the buffers RocksDB call crc32c on is over 4KB. Initial db_bench show tremendous CPU gain.

This change uses the 3-way SSE algorithm by default. The old SSE algorithm is now behind a compiler tag NO_THREEWAY_CRC32C. If user compiles the code with NO_THREEWAY_CRC32C=1 then the old SSE Crc32c algorithm would be used. If the server does not have SSE4.2 at the run time the slow way (Non SSE) will be used.

**# Performance Test Results**
We ran the FillRandom and ReadRandom benchmarks in db_bench. ReadRandom is the point of interest here since it calculates the CRC32 for the in-mem buffers. We did 3 runs for each algorithm.

Before this change the CRC32 value computation takes about 11.5% of total CPU cost, and with the new 3-way algorithm it reduced to around 4.5%. The overall throughput also improved from 25.53MB/s to 27.63MB/s.

1) ReadRandom in db_bench overall metrics

    PER RUN
    Algorithm | run | micros/op | ops/sec |Throughput (MB/s)
    3-way      |  1   | 4.143   | 241387 | 26.7
    3-way      |  2   | 3.775   | 264872 | 29.3
    3-way      | 3    | 4.116   | 242929 | 26.9
    FastCrc32c|1  | 4.037   | 247727 | 27.4
    FastCrc32c|2  | 4.648   | 215166 | 23.8
    FastCrc32c|3  | 4.352   | 229799 | 25.4

     AVG
    Algorithm     |    Average of micros/op |   Average of ops/sec |    Average of Throughput (MB/s)
    3-way           |     4.01                               |      249,729                 |      27.63
    FastCrc32c  |     4.35                              |     230,897                  |      25.53

 2)   Crc32c computation CPU cost (inclusive samples percentage)
    PER RUN
    Implementation | run |  TotalSamples   | Crc32c percentage
    3-way                 |  1    |  4,572,250,000 | 4.37%
    3-way                 |  2    |  3,779,250,000 | 4.62%
    3-way                 |  3    |  4,129,500,000 | 4.48%
    FastCrc32c       |  1    |  4,663,500,000 | 11.24%
    FastCrc32c       |  2    |  4,047,500,000 | 12.34%
    FastCrc32c       |  3    |  4,366,750,000 | 11.68%

 **# Test Plan**
     make -j64 corruption_test && ./corruption_test
      By default it uses 3-way SSE algorithm

     NO_THREEWAY_CRC32C=1 make -j64 corruption_test && ./corruption_test

    make clean && DEBUG_LEVEL=0 make -j64 db_bench
    make clean && DEBUG_LEVEL=0 NO_THREEWAY_CRC32C=1 make -j64 db_bench
Closes https://github.com/facebook/rocksdb/pull/3173

Differential Revision: D6330882

Pulled By: yingsu00

fbshipit-source-id: 8ec3d89719533b63b536a736663ca6f0dd4482e9
2017-12-19 18:26:49 -08:00
Sagar Vemuri
d51fcb21f4 Blob DB: Add db_bench options
Summary:
Adding more BlobDB db_bench options which are needed for benchmarking.
Closes https://github.com/facebook/rocksdb/pull/3230

Differential Revision: D6500711

Pulled By: sagar0

fbshipit-source-id: 91d63122905854ef7c9148a0235568719146e6c5
2017-12-06 20:44:12 -08:00
Andrew Kryczka
63f1c0a57d fix gflags namespace
Summary:
I started adding gflags support for cmake on linux and got frustrated that I'd need to duplicate the build_detect_platform logic, which determines namespace based on attempting compilation. We can do it differently -- use the GFLAGS_NAMESPACE macro if available, and if not, that indicates it's an old gflags version without configurable namespace so we can simply hardcode "google".
Closes https://github.com/facebook/rocksdb/pull/3212

Differential Revision: D6456973

Pulled By: ajkr

fbshipit-source-id: 3e6d5bde3ca00d4496a120a7caf4687399f5d656
2017-12-01 10:42:05 -08:00
Sagar Vemuri
8954f830a0 Blob DB: db_bench flag to control BlobDB's garbage collection
Summary:
flag: blob_db_enable_gc, to control BlobDb's enable_garbage_collection.
Closes https://github.com/facebook/rocksdb/pull/3190

Differential Revision: D6383395

Pulled By: sagar0

fbshipit-source-id: 4134e835150748c425b8187264273a54c6d8381c
2017-11-20 23:26:15 -08:00
Andrew Kryczka
114896c4e0 db_bench compression options
Summary:
- moved existing compression options to `InitializeOptionsGeneral` since they cannot be set through options file
- added flag for `zstd_max_train_bytes` which was recently introduced by #3057
Closes https://github.com/facebook/rocksdb/pull/3128

Differential Revision: D6240460

Pulled By: ajkr

fbshipit-source-id: 27dbebd86a55de237ba6a45cc79cff9214e82ebc
2017-11-07 14:00:03 -08:00
Andrew Kryczka
65c95d9c59 support db_bench compact benchmark on bottommost files
Summary:
Without this option, running the compact benchmark on a DB containing only bottommost files simply returned immediately.
Closes https://github.com/facebook/rocksdb/pull/3138

Differential Revision: D6256660

Pulled By: ajkr

fbshipit-source-id: e3b64543acd503d821066f4200daa201d4fb3a9d
2017-11-07 10:57:24 -08:00
Dmitri Smirnov
ebab2e2d42 Enable MSVC W4 with a few exceptions. Fix warnings and bugs
Summary: Closes https://github.com/facebook/rocksdb/pull/3018

Differential Revision: D6079011

Pulled By: yiwu-arbug

fbshipit-source-id: 988a721e7e7617967859dba71d660fc69f4dff57
2017-10-19 10:57:12 -07:00
Andrew Kryczka
731895214b db_bench randomtransaction print throughput
Summary:
print throughput in MB/s upon finishing randomtransaction benchmark
Closes https://github.com/facebook/rocksdb/pull/3016

Differential Revision: D6070426

Pulled By: ajkr

fbshipit-source-id: 69df43beed4c374a36d826e761ca3a83e1fdcbf5
2017-10-16 18:42:25 -07:00
Andrew Kryczka
1026e794a3 rate limit auto-tuning
Summary:
Dynamic adjustment of rate limit according to demand for background I/O. It increases by a factor when limiter is drained too frequently, and decreases by the same factor when limiter is not drained frequently enough. The parameters for this behavior are fixed in `GenericRateLimiter::Tune`. Other changes:

- make rate limiter's `Env*` configurable for testing
- track num drain intervals in RateLimiter so we don't have to rely on stats, which may be shared across different DB instances from the ones that share the RateLimiter.
Closes https://github.com/facebook/rocksdb/pull/2899

Differential Revision: D5858704

Pulled By: ajkr

fbshipit-source-id: cc2bac30f85e7f6fd63655d0a6732ef9ed7403b1
2017-10-04 19:15:01 -07:00
Andrew Kryczka
8fc3de3c62 make rate limiter a general option
Summary:
it's unsupported in options file, so the flag should be respected by db_bench even when an options file is provided.
Closes https://github.com/facebook/rocksdb/pull/2910

Differential Revision: D5869836

Pulled By: ajkr

fbshipit-source-id: f67f591ae083e95e989f86b6fad50765d2e3d855
2017-09-21 11:11:00 -07:00
Andrew Kryczka
74f18c1301 db_bench support for non-uniform column family ops
Summary:
Previously we could only select the CF on which to operate uniformly at random. This is a limitation, e.g., when testing universal compaction as all CFs would need to run full compaction at roughly the same time, which isn't realistic.

This PR allows the user to specify the probability distribution for selecting CFs via the `--column_family_distribution` argument.
Closes https://github.com/facebook/rocksdb/pull/2677

Differential Revision: D5544436

Pulled By: ajkr

fbshipit-source-id: 478d56260995236ae90895ce5bd51f38882e185a
2017-08-11 13:57:17 -07:00
Andrew Kryczka
dce6d5a838 db_bench background work thread pool size arguments
Summary:
The background thread pools' sizes weren't easily configurable by `max_background_compactions` and `max_background_flushes` in multi-instance setups. Introduced separate arguments for their sizes.
Closes https://github.com/facebook/rocksdb/pull/2680

Differential Revision: D5550675

Pulled By: ajkr

fbshipit-source-id: bab5f0a7bc5db63bb084d0c10facbe437096367d
2017-08-03 21:41:49 -07:00
Andrew Kryczka
cc01985db0 Introduce bottom-pri thread pool for large universal compactions
Summary:
When we had a single thread pool for compactions, a thread could be busy for a long time (minutes) executing a compaction involving the bottom level. In multi-instance setups, the entire thread pool could be consumed by such bottom-level compactions. Then, top-level compactions (e.g., a few L0 files) would be blocked for a long time ("head-of-line blocking"). Such top-level compactions are critical to prevent compaction stalls as they can quickly reduce number of L0 files / sorted runs.

This diff introduces a bottom-priority queue for universal compactions including the bottom level. This alleviates the head-of-line blocking situation for fast, top-level compactions.

- Added `Env::Priority::BOTTOM` thread pool. This feature is only enabled if user explicitly configures it to have a positive number of threads.
- Changed `ThreadPoolImpl`'s default thread limit from one to zero. This change is invisible to users as we call `IncBackgroundThreadsIfNeeded` on the low-pri/high-pri pools during `DB::Open` with values of at least one. It is necessary, though, for bottom-pri to start with zero threads so the feature is disabled by default.
- Separated `ManualCompaction` into two parts in `PrepickedCompaction`. `PrepickedCompaction` is used for any compaction that's picked outside of its execution thread, either manual or automatic.
- Forward universal compactions involving last level to the bottom pool (worker thread's entry point is `BGWorkBottomCompaction`).
- Track `bg_bottom_compaction_scheduled_` so we can wait for bottom-level compactions to finish. We don't count them against the background jobs limits. So users of this feature will get an extra compaction for free.
Closes https://github.com/facebook/rocksdb/pull/2580

Differential Revision: D5422916

Pulled By: ajkr

fbshipit-source-id: a74bd11f1ea4933df3739b16808bb21fcd512333
2017-08-03 15:43:29 -07:00