Commit Graph

344 Commits

Author SHA1 Message Date
Peter Dillinger
4149d044cd Change SstFileMetaData::size from size_t to uint64_t (#8926)
Summary:
Because even 32-bit systems can have large files

This is a "change" that I don't want intermingled with an upcoming refactoring.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8926

Test Plan: CI

Reviewed By: zhichao-cao

Differential Revision: D31020974

Pulled By: pdillinger

fbshipit-source-id: ca9eb4510697df6f1f55e37b37730b88b1809a92
2021-09-17 13:23:34 -07:00
mrambacher
dc0dc90cf5 Make Statistics a Customizable Class (#8637)
Summary:
Make the Statistics object into a Customizable object.  Statistics can now be stored and created to/from the Options file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8637

Reviewed By: zhichao-cao

Differential Revision: D30530550

Pulled By: mrambacher

fbshipit-source-id: 5fc7d01d8431f37b2c205bbbd8342c9f697023bd
2021-09-10 09:47:39 -07:00
mrambacher
beed86473a Make MemTableRepFactory into a Customizable class (#8419)
Summary:
This PR does the following:
-> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString
-> Makes the existing implementations compatible with configurations
-> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API

New tests were added to validate the functionality and all existing tests pass.  db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419

Reviewed By: zhichao-cao

Differential Revision: D29558961

Pulled By: mrambacher

fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c
2021-09-08 07:46:44 -07:00
Merlin Mao
d10801e983 Allow Replayer to report the results of TraceRecords. (#8657)
Summary:
`Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.)
`Replayer::Replay()` reports the results via a callback function.

New interface:
`TraceRecordResult` in "rocksdb/trace_record_result.h".

`DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657

Reviewed By: ajkr

Differential Revision: D30290216

Pulled By: autopear

fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6
2021-08-18 17:06:14 -07:00
Merlin Mao
f58d276764 Make TraceRecord and Replayer public (#8611)
Summary:
New public interfaces:
`TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h".
`Replayer`, available in `rocksdb/utilities/replayer.h`.

User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file.

Unit tests:
- `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes.
- `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611

Reviewed By: ajkr

Differential Revision: D30266329

Pulled By: autopear

fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8
2021-08-11 19:32:46 -07:00
Baptiste Lemaire
e3a96c4823 Memtable sampling for mempurge heuristic. (#8628)
Summary:
Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option.
This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value.
Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate.
At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation.
Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries.
The unit tests have been readapted to support this new API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628

Reviewed By: pdillinger

Differential Revision: D30149315

Pulled By: bjlemaire

fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27
2021-08-10 18:09:03 -07:00
Andrew Kryczka
82b81dc8b5 Simplify GenericRateLimiter algorithm (#8602)
Summary:
`GenericRateLimiter` slow path handles requests that cannot be satisfied
immediately.  Such requests enter a queue, and their thread stays in `Request()`
until they are granted or the rate limiter is stopped.  These threads are
responsible for unblocking themselves.  The work to do so is split into two main
duties.

(1) Waiting for the next refill time.
(2) Refilling the bytes and granting requests.

Prior to this PR, the slow path logic involved a leader election algorithm to
pick one thread to perform (1) followed by (2).  It elected the thread whose
request was at the front of the highest priority non-empty queue since that
request was most likely to be granted.  This algorithm was efficient in terms of
reducing intermediate wakeups, which is a thread waking up only to resume
waiting after finding its request is not granted.  However, the conceptual
complexity of this algorithm was too high.  It took me a long time to draw a
timeline to understand how it works for just one edge case yet there were so
many.

This PR drops the leader election to reduce conceptual complexity.  Now, the two
duties can be performed by whichever thread acquires the lock first.  The risk
of this change is increasing the number of intermediate wakeups, however, we
took steps to mitigate that.

- `wait_until_refill_pending_` flag ensures only one thread performs (1). This\
prevents the thundering herd problem at the next refill time. The remaining\
threads wait on their condition variable with an unbounded duration -- thus we\
must remember to notify them to ensure forward progress.
- (1) is typically done by a thread at the front of a queue. This is trivial\
when the queues are initially empty as the first choice that arrives must be\
the only entry in its queue. When queues are initially non-empty, we achieve\
this by having (2) notify a thread at the front of a queue (preferring higher\
priority) to perform the next duty.
- We do not require any additional wakeup for (2). Typically it will just be\
done by the thread that finished (1).

Combined, the second and third bullet points above suggest the refill/granting
will typically be done by a request at the front of its queue.  This is
important because one wakeup is saved when a granted request happens to be in an
already running thread.

Note there are a few cases that still lead to intermediate wakeup, however.  The
first two are existing issues that also apply to the old algorithm, however, the
third (including both subpoints) is new.

- No request may be granted (only possible when rate limit dynamically\
decreases).
- Requests from a different queue may be granted.
- (2) may be run by a non-front request thread causing it to not be granted even\
if some requests in that same queue are granted. It can happen for a couple\
(unlikely) reasons.
  - A new request may sneak in and grab the lock at the refill time, before the\
thread finishing (1) can wake up and grab it.
  - A new request may sneak in and grab the lock and execute (1) before (2)'s\
chosen candidate can wake up and grab the lock. Then that non-front request\
thread performing (1) can carry over to perform (2).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8602

Test Plan:
- Use existing tests. The edge cases listed in the comment are all performance\
related; I could not really think of any related to correctness. The logic\
looks the same whether a thread wakes up/finishes its work early/on-time/late,\
or whether the thread is chosen vs. "steals" the work.
- Verified write throughput and CPU overhead are basically the same with and\
  without this change, even in a rate limiter heavy workload:

Test command:
```
$ rm -rf /dev/shm/dbbench/ && TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -benchmarks=fillrandom -num_multi_db=64 -num_low_pri_threads=64 -num_high_pri_threads=64 -write_buffer_size=262144 -target_file_size_base=262144 -max_bytes_for_level_base=1048576 -rate_limiter_bytes_per_sec=16777216 -key_size=24 -value_size=1000 -num=10000 -compression_type=none -rate_limiter_refill_period_us=1000
```

Results before this PR:

```
fillrandom   :     108.463 micros/op 9219 ops/sec;    9.0 MB/s
7.40user 8.84system 1:26.20elapsed 18%CPU (0avgtext+0avgdata 256140maxresident)k
```

Results after this PR:

```
fillrandom   :     108.108 micros/op 9250 ops/sec;    9.0 MB/s
7.45user 8.23system 1:26.68elapsed 18%CPU (0avgtext+0avgdata 255688maxresident)k
```

Reviewed By: hx235

Differential Revision: D30048013

Pulled By: ajkr

fbshipit-source-id: 6741bba9d9dfbccab359806d725105817fef818b
2021-08-09 16:47:15 -07:00
sdong
e7c24168d8 Move old files to warm tier in FIFO compactions (#8310)
Summary:
Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310

Test Plan: Add several unit tests.

Reviewed By: ajkr

Differential Revision: D28493792

fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501
2021-08-09 12:51:14 -07:00
mrambacher
d057e8326d Make MergeOperator+CompactionFilter/Factory into Customizable Classes (#8481)
Summary:
- Changed MergeOperator, CompactionFilter, and CompactionFilterFactory into Customizable classes.
 - Added Options/Configurable/Object Registration for TTL and Cassandra variants
 - Changed the StringAppend MergeOperators to accept a string delimiter rather than a simple char.  Made the delimiter into a configurable option
 - Added tests for new functionality

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8481

Reviewed By: zhichao-cao

Differential Revision: D30136050

Pulled By: mrambacher

fbshipit-source-id: 271d1772835935b6773abaf018ee71e42f9491af
2021-08-06 08:27:25 -07:00
Baptiste Lemaire
9501279d5f Create fillanddeleteuniquerandom benchmark (db_bench), with new option flags. (#8593)
Summary:
Introduction of a new `fillanddeleteuniquerandom` benchmark (`db_bench`) with 5 new option flags to simulate a benchmark where the following sequence is repeated multiple times:
"A set of keys S1 is inserted ('`disposable entries`'), then after some delay another set of keys S2 is inserted ('`persistent entries`') and the first set of keys S1 is deleted. S2 artificially represents the insertion of hypothetical results from some undefined computation done on the first set of keys S1. The next sequence can start as soon as the last disposable entry in the set S1 of this sequence is inserted, if the `delay` is non negligible."
New flags:
- `disposable_entries_delete_delay`: minimum delay in microseconds between insertion of the last `disposable` entry, and the start of the insertion of the first `persistent` entry.
- `disposable_entries_batch_size`: number of `disposable` entries inserted at the beginning of each sequence.
- `disposable_entries_value_size`: size of the random `value` string for the `disposable` entries.
- `persistent_entries_batch_size`: number of `persistent` entries inserted at the end of each sequence, right before the deletion of the `disposable` entries starts.
- `persistent_entries_value_size`: size of the random value string for the `persistent` entries.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8593

Reviewed By: pdillinger

Differential Revision: D29974436

Pulled By: bjlemaire

fbshipit-source-id: f578033e5b45e8268ba6fa6f38f4770c2e6e801d
2021-07-29 17:23:01 -07:00
mrambacher
3aee4fbd41 Make EventListener into a Customizable Class (#8473)
Summary:
- Added Type/CreateFromString
- Added ability to load EventListeners to DBOptions
- Since EventListeners did not previously have a Name(), defaulted to "".  If there is no name, the listener cannot be loaded from the ObjectRegistry.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473

Reviewed By: zhichao-cao

Differential Revision: D29901488

Pulled By: mrambacher

fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254
2021-07-27 07:47:02 -07:00
Baptiste Lemaire
4361d6d163 Add simple heuristics for experimental mempurge. (#8583)
Summary:
Add `experimental_mempurge_policy` option flag and introduce two new `MemPurge` (Memtable Garbage Collection) policies: 'ALWAYS' and 'ALTERNATE'. Default value: ALTERNATE.
`ALWAYS`: every flush will first go through a `MemPurge` process. If the output is too big to fit into a single memtable, then the mempurge is aborted and a regular flush process carries on. `ALWAYS` is designed for user that need to reduce the number of L0 SST file created to a strict minimum, and can afford a small dent in performance (possibly hits to CPU usage, read efficiency, and maximum burst write throughput).
`ALTERNATE`: a flush is transformed into a `MemPurge` except if one of the memtables being flushed is the product of a previous `MemPurge`. `ALTERNATE` is a good tradeoff between reduction in number of L0 SST files created and performance. `ALTERNATE` perform particularly well for completely random garbage ratios, or garbage ratios anywhere in (0%,50%], and even higher when there is a wild variability in garbage ratios.
This PR also includes support for `experimental_mempurge_policy` in `db_bench`.
Testing was done locally by replacing all the `MemPurge` policies of the unit tests with `ALTERNATE`, as well as local testing with `db_crashtest.py` `whitebox` and `blackbox`. Overall, if an `ALWAYS` mempurge policy passes the tests, there is no reasons why an `ALTERNATE` policy would fail, and therefore the mempurge policy was set to `ALWAYS` for all mempurge unit tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8583

Reviewed By: pdillinger

Differential Revision: D29888050

Pulled By: bjlemaire

fbshipit-source-id: e2cf26646d66679f6f5fb29842624615610759c1
2021-07-26 11:56:29 -07:00
leipeng
2febf1c45c db_bench_tool.cc: fix copy - paste (#8553)
Summary:
PR https://github.com/facebook/rocksdb/issues/8519 fix db_bench_tool.cc for MSVC build errors by simply copy-paste, this PR fix the copy-paste while also works for MSVC.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8553

Reviewed By: ajkr

Differential Revision: D29838056

Pulled By: jay-zhuang

fbshipit-source-id: 0cd60c146b87a355c3dc1061dfe813169d75cea4
2021-07-23 14:31:29 -07:00
Baptiste Lemaire
6b4cdacf41 Add overwrite_probability for filluniquerandom benchmark in db_bench (#8569)
Summary:
Add flags `overwrite_probability` and `overwrite_window_size` flag to `db_bench`.
Add the possibility of performing a `filluniquerandom` benchmark with an overwrite probability.
For each write operation, there is a probability _p_ that the write is an overwrite (_p_=`overwrite_probability`).
When an overwrite is decided, the key is randomly chosen from the last _N_ keys previously inserted into the DB (with _N_=`overwrite_window_size`).
When a pure write is decided, the key inserted into the DB is unique and therefore will not be an overwrite.
The `overwrite_window_size` is used so that the user can decide if the overwrite are mostly targeting recently inserted keys (when `overwrite_window_size` is small compared to the total number of writes), or can also target keys inserted "a long time ago" (when `overwrite_window_size` is comparable to total number of writes).
Note that total number of writes = # of unique insertions + # of overwrites.
No unit test specifically added.
Local testing show the following **throughputs** for `filluniquerandom` with 1M total writes:
- bypass the code inserts (no `overwrite_probability` flag specified): ~14.0MB/s
- `overwrite_probability=0.99`, `overwrite_window_size=10`: ~17.0MB/s
- `overwrite_probability=0.10`, `overwrite_window_size=10`: ~14.0MB/s
- `overwrite_probability=0.99`, `overwrite_window_size=1M`: ~14.5MB/s
- `overwrite_probability=0.10`, `overwrite_window_size=1M`: ~14.0MB/s

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8569

Reviewed By: pdillinger

Differential Revision: D29818631

Pulled By: bjlemaire

fbshipit-source-id: d472b4ea4e457a4da7c4ee4f14b40cccd6a4587a
2021-07-21 11:33:33 -07:00
sdong
bbc85a5f22 Fix minor wrong variable name in db_bench (#8549)
Summary:
Fix a minor variable name that is not accurate. This is recently introduced in https://github.com/facebook/rocksdb/pull/7818

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8549

Reviewed By: zhichao-cao

Differential Revision: D29745585

fbshipit-source-id: 6268b348878fdf99a162b2cc3d5876fbd9bb10d9
2021-07-19 17:08:15 -07:00
Baptiste Lemaire
f4529a54bb Add experimental_allow_mempurge flag to benchmark. (#8546)
Summary:
Tiny PR to add the `experimental_allow_mempurge` to the `db_bench` tool (`Mempurge` is the current prototype for memtable garbage collection).
This is useful to benchmark the prototype of this new feature, stress test it and help find new meaningful heuristics for GC.
By default, the flag to allow `mempurge` is set to `false`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8546

Reviewed By: anand1976

Differential Revision: D29738338

Pulled By: bjlemaire

fbshipit-source-id: 01892883a2f1c714c110718674da05992d6e2dd6
2021-07-19 11:19:21 -07:00
sdong
1e5b631e51 db_bench seekrandom with multiDB should only create iterators queried (#7818)
Summary:
Right now, db_bench with seekrandom and multiple DB setup creates iterator for all DBs just to query one of them. It's different from most real workloads. Fix it by only creating iterators that will be queried.

Also fix a bug that DBs are not destroyed in multi-DB mode.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7818

Test Plan: Run db_bench with single/multiDB X using/not using tailing iterator with ASAN build, and validate the behavior is expected.

Reviewed By: ajkr

Differential Revision: D25720226

fbshipit-source-id: c2ff7ff7120e5ba64287a30b057c5d29b2cbe20b
2021-07-16 12:28:10 -07:00
sherriiiliu
7b9ecd4067 fix several MSVC build errors (#8519)
Summary:
Fixed a few MSVC (VCToolsVersion=14.0) build errors and warnings
* `DEFINE_string` is a macro and VC compiler complains that it cannot put [ifdef-inside-define](https://stackoverflow.com/questions/5586429/ifdef-inside-define)
* `sleep()` is not a recognizable function. Use `FLAGS_env->SleepForMicroseconds` instead
* Define precise type in comparison to avoid mismatch warning

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8519

Reviewed By: jay-zhuang

Differential Revision: D29683086

fbshipit-source-id: 8c80941472089f8daba84ae29597e75e603850e4
2021-07-13 12:40:43 -07:00
mrambacher
da90e23998 Improvements to benchmark.sh script (#8346)
Summary:
1.  Fix printing of stats when there are no writes (wamp=0).  Previously had a div0 error

2.  Added multireadrandom command as a valid target

3.  Added ability to pass additional command line options to db_bench.  Now can say things like benchmark.sh readrandom --mmap_read and the option will be passed to db_bench.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8346

Reviewed By: zhichao-cao

Differential Revision: D29500436

Pulled By: mrambacher

fbshipit-source-id: 54e90708aae9133be3a903e35efdf8f8abbd86fa
2021-07-12 12:18:17 -07:00
Adam Retter
5afd1e309c Correct CVS -> CSV typo (#8513)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8513

Reviewed By: jay-zhuang

Differential Revision: D29654066

Pulled By: mrambacher

fbshipit-source-id: b8f492fe21edd37fe1f1c5a4a0e9153f58bbf3e2
2021-07-12 05:05:16 -07:00
mrambacher
570248aeff Make SecondaryCache Customizable (#8480)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8480

Reviewed By: zhichao-cao

Differential Revision: D29528740

Pulled By: mrambacher

fbshipit-source-id: fd0f70d15f66611c8498257a9973f7e98ca13839
2021-07-06 09:18:08 -07:00
Peter (Stig) Edwards
b20737709f Add -report_open_timing to db_bench (#8464)
Summary:
Hello and thanks for RocksDB,

This PR adds support for ```-report_open_timing true``` to ```db_bench```.
It can be useful when tuning RocksDB on filesystem/env with high latencies for file level operations (create/delete/rename...) seen during ```((Optimistic)Transaction)DB::Open```.

Some examples:

```
> db_bench -benchmarks updaterandom -num 1 -db /dev/shm/db_bench
> db_bench -benchmarks updaterandom -num 0 -db /dev/shm/db_bench -use_existing_db true -report_open_timing true -readonly true 2>&1 | grep OpenDb
OpenDb:     3.90133 milliseconds
> db_bench -benchmarks updaterandom -num 0 -db /dev/shm/db_bench -use_existing_db true -report_open_timing true -use_secondary_db true 2>&1 | grep OpenDb
OpenDb:     3.33414 milliseconds
> db_bench -benchmarks updaterandom -num 0 -db /dev/shm/db_bench -use_existing_db true -report_open_timing true 2>&1 | grep -A1 OpenDb
OpenDb:     6.05423 milliseconds

> db_bench -benchmarks updaterandom -num 1
> db_bench -benchmarks updaterandom -num 0 -use_existing_db true -report_open_timing true -readonly true 2>&1 | grep OpenDb
OpenDb:     4.06859 milliseconds
> db_bench -benchmarks updaterandom -num 0 -use_existing_db true -report_open_timing true -use_secondary_db true 2>&1 | grep OpenDb
OpenDb:     2.85794 milliseconds
> db_bench -benchmarks updaterandom -num 0 -use_existing_db true -report_open_timing true 2>&1 | grep OpenDb
OpenDb:     6.46376 milliseconds

> db_bench -benchmarks updaterandom -num 1 -db /clustered_fs/db_bench
> db_bench -benchmarks updaterandom -num 0 -db /clustered_fs/db_bench -use_existing_db true -report_open_timing true -readonly true 2>&1 | grep OpenDb
OpenDb:     3.79805 milliseconds
> db_bench -benchmarks updaterandom -num 0 -db /clustered_fs/db_bench -use_existing_db true -report_open_timing true -use_secondary_db true 2>&1 | grep OpenDb
OpenDb:     3.00174 milliseconds
> db_bench -benchmarks updaterandom -num 0 -db /clustered_fs/db_bench -use_existing_db true -report_open_timing true 2>&1 | grep OpenDb
OpenDb:     24.8732 milliseconds
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8464

Reviewed By: hx235

Differential Revision: D29398096

Pulled By: zhichao-cao

fbshipit-source-id: 8f05dc3284f084612a3f30234e39e1c37548f50c
2021-07-01 18:42:19 -07:00
Akanksha Mahajan
be8199cdb9 Run Merge with Integrated BlobDB in stress, crash and db_bench (#8461)
Summary:
Run Merge with Intergrated BlobDB in stress tests, crash tests and db_bench.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8461

Test Plan:
1. python3 -u tools/db_crashtest.py --simple whitebox
---use_merge=1 --enable_blob_files=1
           2.  ./db_bench --benchmarks="readwhilemerging"
--merge_operator=uint64add --enable_blob_files=true

Reviewed By: ltamasi

Differential Revision: D29394824

Pulled By: akankshamahajan15

fbshipit-source-id: 0a8e492b13129673e088fb8af3402ab678bb473a
2021-06-25 10:45:52 -07:00
Peter (Stig) Edwards
75741eb0ce Add more ops to: db_bench -report_file_operations (#8448)
Summary:
Hello and thanks for RocksDB,

Here is a PR to add file deletes, renames and ```Flush()```, ```Sync()```, ```Fsync()``` and ```Close()``` to file ops report.

The reason is to help tune RocksDB options when using an env/filesystem with high latencies for file level ("metadata") operations, typically seen during ```DB::Open``` (```db_bench -num 0``` also see https://github.com/facebook/rocksdb/pull/7203 where IOTracing does not trace ```DB::Open```).

Before:
```
> db_bench -benchmarks updaterandom -num 0 -report_file_operations true
...
Entries:    0
...
Num files opened: 12
Num Read(): 6
Num Append(): 8
Num bytes read: 6216
Num bytes written: 6289
```
After:
```
> db_bench -benchmarks updaterandom -num 0 -report_file_operations true
...
Entries:    0
...
Num files opened: 12
Num files deleted: 3
Num files renamed: 4
Num Flush(): 10
Num Sync(): 5
Num Fsync(): 1
Num Close(): 2
Num Read(): 6
Num Append(): 8
Num bytes read: 6216
Num bytes written: 6289
```

Before:
```
> db_bench -benchmarks updaterandom -report_file_operations true
...
Entries:    1000000
...
Num files opened: 18
Num Read(): 396339
Num Append(): 1000058
Num bytes read: 892030224
Num bytes written: 187569238
```
After:
```
> db_bench -benchmarks updaterandom -report_file_operations true
...
Entries:    1000000
...
Num files opened: 18
Num files deleted: 5
Num files renamed: 4
Num Flush(): 1000068
Num Sync(): 9
Num Fsync(): 1
Num Close(): 6
Num Read(): 396339
Num Append(): 1000058
Num bytes read: 892030224
Num bytes written: 187569238
```

Another example showing how using ```DB::OpenForReadOnly``` reduces file operations compared to ```((Optimistic)Transaction)DB::Open```:

```
> db_bench -benchmarks updaterandom -num 1
> db_bench -benchmarks updaterandom -num 0 -use_existing_db true -readonly true -report_file_operations true
...
Entries:    0
...
Num files opened: 8
Num files deleted: 0
Num files renamed: 0
Num Flush(): 0
Num Sync(): 0
Num Fsync(): 0
Num Close(): 0
Num Read(): 13
Num Append(): 0
Num bytes read: 374
Num bytes written: 0
```

```
> db_bench -benchmarks updaterandom -num 1
> db_bench -benchmarks updaterandom -num 0 -use_existing_db true -report_file_operations true
...
Entries:    0
...
Num files opened: 14
Num files deleted: 3
Num files renamed: 4
Num Flush(): 14
Num Sync(): 5
Num Fsync(): 1
Num Close(): 3
Num Read(): 11
Num Append(): 10
Num bytes read: 7291
Num bytes written: 7357
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8448

Reviewed By: anand1976

Differential Revision: D29333818

Pulled By: zhichao-cao

fbshipit-source-id: a06a8c87f799806462319115195b3e94faf5f542
2021-06-24 11:56:51 -07:00
Akanksha Mahajan
5ba1b6e549 Cache warming data blocks during flush (#8242)
Summary:
This PR prepopulates warm/hot data blocks which are already in memory
into block cache at the time of flush. On a flush, the data block that is
in memory (in memtables) get flushed to the device. If using Direct IO,
additional IO is incurred to read this data back into memory again, which
is avoided by enabling newly added option.

 Right now, this is enabled only for flush for data blocks. We plan to
expand this option to cover compactions in the future and for other types
 of blocks.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8242

Test Plan: Add new unit test

Reviewed By: anand1976

Differential Revision: D28521703

Pulled By: akankshamahajan15

fbshipit-source-id: 7219d6958821cedce689a219c3963a6f1a9d5f05
2021-06-17 21:56:47 -07:00
Peter Dillinger
865a25101d Mark Ribbon filter and optimize_filters_for_memory as production (#8408)
Summary:
Marked the Ribbon filter and optimize_filters_for_memory features
as production-ready, each enabling memory savings for Bloom-like filters.
Use `NewRibbonFilterPolicy` in place of `NewBloomFilterPolicy` to use
Ribbon filters instead of Bloom, or `ribbonfilter` in place of
`bloomfilter` in configuration string.

Some small refactoring in db_stress.

Removed/refactored unused code in db_bench, in part preparing for future
default possibly being different from "disabled."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8408

Test Plan:
Lots of prior automated, ad-hoc, and "real world" testing.
Updated tests for new API names. Quick db_bench test:

bloom fillrandom
77730 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 89929384

ribbon fillrandom
71492 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 64531384

Reviewed By: mrambacher

Differential Revision: D29140805

Pulled By: pdillinger

fbshipit-source-id: d742c922722421678f95ad85eeb0aaebc9f5e49a
2021-06-17 12:29:16 -07:00
mrambacher
281ac9c89e Add CreateFrom methods to Env/FileSystem (#8174)
Summary:
- Added CreateFromString method to Env and FilesSystem to replace LoadEnv/Load.  This method/signature is a precursor to making these classes extend Customizable.

- Added CreateFromSystem to Env.  This method standardizes creating an Env from the environment variables.  Previously, some places would check TEST_ENV_URI and others would also check TEST_FS_URI.  Now the code is more command/standardized.

- Added CreateFromFlags to Env.  These method allows Env to be create from string options (such as GFLAGS options) in a more standard way.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8174

Reviewed By: zhichao-cao

Differential Revision: D28999603

Pulled By: mrambacher

fbshipit-source-id: 88e6911e7e91f908458a7fe10a20e93ecbc275fb
2021-06-15 03:43:48 -07:00
Peter Dillinger
311a544c2a Use deleters to label cache entries and collect stats (#8297)
Summary:
This change gathers and publishes statistics about the
kinds of items in block cache. This is especially important for
profiling relative usage of cache by index vs. filter vs. data blocks.
It works by iterating over the cache during periodic stats dump
(InternalStats, stats_dump_period_sec) or on demand when
DB::Get(Map)Property(kBlockCacheEntryStats), except that for
efficiency and sharing among column families, saved data from
the last scan is used when the data is not considered too old.

The new information can be seen in info LOG, for example:

    Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0
    Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%)

And also through DB::GetProperty and GetMapProperty (here using
ldb just for demonstration):

    $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats
    rocksdb.block-cache-entry-stats.bytes.data-block: 0
    rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0
    rocksdb.block-cache-entry-stats.bytes.filter-block: 0
    rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0
    rocksdb.block-cache-entry-stats.bytes.index-block: 178992
    rocksdb.block-cache-entry-stats.bytes.misc: 0
    rocksdb.block-cache-entry-stats.bytes.other-block: 0
    rocksdb.block-cache-entry-stats.bytes.write-buffer: 0
    rocksdb.block-cache-entry-stats.capacity: 8388608
    rocksdb.block-cache-entry-stats.count.data-block: 0
    rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0
    rocksdb.block-cache-entry-stats.count.filter-block: 0
    rocksdb.block-cache-entry-stats.count.filter-meta-block: 0
    rocksdb.block-cache-entry-stats.count.index-block: 215
    rocksdb.block-cache-entry-stats.count.misc: 1
    rocksdb.block-cache-entry-stats.count.other-block: 0
    rocksdb.block-cache-entry-stats.count.write-buffer: 0
    rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290
    rocksdb.block-cache-entry-stats.percent.data-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.index-block: 2.133751
    rocksdb.block-cache-entry-stats.percent.misc: 0.000000
    rocksdb.block-cache-entry-stats.percent.other-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000
    rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052
    rocksdb.block-cache-entry-stats.secs_since_last_collection: 0

Solution detail - We need some way to flag what kind of blocks each
entry belongs to, preferably without changing the Cache API.
One of the complications is that Cache is a general interface that could
have other users that don't adhere to whichever convention we decide
on for keys and values. Or we would pay for an extra field in the Handle
that would only be used for this purpose.

This change uses a back-door approach, the deleter, to indicate the
"role" of a Cache entry (in addition to the value type, implicitly).
This has the added benefit of ensuring proper code origin whenever we
recognize a particular role for a cache entry; if the entry came from
some other part of the code, it will use an unrecognized deleter, which
we simply attribute to the "Misc" role.

An internal API makes for simple instantiation and automatic
registration of Cache deleters for a given value type and "role".

Another internal API, CacheEntryStatsCollector, solves the problem of
caching the results of a scan and sharing them, to ensure scans are
neither excessive nor redundant so as not to harm Cache performance.

Because code is added to BlocklikeTraits, it is pulled out of
block_based_table_reader.cc into its own file.

This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option
(could still be added), and with actual stat gathering.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297

Test Plan: manual testing with db_bench, and a couple of basic unit tests

Reviewed By: ltamasi

Differential Revision: D28488721

Pulled By: pdillinger

fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 16:51:13 -07:00
anand76
13232e11d4 Allow cache_bench/db_bench to use a custom secondary cache (#8312)
Summary:
This PR adds a ```-secondary_cache_uri``` option to the cache_bench and db_bench tools to allow the user to specify a custom secondary cache URI. The object registry is used to create an instance of the ```SecondaryCache``` object of the type specified in the URI.

The main cache_bench code is packaged into a separate library, similar to db_bench.

An example invocation of db_bench with a secondary cache URI -
```db_bench --env_uri=ws://ws.flash_sandbox.vll1_2/ -db=anand/nvm_cache_2 -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=67108864 -cache_index_and_filter_blocks=true  -secondary_cache_uri='cachelibwrapper://filename=/home/anand76/nvm_cache/cache_file;size=2147483648;regionSize=16777216;admPolicy=random;admProbability=1.0;volatileSize=8388608;bktPower=20;lockPower=12' -partition_index_and_filters=true -duration=1800```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8312

Reviewed By: zhichao-cao

Differential Revision: D28544325

Pulled By: anand1976

fbshipit-source-id: 8f209b9af900c459dc42daa7a610d5f00176eeed
2021-05-19 15:26:18 -07:00
sdong
c3ff14e2c1 Hint temperature of bottommost level files to FileSystem (#8222)
Summary:
As the first part of the effort of having placing different files on different storage types, this change introduces several things:
(1) An experimental interface in FileSystem that specify temperature to a new file created.
(2) A test FileSystemWrapper,  SimulatedHybridFileSystem, that simulates HDD for a file of "warm" temperature.
(3) A simple experimental feature ColumnFamilyOptions.bottommost_temperature. RocksDB would pass this value to FileSystem when creating any bottommost file.
(4) A db_bench parameter that applies the (2) and (3) to db_bench.

The motivation of the change is to introduce minimal changes that allow us to evolve tiered storage development.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8222

Test Plan:
./db_bench --benchmarks=fillrandom --write_buffer_size=2000000 -max_bytes_for_level_base=20000000  -level_compaction_dynamic_level_bytes --reads=100 -compaction_readahead_size=20000000 --reads=100000 -num=10000000

followed by

./db_bench --benchmarks=readrandom,stats --write_buffer_size=2000000 -max_bytes_for_level_base=20000000 -simulate_hybrid_fs_file=/tmp/warm_file_list -level_compaction_dynamic_level_bytes -compaction_readahead_size=20000000 --reads=500 --threads=16 -use_existing_db --num=10000000

and see results as expected.

Reviewed By: ajkr

Differential Revision: D28003028

fbshipit-source-id: 4724896d5205730227ba2f17c3fecb11261744ce
2021-05-03 13:34:04 -07:00
David Carlier
728e5f5750 db_bench_tool: basic sys infos for FreeBSD. (#8169)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8169

Reviewed By: riversand963

Differential Revision: D27672457

Pulled By: ajkr

fbshipit-source-id: b40a7ad5d09a754154f28c2574ef9f77c8a131bb
2021-04-09 10:37:01 -07:00
Yanqin Jin
2d8518f5ea Reset pinnable slice before using it in Get() (#8154)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/6548.
If we do not reset the pinnable slice before calling get, we will see the following assertion failure
while running the test with multiple column families.
```
db_bench: ./include/rocksdb/slice.h:168: void rocksdb::PinnableSlice::PinSlice(const rocksdb::Slice&, rocksdb::Cleanable*): Assertion `!pinned_' failed.
```
This happens in `BlockBasedTable::Get()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8154

Test Plan:
./db_bench --benchmarks=fillseq -num_column_families=3
./db_bench --benchmarks=readrandom -use_existing_db=1 -num_column_families=3

Reviewed By: ajkr

Differential Revision: D27587589

Pulled By: riversand963

fbshipit-source-id: 7379e7649ba40f046d6a4014c9ad629cb3f9a786
2021-04-06 11:31:17 -07:00
mrambacher
1be3867689 Fix check in db_bench for num shard bits to match check in LRUCache (#8110)
Summary:
The check in db_bench for table_cache_numshardbits was 0 < bits <= 20, whereas the check in LRUCache was 0 < bits < 20.  Changed the two values to match to avoid a crash in db_bench on a null cache.

Fixes https://github.com/facebook/rocksdb/issues/7393

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8110

Reviewed By: zhichao-cao

Differential Revision: D27353522

Pulled By: mrambacher

fbshipit-source-id: a414bd23b5bde1f071146b34cfca5e35c02de869
2021-03-29 10:34:54 -07:00
junhan lee
06bb45a65a fix typo (#8088)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8088

Reviewed By: ajkr

Differential Revision: D27270378

Pulled By: zhichao-cao

fbshipit-source-id: 05af12c63855d00cc57bab9866fc8193c03a404e
2021-03-26 11:49:32 -07:00
Zhichao Cao
dd0447ae2c Add new Append API with DataVerificationInfo to Env WritableFile (#8071)
Summary:
Add the new Append and PositionedAppend API to env WritableFile. User is able to benefit from the write checksum handoff API when using the legacy Env classes. FileSystem already implemented the checksum handoff API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8071

Test Plan: make check, added new unit test.

Reviewed By: anand1976

Differential Revision: D27177043

Pulled By: zhichao-cao

fbshipit-source-id: 430c8331fc81099fa6d00f4fff703b68b9e8080e
2021-03-19 11:44:13 -07:00
Mark Callaghan
326670d265 Add new db_bench --benchmarks options for controlling compaction (#8027)
Summary:
The new options are:
* compact0 - compact L0 into L1 using one thread
* compact1 - compact L1 into L2 using one thread
* flush - flush memtable
* waitforcompaction - wait for compaction to finish

These are useful for reproducible benchmarks to help get the LSM tree shape
into a deterministic state. I wrote about this at:
http://smalldatum.blogspot.com/2021/02/read-only-benchmarks-with-lsm-are.html

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8027

Reviewed By: riversand963

Differential Revision: D27053861

Pulled By: ajkr

fbshipit-source-id: 1646f35584a3db03740fbeb47d91c3f00fb35d6e
2021-03-17 09:12:27 -07:00
mrambacher
3dff28cf9b Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033)
Summary:
For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>.  The shared ptr has some performance degradation on certain hardware classes.

For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere.  For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it.  The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.

There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold.  In those cases, the shared pointer was preserved.

Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:

6.17: readrandom   :      28.046 micros/op 854902 ops/sec;   61.3 MB/s (355999 of 355999 found)
6.18: readrandom   :      32.615 micros/op 735306 ops/sec;   52.7 MB/s (290999 of 290999 found)
PR: readrandom   :      27.500 micros/op 871909 ops/sec;   62.5 MB/s (367999 of 367999 found)

(Note that the times for 6.18 are prior to revert of the SystemClock).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033

Reviewed By: pdillinger

Differential Revision: D27014563

Pulled By: mrambacher

fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
2021-03-15 04:34:11 -07:00
Yanqin Jin
1f11d07f24 Enable compact filter for blob in dbstress and dbbench (#8011)
Summary:
As title.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8011

Test Plan:
```
./db_bench -enable_blob_files=1 -use_keep_filter=1 -disable_auto_compactions=1
/db_stress -enable_blob_files=1 -enable_compaction_filter=1 -acquire_snapshot_one_in=0 -compact_range_one_in=0 -iterpercent=0 -test_batches_snapshots=0 -readpercent=10 -prefixpercent=20 -writepercent=55 -delpercent=15 -continuous_verification_interval=0
```

Reviewed By: ltamasi

Differential Revision: D26736061

Pulled By: riversand963

fbshipit-source-id: 1c7834903c28431ce23324c4f259ed71255614e2
2021-03-01 17:24:47 -08:00
Andrew Kryczka
d904233d2f Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.

However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.

Related changes include:

- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970

Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.

Reviewed By: pdillinger

Differential Revision: D26467994

Pulled By: ajkr

fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 14:09:54 -08:00
Levi Tamasi
0743eba0c4 Add support for the integrated BlobDB to db_bench (#7956)
Summary:
The patch adds the configuration options of the new BlobDB implementation
to `db_bench` and adjusts the help messages of the old (`StackableDB`-based)
BlobDB's options to make it clear which implementation they pertain to.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7956

Test Plan: Ran `make check` and `db_bench` with the new options.

Reviewed By: jay-zhuang

Differential Revision: D26384808

Pulled By: ltamasi

fbshipit-source-id: b4405bb2c56cfd3506d4c32e3329c08dfdf69c94
2021-02-17 11:10:18 -08:00
David CARLIER
14fbb43f3e db_bench: dump cpu info for Mac. (#7932)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7932

Reviewed By: jay-zhuang

Differential Revision: D26316480

Pulled By: zhichao-cao

fbshipit-source-id: 3e002e49fcb7f60bc9270550a6b3e182fe197551
2021-02-10 12:56:44 -08:00
anand76
4ee991b1e6 Cleanup multiple DBs after running db_bench in multi-DB mode (#7891)
Summary:
Currently, db_bench cleanup only deletes the main DB, if there's one.
Multiple DBs that are opened when --num_multi_db is specified are not
deleted, which can lead to crashes due to running compaction threads on
process exit.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7891

Test Plan: Run regression test

Reviewed By: jay-zhuang

Differential Revision: D26049914

Pulled By: anand1976

fbshipit-source-id: acef2821001ca5e208a96a6a273c724e56353316
2021-01-26 11:12:22 -08:00
anand76
d7738666b0 Fix db_bench duration for multireadrandom benchmark (#7817)
Summary:
The multireadrandom benchmark, when run for a specific number of reads (--reads argument), should base the duration on the actual number of keys read rather than number of batches.

Tests:
Run db_bench multireadrandom benchmark

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7817

Reviewed By: zhichao-cao

Differential Revision: D25717230

Pulled By: anand1976

fbshipit-source-id: 13f4d8162268cf9a34918655e60302d0aba3864b
2020-12-28 13:38:10 -08:00
Peter Dillinger
239d17a19c Support optimize_filters_for_memory for Ribbon filter (#7774)
Summary:
Primarily this change refactors the optimize_filters_for_memory
code for Bloom filters, based on malloc_usable_size, to also work for
Ribbon filters.

This change also replaces the somewhat slow but general
BuiltinFilterBitsBuilder::ApproximateNumEntries with
implementation-specific versions for Ribbon (new) and Legacy Bloom
(based on a recently deleted version). The reason is to emphasize
speed in ApproximateNumEntries rather than 100% accuracy.

Justification: ApproximateNumEntries (formerly CalculateNumEntry) is
only used by RocksDB for range-partitioned filters, called each time we
start to construct one. (In theory, it should be possible to reuse the
estimate, but the abstractions provided by FilterPolicy don't really
make that workable.) But this is only used as a heuristic estimate for
hitting a desired partitioned filter size because of alignment to data
blocks, which have various numbers of unique keys or prefixes. The two
factors lead us to prioritize reasonable speed over 100% accuracy.

optimize_filters_for_memory adds extra complication, because precisely
calculating num_entries for some allowed number of bytes depends on state
with optimize_filters_for_memory enabled. And the allocator-agnostic
implementation of optimize_filters_for_memory, using malloc_usable_size,
means we would have to actually allocate memory, many times, just to
precisely determine how many entries (keys) could be added and stay below
some size budget, for the current state. (In a draft, I got this
working, and then realized the balance of speed vs. accuracy was all
wrong.)

So related to that, I have made CalculateSpace, an internal-only API
only used for testing, non-authoritative also if
optimize_filters_for_memory is enabled. This simplifies some code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7774

Test Plan:
unit test updated, and for FilterSize test, range of tested
values is greatly expanded (still super fast)

Also tested `db_bench -benchmarks=fillrandom,stats -bloom_bits=10 -num=1000000 -partition_index_and_filters -format_version=5 [-optimize_filters_for_memory] [-use_ribbon_filter]` with temporary debug output of generated filter sizes.

Bloom+optimize_filters_for_memory:

      1 Filter size: 197 (224 in memory)
    134 Filter size: 3525 (3584 in memory)
    107 Filter size: 4037 (4096 in memory)
    Total on disk: 904,506
    Total in memory: 918,752

Ribbon+optimize_filters_for_memory:

      1 Filter size: 3061 (3072 in memory)
    110 Filter size: 3573 (3584 in memory)
     58 Filter size: 4085 (4096 in memory)
    Total on disk: 633,021 (-30.0%)
    Total in memory: 634,880 (-30.9%)

Bloom (no offm):

      1 Filter size: 261 (320 in memory)
      1 Filter size: 3333 (3584 in memory)
    240 Filter size: 3717 (4096 in memory)
    Total on disk: 895,674 (-1% on disk vs. +offm; known tolerable overhead of offm)
    Total in memory: 986,944 (+7.4% vs. +offm)

Ribbon (no offm):

      1 Filter size: 2949 (3072 in memory)
      1 Filter size: 3381 (3584 in memory)
    167 Filter size: 3701 (4096 in memory)
    Total on disk: 624,397 (-30.3% vs. Bloom)
    Total in memory: 690,688 (-30.0% vs. Bloom)

Note that optimize_filters_for_memory is even more effective for Ribbon filter than for cache-local Bloom, because it can close the unused memory gap even tighter than Bloom filter, because of 16 byte increments for Ribbon vs. 64 byte increments for Bloom.

Reviewed By: jay-zhuang

Differential Revision: D25592970

Pulled By: pdillinger

fbshipit-source-id: 606fdaa025bb790d7e9c21601e8ea86e10541912
2020-12-18 14:31:03 -08:00
Mammo, Mulugeta
1861de455e Add arena_block_size flag to db_bench (#7654)
Summary:
db_bench currently does not allow overriding the default `arena_block_size `calculation ([memtable size/8](https://github.com/facebook/rocksdb/blob/master/db/column_family.cc#L216)). For memtables whose size is in gigabytes, the `arena_block_size` defaults to hundreds of megabytes (affecting performance).

Exposing this option in db_bench would allow us to test the workloads with various `arena_block_size` values.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7654

Reviewed By: jay-zhuang

Differential Revision: D24996812

Pulled By: ajkr

fbshipit-source-id: a5e3d2c83d9f89e1bb8382f2e8dd476c79e33bef
2020-11-16 13:06:30 -08:00
Yanqin Jin
394210f280 Remove unused includes (#7604)
Summary:
This is a PR generated **semi-automatically** by an internal tool to remove unused includes and `using` statements.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7604

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D24579392

Pulled By: riversand963

fbshipit-source-id: c4bfa6c6b08da1de186690d37eb73d8fff45aecd
2020-10-28 23:22:27 -07:00
Levi Tamasi
30fb9dd50f Introduce a helper method UncompressData (#7434)
Summary:
The patch introduces a helper method in `util/compression.h` called `UncompressData`
that dispatches calls to the correct uncompression method based on type, and changes
`UncompressBlockContentsForCompressionType` and `Benchmark::Uncompress` in
`db_bench` so they are implemented in terms of the new method. This eliminates
some code duplication. (`Benchmark::Compress` is also updated to use the previously
introduced `CompressData` helper.)

In addition, the patch brings the implementation of `Snappy_Uncompress` into sync with
the other uncompression methods by making the method compute the buffer size and allocate
the buffer itself. Finally, the patch eliminates some potentially risky back-and-forth conversions
between various unsigned and signed integer types by exposing the size of the allocated buffer
as a `size_t` instead of an `int`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7434

Test Plan:
`make check`
`./db_bench -benchmarks=compress,uncompress --compression_type ...`

Reviewed By: riversand963

Differential Revision: D23900011

Pulled By: ltamasi

fbshipit-source-id: b25df63ceec4639889be94acb22eb53e530c54e0
2020-09-25 09:01:45 -07:00
Yanqin Jin
a28df7a75a Add basic support for user-defined timestamp to db_bench (#7389)
Summary:
Update db_bench so that we can run it with user-defined timestamp.
Currently, only 64-bit timestamp is supported, while others are disabled by assertion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7389

Test Plan: ./db_bench -benchmarks=fillseq,fillrandom,readrandom,readsequential,....., -user_timestamp_size=8

Reviewed By: ltamasi

Differential Revision: D23720830

Pulled By: riversand963

fbshipit-source-id: 486eacbb82de9a5441e79a61bfa9beef6581608a
2020-09-15 20:34:26 -07:00
mrambacher
7d472accdc Bring the Configurable options together (#5753)
Summary:
This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts

Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753

Reviewed By: ajkr

Differential Revision: D23385030

Pulled By: zhichao-cao

fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a
2020-09-14 17:01:01 -07:00
Peter Dillinger
9de912de3f Fix some errors showing up in Travis builds (#7359)
Summary:
Also enables a pull request to trigger all the Travis
configurations by writing FULL_CI in the commit message. (See what I did
there?)

First issue

    make: *** No rule to make target 'jl/util/crc32c_ppc_asm.o', needed by 'rocksdbjava'.  Stop.

Second issue

    tools/db_bench_tool.cc:5514:38: error: ‘gen_exp.rocksdb::Benchmark::GenerateTwoTermExpKeys::keyrange_size_’ may be used uninitialized in this function

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7359

Test Plan: CI

Reviewed By: zhichao-cao

Differential Revision: D23582132

Pulled By: pdillinger

fbshipit-source-id: 06d794673fd522ba11cf6398385387e6bd97ef89
2020-09-08 15:11:47 -07:00
Hans Holmberg
679a413f11 Close databases on benchmark error exits in db_bench (#7327)
Summary:
Delete database instances to make sure there are no loose threads
running before exit(). This fixes segfaults seen when running
workloads through CompositeEnvs with custom file systems.

For further background on the issues arising when using CompositeEnvs, see the discussion in:
https://github.com/facebook/rocksdb/pull/6878

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7327

Reviewed By: cheng-chang

Differential Revision: D23433244

Pulled By: ajkr

fbshipit-source-id: 4e19cf2067e3fe68c2a3fe1823f24b4091336bbe
2020-09-03 14:36:30 -07:00
Hans Holmberg
2a0d3c7054 Add a file system parameter: --fs_uri to db_stress and db_bench (#6878)
Summary:
This pull request adds the parameter --fs_uri to db_bench and db_stress, creating a composite env combining the default env with a specified registered rocksdb file system.

This makes it easier to develop and test new RocksDB FileSystems.

The pull request also registers the posix file system for testing purposes.

Examples:
```
$./db_bench --fs_uri=posix:// --benchmarks=fillseq

$./db_stress --fs_uri=zenfs://nullb1
```

zenfs is a RocksDB FileSystem I'm developing to add support for zoned block devices, and in that case the zoned block device is specified in the uri (a zoned null block device in the above example).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6878

Reviewed By: siying

Differential Revision: D23023063

Pulled By: ajkr

fbshipit-source-id: 8b3fe7193ce45e683043b021779b7a4d547af247
2020-08-17 11:55:24 -07:00
Aaron Kabcenell
56ed601df3 Compaction Read/Write Stats by Compaction Type (#7165)
Summary:
Adds compaction statistics (total bytes read and written) for compactions that occur for delete-triggered, periodic, and TTL compaction reasons.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7165

Test Plan:
TTL and periodic can be checked by runnning db_bench with the options activated:

/db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -periodic_compaction_seconds=1
./db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -fifo_compaction_ttl=1

Setting the time to one second causes non-zero bytes read/written for those compaction reasons. Disabling them or setting them to times longer than the test run length causes the stats to return to zero as expected.

Delete-triggered compaction counting is tested in DBTablePropertiesTest.DeletionTriggeredCompactionMarking

Reviewed By: ajkr

Differential Revision: D22693050

Pulled By: akabcenell

fbshipit-source-id: d15cef4d94576f703015c8942d5f0d492f69401d
2020-07-29 13:39:29 -07:00
Peter Dillinger
5b2bbacb6f Minimize memory internal fragmentation for Bloom filters (#6427)
Summary:
New experimental option BBTO::optimize_filters_for_memory builds
filters that maximize their use of "usable size" from malloc_usable_size,
which is also used to compute block cache charges.

Rather than always "rounding up," we track state in the
BloomFilterPolicy object to mix essentially "rounding down" and
"rounding up" so that the average FP rate of all generated filters is
the same as without the option. (YMMV as heavily accessed filters might
be unluckily lower accuracy.)

Thus, the option near-minimizes what the block cache considers as
"memory used" for a given target Bloom filter false positive rate and
Bloom filter implementation. There are no forward or backward
compatibility issues with this change, though it only works on the
format_version=5 Bloom filter.

With Jemalloc, we see about 10% reduction in memory footprint (and block
cache charge) for Bloom filters, but 1-2% increase in storage footprint,
due to encoding efficiency losses (FP rate is non-linear with bits/key).

Why not weighted random round up/down rather than state tracking? By
only requiring malloc_usable_size, we don't actually know what the next
larger and next smaller usable sizes for the allocator are. We pick a
requested size, accept and use whatever usable size it has, and use the
difference to inform our next choice. This allows us to narrow in on the
right balance without tracking/predicting usable sizes.

Why not weight history of generated filter false positive rates by
number of keys? This could lead to excess skew in small filters after
generating a large filter.

Results from filter_bench with jemalloc (irrelevant details omitted):

    (normal keys/filter, but high variance)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.6278
    Number of filters: 5516
    Total size (MB): 200.046
    Reported total allocated memory (MB): 220.597
    Reported internal fragmentation: 10.2732%
    Bits/key stored: 10.0097
    Average FP rate %: 0.965228
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 30.5104
    Number of filters: 5464
    Total size (MB): 200.015
    Reported total allocated memory (MB): 200.322
    Reported internal fragmentation: 0.153709%
    Bits/key stored: 10.1011
    Average FP rate %: 0.966313

    (very few keys / filter, optimization not as effective due to ~59 byte
     internal fragmentation in blocked Bloom filter representation)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.5649
    Number of filters: 162950
    Total size (MB): 200.001
    Reported total allocated memory (MB): 224.624
    Reported internal fragmentation: 12.3117%
    Bits/key stored: 10.2951
    Average FP rate %: 0.821534
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 31.8057
    Number of filters: 159849
    Total size (MB): 200
    Reported total allocated memory (MB): 208.846
    Reported internal fragmentation: 4.42297%
    Bits/key stored: 10.4948
    Average FP rate %: 0.811006

    (high keys/filter)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.7017
    Number of filters: 164
    Total size (MB): 200.352
    Reported total allocated memory (MB): 221.5
    Reported internal fragmentation: 10.5552%
    Bits/key stored: 10.0003
    Average FP rate %: 0.969358
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 30.7131
    Number of filters: 160
    Total size (MB): 200.928
    Reported total allocated memory (MB): 200.938
    Reported internal fragmentation: 0.00448054%
    Bits/key stored: 10.1852
    Average FP rate %: 0.963387

And from db_bench (block cache) with jemalloc:

    $ ./db_bench -db=/dev/shm/dbbench.no_optimize -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false
    $ ./db_bench -db=/dev/shm/dbbench -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -optimize_filters_for_memory -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false
    $ (for FILE in /dev/shm/dbbench.no_optimize/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }'
    17063835
    $ (for FILE in /dev/shm/dbbench/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }'
    17430747
    $ #^ 2.1% additional filter storage
    $ ./db_bench -db=/dev/shm/dbbench.no_optimize -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000
    rocksdb.block.cache.index.add COUNT : 33
    rocksdb.block.cache.index.bytes.insert COUNT : 8440400
    rocksdb.block.cache.filter.add COUNT : 33
    rocksdb.block.cache.filter.bytes.insert COUNT : 21087528
    rocksdb.bloom.filter.useful COUNT : 4963889
    rocksdb.bloom.filter.full.positive COUNT : 1214081
    rocksdb.bloom.filter.full.true.positive COUNT : 1161999
    $ #^ 1.04 % observed FP rate
    $ ./db_bench -db=/dev/shm/dbbench -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -optimize_filters_for_memory -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000
    rocksdb.block.cache.index.add COUNT : 33
    rocksdb.block.cache.index.bytes.insert COUNT : 8448592
    rocksdb.block.cache.filter.add COUNT : 33
    rocksdb.block.cache.filter.bytes.insert COUNT : 18220328
    rocksdb.bloom.filter.useful COUNT : 5360933
    rocksdb.bloom.filter.full.positive COUNT : 1321315
    rocksdb.bloom.filter.full.true.positive COUNT : 1262999
    $ #^ 1.08 % observed FP rate, 13.6% less memory usage for filters

(Due to specific key density, this example tends to generate filters that are "worse than average" for internal fragmentation. "Better than average" cases can show little or no improvement.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6427

Test Plan: unit test added, 'make check' with gcc, clang and valgrind

Reviewed By: siying

Differential Revision: D22124374

Pulled By: pdillinger

fbshipit-source-id: f3e3aa152f9043ddf4fae25799e76341d0d8714e
2020-06-22 13:32:07 -07:00
Peter Dillinger
c7432cc3c0 Fix more defects reported by Coverity Scan (#6935)
Summary:
Mostly uninitialized values: some probably written before use, but some seem like bugs. Also, destructor needs to be virtual, and possible use-after-free in test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6935

Test Plan: make check

Reviewed By: siying

Differential Revision: D21885484

Pulled By: pdillinger

fbshipit-source-id: e2e7cb0a0cf196f2b55edd16f0634e81f6cc8e08
2020-06-04 15:35:08 -07:00
Peter Dillinger
14eca6bf04 For ApproximateSizes, pro-rate table metadata size over data blocks (#6784)
Summary:
The implementation of GetApproximateSizes was inconsistent in
its treatment of the size of non-data blocks of SST files, sometimes
including and sometimes now. This was at its worst with large portion
of table file used by filters and querying a small range that crossed
a table boundary: the size estimate would include large filter size.

It's conceivable that someone might want only to know the size in terms
of data blocks, but I believe that's unlikely enough to ignore for now.
Similarly, there's no evidence the internal function AppoximateOffsetOf
is used for anything other than a one-sided ApproximateSize, so I intend
to refactor to remove redundancy in a follow-up commit.

So to fix this, GetApproximateSizes (and implementation details
ApproximateSize and ApproximateOffsetOf) now consistently include in
their returned sizes a portion of table file metadata (incl filters
and indexes) based on the size portion of the data blocks in range. In
other words, if a key range covers data blocks that are X% by size of all
the table's data blocks, returned approximate size is X% of the total
file size. It would technically be more accurate to attribute metadata
based on number of keys, but that's not computationally efficient with
data available and rarely a meaningful difference.

Also includes miscellaneous comment improvements / clarifications.

Also included is a new approximatesizerandom benchmark for db_bench.
No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784

Test Plan:
Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
Old code running new test...

    [ RUN      ] DBTest.ApproximateSizesFilesWithErrorMargin
    db/db_test.cc:1562: Failure
    Expected: (size) <= (11 * 100), actual: 9478 vs 1100

Other tests updated to reflect consistent accounting of metadata.

Reviewed By: siying

Differential Revision: D21334706

Pulled By: pdillinger

fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185
2020-06-02 12:30:23 -07:00
Mian Qin
d9e170d82b Fix issues for reproducing synthetic ZippyDB workloads in the FAST20' paper (#6795)
Summary:
Fix issues for reproducing synthetic ZippyDB workloads in the FAST20' paper using db_bench. Details changes as follows.
1, add a separate random mode in MixGraph to produce all_random workload.
2, fix power inverse function for generating prefix_dist workload.
3, make sure key_offset in prefix mode is always unsigned.
note: Need to carefully choose key_dist_a/b to avoid aliasing. Power inverse function range should be close to overall key space.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6795

Reviewed By: akankshamahajan15

Differential Revision: D21371095

Pulled By: zhichao-cao

fbshipit-source-id: 80744381e242392c8c7cf8ac3d68fe67fe876048
2020-05-04 10:55:14 -07:00
Ziyue Yang
e619a20e93 Add an option for parallel compression in for db_stress (#6722)
Summary:
This commit adds an `compression_parallel_threads` option in
db_stress. It also fixes the naming of parallel compression
option in db_bench to keep it aligned with others.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6722

Reviewed By: pdillinger

Differential Revision: D21091385

fbshipit-source-id: c9ba8c4e5cc327ff9e6094a6dc6a15fcff70f100
2020-04-30 10:49:07 -07:00
Derrick Pallas
5272305437 Fix FilterBench when RTTI=0 (#6732)
Summary:
The dynamic_cast in the filter benchmark causes release mode to fail due to
no-rtti.  Replace with static_cast_with_check.

Signed-off-by: Derrick Pallas <derrick@pallas.us>

Addition by peterd: Remove unnecessary 2nd template arg on all static_cast_with_check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6732

Reviewed By: ltamasi

Differential Revision: D21304260

Pulled By: pdillinger

fbshipit-source-id: 6e8eb437c4ca5a16dbbfa4053d67c4ad55f1608c
2020-04-29 13:09:23 -07:00
Peter Dillinger
31da5e34c1 C++20 compatibility (#6697)
Summary:
Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended:

* Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists
* Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition
* std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API
* Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line
* Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20)
* Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor
* Added GCC 9 C++11 & GCC9 C++20 in Travis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697

Test Plan: make check and CI

Reviewed By: cheng-chang

Differential Revision: D21020318

Pulled By: pdillinger

fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91
2020-04-20 13:24:25 -07:00
sdong
1be3be5522 Auto-Format two recent diffs and add HISTORY.md (#6685)
Summary:
Two recent diffs can be autoformatted.
Also add HISTORY.md entry for https://github.com/facebook/rocksdb/pull/6214
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6685

Test Plan: Run all existing tests

Reviewed By: cheng-chang

Differential Revision: D20965780

fbshipit-source-id: 195b08d7849513d42fe14073112cd19fdda6af95
2020-04-10 11:32:44 -07:00
Luca Giacchino
66a95f0fac Provide an allocator for new memory type to be used with RocksDB block cache (#6214)
Summary:
New memory technologies are being developed by various hardware vendors (Intel DCPMM is one such technology currently available). These new memory types require different libraries for allocation and management (such as PMDK and memkind). The high capacities available make it possible to provision large caches (up to several TBs in size), beyond what is achievable with DRAM.
The new allocator provided in this PR uses the memkind library to allocate memory on different media.

**Performance**

We tested the new allocator using db_bench.
- For each test, we vary the size of the block cache (relative to the size of the uncompressed data in the database).
- The database is filled sequentially. Throughput is then measured with a readrandom benchmark.
- We use a uniform distribution as a worst-case scenario.

The plot shows throughput (ops/s) relative to a configuration with no block cache and default allocator.
For all tests, p99 latency is below 500 us.

![image](https://user-images.githubusercontent.com/26400080/71108594-42479100-2178-11ea-8231-8a775bbc92db.png)

**Changes**

- Add MemkindKmemAllocator
- Add --use_cache_memkind_kmem_allocator db_bench option (to create an LRU block cache with the new allocator)
- Add detection of memkind library with KMEM DAX support
- Add test for MemkindKmemAllocator

**Minimum Requirements**

- kernel 5.3.12
- ndctl v67 - https://github.com/pmem/ndctl
- memkind v1.10.0 - https://github.com/memkind/memkind

**Memory Configuration**

The allocator uses the MEMKIND_DAX_KMEM memory kind. Follow the instructions on[ memkind’s GitHub page](https://github.com/memkind/memkind) to set up NVDIMM memory accordingly.

Note on memory allocation with NVDIMM memory exposed as system memory.
- The MemkindKmemAllocator will only allocate from NVDIMM memory (using memkind_malloc with MEMKIND_DAX_KMEM kind).
- The default allocator is not restricted to RAM by default. Based on NUMA node latency, the kernel should allocate from local RAM preferentially, but it’s a kernel decision. numactl --preferred/--membind can be used to allocate preferentially/exclusively from the local RAM node.

**Usage**

When creating an LRU cache, pass a MemkindKmemAllocator object as argument.
For example (replace capacity with the desired value in bytes):

```
#include "rocksdb/cache.h"
#include "memory/memkind_kmem_allocator.h"

NewLRUCache(
    capacity /*size_t*/,
    6 /*cache_numshardbits*/,
    false /*strict_capacity_limit*/,
    false /*cache_high_pri_pool_ratio*/,
    std::make_shared<MemkindKmemAllocator>());
```

Refer to [RocksDB’s block cache documentation](https://github.com/facebook/rocksdb/wiki/Block-Cache) to assign the LRU cache as block cache for a database.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6214

Reviewed By: cheng-chang

Differential Revision: D19292435

fbshipit-source-id: 7202f47b769e7722b539c86c2ffd669f64d7b4e1
2020-04-09 20:47:23 -07:00
CaixinGong
a91613dd06 Fix readrandom return NotFound after fillrandom in db_bench (#6665)
Summary:
This commit is fixing a bug that readrandom test returns many NotFound in db_bench from Version 6.2.
Pull Request resolved: https://github.com/facebook/rocksdb/issues/6664
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6665

Reviewed By: cheng-chang

Differential Revision: D20911298

Pulled By: ajkr

fbshipit-source-id: c2658d4dbb35798ccbf67dff6e64923fb731ef81
2020-04-08 14:27:12 -07:00
Ziyue Yang
03a781a90c Add pipelined & parallel compression optimization (#6262)
Summary:
This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262

Reviewed By: ajkr

Differential Revision: D20651306

fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb
2020-04-01 16:40:18 -07:00
sdong
488b1e6739 Fix an error in db_bench with gcc 4.8 (#6537)
Summary:
I start to see following failures:

tools/db_bench_tool.cc: In constructor ‘rocksdb::NormalDistribution::NormalDistribution(unsigned int, unsigned int)’:
tools/db_bench_tool.cc:1528:58: error: declaration of ‘max’ shadows a member of 'this' [-Werror=shadow]
   NormalDistribution(unsigned int min, unsigned int max) :
                                                          ^
tools/db_bench_tool.cc:1528:58: error: declaration of ‘min’ shadows a member of 'this' [-Werror=shadow]
tools/db_bench_tool.cc: In constructor ‘rocksdb::UniformDistribution::UniformDistribution(unsigned int, unsigned int)’:
tools/db_bench_tool.cc:1546:59: error: declaration of ‘max’ shadows a member of 'this' [-Werror=shadow]
   UniformDistribution(unsigned int min, unsigned int max) :
                                                           ^
tools/db_bench_tool.cc:1546:59: error: declaration of ‘min’ shadows a member of 'this' [-Werror=shadow]

when I build from GCC 4.8. Rename those variables to fix the problem.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6537

Test Plan: make all with the compiler that used to show the failure.

Differential Revision: D20448741

fbshipit-source-id: 18bcf012dbe020f22f79038a9b08f447befa2574
2020-03-16 13:50:40 -07:00
Levi Tamasi
8637bc1eea Fix the description of unordered_write in db_bench (#6476)
Summary:
As reported in https://github.com/facebook/rocksdb/issues/6467, the
description of the `unordered_write` switch of `db_bench` was incorrect.
(Note: the new description is based on
https://rocksdb.org/blog/2019/08/15/unordered-write.html).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6476

Test Plan: `db_bench --help`

Differential Revision: D20200653

Pulled By: ltamasi

fbshipit-source-id: 4c3683fcfa6a069164167af5aaff9974a810c16a
2020-03-02 15:34:19 -08:00
sdong
9b3c9ef0e8 Add --index_with_first_key and --index_shortening_mode to DB bench (#5859)
Summary:
Some combinatino of --index_with_first_key and --index_shortening_mode can signifcantly improve performance for large values. Expose them in db_bench.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5859

Test Plan: Run them with the new options and observe the behavior.

Differential Revision: D20104434

fbshipit-source-id: 21d48a732a9caf20b82312c7d7557d747ea3c304
2020-03-02 11:55:28 -08:00
Michael R. Crusoe
051696bf98 fix some spelling typos (#6464)
Summary:
Found from Debian's "Lintian" program
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6464

Differential Revision: D20162862

Pulled By: zhichao-cao

fbshipit-source-id: 06941ee2437b038b2b8045becbe9d2c6fbff3e12
2020-02-28 14:14:03 -08:00
sdong
fdf882ded2 Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433

Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.

Differential Revision: D19977691

fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
2020-02-20 12:09:57 -08:00
sdong
df3f33dd05 Fix db_bench LITE build recently broken (#6411)
Summary:
A recent change https://github.com/facebook/rocksdb/pull/6386 broke LITE build in a trivial way. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6411

Test Plan: Run "LITE=1 make all"

Differential Revision: D19871765

fbshipit-source-id: 74f0ad3f8a9d666fbde0da7fd29ba1547a811f77
2020-02-13 10:52:50 -08:00
Burton Li
e64508917b db_bench supports for generating random variable sized value. (#6386)
Summary:
1. `db_bench` now supports `value_size_distribution_type`, `value_size_min`, `value_size_max` options for generating random variable sized value.
2. Added `blob_db_compression_type` option for BlobDB to enable blob compression.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6386

Differential Revision: D19859406

Pulled By: zhichao-cao

fbshipit-source-id: ace52674090023fde15d832392110bf288a8e215
2020-02-12 14:47:03 -08:00
sdong
876c2dbff4 Allow readahead when reading option files. (#6372)
Summary:
Right, when reading from option files, no readahead is used and 8KB buffer is used. It might introduce high latency if the file system provide high latency and doesn't do readahead. Instead, introduce a readahead to the file. When calling inside DB, infer the value from options.log_readahead. Otherwise, a default 512KB readahead size is used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6372

Test Plan: Add --log_readahead_size in db_bench. Run it with several options and observe read size from option files using strace.

Differential Revision: D19727739

fbshipit-source-id: e6d8053b0a64259abc087f1f388b9cd66fa8a583
2020-02-07 15:18:26 -08:00
sdong
24c9dce825 Remove include math.h (#6373)
Summary:
We see some odd errors complaining math. However, it doesn't seem that it is needed to be included. Remove the include of math.h. Just removing it from db_bench doesn't seem to break anything. Replacing sqrt from std::sqrt seems to work for histogram.cc
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6373

Test Plan: Watch Travis and appveyor to run.

Differential Revision: D19730068

fbshipit-source-id: d3ad41defcdd9f51c2da1a3673fb258f5dfacf47
2020-02-05 21:00:49 -08:00
Levi Tamasi
130e710056 Add BlobDB GC cutoff parameter to db_bench (#6211)
Summary:
The patch makes it possible to set the BlobDB configuration option
`garbage_collection_cutoff` on the command line. In addition, it changes
the `db_bench` code so that the default values of BlobDB related
parameters are taken from the defaults of the actual BlobDB
configuration options (note: this changes the the default of
`blob_db_bytes_per_sync`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6211

Test Plan: Ran `db_bench` with various values of the new parameter.

Differential Revision: D19166895

Pulled By: ltamasi

fbshipit-source-id: 305ccdf0123b9db032b744715810babdc3e3b7d5
2019-12-18 17:46:08 -08:00
Zhichao Cao
8ea087ad16 Workload generator (Mixgraph) based on prefix hotness (#5953)
Summary:
In the previous PR https://github.com/facebook/rocksdb/issues/4788, user can use db_bench mix_graph option to generate the workload that is from the social graph. The key is generated based on the key access hotness. In this PR, user can further model the key-range hotness and fit those to two-term-exponential distribution. First, user cuts the whole key space into small key ranges (e.g., key-ranges are the same size and the key-range number is the number of SST files). Then, user calculates the average access count per key of each key-range as the key-range hotness. Next, user fits the key-range hotness to two-term-exponential distribution (f(x) = f(x) = a*exp(b*x) + c*exp(d*x)) and generate the value of a, b, c, and d. They are the parameters in db_bench: prefix_dist_a, prefix_dist_b, prefix_dist_c, and prefix_dist_d. Finally, user can run db_bench by specify the parameters.
For example:
`./db_bench --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=268435456 -key_dist_a=0.002312 -key_dist_b=0.3467 -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=350 -sine_b=0.0105 -sine_d=50000 --perf_level=2 -reads=1000000 -num=5000000 -key_size=48`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5953

Test Plan: run db_bench with different parameters and checked the results.

Differential Revision: D18053527

Pulled By: zhichao-cao

fbshipit-source-id: 171f8b3142bd76462f1967c58345ad7e4f84bab7
2019-11-06 13:02:20 -08:00
Yanqin Jin
c0abc6bbc1 Use FLAGS_env for certain operations in db_bench (#5943)
Summary:
Since we already parse env_uri from command line and creates custom Env
accordingly, we should invoke the methods of such Envs instead of using
Env::Default().

Test Plan (on devserver):
```
$make db_bench db_stress
$./db_bench -benchmarks=fillseq
./db_stress
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5943

Differential Revision: D18018550

Pulled By: riversand963

fbshipit-source-id: 03b61329aaae0dfd914a0b902cc677f570f102e3
2019-10-22 11:43:21 -07:00
Zhichao Cao
526e3b9763 Enable trace_replay with multi-threads (#5934)
Summary:
In the current trace replay, all the queries are serialized and called by single threads. It may not simulate the original application query situations closely. The multi-threads replay is implemented in this PR. Users can set the number of threads to replay the trace. The queries generated according to the trace records are scheduled in the thread pool job queue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5934

Test Plan: test with make check and real trace replay.

Differential Revision: D17998098

Pulled By: zhichao-cao

fbshipit-source-id: 87eecf6f7c17a9dc9d7ab29dd2af74f6f60212c8
2019-10-18 14:13:50 -07:00
Levi Tamasi
78b28d80b0 Support non-TTL Puts for BlobDB in db_bench (#5921)
Summary:
Currently, db_bench only supports PutWithTTL operations for BlobDB but
not regular Puts. The patch adds support for regular (non-TTL) Puts and also
changes the default for blob_db_max_ttl_range to zero, which corresponds
to no TTL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5921

Test Plan:
make check

./db_bench -benchmarks=fillrandom -statistics -stats_interval_seconds=1
-duration=90 -num=500000 -use_blob_db=1 -blob_db_file_size=1000000
-target_file_size_base=1000000 (issues Put operations with no TTL)

./db_bench -benchmarks=fillrandom -statistics -stats_interval_seconds=1
-duration=90 -num=500000 -use_blob_db=1 -blob_db_file_size=1000000
-target_file_size_base=1000000 -blob_db_max_ttl_range=86400 (issues
PutWithTTL operations with random TTLs in the [0, blob_db_max_ttl_range)
interval, as before)

Differential Revision: D17919798

Pulled By: ltamasi

fbshipit-source-id: b946c3522b836b92b4c157ffbad24f92ba2b0a16
2019-10-14 17:49:20 -07:00
sdong
e8263dbdaa Apply formatter to recent 200+ commits. (#5830)
Summary:
Further apply formatter to more recent commits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5830

Test Plan: Run all existing tests.

Differential Revision: D17488031

fbshipit-source-id: 137458fd94d56dd271b8b40c522b03036943a2ab
2019-09-20 12:04:26 -07:00
Lingjing You
1a928c22a0 Add insert hints for each writebatch (#5728)
Summary:
Add insert hints for each writebatch so that they can be used in concurrent write, and add write option to enable it.

Bench result (qps):

`./db_bench --benchmarks=fillseq -allow_concurrent_memtable_write=true -num=4000000 -batch-size=1 -threads=1 -db=/data3/ylj/tmp -write_buffer_size=536870912 -num_column_families=4`

master:

| batch size \ thread num | 1       | 2       | 4       | 8       |
| ----------------------- | ------- | ------- | ------- | ------- |
| 1                       | 387883  | 220790  | 308294  | 490998  |
| 10                      | 1397208 | 978911  | 1275684 | 1733395 |
| 100                     | 2045414 | 1589927 | 1798782 | 2681039 |
| 1000                    | 2228038 | 1698252 | 1839877 | 2863490 |

fillseq with writebatch hint:

| batch size \ thread num | 1       | 2       | 4       | 8       |
| ----------------------- | ------- | ------- | ------- | ------- |
| 1                       | 286005  | 223570  | 300024  | 466981  |
| 10                      | 970374  | 813308  | 1399299 | 1753588 |
| 100                     | 1962768 | 1983023 | 2676577 | 3086426 |
| 1000                    | 2195853 | 2676782 | 3231048 | 3638143 |
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5728

Differential Revision: D17297240

fbshipit-source-id: b053590a6d77871f1ef2f911a7bd013b3899b26c
2019-09-12 17:15:18 -07:00
anand76
eb9026f09b Add a db_bench benchmark to warm up the row cache
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5707

Differential Revision: D17242698

Pulled By: anand1976

fbshipit-source-id: 5d1bfda3c9e8f56176ae391cae6c91e6262016b8
2019-09-10 11:06:36 -07:00
Zhongyi Xie
2f41ecfe75 Refactor trimming logic for immutable memtables (#5022)
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022

Differential Revision: D14394062

Pulled By: miasantreble

fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
2019-08-23 13:55:34 -07:00
Vijay Nadimpalli
d150e01474 New API to get all merge operands for a Key (#5604)
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
      const ReadOptions& options, ColumnFamilyHandle* column_family,
      const Slice& key, PinnableSlice* merge_operands,
      GetMergeOperandsOptions* get_merge_operands_options,
      int* number_of_operands)

Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);

Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604

Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist

Differential Revision: D16657366

Pulled By: vjnadimpalli

fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
2019-08-06 14:26:44 -07:00
Mark Rambacher
cfcf045acc The ObjectRegistry class replaces the Registrar and NewCustomObjects.… (#5293)
Summary:
The ObjectRegistry class replaces the Registrar and NewCustomObjects.  Objects are registered with the registry by Type (the class must implement the static const char *Type() method).

This change is necessary for a few reasons:
- By having a class (rather than static template instances), the class can be passed between compilation units, meaning that objects could be registered and shared from a dynamic library with an executable.
- By having a class with instances, different units could have different objects registered.  This could be useful if, for example, one Option allowed for a dynamic library and one did not.

When combined with some other PRs (being able to load shared libraries, a Configurable interface to configure objects to/from string), this code will allow objects in external shared libraries to be added to a RocksDB image at run-time, rather than requiring every new extension to be built into the main library and called explicitly by every program.

Test plan (on riversand963's  devserver)
```
$COMPILE_WITH_ASAN=1 make -j32 all && sleep 1 && make check
```
All tests pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5293

Differential Revision: D16363396

Pulled By: riversand963

fbshipit-source-id: fbe4acb615bfc11103eef40a0b288845791c0180
2019-07-23 17:13:05 -07:00
sdong
e4dcf5fd22 db_bench to add a new "benchmark" to print out all stats history (#5532)
Summary:
Sometimes it is helpful to fetch the whole history of stats after benchmark runs. Add such an option
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5532

Test Plan: Run the benchmark manually and observe the output is as expected.

Differential Revision: D16097764

fbshipit-source-id: 10b5b735a22a18be198b8f348be11f11f8806904
2019-07-03 20:03:28 -07:00
haoyuhuang
66464d1fde Remove multiple declarations o kMicrosInSecond.
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5526

Test Plan:
OPT=-g V=1 make J=1 unity_test -j32
make clean && make -j32

Differential Revision: D16079315

Pulled By: HaoyuHuang

fbshipit-source-id: 294ab439cf0db8dd5da44e30eabf0cbb2bb8c4f6
2019-07-01 15:15:12 -07:00
Eli Pozniansky
3e6c185381 Formatting fixes in db_bench_tool (#5525)
Summary:
Formatting fixes in db_bench_tool that were accidentally omitted
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5525

Test Plan: Unit tests

Differential Revision: D16078516

Pulled By: elipoz

fbshipit-source-id: bf8df0e3f08092a91794ebf285396d9b8a335bb9
2019-07-01 14:57:28 -07:00
Eli Pozniansky
f872009237 Fix from some C-style casting (#5524)
Summary:
Fix from some C-style casting in bloom.cc and ./tools/db_bench_tool.cc
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5524

Differential Revision: D16075626

Pulled By: elipoz

fbshipit-source-id: 352948885efb64a7ef865942c75c3c727a914207
2019-07-01 13:05:34 -07:00
Zhongyi Xie
671d15cbdd Persistent Stats: persist stats history to disk (#5046)
Summary:
This PR continues the work in https://github.com/facebook/rocksdb/pull/4748 and https://github.com/facebook/rocksdb/pull/4535 by adding a new DBOption `persist_stats_to_disk` which instructs RocksDB to persist stats history to RocksDB itself. When statistics is enabled, and  both options `stats_persist_period_sec` and `persist_stats_to_disk` are set, RocksDB will periodically write stats to a built-in column family in the following form: key -> (timestamp in microseconds)#(stats name), value -> stats value. The existing API `GetStatsHistory` will detect the current value of `persist_stats_to_disk` and either read from in-memory data structure or from the hidden column family on disk.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5046

Differential Revision: D15863138

Pulled By: miasantreble

fbshipit-source-id: bb82abdb3f2ca581aa42531734ac799f113e931b
2019-06-17 15:21:50 -07:00
haoyuhuang
d43b4cd570 Integrate block cache tracing into db_bench (#5459)
Summary:
This PR integrates the block cache tracing into db_bench. It adds three command line arguments.
-block_cache_trace_file (Block cache trace file path.) type: string default: ""
-block_cache_trace_max_trace_file_size_in_bytes (The maximum block cache
trace file size in bytes. Block cache accesses will not be logged if the
trace file size exceeds this threshold. Default is 64 GB.) type: int64
default: 68719476736
-block_cache_trace_sampling_frequency (Block cache trace sampling
frequency, termed s. It uses spatial downsampling and samples accesses to
one out of s blocks.) type: int32 default: 1
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5459

Differential Revision: D15832031

Pulled By: HaoyuHuang

fbshipit-source-id: 0ecf2f2686557251fe741a2769b21170777efa3d
2019-06-17 11:08:21 -07:00
Zhongyi Xie
d68f9f4580 simplify include directive involving inttypes (#5402)
Summary:
When using `PRIu64` type of printf specifier, current code base does the following:
```
#ifndef __STDC_FORMAT_MACROS
#define __STDC_FORMAT_MACROS
#endif
#include <inttypes.h>
```
However, this can be simplified to
```
#include <cinttypes>
```
as long as flag `-std=c++11` is used.
This should solve issues like https://github.com/facebook/rocksdb/issues/5159
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402

Differential Revision: D15701195

Pulled By: miasantreble

fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03
2019-06-06 13:56:07 -07:00
Vijay Nadimpalli
49c5a12dbe Organizing rocksdb/db directory
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390

Differential Revision: D15579388

Pulled By: vjnadimpalli

fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366
2019-05-31 11:57:01 -07:00
Yanqin Jin
83f7a8eed0 Fix compilation error in LITE mode (#5391)
Summary:
Add macro ROCKSDB_LITE to fix compilation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5391

Differential Revision: D15574522

Pulled By: riversand963

fbshipit-source-id: 95aea83c5d9b2bf98a3ba0ef9167b63c9be2988b
2019-05-31 08:32:22 -07:00
Yanqin Jin
b9f5900658 Fix WAL replay by skipping old write batches (#5170)
Summary:
1. Fix a bug in WAL replay in which write batches with old sequence numbers are mistakenly inserted into memtables.
2. Add support for benchmarking secondary instance to db_bench_tool.
With changes made in this PR, we can start benchmarking secondary instance
using two processes. It is also possible to vary the frequency at which the
secondary instance tries to catch up with the primary. The info log of the
secondary can be found in a directory whose path can be specified with
'-secondary_path'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5170

Differential Revision: D15564608

Pulled By: riversand963

fbshipit-source-id: ce97688ed3d33f69d3a0b9266ebbbbf887aa0ec8
2019-05-30 19:33:33 -07:00
Siying Dong
8843129ece Move some memory related files from util/ to memory/ (#5382)
Summary:
Move arena, allocator, and memory tools under util to a separate memory/ directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382

Differential Revision: D15564655

Pulled By: siying

fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5
2019-05-30 17:44:09 -07:00
Siying Dong
e9e0101ca4 Move test related files under util/ to test_util/ (#5377)
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377

Differential Revision: D15551366

Pulled By: siying

fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
2019-05-30 11:25:51 -07:00
Maysam Yabandeh
eab4f49a2c WritePrepared: skip_concurrency_control option (#5330)
Summary:
This enables the user to set TransactionDBOptions::skip_concurrency_control so the standard `DB::Write(const WriteOptions& opts, WriteBatch* updates)` would skip the concurrency control. This would give higher throughput to the users who know their use case doesn't need concurrency control.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5330

Differential Revision: D15525932

Pulled By: maysamyabandeh

fbshipit-source-id: 68421ac1ba34f549a4a8de9ce4c2dccf6fb4b06b
2019-05-28 16:29:45 -07:00
Zhichao Cao
a13026fb2f Added trace replay fast forward function (#5273)
Summary:
In the current db_bench trace replay, the replay process strictly follows the timestamp to issue the queries. In some cases, user does not care about the time. Therefore, fast forward is needed for users to speed up the replay process.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5273

Differential Revision: D15389232

Pulled By: zhichao-cao

fbshipit-source-id: 735d629b9d2a167b05af3e4fa0ddf9d5d0be1806
2019-05-16 20:21:18 -07:00
Maysam Yabandeh
f383641a1d Unordered Writes (#5218)
Summary:
Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
The patch is prepared based on an original by siying
Existing unit tests are extended to include unordered_write option.

Benchmark Results:
```
TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions  --unordered_write=1
```
With WAL
- Vanilla RocksDB: 78.6 MB/s
- WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
- unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)

Without WAL
- Vanilla RocksDB: 111.3 MB/s
- WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
- unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)

- WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)

Limitations:
- The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218

Differential Revision: D15219029

Pulled By: maysamyabandeh

fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
2019-05-13 17:47:21 -07:00
Yi Wu
92c60547fe db_bench: fix hang on IO error (#5300)
Summary:
db_bench will wait indefinitely if there's background error. Fix by pass `abs_time_us` to cond var.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5300

Differential Revision: D15319945

Pulled By: miasantreble

fbshipit-source-id: 0034fb7f6ec7c3303c4ccf26e54c20fbdac8ab44
2019-05-13 11:30:35 -07:00