rocksdb/include/rocksdb
Hui Xiao ca0ef54f16 Rate-limit automatic WAL flush after each user write (#9607)
Summary:
**Context:**
WAL flush is currently not rate-limited by `Options::rate_limiter`. This PR is to provide rate-limiting to auto WAL flush, the one that automatically happen after each user write operation (i.e, `Options::manual_wal_flush == false`), by adding `WriteOptions::rate_limiter_options`.

Note that we are NOT rate-limiting WAL flush that do NOT automatically happen after each user write, such as  `Options::manual_wal_flush == true + manual FlushWAL()` (rate-limiting multiple WAL flushes),  for the benefits of:
- being consistent with [ReadOptions::rate_limiter_priority](https://github.com/facebook/rocksdb/blob/7.0.fb/include/rocksdb/options.h#L515)
- being able to turn off some WAL flush's rate-limiting but not all (e.g, turn off specific the WAL flush of a critical user write like a service's heartbeat)

`WriteOptions::rate_limiter_options` only accept `Env::IO_USER` and `Env::IO_TOTAL` currently due to an implementation constraint.
- The constraint is that we currently queue parallel writes (including WAL writes) based on FIFO policy which does not factor rate limiter priority into this layer's scheduling. If we allow lower priorities such as `Env::IO_HIGH/MID/LOW` and such writes specified with lower priorities occurs before ones specified with higher priorities (even just by a tiny bit in arrival time), the former would have blocked the latter, leading to a "priority inversion" issue and contradictory to what we promise for rate-limiting priority. Therefore we only allow `Env::IO_USER` and `Env::IO_TOTAL`  right now before improving that scheduling.

A pre-requisite to this feature is to support operation-level rate limiting in `WritableFileWriter`, which is also included in this PR.

**Summary:**
- Renamed test suite `DBRateLimiterTest to DBRateLimiterOnReadTest` for adding a new test suite
- Accept `rate_limiter_priority` in `WritableFileWriter`'s private and public write functions
- Passed `WriteOptions::rate_limiter_options` to `WritableFileWriter` in the path of automatic WAL flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9607

Test Plan:
- Added new unit test to verify existing flush/compaction rate-limiting does not break, since `DBTest, RateLimitingTest` is disabled and current db-level rate-limiting tests focus on read only (e.g, `db_rate_limiter_test`, `DBTest2, RateLimitedCompactionReads`).
- Added new unit test `DBRateLimiterOnWriteWALTest, AutoWalFlush`
- `strace -ftt -e trace=write ./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -rate_limit_auto_wal_flush=1 -rate_limiter_bytes_per_sec=15 -rate_limiter_refill_period_us=1000000 -write_buffer_size=100000000 -disable_auto_compactions=1 -num=100`
   - verified that WAL flush(i.e, system-call _write_) were chunked into 15 bytes and each _write_ was roughly 1 second apart
   - verified the chunking disappeared when `-rate_limit_auto_wal_flush=0`
- crash test: `python3 tools/db_crashtest.py blackbox --disable_wal=0  --rate_limit_auto_wal_flush=1 --rate_limiter_bytes_per_sec=10485760 --interval=10` killed as normal

**Benchmarked on flush/compaction to ensure no performance regression:**
- compaction with rate-limiting  (see table 1, avg over 1280-run):  pre-change: **915635 micros/op**; post-change:
   **907350 micros/op (improved by 0.106%)**
```
#!/bin/bash
TEST_TMPDIR=/dev/shm/testdb
START=1
NUM_DATA_ENTRY=8
N=10

rm -f compact_bmk_output.txt compact_bmk_output_2.txt dont_care_output.txt
for i in $(eval echo "{$START..$NUM_DATA_ENTRY}")
do
    NUM_RUN=$(($N*(2**($i-1))))
    for j in $(eval echo "{$START..$NUM_RUN}")
    do
       ./db_bench --benchmarks=fillrandom -db=$TEST_TMPDIR -disable_auto_compactions=1 -write_buffer_size=6710886 > dont_care_output.txt && ./db_bench --benchmarks=compact -use_existing_db=1 -db=$TEST_TMPDIR -level0_file_num_compaction_trigger=1 -rate_limiter_bytes_per_sec=100000000 | egrep 'compact'
    done > compact_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' compact_bmk_output.txt >> compact_bmk_output_2.txt
done
```
- compaction w/o rate-limiting  (see table 2, avg over 640-run):  pre-change: **822197 micros/op**; post-change: **823148 micros/op (regressed by 0.12%)**
```
Same as above script, except that -rate_limiter_bytes_per_sec=0
```
- flush with rate-limiting (see table 3, avg over 320-run, run on the [patch](ee5c6023a9) to augment current db_bench ): pre-change: **745752 micros/op**; post-change: **745331 micros/op (regressed by 0.06 %)**
```
 #!/bin/bash
TEST_TMPDIR=/dev/shm/testdb
START=1
NUM_DATA_ENTRY=8
N=10

rm -f flush_bmk_output.txt flush_bmk_output_2.txt

for i in $(eval echo "{$START..$NUM_DATA_ENTRY}")
do
    NUM_RUN=$(($N*(2**($i-1))))
    for j in $(eval echo "{$START..$NUM_RUN}")
    do
       ./db_bench -db=$TEST_TMPDIR -write_buffer_size=1048576000 -num=1000000 -rate_limiter_bytes_per_sec=100000000 -benchmarks=fillseq,flush | egrep 'flush'
    done > flush_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' flush_bmk_output.txt >> flush_bmk_output_2.txt
done

```
- flush w/o rate-limiting (see table 4, avg over 320-run, run on the [patch](ee5c6023a9) to augment current db_bench): pre-change: **487512 micros/op**, post-change: **485856 micors/ops (improved by 0.34%)**
```
Same as above script, except that -rate_limiter_bytes_per_sec=0
```

| table 1 - compact with rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op  (%)
-- | -- | -- | -- | -- | --
10 | 896978 | 16046.9 | 901242 | 15670.9 | 0.475373978
20 | 893718 | 15813 | 886505 | 17544.7 | -0.8070778478
40 | 900426 | 23882.2 | 894958 | 15104.5 | -0.6072681153
80 | 906635 | 21761.5 | 903332 | 23948.3 | -0.3643141948
160 | 898632 | 21098.9 | 907583 | 21145 | 0.9960695813
3.20E+02 | 905252 | 22785.5 | 908106 | 25325.5 | 0.3152713278
6.40E+02 | 905213 | 23598.6 | 906741 | 21370.5 | 0.1688000504
**1.28E+03** | **908316** | **23533.1** | **907350** | **24626.8** | **-0.1063506533**
average over #-run | 901896.25 | 21064.9625 | 901977.125 | 20592.025 | 0.008967217682

| table 2 - compact w/o rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op  (%)
-- | -- | -- | -- | -- | --
10 | 811211 | 26996.7 | 807586 | 28456.4 | -0.4468627768
20 | 815465 | 14803.7 | 814608 | 28719.7 | -0.105093413
40 | 809203 | 26187.1 | 797835 | 25492.1 | -1.404839082
80 | 822088 | 28765.3 | 822192 | 32840.4 | 0.01265071379
160 | 821719 | 36344.7 | 821664 | 29544.9 | -0.006693285661
3.20E+02 | 820921 | 27756.4 | 821403 | 28347.7 | 0.05871454135
**6.40E+02** | **822197** | **28960.6** | **823148** | **30055.1** | **0.1156657103**
average over #-run | 8.18E+05 | 2.71E+04 | 8.15E+05 | 2.91E+04 |  -0.25

| table 3 - flush with rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op  (%)
-- | -- | -- | -- | -- | --
10 | 741721 | 11770.8 | 740345 | 5949.76 | -0.1855144994
20 | 735169 | 3561.83 | 743199 | 9755.77 | 1.09226586
40 | 743368 | 8891.03 | 742102 | 8683.22 | -0.1703059588
80 | 742129 | 8148.51 | 743417 | 9631.58| 0.1735547324
160 | 749045 | 9757.21 | 746256 | 9191.86 | -0.3723407806
**3.20E+02** | **745752** | **9819.65** | **745331** | **9840.62** | **-0.0564530836**
6.40E+02 | 749006 | 11080.5 | 748173 | 10578.7 | -0.1112140624
average over #-run | 743741.4286 | 9004.218571 | 744117.5714 | 9090.215714 | 0.05057441238

| table 4 - flush w/o rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op (%)
-- | -- | -- | -- | -- | --
10 | 477283 | 24719.6 | 473864 | 12379 | -0.7163464863
20 | 486743 | 20175.2 | 502296 | 23931.3 | 3.195320734
40 | 482846 | 15309.2 | 489820 | 22259.5 | 1.444352858
80 | 491490 | 21883.1 | 490071 | 23085.7 | -0.2887139108
160 | 493347 | 28074.3 | 483609 | 21211.7 | -1.973864238
**3.20E+02** | **487512** | **21401.5** | **485856** | **22195.2** | **-0.3396839462**
6.40E+02 | 490307 | 25418.6 | 485435 | 22405.2 | -0.9936631539
average over #-run | 4.87E+05 | 2.24E+04 | 4.87E+05 | 2.11E+04 | 0.00E+00

Reviewed By: ajkr

Differential Revision: D34442441

Pulled By: hx235

fbshipit-source-id: 4790f13e1e5c0a95ae1d1cc93ffcf69dc6e78bdd
2022-03-08 13:19:39 -08:00
..
utilities Add support for BlobDB to ldb (#9630) 2022-02-25 23:13:11 -08:00
advanced_options.h compression_per_level should be used for flush and changeable (#9658) 2022-03-07 18:06:19 -08:00
c.h Remove BlockBasedTableOptions.hash_index_allow_collision (#9454) 2022-03-01 13:58:02 -08:00
cache_bench_tool.h Allow cache_bench/db_bench to use a custom secondary cache (#8312) 2021-05-19 15:26:18 -07:00
cache.h Add a secondary cache implementation based on LRUCache 1 (#9518) 2022-02-23 16:06:27 -08:00
cleanable.h Replace most typedef with using= (#8751) 2021-09-07 11:31:59 -07:00
compaction_filter.h Fix compile warnings (#9199) 2021-11-24 11:19:06 -08:00
compaction_job_stats.h Update compaction statistics to include the amount of data read from blob files (#8022) 2021-03-04 00:43:48 -08:00
comparator.h Mark destructors as override (#9404) 2022-01-20 08:44:27 -08:00
compression_type.h Move CompressionType to its own header file (#7162) 2020-08-03 15:49:31 -07:00
concurrent_task_limiter.h Some API clarifications (#9080) 2021-11-02 20:30:07 -07:00
configurable.h Improve performance of SliceTransform::AsString (#9401) 2022-01-27 10:05:33 -08:00
convenience.h Remove deprecated API AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit (#9452) 2022-01-27 13:01:09 -08:00
customizable.h Mark destructors as override (#9404) 2022-01-20 08:44:27 -08:00
data_structure.h Add (Live)FileStorageInfo API (#8968) 2021-10-16 10:04:32 -07:00
db_bench_tool.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
db_dump_tool.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
db_stress_tool.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
db.h Change enum SizeApproximationFlags to enum class (#9604) 2022-02-18 20:22:57 -08:00
env_encryption.h Some API clarifications (#9080) 2021-11-02 20:30:07 -07:00
env.h Rate-limit automatic WAL flush after each user write (#9607) 2022-03-08 13:19:39 -08:00
experimental.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
file_checksum.h Mark destructors as override (#9404) 2022-01-20 08:44:27 -08:00
file_system.h Rate-limit automatic WAL flush after each user write (#9607) 2022-03-08 13:19:39 -08:00
filter_policy.h Make FilterPolicy Customizable (#9590) 2022-02-18 13:22:31 -08:00
flush_block_policy.h Some API clarifications (#9080) 2021-11-02 20:30:07 -07:00
functor_wrapper.h Fix and detect headers with missing dependencies (#8893) 2021-09-10 10:00:26 -07:00
io_status.h Combine data members of IOStatus with Status (#9549) 2022-02-22 11:23:01 -08:00
iostats_context.h Add file temperature related counter and bytes stats to and io_stats (#8710) 2021-10-07 14:58:41 -07:00
iterator.h Add API warning for Iterator::Refresh() with range tombstones (#9398) 2022-01-19 10:13:27 -08:00
ldb_tool.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
listener.h Add temperature information to the event listener callbacks (#9591) 2022-02-18 11:23:18 -08:00
memory_allocator.h Make MemoryAllocator into a Customizable class (#8980) 2021-12-17 04:20:47 -08:00
memtablerep.h Mark destructors as override (#9404) 2022-01-20 08:44:27 -08:00
merge_operator.h Fix compile warnings (#9199) 2021-11-24 11:19:06 -08:00
metadata.h Add (Live)FileStorageInfo API (#8968) 2021-10-16 10:04:32 -07:00
options.h Rate-limit automatic WAL flush after each user write (#9607) 2022-03-08 13:19:39 -08:00
perf_context.h Add a PerfContext counter for secondary cache hits (#8685) 2021-08-20 15:17:30 -07:00
perf_level.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
persistent_cache.h Check for and disallow shared key space in block caches (#9172) 2021-11-16 11:16:05 -08:00
rate_limiter.h Remove deprecated option new_table_reader_for_compaction_inputs (#9443) 2022-02-08 19:31:28 -08:00
rocksdb_namespace.h Fix and detect headers with missing dependencies (#8893) 2021-09-10 10:00:26 -07:00
secondary_cache.h Fix compile warnings (#9199) 2021-11-24 11:19:06 -08:00
slice_transform.h Some better API and other comments (#9533) 2022-02-17 18:51:08 -08:00
slice.h Require C++17 (#9481) 2022-02-04 17:13:10 -08:00
snapshot.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
sst_dump_tool.h Add --version and --help to ldb and sst_dump (#6951) 2020-06-09 10:04:01 -07:00
sst_file_manager.h Some API clarifications (#9080) 2021-11-02 20:30:07 -07:00
sst_file_reader.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
sst_file_writer.h Support timestamps in SstFileWriter (#8899) 2021-09-09 18:58:01 -07:00
sst_partitioner.h Mark destructors as override (#9404) 2022-01-20 08:44:27 -08:00
statistics.h Add last level and non-last level read statistics (#9519) 2022-02-18 14:23:07 -08:00
stats_history.h More refactoring ahead of footer & meta changes (#9240) 2021-12-10 08:13:26 -08:00
status.h Combine data members of IOStatus with Status (#9549) 2022-02-22 11:23:01 -08:00
system_clock.h Fix compile warnings (#9199) 2021-11-24 11:19:06 -08:00
table_properties.h Mark destructors as override (#9404) 2022-01-20 08:44:27 -08:00
table.h Dynamic toggling of BlockBasedTableOptions::detect_filter_construct_corruption (#9654) 2022-03-04 10:35:08 -08:00
thread_status.h Fix and detect headers with missing dependencies (#8893) 2021-09-10 10:00:26 -07:00
threadpool.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
trace_reader_writer.h Update comments, fix typos. (#8721) 2021-08-27 13:16:32 -07:00
trace_record_result.h Add IteratorTraceExecutionResult for iterator related trace records. (#8687) 2021-08-20 15:35:56 -07:00
trace_record.h Add IteratorTraceExecutionResult for iterator related trace records. (#8687) 2021-08-20 15:35:56 -07:00
transaction_log.h Replace most typedef with using= (#8751) 2021-09-07 11:31:59 -07:00
types.h Expose blob file information through the EventListener interface (#8675) 2021-09-16 17:23:36 -07:00
unique_id.h Experimental support for SST unique IDs (#8990) 2021-10-18 23:32:01 -07:00
universal_compaction.h Incremental Space Amp Compactions in Universal Style (#8655) 2021-10-20 10:04:13 -07:00
version.h Update HISTORY.md and version.h for 7.0 release (#9609) 2022-02-20 15:22:54 -08:00
wal_filter.h Fix compile warnings (#9199) 2021-11-24 11:19:06 -08:00
write_batch_base.h Revise APIs related to user-defined timestamp (#8946) 2022-02-01 22:19:01 -08:00
write_batch.h Support WBWI for keys having timestamps (#9603) 2022-02-22 14:23:01 -08:00
write_buffer_manager.h Minor improvement to CacheReservationManager/WriteBufferManager/CompressionDictBuilding (#9139) 2021-11-05 16:13:47 -07:00