2310 Commits

Author SHA1 Message Date
Alan Paxton
b6ad0d958f Fb 9718 verify checksums is ignored (#9767)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/9718

The verify_checksums flag of read_options should be passed to the read options used by the BlockFetcher in a couple of cases where it is not at present. It will now happen (but did not, previously) on iteration and on [multi]get, where a fetcher is created as part of the iterate/get call.

This may result in much better performance in a few workloads where the client chooses to remove verification.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9767

Reviewed By: mrambacher

Differential Revision: D35218986

Pulled By: jay-zhuang

fbshipit-source-id: 329d29764bb70fbc7f2673440bc46c107a813bc8
2022-03-29 11:54:54 -07:00
Hui Xiao
e2cb9aa27c Clarify Options::rate_limiter api doc for #9607 Rate-limit automatic WAL flush after each user write (#9745)
Summary:
As title for https://github.com/facebook/rocksdb/pull/9607

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9745

Test Plan: No code change

Reviewed By: ajkr

Differential Revision: D35096901

Pulled By: hx235

fbshipit-source-id: 6bd3671baecfdc04579b0a81a957bfaa7bed81e1
2022-03-25 15:16:07 -07:00
Peter Dillinger
91687d70ea Fix a major performance bug in 7.0 re: filter compatibility (#9736)
Summary:
Bloom filters generated by pre-7.0 releases are not read by
7.0.x releases (and vice-versa) due to changes to FilterPolicy::Name()
in https://github.com/facebook/rocksdb/issues/9590. This can severely impact read performance and read I/O on
upgrade or downgrade with existing DB, but not data correctness.

To fix, we go back using the old, unified name in SST metadata but (for
a while anyway) recognize the aliases that could be generated by early
7.0.x releases. This unfortunately requires a public API change to avoid
interfering with all the good changes from https://github.com/facebook/rocksdb/issues/9590, but the API change
only affects users with custom FilterPolicy, which should be very few.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9736

Test Plan:
manual

Generate DBs with
```
./db_bench.7.0 -db=/dev/shm/rocksdb.7.0 -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0
```
and similar. Compare with
```
for IMPL in 6.29 7.0 fixed; do for DB in 6.29 7.0 fixed; do echo "Testing $IMPL on $DB:"; ./db_bench.$IMPL -db=/dev/shm/rocksdb.$DB -use_existing_db -readonly -bloom_bits=10 -benchmarks=readrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -duration=10 2>&1 | grep micros/op; done; done
```

Results:
```
Testing 6.29 on 6.29:
readrandom   :      34.381 micros/op 29085 ops/sec;    3.2 MB/s (291999 of 291999 found)
Testing 6.29 on 7.0:
readrandom   :     190.443 micros/op 5249 ops/sec;    0.6 MB/s (52999 of 52999 found)
Testing 6.29 on fixed:
readrandom   :      40.148 micros/op 24907 ops/sec;    2.8 MB/s (249999 of 249999 found)
Testing 7.0 on 6.29:
readrandom   :     229.430 micros/op 4357 ops/sec;    0.5 MB/s (43999 of 43999 found)
Testing 7.0 on 7.0:
readrandom   :      33.348 micros/op 29986 ops/sec;    3.3 MB/s (299999 of 299999 found)
Testing 7.0 on fixed:
readrandom   :     152.734 micros/op 6546 ops/sec;    0.7 MB/s (65999 of 65999 found)
Testing fixed on 6.29:
readrandom   :      32.024 micros/op 31224 ops/sec;    3.5 MB/s (312999 of 312999 found)
Testing fixed on 7.0:
readrandom   :      33.990 micros/op 29390 ops/sec;    3.3 MB/s (294999 of 294999 found)
Testing fixed on fixed:
readrandom   :      28.714 micros/op 34825 ops/sec;    3.9 MB/s (348999 of 348999 found)
```

Just paying attention to order of magnitude of ops/sec (short test
durations, lots of noise), it's clear that with the fix we can read <= 6.29
& >= 7.0 at full speed, where neither 6.29 nor 7.0 can on both. And 6.29
release can properly read fixed DB at full speed.

Reviewed By: siying, ajkr

Differential Revision: D35057844

Pulled By: pdillinger

fbshipit-source-id: a46893a6af4bf084375ebe4728066d00eb08f050
2022-03-23 10:00:54 -07:00
gitbw95
8102690a52 Update Cache::Release param from force_erase to erase_if_last_ref (#9728)
Summary:
The param name force_erase may be misleading, since the handle is erased only if it has last reference even if the param is set true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9728

Reviewed By: pdillinger

Differential Revision: D35038673

Pulled By: gitbw95

fbshipit-source-id: 0d16d1e8fed17b97eba7fb53207119332f659a5f
2022-03-22 10:22:18 -07:00
Hui Xiao
b360d25deb Update HISTORY.md and version.h for 7.1 release (#9727)
Summary:
As title

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9727

Test Plan: no code change

Reviewed By: ajkr

Differential Revision: D35034541

Pulled By: hx235

fbshipit-source-id: ae839f23db1bdb9e5f787ca653a7685beb2ada68
2022-03-21 19:48:41 -07:00
duyuqi
cbe303c19b fix a bug, c api, if enable inplace_update_support, and use create sn… (#9471)
Summary:
c api release snapshot will core dump when enable inplace_update_support and create snapshot

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9471

Reviewed By: akankshamahajan15

Differential Revision: D34965103

Pulled By: riversand963

fbshipit-source-id: c3aeeb9ea7126c2eda1466102794fecf57b6ab77
2022-03-21 12:04:33 -07:00
Akanksha Mahajan
49a10feb21 Provide implementation to prefetch data asynchronously in FilePrefetchBuffer (#9674)
Summary:
In FilePrefetchBuffer if reads are sequential, after prefetching call ReadAsync API to prefetch data asynchronously so that in next prefetching data will be available. Data prefetched asynchronously will be readahead_size/2. It uses two buffers, one for synchronous prefetching and one for asynchronous. In case, the data is overlapping, the data is copied from both buffers to third buffer to make it continuous.
This feature is under ReadOptions::async_io and is under experimental.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9674

Test Plan:
1. Add new unit tests
2. Run **db_stress** to make sure nothing crashes.

    -   Normal prefetch without `async_io` ran successfully:
```
export CRASH_TEST_EXT_ARGS=" --async_io=0"
 make crash_test -j
 ```

3. **Run Regressions**.
   i) Main branch without any change for normal prefetching with async_io disabled:

 ```
 ./db_bench -db=/tmp/prefix_scan_prefetch_main -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000000 -
           use_direct_io_for_flush_and_compaction=true -target_file_size_base=16777216
 ```

```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 7.0
Date:       Thu Mar 17 13:11:34 2022
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom   :  483618.390 micros/op 2 ops/sec;  338.9 MB/s (249 of 249 found)
```

  ii) normal prefetching after changes with async_io disable:

```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_withchange -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 7.0
Date:       Thu Mar 17 14:11:31 2022
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_withchange]
seekrandom   :  471347.227 micros/op 2 ops/sec;  348.1 MB/s (255 of 255 found)
```

Reviewed By: anand1976

Differential Revision: D34731543

Pulled By: akankshamahajan15

fbshipit-source-id: 8e23aa93453d5fe3c672b9231ad582f60207937f
2022-03-21 07:12:43 -07:00
Peter Dillinger
a8a422e962 Add manifest fix-up utility for file temperatures (#9683)
Summary:
The goal of this change is to allow changes to the "current" (in
FileSystem) file temperatures to feed back into DB metadata, so that
they can inform decisions and stats reporting. In part because of
modular code factoring, it doesn't seem easy to do this automagically,
where opening an SST file and observing current Temperature different
from expected would trigger a change in metadata and DB manifest write
(essentially giving the deep read path access to the write path). It is also
difficult to do this while the DB is open because of the limitations of
LogAndApply.

This change allows updating file temperature metadata on a closed DB
using an experimental utility function UpdateManifestForFilesState()
or `ldb update_manifest --update_temperatures`. This should suffice for
"migration" scenarios where outside tooling has placed or re-arranged DB
files into a (different) tiered configuration without going through
RocksDB itself (currently, only compaction can change temperature
metadata).

Some details:
* Refactored and added unit test for `ldb unsafe_remove_sst_file` because
of shared functionality
* Pulled in autovector.h changes from https://github.com/facebook/rocksdb/issues/9546 to fix SuperVersionContext
move constructor (related to an older draft of this change)

Possible follow-up work:
* Support updating manifest with file checksums, such as when a
new checksum function is used and want existing DB metadata updated
for it.
* It's possible that for some repair scenarios, lighter weight than
full repair, we might want to support UpdateManifestForFilesState() to
modify critical file details like size or checksum using same
algorithm. But let's make sure these are differentiated from modifying
file details in ways that don't suspect corruption (or require extreme
trust).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9683

Test Plan: unit tests added

Reviewed By: jay-zhuang

Differential Revision: D34798828

Pulled By: pdillinger

fbshipit-source-id: cfd83e8fb10761d8c9e7f9c020d68c9106a95554
2022-03-18 16:35:51 -07:00
Peter Dillinger
cff0d1e8e6 New backup meta schema, with file temperatures (#9660)
Summary:
The primary goal of this change is to add support for backing up and
restoring (applying on restore) file temperature metadata, without
committing to either the DB manifest or the FS reported "current"
temperatures being exclusive "source of truth".

To achieve this goal, we need to add temperature information to backup
metadata, which requires updated backup meta schema. Fortunately I
prepared for this in https://github.com/facebook/rocksdb/issues/8069, which began forward compatibility in version
6.19.0 for this kind of schema update. (Previously, backup meta schema
was not extensible! Making this schema update public will allow some
other "nice to have" features like taking backups with hard links, and
avoiding crc32c checksum computation when another checksum is already
available.) While schema version 2 is newly public, the default schema
version is still 1. Until we change the default, users will need to set
to 2 to enable features like temperature data backup+restore. New
metadata like temperature information will be ignored with a warning
in versions before this change and since 6.19.0. The metadata is
considered ignorable because a functioning DB can be restored without
it.

Some detail:
* Some renaming because "future schema" is now just public schema 2.
* Initialize some atomics in TestFs (linter reported)
* Add temperature hint support to SstFileDumper (used by BackupEngine)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9660

Test Plan:
related unit test majorly updated for the new functionality,
including some shared testing support for tracking temperatures in a FS.

Some other tests and testing hooks into production code also updated for
making the backup meta schema change public.

Reviewed By: ajkr

Differential Revision: D34686968

Pulled By: pdillinger

fbshipit-source-id: 3ac1fa3e67ee97ca8a5103d79cc87d872c1d862a
2022-03-18 11:06:17 -07:00
Yanqin Jin
5894761056 Improve stress test for transactions (#9568)
Summary:
Test only, no change to functionality.
Extremely low risk of library regression.

Update test key generation by maintaining existing and non-existing keys.
Update db_crashtest.py to drive multiops_txn stress test for both write-committed and write-prepared.
Add a make target 'blackbox_crash_test_with_multiops_txn'.

Running the following commands caught the bug exposed in https://github.com/facebook/rocksdb/issues/9571.
```
$rm -rf /tmp/rocksdbtest/*
$./db_stress -progress_reports=0 -test_multi_ops_txns -use_txn -clear_column_family_one_in=0 \
    -column_families=1 -writepercent=0 -delpercent=0 -delrangepercent=0 -customopspercent=60 \
   -readpercent=20 -prefixpercent=0 -iterpercent=20 -reopen=0 -ops_per_thread=1000 -ub_a=10000 \
   -ub_c=100 -destroy_db_initially=0 -key_spaces_path=/dev/shm/key_spaces_desc -threads=32 -read_fault_one_in=0
$./db_stress -progress_reports=0 -test_multi_ops_txns -use_txn -clear_column_family_one_in=0
   -column_families=1 -writepercent=0 -delpercent=0 -delrangepercent=0 -customopspercent=60 -readpercent=20 \
   -prefixpercent=0 -iterpercent=20 -reopen=0 -ops_per_thread=1000 -ub_a=10000 -ub_c=100 -destroy_db_initially=0 \
   -key_spaces_path=/dev/shm/key_spaces_desc -threads=32 -read_fault_one_in=0
```

Running the following command caught a bug which will be fixed in https://github.com/facebook/rocksdb/issues/9648 .
```
$TEST_TMPDIR=/dev/shm make blackbox_crash_test_with_multiops_wc_txn
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9568

Reviewed By: jay-zhuang

Differential Revision: D34308154

Pulled By: riversand963

fbshipit-source-id: 99ff1b65c19b46c471d2f2d3b47adcd342a1b9e7
2022-03-16 19:00:04 -07:00
slk
95305c44a1 Add OpenAndTrimHistory API to support trimming data with specified timestamp (#9410)
Summary:
As disscussed in (https://github.com/facebook/rocksdb/issues/9223), Here added a new API  named DB::OpenAndTrimHistory, this API will open DB and trim data to the timestamp specofied by **trim_ts** (The data with newer timestamp than specified trim bound will be removed). This API should only be used at a timestamp-enabled db instance recovery.

And this PR implemented a new iterator named HistoryTrimmingIterator to support trimming history with a new API named DB::OpenAndTrimHistory. HistoryTrimmingIterator wrapped around the underlying InternalITerator such that keys whose timestamps newer than **trim_ts** should not be returned to the compaction iterator while **trim_ts** is not null.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9410

Reviewed By: ltamasi

Differential Revision: D34410207

Pulled By: riversand963

fbshipit-source-id: e54049dc234eccd673244c566b15df58df5a6236
2022-03-11 16:13:23 -08:00
Akanksha Mahajan
8465cccde2 Posix API support for Async Read and Poll APIs (#9578)
Summary:
Provide support for Async Read and Poll in Posix file system using IOUring.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9578

Test Plan: In progress

Reviewed By: anand1976

Differential Revision: D34690256

Pulled By: akankshamahajan15

fbshipit-source-id: 291cbd1380a3cb904b726c34c0560d1b2ce44a2e
2022-03-10 18:28:31 -08:00
Hui Xiao
ca0ef54f16 Rate-limit automatic WAL flush after each user write (#9607)
Summary:
**Context:**
WAL flush is currently not rate-limited by `Options::rate_limiter`. This PR is to provide rate-limiting to auto WAL flush, the one that automatically happen after each user write operation (i.e, `Options::manual_wal_flush == false`), by adding `WriteOptions::rate_limiter_options`.

Note that we are NOT rate-limiting WAL flush that do NOT automatically happen after each user write, such as  `Options::manual_wal_flush == true + manual FlushWAL()` (rate-limiting multiple WAL flushes),  for the benefits of:
- being consistent with [ReadOptions::rate_limiter_priority](https://github.com/facebook/rocksdb/blob/7.0.fb/include/rocksdb/options.h#L515)
- being able to turn off some WAL flush's rate-limiting but not all (e.g, turn off specific the WAL flush of a critical user write like a service's heartbeat)

`WriteOptions::rate_limiter_options` only accept `Env::IO_USER` and `Env::IO_TOTAL` currently due to an implementation constraint.
- The constraint is that we currently queue parallel writes (including WAL writes) based on FIFO policy which does not factor rate limiter priority into this layer's scheduling. If we allow lower priorities such as `Env::IO_HIGH/MID/LOW` and such writes specified with lower priorities occurs before ones specified with higher priorities (even just by a tiny bit in arrival time), the former would have blocked the latter, leading to a "priority inversion" issue and contradictory to what we promise for rate-limiting priority. Therefore we only allow `Env::IO_USER` and `Env::IO_TOTAL`  right now before improving that scheduling.

A pre-requisite to this feature is to support operation-level rate limiting in `WritableFileWriter`, which is also included in this PR.

**Summary:**
- Renamed test suite `DBRateLimiterTest to DBRateLimiterOnReadTest` for adding a new test suite
- Accept `rate_limiter_priority` in `WritableFileWriter`'s private and public write functions
- Passed `WriteOptions::rate_limiter_options` to `WritableFileWriter` in the path of automatic WAL flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9607

Test Plan:
- Added new unit test to verify existing flush/compaction rate-limiting does not break, since `DBTest, RateLimitingTest` is disabled and current db-level rate-limiting tests focus on read only (e.g, `db_rate_limiter_test`, `DBTest2, RateLimitedCompactionReads`).
- Added new unit test `DBRateLimiterOnWriteWALTest, AutoWalFlush`
- `strace -ftt -e trace=write ./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -rate_limit_auto_wal_flush=1 -rate_limiter_bytes_per_sec=15 -rate_limiter_refill_period_us=1000000 -write_buffer_size=100000000 -disable_auto_compactions=1 -num=100`
   - verified that WAL flush(i.e, system-call _write_) were chunked into 15 bytes and each _write_ was roughly 1 second apart
   - verified the chunking disappeared when `-rate_limit_auto_wal_flush=0`
- crash test: `python3 tools/db_crashtest.py blackbox --disable_wal=0  --rate_limit_auto_wal_flush=1 --rate_limiter_bytes_per_sec=10485760 --interval=10` killed as normal

**Benchmarked on flush/compaction to ensure no performance regression:**
- compaction with rate-limiting  (see table 1, avg over 1280-run):  pre-change: **915635 micros/op**; post-change:
   **907350 micros/op (improved by 0.106%)**
```
#!/bin/bash
TEST_TMPDIR=/dev/shm/testdb
START=1
NUM_DATA_ENTRY=8
N=10

rm -f compact_bmk_output.txt compact_bmk_output_2.txt dont_care_output.txt
for i in $(eval echo "{$START..$NUM_DATA_ENTRY}")
do
    NUM_RUN=$(($N*(2**($i-1))))
    for j in $(eval echo "{$START..$NUM_RUN}")
    do
       ./db_bench --benchmarks=fillrandom -db=$TEST_TMPDIR -disable_auto_compactions=1 -write_buffer_size=6710886 > dont_care_output.txt && ./db_bench --benchmarks=compact -use_existing_db=1 -db=$TEST_TMPDIR -level0_file_num_compaction_trigger=1 -rate_limiter_bytes_per_sec=100000000 | egrep 'compact'
    done > compact_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' compact_bmk_output.txt >> compact_bmk_output_2.txt
done
```
- compaction w/o rate-limiting  (see table 2, avg over 640-run):  pre-change: **822197 micros/op**; post-change: **823148 micros/op (regressed by 0.12%)**
```
Same as above script, except that -rate_limiter_bytes_per_sec=0
```
- flush with rate-limiting (see table 3, avg over 320-run, run on the [patch](ee5c6023a9) to augment current db_bench ): pre-change: **745752 micros/op**; post-change: **745331 micros/op (regressed by 0.06 %)**
```
 #!/bin/bash
TEST_TMPDIR=/dev/shm/testdb
START=1
NUM_DATA_ENTRY=8
N=10

rm -f flush_bmk_output.txt flush_bmk_output_2.txt

for i in $(eval echo "{$START..$NUM_DATA_ENTRY}")
do
    NUM_RUN=$(($N*(2**($i-1))))
    for j in $(eval echo "{$START..$NUM_RUN}")
    do
       ./db_bench -db=$TEST_TMPDIR -write_buffer_size=1048576000 -num=1000000 -rate_limiter_bytes_per_sec=100000000 -benchmarks=fillseq,flush | egrep 'flush'
    done > flush_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' flush_bmk_output.txt >> flush_bmk_output_2.txt
done

```
- flush w/o rate-limiting (see table 4, avg over 320-run, run on the [patch](ee5c6023a9) to augment current db_bench): pre-change: **487512 micros/op**, post-change: **485856 micors/ops (improved by 0.34%)**
```
Same as above script, except that -rate_limiter_bytes_per_sec=0
```

| table 1 - compact with rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op  (%)
-- | -- | -- | -- | -- | --
10 | 896978 | 16046.9 | 901242 | 15670.9 | 0.475373978
20 | 893718 | 15813 | 886505 | 17544.7 | -0.8070778478
40 | 900426 | 23882.2 | 894958 | 15104.5 | -0.6072681153
80 | 906635 | 21761.5 | 903332 | 23948.3 | -0.3643141948
160 | 898632 | 21098.9 | 907583 | 21145 | 0.9960695813
3.20E+02 | 905252 | 22785.5 | 908106 | 25325.5 | 0.3152713278
6.40E+02 | 905213 | 23598.6 | 906741 | 21370.5 | 0.1688000504
**1.28E+03** | **908316** | **23533.1** | **907350** | **24626.8** | **-0.1063506533**
average over #-run | 901896.25 | 21064.9625 | 901977.125 | 20592.025 | 0.008967217682

| table 2 - compact w/o rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op  (%)
-- | -- | -- | -- | -- | --
10 | 811211 | 26996.7 | 807586 | 28456.4 | -0.4468627768
20 | 815465 | 14803.7 | 814608 | 28719.7 | -0.105093413
40 | 809203 | 26187.1 | 797835 | 25492.1 | -1.404839082
80 | 822088 | 28765.3 | 822192 | 32840.4 | 0.01265071379
160 | 821719 | 36344.7 | 821664 | 29544.9 | -0.006693285661
3.20E+02 | 820921 | 27756.4 | 821403 | 28347.7 | 0.05871454135
**6.40E+02** | **822197** | **28960.6** | **823148** | **30055.1** | **0.1156657103**
average over #-run | 8.18E+05 | 2.71E+04 | 8.15E+05 | 2.91E+04 |  -0.25

| table 3 - flush with rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op  (%)
-- | -- | -- | -- | -- | --
10 | 741721 | 11770.8 | 740345 | 5949.76 | -0.1855144994
20 | 735169 | 3561.83 | 743199 | 9755.77 | 1.09226586
40 | 743368 | 8891.03 | 742102 | 8683.22 | -0.1703059588
80 | 742129 | 8148.51 | 743417 | 9631.58| 0.1735547324
160 | 749045 | 9757.21 | 746256 | 9191.86 | -0.3723407806
**3.20E+02** | **745752** | **9819.65** | **745331** | **9840.62** | **-0.0564530836**
6.40E+02 | 749006 | 11080.5 | 748173 | 10578.7 | -0.1112140624
average over #-run | 743741.4286 | 9004.218571 | 744117.5714 | 9090.215714 | 0.05057441238

| table 4 - flush w/o rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op (%)
-- | -- | -- | -- | -- | --
10 | 477283 | 24719.6 | 473864 | 12379 | -0.7163464863
20 | 486743 | 20175.2 | 502296 | 23931.3 | 3.195320734
40 | 482846 | 15309.2 | 489820 | 22259.5 | 1.444352858
80 | 491490 | 21883.1 | 490071 | 23085.7 | -0.2887139108
160 | 493347 | 28074.3 | 483609 | 21211.7 | -1.973864238
**3.20E+02** | **487512** | **21401.5** | **485856** | **22195.2** | **-0.3396839462**
6.40E+02 | 490307 | 25418.6 | 485435 | 22405.2 | -0.9936631539
average over #-run | 4.87E+05 | 2.24E+04 | 4.87E+05 | 2.11E+04 | 0.00E+00

Reviewed By: ajkr

Differential Revision: D34442441

Pulled By: hx235

fbshipit-source-id: 4790f13e1e5c0a95ae1d1cc93ffcf69dc6e78bdd
2022-03-08 13:19:39 -08:00
Jay Zhuang
36aec94d85 compression_per_level should be used for flush and changeable (#9658)
Summary:
- Make `compression_per_level` dynamical changeable with `SetOptions`;
- Fix a bug that `compression_per_level` is not used for flush;

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9658

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D34700749

Pulled By: jay-zhuang

fbshipit-source-id: a23b9dfa7ad03d393c1d71781d19e91de796f49c
2022-03-07 18:06:19 -08:00
Hui Xiao
4a776d81cc Dynamic toggling of BlockBasedTableOptions::detect_filter_construct_corruption (#9654)
Summary:
**Context/Summary:**
As requested, `BlockBasedTableOptions::detect_filter_construct_corruption` can now be dynamically configured using `DB::SetOptions` after this PR

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9654

Test Plan: - New unit test

Reviewed By: pdillinger

Differential Revision: D34622609

Pulled By: hx235

fbshipit-source-id: c06773ef3d029e6bf1724d3a72dffd37a8ec66d9
2022-03-04 10:35:08 -08:00
Akanksha Mahajan
d74468e348 Update Poll and ReadAsync API in File System (#9623)
Summary:
Update the signature of Poll and ReadAsync APIs in filesystem.
Instead of unique_ptr, void** will be passed as io_handle and the delete function.
io_handle and delete function should be provided by underlying
FileSystem and its lifetime will be maintained by RocksDB. io_handle
will be deleted by RocksDB once callback is made to update the results or Poll is
called to get the results.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9623

Test Plan: Add a new unit test.

Reviewed By: anand1976

Differential Revision: D34403529

Pulled By: akankshamahajan15

fbshipit-source-id: ea185a5f4c7bec334631e4f781ea7ba4135645f0
2022-03-01 17:11:42 -08:00
sdong
33742c2a9f Remove BlockBasedTableOptions.hash_index_allow_collision (#9454)
Summary:
BlockBasedTableOptions.hash_index_allow_collision is already deprecated and has no effect. Delete it for preparing 7.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9454

Test Plan: Run all existing tests.

Reviewed By: ajkr

Differential Revision: D33805827

fbshipit-source-id: ed8a436d1d083173ec6aef2a762ba02e1eefdc9d
2022-03-01 13:58:02 -08:00
Changneng Chen
9ed96703d1 Add support for BlobDB to ldb (#9630)
Summary:
Add the configuration options and help messages of BlobDB to `ldb`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9630

Test Plan: `python ./tools/ldb_test.py`

Reviewed By: ltamasi

Differential Revision: D34443176

Pulled By: changneng

fbshipit-source-id: 5b3f185cdfc2561e06dd37215c7edfbca07dbe80
2022-02-25 23:13:11 -08:00
Bo Wang
f706a9c199 Add a secondary cache implementation based on LRUCache 1 (#9518)
Summary:
**Summary:**
RocksDB uses a block cache to reduce IO and make queries more efficient. The block cache is based on the LRU algorithm (LRUCache) and keeps objects containing uncompressed data, such as Block, ParsedFullFilterBlock etc. It allows the user to configure a second level cache (rocksdb::SecondaryCache) to extend the primary block cache by holding items evicted from it. Some of the major RocksDB users, like MyRocks, use direct IO and would like to use a primary block cache for uncompressed data and a secondary cache for compressed data. The latter allows us to mitigate the loss of the Linux page cache due to direct IO.

This PR includes a concrete implementation of rocksdb::SecondaryCache that integrates with compression libraries such as LZ4 and implements an LRU cache to hold compressed blocks.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9518

Test Plan:
In this PR, the lru_secondary_cache_test.cc includes the following tests:
1. The unit tests for the secondary cache with either compression or no compression, such as basic tests, fails tests.
2. The integration tests with both primary cache and this secondary cache .

**Follow Up:**

1. Statistics (e.g. compression ratio) will be added in another PR.
2. Once this implementation is ready, I will do some shadow testing and benchmarking with UDB to measure the impact.

Reviewed By: anand1976

Differential Revision: D34430930

Pulled By: gitbw95

fbshipit-source-id: 218d78b672a2f914856d8a90ff32f2f5b5043ded
2022-02-23 16:06:27 -08:00
Yanqin Jin
6f12599863 Support WBWI for keys having timestamps (#9603)
Summary:
This PR supports inserting keys to a `WriteBatchWithIndex` for column families that enable user-defined timestamps
and reading the keys back. **The index does not have timestamps.**

Writing a key to WBWI is unchanged, because the underlying WriteBatch already supports it.
When reading the keys back, we need to make sure to distinguish between keys with and without timestamps before
comparison.

When user calls `GetFromBatchAndDB()`, no timestamp is needed to query the batch, but a timestamp has to be
provided to query the db. The assumption is that data in the batch must be newer than data from the db.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9603

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D34354849

Pulled By: riversand963

fbshipit-source-id: d25d1f84e2240ce543e521fa30595082fb8db9a0
2022-02-22 14:23:01 -08:00
anand76
d795a730be Combine data members of IOStatus with Status (#9549)
Summary:
Combine the data members retryable_, data_loss_ and scope_ of IOStatus
with Status, as protected members. IOStatus is now defined as a derived class of Status with
no new data, but additional methods. This will allow us to eventually
track the result of FileSystem calls in RocksDB with one variable
instead of two.

Benchmark commands and results are below. The performance after changes seems slightly better.

```./db_bench -db=/data/mysql/rocksdb/prefix_scan -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000000 -use_direct_io_for_flush_and_compaction=true -target_file_size_base=16777216```

```./db_bench -use_existing_db=true --db=/data/mysql/rocksdb/prefix_scan -benchmarks="readseq,seekrandom,readseq" -key_size=32 -value_size=512 -num=5000000 -seek_nexts=10000 -use_direct_reads=true -duration=60 -ops_between_duration_checks=1 -readonly=true -adaptive_readahead=false -threads=1 -cache_size=10485760000```

Before -
seekrandom   :    3715.432 micros/op 269 ops/sec; 1394.9 MB/s (16149 of 16149 found)
seekrandom   :    3687.177 micros/op 271 ops/sec; 1405.6 MB/s (16273 of 16273 found)
seekrandom   :    3709.646 micros/op 269 ops/sec; 1397.1 MB/s (16175 of 16175 found)

readseq      :       0.369 micros/op 2711321 ops/sec; 1406.6 MB/s
readseq      :       0.363 micros/op 2754092 ops/sec; 1428.8 MB/s
readseq      :       0.372 micros/op 2688046 ops/sec; 1394.6 MB/s

After -
seekrandom   :    3606.830 micros/op 277 ops/sec; 1436.9 MB/s (16636 of 16636 found)
seekrandom   :    3594.467 micros/op 278 ops/sec; 1441.9 MB/s (16693 of 16693 found)
seekrandom   :    3597.919 micros/op 277 ops/sec; 1440.5 MB/s (16677 of 16677 found)

readseq      :       0.354 micros/op 2822809 ops/sec; 1464.5 MB/s
readseq      :       0.358 micros/op 2795080 ops/sec; 1450.1 MB/s
readseq      :       0.354 micros/op 2822889 ops/sec; 1464.5 MB/s

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9549

Reviewed By: pdillinger

Differential Revision: D34310362

Pulled By: anand1976

fbshipit-source-id: 54b27756edf9c9ecfe730a2dce542a7a46743096
2022-02-22 11:23:01 -08:00
Andrew Kryczka
7ae4da924a Update HISTORY.md and version.h for 7.0 release (#9609)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9609

Reviewed By: riversand963

Differential Revision: D34370309

Pulled By: ajkr

fbshipit-source-id: 5fc9306439aefa4b2d61d847534ea6758c30b6a5
2022-02-20 15:22:54 -08:00
Akanksha Mahajan
3699b171e4 Change enum SizeApproximationFlags to enum class (#9604)
Summary:
Change enum SizeApproximationFlags to enum and class and add
overloaded operators for the transition between enum class and uint8_t

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9604

Test Plan: Circle CI jobs

Reviewed By: riversand963

Differential Revision: D34360281

Pulled By: akankshamahajan15

fbshipit-source-id: 6351dfdb717ae3c4530d324c3d37a8ecb01dd1ef
2022-02-18 20:22:57 -08:00
Akanksha Mahajan
559525dcbb Add Async Read and Poll APIs in FileSystem (#9564)
Summary:
This PR adds support for new APIs Async Read that reads the data
asynchronously and Poll API that checks if requested read request has
completed or not.

Usage: In RocksDB, we are currently planning to prefetch data
asynchronously during sequential scanning and RocksDB will call these
APIs to prefetch more data in advanced.

Design:
- ReadAsync API submits the read request to underlying FileSystem in
order to read data asynchronously. When read request is completed,
callback function will be called. cb_arg is used by RocksDB to track the
original request submitted and IOHandle is used by FileSystem to keep track
of IO requests at their level.

- The Poll API  is added in FileSystem because the call could end up handling
completions for multiple different files which is not specific to a
FSRandomAccessFile instance. There could be multiple outstanding file reads
from different files in future and they can complete in any order.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9564

Test Plan: Test will be added in separate PR.

Reviewed By: anand1976

Differential Revision: D34226216

Pulled By: akankshamahajan15

fbshipit-source-id: 95e64edafb17f543f7232421d51e2665a3267f69
2022-02-18 17:23:18 -08:00
Jay Zhuang
f4b2500e12 Add last level and non-last level read statistics (#9519)
Summary:
Add last level and non-last level read statistics:
```
LAST_LEVEL_READ_BYTES,
LAST_LEVEL_READ_COUNT,
NON_LAST_LEVEL_READ_BYTES,
NON_LAST_LEVEL_READ_COUNT,
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9519

Test Plan: added unittest

Reviewed By: siying

Differential Revision: D34062539

Pulled By: jay-zhuang

fbshipit-source-id: 908644c3050878b4234febdc72e3e19d89af38cd
2022-02-18 14:23:07 -08:00
mrambacher
30b08878d8 Make FilterPolicy Customizable (#9590)
Summary:
Make FilterPolicy into a Customizable class.  Allow new FilterPolicy to be discovered through the ObjectRegistry

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9590

Reviewed By: pdillinger

Differential Revision: D34327367

Pulled By: mrambacher

fbshipit-source-id: 37e7edac90ec9457422b72f359ab8ef48829c190
2022-02-18 13:22:31 -08:00
Jay Zhuang
2fbc672732 Add temperature information to the event listener callbacks (#9591)
Summary:
RocksDB try to provide temperature information in the event
listener callbacks. The information is not guaranteed, as some operation
like backup won't have these information.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9591

Test Plan: Added unittest

Reviewed By: siying, pdillinger

Differential Revision: D34309339

Pulled By: jay-zhuang

fbshipit-source-id: 4aca4f270f99fa49186d85d300da42594663d6d7
2022-02-18 11:23:18 -08:00
Andrew Kryczka
54fb2a8975 Change type of cache buffer passed to Cache::CreateCallback() to const void* (#9595)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9595

Reviewed By: riversand963

Differential Revision: D34329906

Pulled By: ajkr

fbshipit-source-id: 508601856fa9bee4d40f4a68d14d333ef2143d40
2022-02-17 21:09:56 -08:00
Peter Dillinger
48b9de4a3e Mark more OldDefaults as deprecated (#9594)
Summary:
`ColumnFamilyOptions::OldDefaults` and `DBOptions::OldDefaults`
now deprecated. Were previously overlooked with `Options::OldDefaults` in https://github.com/facebook/rocksdb/issues/9363

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9594

Test Plan: comments only

Reviewed By: jay-zhuang

Differential Revision: D34318592

Pulled By: pdillinger

fbshipit-source-id: 773c97a61e2a8290ae154f363dd61c1f35a9dd16
2022-02-17 20:28:10 -08:00
Peter Dillinger
561be005ba Some better API and other comments (#9533)
Summary:
Various comments, mostly about SliceTransform + prefix extractors.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9533

Test Plan: comments only

Reviewed By: ajkr

Differential Revision: D34094367

Pulled By: pdillinger

fbshipit-source-id: 9742ce3b89ef7fd5c5e748fec862e6361ed44e95
2022-02-17 18:51:08 -08:00
Peter Dillinger
725833a424 Hide FilterBits{Builder,Reader} from public API (#9592)
Summary:
We don't have any evidence of people using these to build custom
filters. The recommended way of customizing filter handling is to
defer to various built-in policies based on FilterBuildingContext
(e.g. to build Monkey filtering policy). With old API, we have
evidence of people modifying keys going into filter, but most cases
of that can be handled with prefix_extractor.

Having FilterBitsBuilder+Reader in the public API is an ogoing
hinderance to code evolution (e.g. recent new Finish and
MaybePostVerify), and so this change removes them from the public API
for 7.0. Maybe they will come back in some form later, but lacking
evidence of them providing value in the public API, we want to take back
more freedom to evolve these.

With this moved to internal-only, there is no rush to clean up the
complex Finish signatures, or add memory allocator support, but doing so
is much easier with them out of public API, for example to use
CacheAllocationPtr without exposing it in the public API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9592

Test Plan: cosmetic changes only

Reviewed By: hx235

Differential Revision: D34315470

Pulled By: pdillinger

fbshipit-source-id: 03e03bb66a72c73df2c464d2dbbbae906dd8f99b
2022-02-17 16:34:46 -08:00
Siddhartha Roychowdhury
39b0d92153 Add record to set WAL compression type if enabled (#9556)
Summary:
When WAL compression is enabled, add a record (new record type) to store the compression type to indicate that all subsequent records are compressed. The log reader will store the compression type when this record is encountered and use the type to uncompress the subsequent records. Compress and uncompress to be implemented in subsequent diffs.
Enabled WAL compression in some WAL tests to check for regressions. Some tests that rely on offsets have been disabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9556

Reviewed By: anand1976

Differential Revision: D34308216

Pulled By: sidroyc

fbshipit-source-id: 7f10595e46f3277f1ea2d309fbf95e2e935a8705
2022-02-17 16:19:31 -08:00
Jay Zhuang
f092f0fa5d Add subcompaction event API (#9311)
Summary:
Add event callback for subcompaction and adds a sub_job_id to identify it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9311

Reviewed By: ajkr

Differential Revision: D33892707

Pulled By: jay-zhuang

fbshipit-source-id: 57b5e5e594d61b2112d480c18a79a36751f65a4e
2022-02-17 15:47:10 -08:00
Andrew Kryczka
babe56ddba Add rate limiter priority to ReadOptions (#9424)
Summary:
Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working.

`RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`.

There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads).

The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424

Test Plan:
- new unit tests
- new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart.
  - setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true`
  - benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true`
- crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10`

Reviewed By: hx235

Differential Revision: D33747386

Pulled By: ajkr

fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c
2022-02-16 23:18:14 -08:00
Jay Zhuang
31031c0210 Remove deprecated RemoteCompaction API (#9570)
Summary:
Remove deprecated remote compaction APIs
`CompactionService::Start()` and `CompactionService::WaitForComplete()`.
Please use `CompactionService::StartV2()`,
`CompactionService::WaitForCompleteV2()` instead, which provides the
same information plus extra data like priority, db_id, etc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9570

Test Plan: CI

Reviewed By: riversand963

Differential Revision: D34255969

Pulled By: jay-zhuang

fbshipit-source-id: c6376eccdd1123f1c42ab53771b5f65f8160c325
2022-02-16 13:25:28 -08:00
mrambacher
c42d0cf862 Add support for decimals to PatternEntry (#9577)
Summary:
Add support for doubles to ObjectLibrary::PatternEntry.  This support will allow patterns containing a non-integer number to be parsed correctly.

Added appropriate test cases to cover this new option.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9577

Reviewed By: pdillinger

Differential Revision: D34269763

Pulled By: mrambacher

fbshipit-source-id: b5ce16cbd3665c2974ec0f3412ef2b403ef8b155
2022-02-16 11:15:19 -08:00
Peter Dillinger
479eb1aad6 Hide deprecated, inefficient block-based filter from public API (#9535)
Summary:
This change removes the ability to configure the deprecated,
inefficient block-based filter in the public API. Options that would
have enabled it now use "full" (and optionally partitioned) filters.
Existing block-based filters can still be read and used, and a "back
door" way to build them still exists, for testing and in case of trouble.

About the only way this removal would cause an issue for users is if
temporary memory for filter construction greatly increases. In
HISTORY.md we suggest a few possible mitigations: partitioned filters,
smaller SST files, or setting reserve_table_builder_memory=true.

Or users who have customized a FilterPolicy using the
CreateFilter/KeyMayMatch mechanism removed in https://github.com/facebook/rocksdb/issues/9501 will have to upgrade
their code. (It's long past time for people to move to the new
builder/reader customization interface.)

This change also introduces some internal-use-only configuration strings
for testing specific filter implementations while bypassing some
compatibility / intelligence logic. This is intended to hint at a path
toward making FilterPolicy Customizable, but it also gives us a "back
door" way to configure block-based filter.

Aside: updated db_bench so that -readonly implies -use_existing_db

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9535

Test Plan:
Unit tests updated. Specifically,

* BlockBasedTableTest.BlockReadCountTest is tweaked to validate the back
door configuration interface and ignoring of `use_block_based_builder`.
* BlockBasedTableTest.TracingGetTest is migrated from testing
block-based filter access pattern to full filter access patter, by
re-ordering some things.
* Options test (pretty self-explanatory)

Performance test - create with `./db_bench -db=/dev/shm/rocksdb1 -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0` with and without `-use_block_based_filter`, which creates a DB with 21 SST files in L0. Read with `./db_bench -db=/dev/shm/rocksdb1 -readonly -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=readrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -duration=30`

Without -use_block_based_filter: readrandom 464 ops/sec, 689280 KB DB
With -use_block_based_filter: readrandom 169 ops/sec, 690996 KB DB
No consistent difference with fillrandom

Reviewed By: jay-zhuang

Differential Revision: D34153871

Pulled By: pdillinger

fbshipit-source-id: 31f4a933c542f8f09aca47fa64aec67832a69738
2022-02-12 07:05:57 -08:00
Yanqin Jin
d6e1e6f37a Add commit_timestamp and read_timestamp to Pessimistic transaction (#9537)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9537

Add `Transaction::SetReadTimestampForValidation()` and
`Transaction::SetCommitTimestamp()` APIs with default implementation
returning `Status::NotSupported()`. Currently, calling these two APIs do not
have any effect.

Also add checks to `PessimisticTransactionDB`
to enforce that column families in the same db either
- disable user-defined timestamp
- enable 64-bit timestamp

Just to clarify, a `PessimisticTransactionDB` can have some column families without
timestamps as well as column families that enable timestamp.

Each `PessimisticTransaction` can have two optional timestamps, `read_timestamp_`
used for additional validation and `commit_timestamp_` which denotes when the transaction commits.
For now, we are going to support `WriteCommittedTxn` (in a series of subsequent PRs)

Once set, we do not allow decreasing `read_timestamp_`. The `commit_timestamp_` must be
 greater than `read_timestamp_` for each transaction and must be set before commit, unless
the transaction does not involve any column family that enables user-defined timestamp.

TransactionDB builds on top of RocksDB core `DB` layer. Though `DB` layer assumes
that user-defined timestamps are byte arrays, `TransactionDB` uses uint64_t to store
timestamps. When they are passed down, they are still interpreted as
byte-arrays by `DB`.

Reviewed By: ltamasi

Differential Revision: D31567959

fbshipit-source-id: b0b6b69acab5d8e340cf174f33e8b09f1c3d3502
2022-02-11 20:19:15 -08:00
mrambacher
81ada95bd7 Add STATIC_AVOID_DESTRUCTION for ObjectLibrary/Registry (#9464)
Summary:
This change should guarantee that the default ObjectLibrary/Registry are long-lived and not destroyed while the process is running.  This will prevent some issues of them being referenced after they were destroyed via the static destruction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9464

Reviewed By: pdillinger

Differential Revision: D33849876

Pulled By: mrambacher

fbshipit-source-id: 7a69177d7c58c81be293fc7ef8e600d47ddbc14b
2022-02-11 13:20:41 -08:00
Akanksha Mahajan
5c53b9008f Fix failure in c_test (#9547)
Summary:
When tests are run with TMPD, c_test may fail because TMPD
is not created by the test. It results in IO error: No such file
or directory: While mkdir if missing:
/tmp/rocksdb_test_tmp/rocksdb_c_test-0: No such file or directory

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9547

Test Plan:
make -j32 c_test;
 TEST_TMPDIR=/tmp/rocksdb_test  ./c_test

Reviewed By: riversand963

Differential Revision: D34173298

Pulled By: akankshamahajan15

fbshipit-source-id: 5b5a01f5b842c2487b05b0708c8e9532241db7f8
2022-02-11 10:31:41 -08:00
mrambacher
fe9d495112 Return different Status based on ObjectRegistry::NewObject calls (#9333)
Summary:
This fix addresses https://github.com/facebook/rocksdb/issues/9299.

If attempting to create a new object via the ObjectRegistry and a factory is not found, the ObjectRegistry will return a "NotSupported" status.  This is the same behavior as previously.

If the factory is found but could not successfully create the object, an "InvalidArgument" status is returned.  If the factory returned a reason why (in the errmsg), this message will be in the returned status.

In practice, there are two options in the ConfigOptions that control how these errors are propagated:
- If "ignore_unknown_options=true", then both InvalidArgument and NotSupported status codes will be swallowed internally.  Both cases will return success
- If "ignore_unsupported_options=true", then having no factory will return success but a failing factory will return an error
- If both options are false, both cases (no and failing factory) will return errors.

In practice this likely only changes Customizable that may be partially available.  For example, the JEMallocMemoryAllocator is a built-in allocator that is registered with the system but may not be compiled in.  In this case, the status code for this allocator changed from NotSupported("JEMalloc not available") to InvalidArgumen("JEMalloc not available").  Other Customizable builtins/plugins would have the same semantics.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9333

Reviewed By: pdillinger

Differential Revision: D33517681

Pulled By: mrambacher

fbshipit-source-id: 8033052d4a4a7b88c2d9f90147b1b4467e51f6fd
2022-02-11 05:11:24 -08:00
Akanksha Mahajan
9745c68eb1 Remove deprecated option new_table_reader_for_compaction_inputs (#9443)
Summary:
In RocksDB option new_table_reader_for_compaction_inputs has
not effect on Compaction or on the behavior of RocksDB library.
Therefore, we are removing it in the upcoming 7.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9443

Test Plan: CircleCI

Reviewed By: ajkr

Differential Revision: D33788508

Pulled By: akankshamahajan15

fbshipit-source-id: 324ca6f12bfd019e9bd5e1b0cdac39be5c3cec7d
2022-02-08 19:31:28 -08:00
Peter Dillinger
68a9c186d0 FilterPolicy API changes for 7.0 (#9501)
Summary:
* Inefficient block-based filter is no longer customizable in the public
API, though (for now) can still be enabled.
  * Removed deprecated FilterPolicy::CreateFilter() and
  FilterPolicy::KeyMayMatch()
  * Removed `rocksdb_filterpolicy_create()` from C API
* Change meaning of nullptr return from GetBuilderWithContext() from "use
block-based filter" to "generate no filter in this case." This is a
cleaner solution to the proposal in https://github.com/facebook/rocksdb/issues/8250.
  * Also, when user specifies bits_per_key < 0.5, we now round this down
  to "no filter" because we expect a filter with >= 80% FP rate is
  unlikely to be worth the CPU cost of accessing it (esp with
  cache_index_and_filter_blocks=1 or partition_filters=1).
  * bits_per_key >= 0.5 and < 1.0 is still rounded up to 1.0 (for 62% FP
  rate)
  * This also gives us some support for configuring filters from OPTIONS
  file as currently saved: `filter_policy=rocksdb.BuiltinBloomFilter`.
  Opening from such an options file will enable reading filters (an
  improvement) but not writing new ones. (See Customizable follow-up
  below.)
* Also removed deprecated functions
  * FilterBitsBuilder::CalculateNumEntry()
  * FilterPolicy::GetFilterBitsBuilder()
  * NewExperimentalRibbonFilterPolicy()
* Remove default implementations of
  * FilterBitsBuilder::EstimateEntriesAdded()
  * FilterBitsBuilder::ApproximateNumEntries()
  * FilterPolicy::GetBuilderWithContext()
* Remove support for "filter_policy=experimental_ribbon" configuration
string.
* Allow "filter_policy=bloomfilter:n" without bool to discourage use of
block-based filter.

Some pieces for https://github.com/facebook/rocksdb/issues/9389

Likely follow-up (later PRs):
* Refactoring toward FilterPolicy Customizable, so that we can generate
filters with same configuration as before when configuring from options
file.
* Remove support for user enabling block-based filter (ignore `bool
use_block_based_builder`)
  * Some months after this change, we could even remove read support for
  block-based filter, because it is not critical to DB data
  preservation.
* Make FilterBitsBuilder::FinishV2 to avoid `using
FilterBitsBuilder::Finish` mess and add support for specifying a
MemoryAllocator (for cache warming)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9501

Test Plan:
A number of obsolete tests deleted and new tests or test
cases added or updated.

Reviewed By: hx235

Differential Revision: D34008011

Pulled By: pdillinger

fbshipit-source-id: a39a720457c354e00d5b59166b686f7f59e392aa
2022-02-08 13:56:46 -08:00
Peter Dillinger
5cb137a860 Work around some new clang-analyze failures (#9515)
Summary:
... seen only in internal clang-analyze runs after https://github.com/facebook/rocksdb/issues/9481

* Mostly, this works around falsely reported leaks by using
std::unique_ptr in some places where clang-analyze was getting
confused. (I didn't see any changes in C++17 that could make our Status
implementation leak memory.)
* Also fixed SetBGError returning address of a stack variable.
* Also fixed another false null deref report by adding an assert.

Also, use SKIP_LINK=1 to speed up `make analyze`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9515

Test Plan:
Was able to reproduce the reported errors locally and verify
they're fixed (except SetBGError). Otherwise, existing tests

Reviewed By: hx235

Differential Revision: D34054630

Pulled By: pdillinger

fbshipit-source-id: 38600ef3da75ddca307dff96b7a1a523c2885c2e
2022-02-07 18:24:36 -08:00
Akanksha Mahajan
bbe4763ee4 Remove Deprecated overloads of DB::GetApproximateSizes (#9458)
Summary:
In RocksDB few overloads of DB::GetApproximateSizes are marked as
DEPRECATED_FUNC, and we are removing it in the upcoming 7.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9458

Test Plan: CircleCI

Reviewed By: riversand963

Differential Revision: D34043791

Pulled By: akankshamahajan15

fbshipit-source-id: 815c0ad283a6627c4b241479c7d40ce03a758493
2022-02-07 12:02:57 -08:00
Peter Dillinger
bd08374130 Add GetTemperature on existing files (#9498)
Summary:
For tiered storage

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9498

Test Plan: Just API placeholders for now

Reviewed By: jay-zhuang

Differential Revision: D33993094

Pulled By: pdillinger

fbshipit-source-id: 3cf19a450c7232e05306e94018559b26e9fd35db
2022-02-07 10:40:31 -08:00
Peter Dillinger
fd3e0f43b3 Require C++17 (#9481)
Summary:
Drop support for some old compilers by requiring C++17 standard
(or higher). See https://github.com/facebook/rocksdb/issues/9388

First modification based on this is to remove some conditional compilation in slice.h (also
better for ODR)

Also in this PR:
* Fix some Makefile formatting that seems to affect ASSERT_STATUS_CHECKED config in
some cases
* Add c_test to NON_PARALLEL_TEST in Makefile
* Fix a clang-analyze reported "potential leak" in lru_cache_test
* Better "compatibility" definition of DEFINE_uint32 for old versions of gflags
* Fix a linking problem with shared libraries in Makefile (`./random_test: error while loading shared libraries: librocksdb.so.6.29: cannot open shared object file: No such file or directory`)
* Always set ROCKSDB_SUPPORT_THREAD_LOCAL and use thread_local (from C++11)
  * TODO in later PR: clean up that obsolete flag
* Fix a cosmetic typo in c.h (https://github.com/facebook/rocksdb/issues/9488)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9481

Test Plan:
CircleCI config substantially updated.

* Upgrade to latest Ubuntu images for each release
* Generally prefer Ubuntu 20, but keep a couple Ubuntu 16 builds with oldest supported
compilers, to ensure compatibility
* Remove .circleci/cat_ignore_eagain except for Ubuntu 16 builds, because this is to work
around a kernel bug that should not affect anything but Ubuntu 16.
* Remove designated gcc-9 build, because the default linux build now uses GCC 9 from
Ubuntu 20.
* Add some `apt-key add` to fix some apt "couldn't be verified" errors
* Generally drop SKIP_LINK=1; work-around no longer needed
* Generally `add-apt-repository` before `apt-get update` as manual testing indicated the
reverse might not work.

Travis:
* Use gcc-7 by default (remove specific gcc-7 and gcc-4.8 builds)
* TODO in later PR: fix s390x "Assembler messages: Error: invalid switch -march=z14" failure

AppVeyor:
* Completely dropped because we are dropping VS2015 support and CircleCI covers
VS >= 2017

Also local testing with old gflags (out of necessity when using ROCKSDB_NO_FBCODE=1).

Reviewed By: mrambacher

Differential Revision: D33946377

Pulled By: pdillinger

fbshipit-source-id: ae077c823905b45370a26c0103ada119459da6c1
2022-02-04 17:13:10 -08:00
Jermy Li
83ff350ff2 jni: expose memtable_whole_key_filtering option (#9394)
Summary:
refer to: https://github.com/facebook/rocksdb/wiki/Prefix-Seek#configure-prefix-bloom-filter

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9394

Reviewed By: mrambacher

Differential Revision: D33671533

Pulled By: ajkr

fbshipit-source-id: d90db1712efdd5dd65020329867381d6b3cf2626
2022-02-04 16:01:16 -08:00
Baptiste Lemaire
bec9ab4316 Remove deprecated option DBOptions::max_mem_compaction_level (#9446)
Summary:
In RocksDB, this option was already marked as "NOT SUPPORTED" for a long time, and setting this option does not have any effect on the behavior of RocksDB library. Therefore, we are removing it in the preparations of the upcoming 7.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9446

Reviewed By: ajkr

Differential Revision: D33793048

Pulled By: bjlemaire

fbshipit-source-id: 73316efdb194e90225005246673dae99e65577ae
2022-02-04 05:32:28 -08:00
Yanqin Jin
629e3e1d77 Fix spelling in public API (#9490)
Summary:
I feel it would be nice if we can fix this spelling error.

In `SizeApproximationOptions`, the `include_memtabtles` should be `include_memtables`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9490

Test Plan: make check

Reviewed By: hx235

Differential Revision: D33949862

Pulled By: riversand963

fbshipit-source-id: b2be67501b65d4aabb6b8df1bf25eb8d54cc1466
2022-02-03 15:15:23 -08:00