Commit Graph

181 Commits

Author SHA1 Message Date
Levi Tamasi
f07c56928f Set the number of threads up front in db_stress (#9466)
Summary:
With the code on main, `RunStressTest` increments the number of threads
one by one as the threads are created and started. This results in a
data race with `NonBatchedOpsStressTest::VerifyDb`, which reads this
value without synchronization, and is also not correct in the sense
that `VerifyDb` assumes that the number of threads already has its final
value set (e.g. it's checking whether the current thread is the last
one). The patch fixes this by setting the number of threads before
creating/starting any threads. This also eliminates the need for locking
the mutex during thread startup.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9466

Test Plan: Ran the blackbox crash test under TSAN for a while.

Reviewed By: ajkr

Differential Revision: D33858856

Pulled By: ltamasi

fbshipit-source-id: 8a6515a83fd1808b8b8dca61978777c4404f04cc
2022-01-29 10:45:41 -08:00
Peter Dillinger
78aee6fedc Remove obsolete backupable_db.h, utility_db.h (#9438)
Summary:
This also removes the obsolete names BackupableDBOptions
and UtilityDB. API users must now use BackupEngineOptions and
DBWithTTL::Open. In C API, `rocksdb_backupable_db_*` is replaced
`rocksdb_backup_engine_*`. Similar renaming in Java API.

In reference to https://github.com/facebook/rocksdb/issues/9389

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9438

Test Plan: CI

Reviewed By: mrambacher

Differential Revision: D33780269

Pulled By: pdillinger

fbshipit-source-id: 4a6cfc5c1b4c78bcad790b9d3dd13c5fdf4a1fac
2022-01-27 15:45:30 -08:00
Hui Xiao
1e0e883ca5 Remove deprecated API AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit (#9452)
Summary:
**Context/Summary:**
AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit have been marked as deprecated and it's time to actually remove the code.
- Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
- Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9452

Test Plan: Rely on my eyeball and CI

Reviewed By: ajkr

Differential Revision: D33804938

Pulled By: hx235

fbshipit-source-id: 133d49f7ec5238d7efceeb0a3122a5792a2b9945
2022-01-27 13:01:09 -08:00
Aravind Ramesh
2eac6bb120 db_stress: db_stress fails on custom filesystems. (#9352)
Summary:
db_stress listener service always uses default filesystem to operate,
causing it to not recognize custom filesystem (like ZenFS plugin FS).
Pass the env to db_stress listener with the correct filesystem
information, so it can open the user intended filesystem.

Signed-off-by: Aravind Ramesh <Aravind.Ramesh@wdc.com>

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9352

Reviewed By: riversand963

Differential Revision: D33776762

Pulled By: pdillinger

fbshipit-source-id: e79bb9a544384f80ae9dd0108241ab9c83223954
2022-01-25 16:22:58 -08:00
Yanqin Jin
50135c1bf3 Move HDFS support to separate repo (#9170)
Summary:
This PR moves HDFS support from RocksDB repo to a separate repo. The new (temporary?) repo
in this PR serves as an example before we finalize the decision on where and who to host hdfs support. At this point,
people can start from the example repo and fork.

Java/JNI is not included yet, and needs to be done later if necessary.

The goal is to include this commit in RocksDB 7.0 release.

Reference:
https://github.com/ajkr/dedupfs by ajkr

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9170

Test Plan:
Follow the instructions in https://github.com/riversand963/rocksdb-hdfs-env/blob/master/README.md. Build and run db_bench and db_stress.

make check

Reviewed By: ajkr

Differential Revision: D33751662

Pulled By: riversand963

fbshipit-source-id: 22b4db7f31762ed417a20239f5a08dcd1696244f
2022-01-24 20:23:54 -08:00
Andrew Kryczka
6892f19b11 Test correctness with WAL disabled in non-txn blackbox crash tests (#9338)
Summary:
Recently we added the ability to verify some prefix of operations are recovered (AKA no "hole" in the recovered data) (https://github.com/facebook/rocksdb/issues/8966). Besides testing unsynced data loss scenarios, it is also useful to test WAL disabled use cases, where unflushed writes are expected to be lost. Note RocksDB only offers the prefix-recovery guarantee to WAL-disabled use cases that use atomic flush, so crash test always enables atomic flush when WAL is disabled.

To verify WAL-disabled crash-recovery correctness globally, i.e., also in whitebox and blackbox transaction tests, it is possible but requires further changes. I added TODOs in db_crashtest.py.

Depends on https://github.com/facebook/rocksdb/issues/9305.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9338

Test Plan: Running all crash tests and many instances of blackbox. Sandcastle links are in Phabricator diff test plan.

Reviewed By: riversand963

Differential Revision: D33345333

Pulled By: ajkr

fbshipit-source-id: f56dd7d2e5a78d59301bf4fc3fedb980eb31e0ce
2022-01-05 16:23:37 -08:00
mrambacher
fe31dc53ca Make the Env class Customizable (#9293)
Summary:
Allows the Env to have options (Configurable) and loads like other Customizable classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9293

Reviewed By: pdillinger, zhichao-cao

Differential Revision: D33181591

Pulled By: mrambacher

fbshipit-source-id: 55e823886c654d214eda9eedd45ccdc54dac14d7
2022-01-04 16:45:49 -08:00
Andrew Kryczka
aa2b3bf675 Added TraceOptions::preserve_write_order (#9334)
Summary:
This option causes trace records to be written in the serialized write thread. That way, the write records in the trace must follow the same order as writes that are logged to WAL and writes that are applied to the DB.

By default I left it disabled to match existing behavior. I enabled it in `db_stress`, though, as that use case requires order of write records in trace matches the order in WAL.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9334

Test Plan:
- See if below unsynced data loss crash test can run  for 24h straight. It used to crash after a few hours when reaching an unlucky trace ordering.

```
DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm /usr/local/bin/python3 -u tools/db_crashtest.py blackbox --interval=10 --max_key=100000 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --value_size_mult=33 --sync_fault_injection=1 --test_batches_snapshots=0 --duration=86400
```

Reviewed By: zhichao-cao

Differential Revision: D33301990

Pulled By: ajkr

fbshipit-source-id: 82d97559727adb4462a7af69758449c8725b22d3
2021-12-28 15:04:26 -08:00
Andrew Kryczka
2ee20a669d Extend trace filtering to more operation types (#9335)
Summary:
- Extended trace filtering to cover `MultiGet()`, `Seek()`, and `SeekForPrev()`. Now all user ops that can be traced support filtering.
- Enabled the new filter masks in `db_stress` since it only cares to trace writes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9335

Test Plan:
- trace-heavy `db_stress` command reduced 30% elapsed time  (79.21 -> 55.47 seconds)

Benchmark command:
```
$ /usr/bin/time ./db_stress -ops_per_thread=100000 -sync_fault_injection=1 --db=/dev/shm/rocksdb_stress_db/ --expected_values_dir=/dev/shm/rocksdb_stress_expected/ --clear_column_family_one_in=0
```

- replay-heavy `db_stress` command reduced 12.4% elapsed time (23.69 -> 20.75 seconds)

Setup command:
```
$  ./db_stress -ops_per_thread=100000000 -sync_fault_injection=1 -db=/dev/shm/rocksdb_stress_db/ -expected_values_dir=/dev/shm/rocksdb_stress_expected --clear_column_family_one_in=0 & sleep 120; pkill -9 db_stress
```

Benchmark command:
```
$ /usr/bin/time ./db_stress -ops_per_thread=1 -reopen=0 -expected_values_dir=/dev/shm/rocksdb_stress_expected/ -db=/dev/shm/rocksdb_stress_db/ --clear_column_family_one_in=0 --destroy_db_initially=0
```

Reviewed By: zhichao-cao

Differential Revision: D33304580

Pulled By: ajkr

fbshipit-source-id: 0df10f87c1fc506e9484b6b42cea2ef96c7ecd65
2021-12-28 11:46:30 -08:00
Andrew Kryczka
dfff1cecff Filter Get()s from db_stress traces (#9315)
Summary:
`db_stress` traces are used for tracking unsynced changes. For that purpose, we
only need to track writes and not reads. Currently `TraceOptions` only
supports excluding `Get()`s from the trace, so this PR only excludes
`Get()`s. In the future it would be good to exclude `MultiGet()`s and
iterator operations too.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9315

Test Plan:
- trace-heavy `db_stress` command elapsed time reduced 37%

Benchmark:
```
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_stress -ops_per_thread=100000 -sync_fault_injection=1 -expected_values_dir=/dev/shm/dbstress_expected --clear_column_family_one_in=0
```

- replay-heavy `db_stress` command elapsed time reduced 38%

Setup:
```
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_stress -ops_per_thread=100000000 -sync_fault_injection=1 -expected_values_dir=/dev/shm/dbstress_expected --clear_column_family_one_in=0 & sleep 120; pkill -9 db_stress
```
Benchmark:
```
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_stress -ops_per_thread=1 -reopen=0 -expected_values_dir=/dev/shm/dbstress_expected --clear_column_family_one_in=0 --destroy_db_initially=0
```

Reviewed By: zhichao-cao

Differential Revision: D33229900

Pulled By: ajkr

fbshipit-source-id: 0e4251c674d236ddbc4548e9bbfdd608bf3cdc93
2021-12-22 14:17:45 -08:00
Andrew Kryczka
82670fb17b db_stress print hex key for MultiGet() inconsistency (#9324)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9324

Reviewed By: riversand963

Differential Revision: D33248178

Pulled By: ajkr

fbshipit-source-id: c8a7382ed613f9ac3a0a2e3fa7d3c6fe9c95ef85
2021-12-20 23:29:43 -08:00
Andrew Kryczka
b448b71222 db_stress tolerate incomplete tail records in trace file (#9316)
Summary:
I saw the following error when running crash test for a while with
unsynced data loss:

```
Error restoring historical expected values: Corruption: Corrupted trace file.
```

The trace file turned out to have an incomplete tail record. This is
normal considering blackbox kills `db_stress` while trace can be
ongoing.

In the case where the trace file is not otherwise corrupted, there
should be enough records already seen to sync up the expected state with
the recovered DB. This PR ignores any `Status::Corruption` the
`Replayer` returns when that happens.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9316

Reviewed By: jay-zhuang

Differential Revision: D33230579

Pulled By: ajkr

fbshipit-source-id: 9814af4e39e57f00d85be7404363211762f9b41b
2021-12-20 13:08:49 -08:00
Andrew Kryczka
791723c1ec Fix race condition in db_stress thread setup (#9314)
Summary:
We need to grab `SharedState`'s mutex while calling `IncThreads()` or `IncBgThreads()`. Otherwise the newly launched threads can simultaneously access the thread counters to check if every thread has finished initializing.

Repro command:

```
$ rm -rf /dev/shm/rocksdb/rocksdb_crashtest_{whitebox,expected}/ && mkdir -p /dev/shm/rocksdb/rocksdb_crashtest_{whitebox,expected}/ && ./db_stress --acquire_snapshot_one_in=10000 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=131.8094496796033 --bottommost_compression_type=zlib --cache_index_and_filter_blocks=1 --cache_size=1048576 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_style=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --fail_if_options_file_error=1 --file_checksum_impl=crc32c --flush_one_in=1000000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=22 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=1000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --open_files=500000 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=20000 --optimize_filters_for_memory=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefixpercent=5 --prepopulate_block_cache=1 --progress_reports=0 --read_fault_one_in=1000 --readpercent=45 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=32 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --subcompactions=2 --sync=0 --sync_fault_injection=False --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=1 --test_cf_consistency=1 --top_level_index_pinning=0 --unpartitioned_pinning=0 --use_block_based_filter=1 --use_clock_cache=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=35
```

TSAN error:

```
WARNING: ThreadSanitizer: data race (pid=2750142)
  Read of size 4 at 0x7ffc21d7f58c by thread T39 (mutexes: write M670895590377780496):
    #0 rocksdb::SharedState::AllInitialized() const db_stress_tool/db_stress_shared_state.h:204 (db_stress+0x4fd307)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::ThreadBody(void*) db_stress_tool/db_stress_driver.cc:26 (db_stress+0x4fd307)
    https://github.com/facebook/rocksdb/issues/2 StartThreadWrapper env/env_posix.cc:454 (db_stress+0x84472f)

  Previous write of size 4 at 0x7ffc21d7f58c by main thread:
    #0 rocksdb::SharedState::IncThreads() db_stress_tool/db_stress_shared_state.h:194 (db_stress+0x4fd779)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::RunStressTest(rocksdb::StressTest*) db_stress_tool/db_stress_driver.cc:78 (db_stress+0x4fd779)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::db_stress_tool(int, char**) db_stress_tool/db_stress_tool.cc:348 (db_stress+0x4b97dc)
    https://github.com/facebook/rocksdb/issues/3 main db_stress_tool/db_stress.cc:21 (db_stress+0x47a351)

  Location is stack of main thread.

  Location is global '<null>' at 0x000000000000 ([stack]+0x00000001d58c)

  Mutex M670895590377780496 is already destroyed.

  Thread T39 (tid=2750211, running) created by main thread at:
    #0 pthread_create /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/libsanitizer/tsan/tsan_interceptors.cc:964 (libtsan.so.0+0x613c3)
    https://github.com/facebook/rocksdb/issues/1 StartThread env/env_posix.cc:464 (db_stress+0x8463c2)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::CompositeEnvWrapper::StartThread(void (*)(void*), void*) env/composite_env_wrapper.h:288 (db_stress+0x4bcd20)
    https://github.com/facebook/rocksdb/issues/3 rocksdb::EnvWrapper::StartThread(void (*)(void*), void*) include/rocksdb/env.h:1475 (db_stress+0x4bb950)
    https://github.com/facebook/rocksdb/issues/4 rocksdb::RunStressTest(rocksdb::StressTest*) db_stress_tool/db_stress_driver.cc:80 (db_stress+0x4fd9d2)
    https://github.com/facebook/rocksdb/issues/5 rocksdb::db_stress_tool(int, char**) db_stress_tool/db_stress_tool.cc:348 (db_stress+0x4b97dc)
    https://github.com/facebook/rocksdb/issues/6 main db_stress_tool/db_stress.cc:21 (db_stress+0x47a351)

 ThreadSanitizer: data race db_stress_tool/db_stress_shared_state.h:204 in rocksdb::SharedState::AllInitialized() const
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9314

Test Plan: verified repro command works after this PR.

Reviewed By: jay-zhuang

Differential Revision: D33217698

Pulled By: ajkr

fbshipit-source-id: 79358fe5adb779fc9dcf80643cc102d4b467fc38
2021-12-20 13:05:23 -08:00
Andrew Kryczka
863c78d2c9 Fix unsynced data loss correctness test with mixed -test_batches_snapshots (#9302)
Summary:
This fixes two bugs in the recently committed DB verification following
crash-recovery with unsynced data loss (https://github.com/facebook/rocksdb/issues/8966):

The first bug was in crash test runs involving mixed values for
`-test_batches_snapshots`. The problem was we were neither restoring
expected values nor enabling tracing when `-test_batches_snapshots=1`.
This caused a future `-test_batches_snapshots=0` run to not find enough
trace data to restore expected values. The fix is to restore expected
values at the start of `-test_batches_snapshots=1` runs, but still leave
tracing disabled as we do not need to track those KVs.

The second bug was in `db_stress` runs that restore the expected values
file and use compaction filter. The compaction filter was initialized to use
the pre-restore expected values, which would be `munmap()`'d during
`FileExpectedStateManager::Restore()`. Then compaction filter would run
into a segfault. The fix is just to reorder compaction filter init after expected
values restore.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9302

Test Plan:
- To verify the first problem, the below sequence used to fail; now it passes.

```
$ ./db_stress --db=./test-db/ --expected_values_dir=./test-db-expected/ --max_key=100000 --ops_per_thread=1000 --sync_fault_injection=1 --clear_column_family_one_in=0 --destroy_db_initially=0 -reopen=0 -test_batches_snapshots=0
$ ./db_stress --db=./test-db/ --expected_values_dir=./test-db-expected/ --max_key=100000 --ops_per_thread=1000 --sync_fault_injection=1 --clear_column_family_one_in=0 --destroy_db_initially=0 -reopen=0 -test_batches_snapshots=1
$ ./db_stress --db=./test-db/ --expected_values_dir=./test-db-expected/ --max_key=100000 --ops_per_thread=1000 --sync_fault_injection=1 --clear_column_family_one_in=0 --destroy_db_initially=0 -reopen=0 -test_batches_snapshots=0
```

- The second problem occurred rarely in the form of a SIGSEGV on a file that was `munmap()`d. I have not seen it after this PR though this doesn't prove much.

Reviewed By: jay-zhuang

Differential Revision: D33155283

Pulled By: ajkr

fbshipit-source-id: 66fd0f0edf34015a010c30015f14f104734e964e
2021-12-17 22:05:29 -08:00
Andrew Kryczka
84228e21e8 Fix shutdown in db_stress with -test_batches_snapshots=1 (#9313)
Summary:
The `SharedState` constructor had an early return in case of
`-test_batches_snapshots=1`. This early return caused `num_bg_threads_`
to never be incremented. Consequently, the driver thread could cleanup
objects like the `SharedState` while BG threads were still running and
accessing it, leading to crash.

The fix is to move the logic for counting threads (both FG and BG) to
the place they are launched. That way we can be sure the counts are
consistent, at least for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9313

Test Plan:
below command used to fail, now it passes.

```
$ ./db_stress --db=./test-db/ --expected_values_dir=./test-db-expected/ --max_key=100000 --ops_per_thread=1000 --sync_fault_injection=1 --clear_column_family_one_in=0 --destroy_db_initially=0 -reopen=0 -test_batches_snapshots=1
```

Reviewed By: jay-zhuang

Differential Revision: D33198670

Pulled By: ajkr

fbshipit-source-id: 126592dc1eb31998bc8f82ffbf5a0d4eb8dec317
2021-12-17 17:31:40 -08:00
mrambacher
423538a816 Make MemoryAllocator into a Customizable class (#8980)
Summary:
- Make MemoryAllocator and its implementations into a Customizable class.
- Added a "DefaultMemoryAllocator" which uses new and delete
- Added a "CountedMemoryAllocator" that counts the number of allocs and free
- Updated the existing tests to use these new allocators
- Changed the memkind allocator test into a generic test that can test the various allocators.
- Added tests for creating all of the allocators
- Added tests to verify/create the JemallocNodumpAllocator using its options.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8980

Reviewed By: zhichao-cao

Differential Revision: D32990403

Pulled By: mrambacher

fbshipit-source-id: 6fdfe8218c10dd8dfef34344a08201be1fa95c76
2021-12-17 04:20:47 -08:00
Andrew Kryczka
c9818b3325 db_stress verify with lost unsynced operations (#8966)
Summary:
When a previous run left behind historical state/trace files (implying it was run with --sync_fault_injection set), this PR uses them to restore the expected state according to the DB's recovered sequence number. That way, a tail of latest unsynced operations are permitted to be dropped, as is the case when data in page cache or certain `Env`s is lost. The point of the verification in this scenario is just to ensure there is no hole in the recovered data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8966

Test Plan:
- ran it a while, made sure it is restoring expected values using the historical state/trace files:
```
$ rm -rf ./tmp-db/ ./exp/ && mkdir -p ./tmp-db/ ./exp/ && while ./db_stress -compression_type=none -clear_column_family_one_in=0 -expected_values_dir=./exp -sync_fault_injection=1 -destroy_db_initially=0 -db=./tmp-db -max_key=1000000 -ops_per_thread=10000 -reopen=0 -threads=32 ; do : ; done
```

Reviewed By: pdillinger

Differential Revision: D31219445

Pulled By: ajkr

fbshipit-source-id: f0e1d51fe5b35465b00565c33331190ea38ba0ad
2021-12-15 12:54:44 -08:00
Yanqin Jin
e05c2bb549 Stress test for RocksDB transactions (#8936)
Summary:
Current db_stress does not cover complex read-write transactions. Therefore, this PR adds
coverage for emulated MyRocks-style transactions in `MultiOpsTxnsStressTest`. To achieve this, we need:

- Add a new operation type 'customops' so that we can add new complex groups of operations, e.g. transactions involving multiple read-write operations.
- Implement three read-write transactions and two read-only ones to emulate MyRocks-style transactions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8936

Test Plan:
```
make check
./db_stress -test_multi_ops_txns -use_txn -clear_column_family_one_in=0 -column_families=1 -writepercent=0 -delpercent=0 -delrangepercent=0 -customopspercent=60 -readpercent=20 -prefixpercent=0 -iterpercent=20 -reopen=0 -ops_per_thread=100000
```

Next step is to add more configurability and refine input generation and result reporting, which will done in separate follow-up PRs.

Reviewed By: zhichao-cao

Differential Revision: D31071795

Pulled By: riversand963

fbshipit-source-id: 50d7c828346ec643311336b904848a1588a37006
2021-12-14 13:34:43 -08:00
Akanksha Mahajan
9e4d56f2c9 Fix segmentation fault in table_options.prepopulate_block_cache when used with partition_filters (#9263)
Summary:
When table_options.prepopulate_block_cache is set to
BlockBasedTableOptions::PrepopulateBlockCache::kFlushOnly and
table_options.partition_filters is also set true, then there is
segmentation failure when top level filter is fetched because its
entered with wrong type in cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9263

Test Plan:
Updated unit tests;
Ran db_stress: make crash_test -j32

Reviewed By: pdillinger

Differential Revision: D32936566

Pulled By: akankshamahajan15

fbshipit-source-id: 8bd79e53830d3e3c1bb79787e1ffbc3cb46d4426
2021-12-08 12:44:38 -08:00
Hui Xiao
66b31c5098 Fix -Werror=maybe-uninitialized in db_stress_tool (#9265)
Summary:
Context/Summary:
Uninitialized variable `SequenceNumber old_saved_seqno` causes asan related compilation error/warning below:

```
db_stress_tool/expected_state.cc:308:55: error: ‘old_saved_seqno’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
  308 |   if (s.ok() && old_saved_seqno != kMaxSequenceNumber &&
      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
```

Fix it by initializing to 0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9265

Test Plan:
- make clean && COMPILE_WITH_ASAN=1 make -j48 db_stress_tool/expected_state.o
- monitor if same error happens again after merging

Reviewed By: ajkr

Differential Revision: D32939630

Pulled By: hx235

fbshipit-source-id: 41697515fd11ada8427f606b5dceb4e58d12cb80
2021-12-07 22:42:30 -08:00
Andrew Kryczka
ce42ae6ffd Fix Statistics in db_stress (#9260)
Summary:
The `Statistics` objects are meant to be shared across translation
units, but this was prevented by declaring them static. We need to
ensure they are defined once in the program. The effect is now
`StressTest::PrintStatistics()` can actually print statistics since it
now sees non-null values when `--statistics=1`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9260

Reviewed By: zhichao-cao

Differential Revision: D32910162

Pulled By: ajkr

fbshipit-source-id: c926d6f556177987bee5fa3cbc87597803b230ee
2021-12-07 16:24:22 -08:00
Andrew Kryczka
a6a6aad74e db_stress support tracking historical values (#8960)
Summary:
When `--sync_fault_injection` is set, this PR takes a snapshot of the expected values and starts an operation trace when the DB is opened. These files are stored in `--expected_values_dir`. They will be used for recovering the expected state of the DB following a crash where a suffix of unsynced operations are allowed to be lost.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8960

Test Plan: injected crashed at various points in `FileExpectedStateManager` and verified the next run recovers the state/trace file with highest seqno and removes all older/temporary files. Note we don't use sync_fault_injection in CI crash tests yet.

Reviewed By: pdillinger

Differential Revision: D31194941

Pulled By: ajkr

fbshipit-source-id: b0f935a529a0186c5a9c7709fcaa8829de8a84cf
2021-12-07 13:41:48 -08:00
Aravind Ramesh
9c932816cf db_stress: db_stress segmentation fault (#9219)
Summary:
db_stress asserts/seg-faults with below command (on debug and release builds)

```
"rm -rf /tmp/rocksdbtest*; db_stress --ops_per_thread=1000 --reopen=5"
=======================================
Error opening unique id file for append: IO error: No such file or directory:
While open a file for appending: /tmp/rocksdbtest-0/dbstress/.unique_ids:
No such file or directory
Choosing random keys with no overwrite
Creating 2621440 locks
Starting continuous_verification_thread
2021/11/15-08:46:49  Initializing worker threads
2021/11/15-08:46:49  Starting database operations
2021/11/15-08:46:49  Reopening database for the 1th time
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
DB path: [/tmp/rocksdbtest-0/dbstress]
Segmentation fault
=======================================

```
StressTest() constructor deletes the directory "dbstress" because
the option --destroy_db_initially is true by default in db_stress.

This Seg fault happens on a new database, UniqueIdVerifier's constructor
tries to read the ".unique_ids" file, if the file is not present,
ReopenWritableFile() tries to create .unique_ids file, but fails
as the directory db_stress is not available. The data_file_writer_
is set as an invalid(null) pointer and in subsequent calls (~UniqueIdVerifier()
and UniqueIdVerifier::Verify()) it accesses this null pointer and crashes.

This patch creates db_stress directory if it is missing, so the .unique_ids file
is created.

Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9219

Reviewed By: ajkr

Differential Revision: D32730151

Pulled By: pdillinger

fbshipit-source-id: f47baba56b380d93c3ba5608904756e86bbf14f5
2021-11-30 14:56:13 -08:00
Yanqin Jin
42fef0224f Fix build for msvc (#9230)
Summary:
Test plan

With Visual Studio 2017.
```
cd rocksdb
mkdir build && cd build
cmake -G "Visual Studio 15 Win64" -DWITH_GFLAGS=1 ..
MSBuild rocksdb.sln /m /TARGET:cache_bench /TARGET:db_bench /TARGET:db_stress
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9230

Reviewed By: akankshamahajan15

Differential Revision: D32705095

Pulled By: riversand963

fbshipit-source-id: 101e3533f5178b24c0535ddc47a39347ccfcf92c
2021-11-29 14:27:48 -08:00
Levi Tamasi
dc5de45af8 Support readahead during compaction for blob files (#9187)
Summary:
The patch adds a new BlobDB configuration option `blob_compaction_readahead_size`
that can be used to enable prefetching data from blob files during compaction.
This is important when using storage with higher latencies like HDDs or remote filesystems.
If enabled, prefetching is used for all cases when blobs are read during compaction,
namely garbage collection, compaction filters (when the existing value has to be read from
a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187

Test Plan: Ran `make check` and the stress/crash test.

Reviewed By: riversand963

Differential Revision: D32565512

Pulled By: ltamasi

fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d
2021-11-19 17:53:47 -08:00
anand76
78556c14dd Secondary cache error injection (#9002)
Summary:
Implement secondary cache error injection in db_stress.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9002

Reviewed By: zhichao-cao

Differential Revision: D31874896

Pulled By: anand1976

fbshipit-source-id: 8cf04c061a4a44efa0fe88423d05cade67b85f73
2021-11-08 10:27:27 -08:00
Yanqin Jin
d263505417 Avoid div-by-zero error in db_stress (#9086)
Summary:
If a column family has 0 levels, then existing `TestCompactFiles(...)` may hit
divide-by-zero. To fix, return early if the cf is empty.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9086

Test Plan: TBD

Reviewed By: ajkr

Differential Revision: D31986799

Pulled By: riversand963

fbshipit-source-id: 48f7dfb2b2b47cfc1315cb71ca80eb230d947f17
2021-10-31 22:16:03 -07:00
Peter Dillinger
0534393fc8 Fix stress/crash test handling of SST unique IDs (#9054)
Summary:
Was not handling the case of OnTableFileCreated invoked for
table file NOT created.

Also improved error reporting and caught a missing status check.

Also strengthened the db_stress listener to require file_size > 0 when
status.ok(). We would be violating the API contract if status is OK and
we didn't create a valid SST file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9054

Test Plan: make blackbox_crash_test for a while

Reviewed By: zhichao-cao

Differential Revision: D31765200

Pulled By: pdillinger

fbshipit-source-id: 7c527f5531bc239a5efd7a7b018545d480f926e2
2021-10-19 11:52:07 -07:00
Peter Dillinger
ad5325a736 Experimental support for SST unique IDs (#8990)
Summary:
* New public header unique_id.h and function GetUniqueIdFromTableProperties
which computes a universally unique identifier based on table properties
of table files from recent RocksDB versions.
* Generation of DB session IDs is refactored so that they are
guaranteed unique in the lifetime of a process running RocksDB.
(SemiStructuredUniqueIdGen, new test included.) Along with file numbers,
this enables SST unique IDs to be guaranteed unique among SSTs generated
in a single process, and "better than random" between processes.
See https://github.com/pdillinger/unique_id
* In addition to public API producing 'external' unique IDs, there is a function
for producing 'internal' unique IDs, with functions for converting between the
two. In short, the external ID is "safe" for things people might do with it, and
the internal ID enables more "power user" features for the future. Specifically,
the external ID goes through a hashing layer so that any subset of bits in the
external ID can be used as a hash of the full ID, while also preserving
uniqueness guarantees in the first 128 bits (bijective both on first 128 bits
and on full 192 bits).

Intended follow-up:
* Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into
the third 64-bit value of the unique ID.)
* Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990

Test Plan:
Unit tests added, and checking of unique ids in stress test.
NOTE in stress test we do not generate nearly enough files to thoroughly
stress uniqueness, but the test trims off pieces of the ID to check for
uniqueness so that we can infer (with some assumptions) stronger
properties in the aggregate.

Reviewed By: zhichao-cao, mrambacher

Differential Revision: D31582865

Pulled By: pdillinger

fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243
2021-10-18 23:32:01 -07:00
Levi Tamasi
3e1bf771a3 Make it possible to force the garbage collection of the oldest blob files (#8994)
Summary:
The current BlobDB garbage collection logic works by relocating the valid
blobs from the oldest blob files as they are encountered during compaction,
and cleaning up blob files once they contain nothing but garbage. However,
with sufficiently skewed workloads, it is theoretically possible to end up in a
situation when few or no compactions get scheduled for the SST files that contain
references to the oldest blob files, which can lead to increased space amp due
to the lack of GC.

In order to efficiently handle such workloads, the patch adds a new BlobDB
configuration option called `blob_garbage_collection_force_threshold`,
which signals to BlobDB to schedule targeted compactions for the SST files
that keep alive the oldest batch of blob files if the overall ratio of garbage in
the given blob files meets the threshold *and* all the given blob files are
eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example,
if the new option is set to 0.9, targeted compactions will get scheduled if the
sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the
oldest blob files, assuming all affected blob files are below the age-based cutoff.)
The net result of these targeted compactions is that the valid blobs in the oldest
blob files are relocated and the oldest blob files themselves cleaned up (since
*all* SST files that rely on them get compacted away).

These targeted compactions are similar to periodic compactions in the sense
that they force certain SST files that otherwise would not get picked up to undergo
compaction and also in the sense that instead of merging files from multiple levels,
they target a single file. (Note: such compactions might still include neighboring files
from the same level due to the need of having a "clean cut" boundary but they never
include any files from any other level.)

This functionality is currently only supported with the leveled compaction style
and is inactive by default (since the default value is set to 1.0, i.e. 100%).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994

Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests.

Reviewed By: riversand963

Differential Revision: D31489850

Pulled By: ltamasi

fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab
2021-10-11 18:03:01 -07:00
Andrew Kryczka
a282eff3d1 Protect existing files in FaultInjectionTest{Env,FS}::ReopenWritableFile() (#8995)
Summary:
`FaultInjectionTest{Env,FS}::ReopenWritableFile()` functions were accidentally deleting WALs from previous `db_stress` runs causing verification to fail. They were operating under the assumption that `ReopenWritableFile()` would delete any existing file. It was a reasonable assumption considering the `{Env,FileSystem}::ReopenWritableFile()` documentation stated that would happen. The only problem was neither the implementations we offer nor the "real" clients in RocksDB code followed that contract. So, this PR updates the contract as well as fixing the fault injection client usage.

The fault injection change exposed that `ExternalSSTFileBasicTest.SyncFailure` was relying on a fault injection `Env` dropping unsynced data written by a regular `Env`. I changed that test to make its `SstFileWriter` use fault injection `Env`, and also implemented `LinkFile()` in fault injection so the unsynced data is tracked under the new name.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8995

Test Plan:
- Verified it fixes the following failure:

```
$ ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --ops_per_thread=1000 --prefixpercent=0 --readpercent=60 --reopen=0 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
$ ./db_stress --avoid_flush_during_recovery=1 --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=10 --key_len_percent_dist=1,30,69 --max_bytes_for_level_base=4194304 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_write_fault_one_in=16 --ops_per_thread=1000 --prefix_size=-1 --prefixpercent=0 --readpercent=50 --sync=1 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
Verification failed for column family 0 key 000000000000001300000000000000857878787878 (1143): Value not found: NotFound:
Crash-recovery verification failed :(
...
```

- `make check -j48`

Reviewed By: ltamasi

Differential Revision: D31495388

Pulled By: ajkr

fbshipit-source-id: 7886ccb6a07cb8b78ad7b6c1c341ccf40bb68385
2021-10-11 16:23:18 -07:00
Akanksha Mahajan
84d71f30c4 Enable SingleDelete with user defined ts in db_bench and crash tests (#8971)
Summary:
Enable SingleDelete with user defined timestamp in db_bench,
db_stress and crash test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8971

Test Plan:
1. For db_stress, ran the command for full duration: i) python3 -u tools/db_crashtest.py
--enable_ts whitebox --nooverwritepercent=100
ii) make crash_test_with_ts

2. For db_bench, ran:  ./db_bench -benchmarks=randomreplacekeys
-user_timestamp_size=8 -use_single_deletes=true

Reviewed By: riversand963

Differential Revision: D31246558

Pulled By: akankshamahajan15

fbshipit-source-id: 29cd8740c9921341e52f09242fca3c44d75a12b7
2021-10-01 16:48:01 -07:00
mrambacher
13ae16c315 Cleanup includes in dbformat.h (#8930)
Summary:
This header file was including everything and the kitchen sink when it did not need to.  This resulted in many places including this header when they needed other pieces instead.

Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.

Hopefully, this sort of code hygiene cleanup will speed up the builds...

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930

Reviewed By: pdillinger

Differential Revision: D31142788

Pulled By: mrambacher

fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
2021-09-29 04:04:40 -07:00
Andrew Kryczka
559943cdc0 Refactor expected state in stress/crash test (#8913)
Summary:
This is a precursor refactoring to enable an upcoming feature: persistence failure correctness testing.

- Changed `--expected_values_path` to `--expected_values_dir` and migrated "db_crashtest.py" to use the new flag. For persistence failure correctness testing there are multiple possible correct states since unsynced data is allowed to be dropped. Making it possible to restore all these possible correct states will eventually involve files containing snapshots of expected values and DB trace files.
- The expected values directory is managed by an `ExpectedStateManager` instance. Managing expected state files is separated out of `SharedState` to prevent `SharedState` from becoming too complex when the new files and features (snapshotting, tracing, and restoring) are introduced.
- Migrated expected values file access/management out of `SharedState` into a separate class called `ExpectedState`. This is not exposed directly to the test but rather the `ExpectedState` for the latest values file is accessed via a pass-through API on `ExpectedStateManager`. This forces the test to always access the single latest `ExpectedState`.
- Changed the initialization of the latest expected values file to use a tempfile followed by rename, and also add cleanup logic for possible stranded tempfiles.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8913

Test Plan:
run in several ways; try to make sure it's not obviously broken.

- crashtest blackbox without TEST_TMPDIR
```
$ python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --duration=120 --interval=10 --compression_type=none --blob_compression_type=none
```
- crashtest blackbox with TEST_TMPDIR
```
$ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --duration=120 --interval=10 --compression_type=none --blob_compression_type=none
```
- crashtest whitebox with TEST_TMPDIR
```
$ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py whitebox --simple --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --duration=120 --interval=10 --compression_type=none --blob_compression_type=none --random_kill_odd=88887
```
- db_stress without expected_values_dir
```
$ ./db_stress --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --ops_per_thread=10000 --clear_column_family_one_in=0 --destroy_db_initially=true
```
- db_stress with expected_values_dir and manual corruption
```
$ ./db_stress --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --ops_per_thread=10000 --clear_column_family_one_in=0 --destroy_db_initially=true --expected_values_dir=./
// modify one byte in "./LATEST.state"
$ ./db_stress --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --ops_per_thread=10000 --clear_column_family_one_in=0 --destroy_db_initially=false --expected_values_dir=./
...
Verification failed for column family 0 key 0000000000000000 (0): Value not found: NotFound:
...
```

Reviewed By: riversand963

Differential Revision: D30921951

Pulled By: ajkr

fbshipit-source-id: babfe218062e55d018c9b046536c0289fb78f41c
2021-09-28 14:13:33 -07:00
Andrew Kryczka
791bff5b4e Prevent deadlock in db_stress with DbStressCompactionFilter (#8956)
Summary:
The cyclic dependency was:

- `StressTest::OperateDb()` locks the mutex for key 'k'
- `StressTest::OperateDb()` calls a function like `PauseBackgroundWork()`, which waits for pending compaction to complete.
- The pending compaction reaches key `k` and `DbStressCompactionFilter::FilterV2()` calls `Lock()` on that key's mutex, which hangs forever.

The cycle can be broken by using a new function, `port::Mutex::TryLock()`, which returns immediately upon failure to acquire a lock. In that case `DbStressCompactionFilter::FilterV2()` can just decide to keep the key.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8956

Reviewed By: riversand963

Differential Revision: D31183718

Pulled By: ajkr

fbshipit-source-id: 329e4a31ce43085af174cf367ef560b5a04399c5
2021-09-24 16:54:02 -07:00
sdong
9320067703 Improve fault injection to MultiRead (#8937)
Summary:
Several improvements to MultiRead:
1. Fix a bug in stress test which causes false positive when both MultiRead() return and individual read request have failure injected.
2. Add two more types of fault that should be handled: empty read results and checksum mismatch
3. Add a message indicating which type of fault is injected
4. Increase the failure rate

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8937

Reviewed By: anand1976

Differential Revision: D31085930

fbshipit-source-id: 3a04994a3cadebf9a64d25e1fe12b14b7a272fba
2021-09-21 14:48:15 -07:00
Andrew Kryczka
5c92aa38ea Avoid overwriting first non-OK Status in db_stress setup (#8907)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8907

Reviewed By: zhichao-cao

Differential Revision: D30922081

Pulled By: ajkr

fbshipit-source-id: ad7a32c21d0049342fd20c9b7f555e93674c3671
2021-09-15 14:28:09 -07:00
Yanqin Jin
2a2b3e03a5 Allow WriteBatch to have keys with different timestamp sizes (#8725)
Summary:
In the past, we unnecessarily requires all keys in the same write batch
to be from column families whose timestamps' formats are the same for
simplicity. Specifically, we cannot use the same write batch to write to
two column families, one of which enables timestamp while the other
disables it.

The limitation is due to the member `timestamp_size_` that used to exist
in each `WriteBatch` object. We pass a timestamp_size to the constructor
of `WriteBatch`. Therefore, users can simply use the old
`WriteBatch::Put()`, `WriteBatch::Delete()`, etc APIs for write, while
the internal implementation of `WriteBatch` will take care of memory
allocation for timestamps.

The above is not necessary.
One the one hand, users can set up a memory buffer to store user key and
then contiguously append the timestamp to the user key. Then the user
can pass this buffer to the `WriteBatch::Put(Slice&)` API.
On the other hand, users can set up a SliceParts object which is an
array of Slices and let the last Slice to point to the memory buffer
storing timestamp. Then the user can pass the SliceParts object to the
`WriteBatch::Put(SliceParts&)` API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8725

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D30654499

Pulled By: riversand963

fbshipit-source-id: 9d848c77ad3c9dd629aa5fc4e2bc16fb0687b4a2
2021-09-12 15:34:26 -07:00
Peter Dillinger
bda8d93ba9 Fix and detect headers with missing dependencies (#8893)
Summary:
It's always annoying to find a header does not include its own
dependencies and only works when included after other includes. This
change adds `make check-headers` which validates that each header can
be included at the top of a file. Some headers are excluded e.g. because
of platform or external dependencies.

rocksdb_namespace.h had to be re-worked slightly to enable checking for
failure to include it. (ROCKSDB_NAMESPACE is a valid namespace name.)

Fixes mostly involve adding and cleaning up #includes, but for
FileTraceWriter, a constructor was out-of-lined to make a forward
declaration sufficient.

This check is not currently run with `make check` but is added to
CircleCI build-linux-unity since that one is already relatively fast.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8893

Test Plan: existing tests and resolving issues detected by new check

Reviewed By: mrambacher

Differential Revision: D30823300

Pulled By: pdillinger

fbshipit-source-id: 9fff223944994c83c105e2e6496d24845dc8e572
2021-09-10 10:00:26 -07:00
Peter Dillinger
cb5b851ff8 Add (& fix) some simple source code checks (#8821)
Summary:
* Don't hardcode namespace rocksdb (use ROCKSDB_NAMESPACE)
* Don't #include <rocksdb/...> (use double quotes)
* Support putting NOCOMMIT (any case) in source code that should not be
committed/pushed in current state.

These will be run with `make check` and in GitHub actions

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8821

Test Plan: existing tests, manually try out new checks

Reviewed By: zhichao-cao

Differential Revision: D30791726

Pulled By: pdillinger

fbshipit-source-id: 399c883f312be24d9e55c58951d4013e18429d92
2021-09-07 21:19:27 -07:00
Peter Dillinger
c9cd5d25a8 Remove some unneeded code (#8736)
Summary:
* FullKey and ParseFullKey appear to serve no purpose in the public API
(or anything else) so removed. Only use in one test updated.
* NumberToString serves no purpose vs. ToString so removed, numerous
calls updated
* Remove unnecessary forward declarations in metadata.h by re-arranging
class definitions.
* Remove some unneeded semicolons

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736

Test Plan: existing tests

Reviewed By: mrambacher

Differential Revision: D30700039

Pulled By: pdillinger

fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2
2021-09-01 14:28:58 -07:00
Peter Dillinger
2a383f21f4 Add Bloom/Ribbon hybrid API support (#8679)
Summary:
This is essentially resurrection and fixing of the part of
https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically,
when configuring Ribbon filter, you can specify an LSM level before which
Bloom will be used instead of Ribbon. But Bloom is only considered for
Leveled and Universal compaction styles and file going into a known LSM
level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as
you would expect with NewRibbonFilterPolicy.

So that this can be controlled with a single int value and so that flushes
can be distinguished from intra-L0, we consider flush to go to level -1 for
the purposes of this option. (Explained in API comment.)

I also expect the most common and recommended Ribbon configuration to
use Bloom during flush, to minimize slowing down writes and because according
to my estimates, Ribbon only pays off if the structure lives in memory for
more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy
to be this mild hybrid configuration. I don't really want to add something like
NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for
flush, Ribbon otherwise) should be considered a natural choice.

C APIs also updated, but because they don't support overloading,
rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and
rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid
configuration. While touching C API, I changed bits per key options from
int to double.

BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit
unused fields from BloomFilterPolicy.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679

Test Plan: new + updated tests, including crash test

Reviewed By: jay-zhuang

Differential Revision: D30445797

Pulled By: pdillinger

fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1
2021-08-20 18:00:16 -07:00
Jay Zhuang
0729b287e9 Exclude property kLiveSstFilesSizeAtTemperature from stress_test (#8668)
Summary:
Just like other per_level properties.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8668

Test Plan: stress_test

Reviewed By: zhichao-cao

Differential Revision: D30360967

Pulled By: jay-zhuang

fbshipit-source-id: 70da2557b95c55e8081b04ebf1a909a0fe69488f
2021-08-17 09:06:01 -07:00
Baptiste Lemaire
e3a96c4823 Memtable sampling for mempurge heuristic. (#8628)
Summary:
Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option.
This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value.
Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate.
At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation.
Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries.
The unit tests have been readapted to support this new API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628

Reviewed By: pdillinger

Differential Revision: D30149315

Pulled By: bjlemaire

fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27
2021-08-10 18:09:03 -07:00
Akanksha Mahajan
052c24a668 Fix db_stress failure (#8632)
Summary:
FaultInjectionTestFS injects error in Rename operation. Because
of injected error, info.log fails to be created if rename  returns error and info_log is set to nullptr which leads to this assertion

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8632

Test Plan: run the db_stress job locally

Reviewed By: ajkr

Differential Revision: D30167387

Pulled By: akankshamahajan15

fbshipit-source-id: 8d08c4c33e8f0cabd368bbb498d21b9de0660067
2021-08-07 09:21:03 -07:00
Baptiste Lemaire
d6006f9c9b Add experimental mempurge policy flag to db_stress. (#8588)
Summary:
Add `experimental_mempurge_policy` flag to `db_stress` and `db_crashtest.py`.
This flag is only read if the `experimental_allow_mempurge` flag is set to `true`. This flag can take the following values: `kAlways`, and `kAlternate` (default).
- `kAlways`: a flush is always redirected to a mempurge. If the mempurge aborts, the a regular flush proceeds.
- `kAlternate`: if one or more of the flush input memtables is an mempurge output memtable, then a flush is performed, else a mempurge is carried out. Similar to kAlways, if a mempurge aborts, the FlushJob proceeds to a regular flush to storage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8588

Reviewed By: pdillinger

Differential Revision: D29934251

Pulled By: bjlemaire

fbshipit-source-id: 90c1debed2029b9915d066914556547507c33dae
2021-07-28 13:27:58 -07:00
mrambacher
3aee4fbd41 Make EventListener into a Customizable Class (#8473)
Summary:
- Added Type/CreateFromString
- Added ability to load EventListeners to DBOptions
- Since EventListeners did not previously have a Name(), defaulted to "".  If there is no name, the listener cannot be loaded from the ObjectRegistry.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473

Reviewed By: zhichao-cao

Differential Revision: D29901488

Pulled By: mrambacher

fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254
2021-07-27 07:47:02 -07:00
sdong
9b41082d4a Complete the fix of stress open WAL drop fix (#8570)
Summary:
https://github.com/facebook/rocksdb/pull/8548 is not complete. We should instead cover all cases writable files are buffered, not just when failures are ingested. Extend it to any case where failures are ingested in DB open.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8570

Test Plan: Run db_stress and see it doesn't break

Reviewed By: jay-zhuang

Differential Revision: D29830415

fbshipit-source-id: 94449a0468fb2f7eec17423724008c9c63b2445d
2021-07-21 16:08:53 -07:00
Jay Zhuang
66ca5ac427 Cleanup cf handlers before deleting db (#8564)
Summary:
Delete column family handlers before deleting db to avoid `last_ref`
assert.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8564

Test Plan: Inject compaction test in db_stress test

Reviewed By: pdillinger

Differential Revision: D29797375

Pulled By: jay-zhuang

fbshipit-source-id: e8baf4d279f4db5d963db95c9445454156205501
2021-07-20 14:59:40 -07:00
Peter Dillinger
d5f3b77f23 Add GetMapProperty to db_stress (#8551)
Summary:
Already has good coverage for GetProperty and GetIntProperty
but this one was missing.

This should add more confidence to https://github.com/facebook/rocksdb/issues/8538

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8551

Test Plan:
brief local run with boosted probability showed no immediate
issues

Reviewed By: siying

Differential Revision: D29746383

Pulled By: pdillinger

fbshipit-source-id: 9f9f525bc1a7607f85e563e33bda1979ef197127
2021-07-19 08:10:29 -07:00