Commit Graph

4124 Commits

Author SHA1 Message Date
Levi Tamasi
b9bb59d49d Add initial set of options for integrated blob write path (#7280)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7280

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D23195192

Pulled By: ltamasi

fbshipit-source-id: 743b382de391963e62ba86119e9fbd0233ea3b3a
2020-08-18 18:32:37 -07:00
Akanksha Mahajan
cc24ac14eb Store FSSequentialFilePtr object in SequenceFileReader (#7190)
Summary:
This diff contains following changes:
    1. Replace `FSSequentialFile` pointer with `FSSequentialFilePtr` object that wraps `FSSequentialFile` pointer in `SequenceFileReader`.

Objective: If tracing is enabled, `FSSequentialFilePtr` returns `FSSequentialFileTracingWrapper` pointer that includes all necessary information in `IORecord` and calls underlying FileSystem and invokes `IOTracer` to dump that record in a binary file. If tracing is disabled then, underlying `FileSystem` pointer is returned directly. `FSSequentialFilePtr` wrapper class is added to bypass the `FSSequentialFileTracingWrapper` when tracing is disabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7190

Test Plan:
make check -j64
          COMPILE_WITH_TSAN=1 make check -j64

Reviewed By: anand1976

Differential Revision: D23059616

Pulled By: akankshamahajan15

fbshipit-source-id: 1564b94dd1297cd0fbfe2ed5c9cc3e20f7395301
2020-08-18 16:20:54 -07:00
sdong
b194c21bba Whole DBTest to skip fsync (#7274)
Summary:
After https://github.com/facebook/rocksdb/pull/7036, we still see extra DBTest that can timeout when running 10 or 20 in parallel. Expand skip-fsync mode in whole DBTest. Still preserve other tests from doing this mode to be conservative.

This commit reinstates https://github.com/facebook/rocksdb/issues/7049, whose un-revert was lost in an automatic
infrastructure mis-merge.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7274

Test Plan: Run all existing files.

Reviewed By: pdillinger

Differential Revision: D23177444

fbshipit-source-id: 1f61690b2ac6333c3b2c87176fef6b2cba086b33
2020-08-17 18:42:25 -07:00
Andrew Kryczka
5d5ff82408 Disable recycle_log_file_num with kTolerateCorruptedTailRecords (#7271)
Summary:
The two features are naturally incompatible. WAL recycling expects the recovery to succeed upon encountering a corrupt record at the point where new data ends and recycled data remains at the tail. However, `WALRecoveryMode::kTolerateCorruptedTailRecords` must fail upon encountering any such corrupt record, as it cannot differentiate between this and a real corruption, which would cause committed updates to be truncated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7271

Reviewed By: riversand963

Differential Revision: D23169923

Pulled By: ajkr

fbshipit-source-id: 2cf8a3bcd2c9a0ecb0055a84725047a10fd4db50
2020-08-17 18:21:10 -07:00
Yanqin Jin
92593d511a Add a new EntryType for deletion with timestamp (#7195)
Summary:
Add `kEntryDeleteWithTimestamp` to `EntryType` which is a public API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7195

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D22914704

Pulled By: riversand963

fbshipit-source-id: 886f73c6b70c527cad1c8fc9fc8d3afe60e1ea39
2020-08-17 16:26:06 -07:00
Levi Tamasi
9b083cb11c Build blob file reader/writer classes in LITE mode as well (#7272)
Summary:
The patch makes sure that the functionality required for the new integrated
BlobDB implementation (most importantly, the classes related to reading and
writing blob files) is also built in LITE mode by removing the corresponding
`#ifndef`s.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7272

Test Plan: Ran `make check` in both regular and LITE mode.

Reviewed By: zhichao-cao

Differential Revision: D23173280

Pulled By: ltamasi

fbshipit-source-id: 1596bd1a76409a8a6d83d8f1dbfe08bfdea7ffe6
2020-08-17 15:19:05 -07:00
sdong
1760637539 CompactRange() refit level should confirm destination level is not empty (#7261)
Summary:
There is potential data race related CompactRange() with level refitting. After the compaction step and refitting step, some automatic compaction could put data to the destination level and cause the DB to be corrupted. Fix the bug by checking the target level to be empty.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7261

Test Plan: Add a unit test, which would fail with "Corruption: L1 have overlapping ranges '666F6F' seq:6, type:1 vs. '626172' seq:2, type:1", and now it succeeds.

Reviewed By: ajkr

Differential Revision: D23142269

fbshipit-source-id: 28bc14d5ac934c192260b23a4ce3f10a95e3ee91
2020-08-17 14:21:53 -07:00
matthewvon
2ad88ceae9 Populate cf_id member of CompactionJobInfo for OnCompactionBegin (#6938)
Summary:
Looks like somebody simply missed initializing a member variable. The column family ID, cf_id, is not set during OnCompactionBegin. But it is set properly in the next function for OnCompactionCompleted. Need this cf_id for tracking progress of a Stardog optimize since there may be multiple compactions required for a given column family.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6938

Reviewed By: siying

Differential Revision: D23153235

Pulled By: ajkr

fbshipit-source-id: 932938de3a4ebbc7ac89702f655583862587d251
2020-08-17 11:57:47 -07:00
Jay Zhuang
69760b4d05 Introduce a global StatsDumpScheduler for stats dumping (#7223)
Summary:
Have a global StatsDumpScheduler for all DB instance stats dumping, including `DumpStats()` and `PersistStats()`. Before this, there're 2 dedicate threads for every DB instance, one for DumpStats() one for PersistStats(), which could create lots of threads if there're hundreds DB instances.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7223

Reviewed By: riversand963

Differential Revision: D23056737

Pulled By: jay-zhuang

fbshipit-source-id: 0faa2311142a73433ebb3317361db7cbf43faeba
2020-08-14 20:12:44 -07:00
Yanqin Jin
d758273ceb Get() with timestamp should respect snapshot (#7227)
Summary:
If user-defined timestamp is enabled, current implementation can expose
newer data to queries even if an older sequence number is specified via
read_options.snapshot. This PR makes Get() respect sequence-number-based
snapshot.

Solution is simple. Besides using <ukey, ts, seq> to search the index for the key,
we also verify that the candidate result's seq is smaller than or equal to seq. This
requires passing a seq via `GetContext`, which results in the majority of code
change caused by this PR.

Also added a few unit tests to demonstrate standard visibility during point lookup
and range scan when timestamp and snapshot are both present.

Test plan (devserver):
```
make check
$./db_bench --benchmarks=fillseq,readrandom -cache_size=$[64*1024*1024]
```
Result
this PR: readrandom   :       4.827 micros/op 207180 ops/sec;   22.9 MB/s (1000000 of 1000000 found)
master:  readrandom   :       4.936 micros/op 202610 ops/sec;   22.4 MB/s (1000000 of 1000000 found)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7227

Reviewed By: ltamasi

Differential Revision: D23015242

Pulled By: riversand963

fbshipit-source-id: ea7b85a728654553ba357d2e6a207b5e40f7376a
2020-08-14 19:20:58 -07:00
Andrew Kryczka
a1aa3f8385 Disable manual compaction during ReFitLevel() (#7250)
Summary:
Manual compaction with `CompactRangeOptions::change_levels` set could
refit to a level targeted by another manual compaction. If
force_consistency_checks were disabled, it could be possible for
overlapping files to be written at that target level.

This PR prevents the possibility by calling `DisableManualCompaction()`
prior to `ReFitLevel()`. It also improves the manual compaction disabling
mechanism to wait for pending manual compactions to complete before
returning, and support disabling from multiple threads.

Fixes https://github.com/facebook/rocksdb/issues/6432.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7250

Test Plan:
crash test command that repro'd the bug reliably:

```
$ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --simple -target_file_size_base=524288 -write_buffer_size=1048576 -clear_column_family_one_in=0 -reopen=0 -max_key=10000000 -column_families=1 -max_background_compactions=8 -compact_range_one_in=100000 -compression_type=none -compaction_style=1 -num_levels=5 -universal_min_merge_width=4 -universal_max_merge_width=8 -level0_file_num_compaction_trigger=12 -rate_limiter_bytes_per_sec=1048576000 -universal_max_size_amplification_percent=100 --duration=3600 --interval=60 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --enable_compaction_filter=0
```

Reviewed By: ltamasi

Differential Revision: D23090800

Pulled By: ajkr

fbshipit-source-id: afcbcd51b42ce76789fdb907d8b9ada790709c13
2020-08-14 11:29:52 -07:00
sdong
e7358da9a2 Upgrade tool chain (#7251)
Summary:
Upgrade tool chain to the latest. It is done mostly manually as build_tools/build_detect_platform fails to update many of them.

Try to fix a new clang analyze warning with the new tool chain.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7251

Test Plan: "make all", "USE_CLANG=1 make all"

Reviewed By: riversand963

Differential Revision: D23091090

fbshipit-source-id: 732e5a30137837431438f85f36296406b641f975
2020-08-12 19:30:00 -07:00
Levi Tamasi
9d6f48ec1d Clean up CompressBlock/CompressBlockInternal a bit (#7249)
Summary:
The patch cleans up and refactors `CompressBlock` and `CompressBlockInternal` a bit.
In particular, it does the following:
* It renames `CompressBlockInternal` to `CompressData` and moves it to `util/compression.h`,
where other general compression-related utilities are located. This will facilitate reuse in the
BlobDB write path.
* The signature of the method is changed so it now takes `compression_format_version`
(similarly to the compression library specific methods) instead of `format_version` (which is
specific to the block based table).
* `GetCompressionFormatForVersion` no longer takes `compression_type` as a parameter.
This parameter was only used in a (not entirely up-to-date) assertion; also, removing it
eliminates the need to ensure this precondition holds at all call sites.
* Does some minor cleanup in `CompressBlock`, for instance, it is now possible to pass
only one of `sampled_output_fast` and `sampled_output_slow`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7249

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D23087278

Pulled By: ltamasi

fbshipit-source-id: e6316e45baed8b4e7de7c1780c90501c2a3439b3
2020-08-12 18:25:48 -07:00
Akanksha Mahajan
1f9f630b27 Store FileSystemPtr object that contains FileSystem ptr (#7180)
Summary:
As part of the IOTracing project, this PR
    1. Caches "FileSystemPtr" object(wrapper class that returns file system pointer based on tracing enabled) instead of "FileSystem" pointer.
    2. FileSystemPtr object is created using FileSystem pointer and IOTracer
    pointer.
    3. IOTracer shared_ptr is created in DBImpl and it is passed to different classes through constructor.
    4. When tracing is enabled through DB::StartIOTrace, FileSystemPtr
    returns FileSystemTracingWrapper pointer for tracing purpose and when
    it is disabled underlying FileSystem pointer is returned.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7180

Test Plan:
make check -j64
                COMPILE_WITH_TSAN=1 make check -j64

Reviewed By: anand1976

Differential Revision: D22987117

Pulled By: akankshamahajan15

fbshipit-source-id: 6073617e4c2d5bc363914f3a1f55ae3b0a58fbf1
2020-08-12 17:31:23 -07:00
Zitan Chen
b578ca2e4d BackupEngine supports custom file checksums (#7085)
Summary:
A new option `std::shared_ptr<FileChecksumGenFactory> backup_checksum_gen_factory` is added to `BackupableDBOptions`. This allows custom checksum functions to be used for creating, verifying, or restoring backups.

Tests are added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7085

Test Plan: Passed make check

Reviewed By: pdillinger

Differential Revision: D22390756

Pulled By: gg814

fbshipit-source-id: 3b7756ca444c2129844536b91c3ca09f53b6248f
2020-08-12 13:31:09 -07:00
Peter Dillinger
6ac1d25fd0 Fix+clean up handling of mock sleeps (#7101)
Summary:
We have a number of tests hanging on MacOS and windows due to
mishandling of code for mock sleeps. In addition, the code was in
terrible shape because the same variable (addon_time_) would sometimes
refer to microseconds and sometimes to seconds. One test even assumed it
was nanoseconds but was written to pass anyway.

This has been cleaned up so that DB tests generally use a SpecialEnv
function to mock sleep, for either some number of microseconds or seconds
depending on the function called. But to call one of these, the test must first
call SetMockSleep (precondition enforced with assertion), which also turns
sleeps in RocksDB into mock sleeps. To also removes accounting for actual
clock time, call SetTimeElapseOnlySleepOnReopen, which implies
SetMockSleep (on DB re-open). This latter setting only works by applying
on DB re-open, otherwise havoc can ensue if Env goes back in time with
DB open.

More specifics:

Removed some unused test classes, and updated comments on the general
problem.

Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
of mock time. For this we have the only modification to production code,
inserting a sync point callback in flush_job.cc, which is not a change to
production behavior.

Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
deals in relative time. Any behaviors relying on absolute date/time are likely
a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
clearly injecting a specific absolute time for actual testing convenience.) Just
in case I misunderstood some test, I put this note in each replacement:
// NOTE: Presumed unnecessary and removed: resetting mock time in env

Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
FilterCompactionTimeTest in db_test.cc

stats_history_test and blob_db_test are each their own beast, rather deeply
dependent on MockTimeEnv. Each gets its own variant of a work-around for
TimedWait in a mock time environment. (Reduces redundancy and
inconsistency in stats_history_test.)

Intended follow-up:

Remove TimedWait from the public API of InstrumentedCondVar, and only
make that accessible through Env by passing in an InstrumentedCondVar and
a deadline. Then the Env implementations mocking time can fix this problem
without using sync points. (Test infrastructure using sync points interferes
with individual tests' control over sync points.)

With that change, we can simplify/consolidate the scattered work-arounds.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101

Test Plan: make check on Linux and MacOS

Reviewed By: zhichao-cao

Differential Revision: D23032815

Pulled By: pdillinger

fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
2020-08-11 12:41:30 -07:00
Levi Tamasi
a99fb67233 Remove redundant consistency check from VersionStorageInfo::AddFile (#7237)
Summary:
`VersionStorageInfo::AddFile` currently has a debug-mode consistency
check to make sure the newly added file does not overlap with the
previous one (for levels below L0). Considering that
`VersionBuilder::CheckConsistency` also performs similar checks (in
fact, those checks are more comprehensive and cover L0 as well), this
check is redundant. The patch removes it and also cleans up `AddFile` a
little.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7237

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D23041937

Pulled By: ltamasi

fbshipit-source-id: e00665f3b83bfd17f86c54c238800f3d77d739bd
2020-08-11 09:23:17 -07:00
anand76
f308da5273 Fix delete triggered compaction for single level universal (#7224)
Summary:
Delete triggered compaction (DTC) for universal compaction style with ```num_levels = 1``` has been disabled for sometime due to a data correctness bug. This PR re-enables it with a bug fix. A file marked for compaction can be picked, along with all L0 files after it as the compaction input. We stop adding files to the input once we encounter a file already being compacted (the original bug failed to check the compaction status of the files).

Tests:
Add unit tests to ```compaction_picker_test.cc```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7224

Reviewed By: ajkr

Differential Revision: D23031845

Pulled By: anand1976

fbshipit-source-id: 9de3cab5f9774cede666c2c48d309a7d9b88a505
2020-08-10 12:19:17 -07:00
Yuhong Guo
5444942f15 Fix cmake build on MacOS (#7205)
Summary:
1. `std::random_shuffle` is deprecated and now we can use `std::shuffle`
```
/rocksdb/db/prefix_test.cc:590:12: error: 'random_shuffle<std::__1::__wrap_iter<unsigned long long *> >'
      is deprecated [-Werror,-Wdeprecated-declarations]
      std::random_shuffle(prefixes.begin(), prefixes.end());
           ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/algorithm:2982:1: note:
      'random_shuffle<std::__1::__wrap_iter<unsigned long long *> >' has been explicitly marked deprecated here
_LIBCPP_DEPRECATED_IN_CXX14 void
^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__config:1107:39: note: expanded from macro
      '_LIBCPP_DEPRECATED_IN_CXX14'
#  define _LIBCPP_DEPRECATED_IN_CXX14 _LIBCPP_DEPRECATED
                                      ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__config:1090:48: note: expanded from macro
      '_LIBCPP_DEPRECATED'
#    define _LIBCPP_DEPRECATED __attribute__ ((deprecated))
```
2. `c_test` link error with `-DROCKSDB_BUILD_SHARED=OFF`:
```
[  7%] Linking CXX executable c_test
ld: library not found for -lrocksdb-shared
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[5]: *** [c_test] Error 1
make[4]: *** [CMakeFiles/c_test.dir/all] Error 2
make[4]: *** Waiting for unfinished jobs....
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7205

Reviewed By: ajkr

Differential Revision: D23030641

Pulled By: pdillinger

fbshipit-source-id: f270e50fc0b824ca1a0876ec5c65d33f55a72dd0
2020-08-10 10:48:05 -07:00
Remington Brasga
633bff2f19 Fixed typo on Value mismatch error in db_test (#6587)
Summary:
The debug is supposed to print out two keys to show the value mismatch, which was compared just a few lines above.

However, the actual print-out is the same values (so they obviously won't be mismatched)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6587

Reviewed By: riversand963

Differential Revision: D23025279

Pulled By: ajkr

fbshipit-source-id: 4c6c35bc60b273f13c08b5464b6f690d8a5cfe41
2020-08-10 10:06:08 -07:00
anand76
832b056a30 Enable IO timeouts for iterators (#7161)
Summary:
Introduce io_timeout in ReadOptions and enabled deadline/io_timeout for
Iterators.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7161

Test Plan: New unit tests in db_basic_test

Reviewed By: riversand963

Differential Revision: D22687352

Pulled By: anand1976

fbshipit-source-id: 67bbb0e6d7ae80b256589244468494292538c6ec
2020-08-07 12:01:08 -07:00
Zhichao Cao
b79f13b2aa Fix the potential deadlock in WriteImplWALOnly and UnorderedWriteMemtable (#7199)
Summary:
Pointed out by https://github.com/facebook/rocksdb/issues/7197 , there is a double lock in WriteImplWALOnly.
Also find another deadlock in UnorderedWriteMemtable. Move the check after switch_all_.notify_all().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7199

Test Plan: pass make check

Reviewed By: anand1976

Differential Revision: D22961714

Pulled By: zhichao-cao

fbshipit-source-id: 0707922dc50d28ea141a15a8cdcbd1c8993ea0d8
2020-08-07 11:28:49 -07:00
mrambacher
56f468b356 Add more tests to ASSERT_STATUS_CHECKED (#7211)
Summary:
Added 4 more tests to those which pass ASSERT_STATUS_CHECKED (cache_test, lru_cache_test, filename_test, filelock_test).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7211

Reviewed By: ajkr

Differential Revision: D22982858

Pulled By: zhichao-cao

fbshipit-source-id: acdd071582ed6aa7447ed96c5732f10bf720d783
2020-08-06 17:19:41 -07:00
Yingchun Lai
67bbac3621 Remove duplicate colon in Status message (#7041)
Summary:
A colon will be added after 'msg' automatically when invoke function Status(Code _code, const Slice& msg, const Slice& msg2),
it's not needed to append a colon explicitly to 'msg'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7041

Reviewed By: ajkr

Differential Revision: D22292801

fbshipit-source-id: 8f2d69065bb779d2613468bf9fc9169f32c3f1ec
2020-08-06 15:18:04 -07:00
jsteemann
5e1808d515 fix typo: paraniod -> paranoid (#7163)
Summary:
Rename "paraniod" to "paranoid" in a few places.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7163

Reviewed By: ajkr

Differential Revision: D22678242

fbshipit-source-id: 28b1011a736d0a95612676f7e1b9500a70c324b4
2020-08-06 14:25:34 -07:00
Cheng Chang
cd48ecaa1a Define WAL related classes to be used in VersionEdit and VersionSet (#7164)
Summary:
`WalAddition`, `WalDeletion` are defined in `wal_version.h` and used in `VersionEdit`.
`WalAddition` is used to represent events of creating a new WAL (no size, just log number), or closing a WAL (with size).
`WalDeletion` is used to represent events of deleting or archiving a WAL, it means the WAL is no longer alive (won't be replayed during recovery).

`WalSet` is the set of alive WALs kept in `VersionSet`.

1. Why use `WalDeletion` instead of relying on `MinLogNumber` to identify outdated WALs

On recovery, we can compute `MinLogNumber()` based on the log numbers kept in MANIFEST, any log with number < MinLogNumber can be ignored. So it seems that we don't need to persist `WalDeletion` to MANIFEST, since we can ignore the WALs based on MinLogNumber.

But the `MinLogNumber()` is actually a lower bound, it does not exactly mean that logs starting from MinLogNumber must exist. This is because in a corner case, when a column family is empty and never flushed, its log number is set to the largest log number, but not persisted in MANIFEST. So let's say there are 2 column families, when creating the DB, the first WAL has log number 1, so it's persisted to MANIFEST for both column families. Then CF 0 is empty and never flushed, CF 1 is updated and flushed, so a new WAL with log number 2 is created and persisted to MANIFEST for CF 1. But CF 0's log number in MANIFEST is still 1. So on recovery, MinLogNumber is 1, but since log 1 only contains data for CF 1, and CF 1 is flushed, log 1 might have already been deleted from disk.

We can make `MinLogNumber()` be the exactly minimum log number that must exist, by persisting the most recent log number for empty column families that are not flushed. But if there are N such column families, then every time a new WAL is created, we need to add N records to MANIFEST.

In current design, a record is persisted to MANIFEST only when WAL is created, closed, or deleted/archived, so the number of WAL related records are bounded to 3x number of WALs.

2. Why keep `WalSet` in `VersionSet` instead of applying the `VersionEdit`s to `VersionStorageInfo`

`VersionEdit`s are originally designed to track the addition and deletion of SST files. The SST files are related to column families, each column family has a list of `Version`s, and each `Version` keeps the set of active SST files in `VersionStorageInfo`.

But WALs are a concept of DB, they are not bounded to specific column families. So logically it does not make sense to store WALs in a column family's `Version`s.
Also, `Version`'s purpose is to keep reference to SST / blob files, so that they are not deleted until there is no version referencing them. But a WAL is deleted regardless of version references.
So we keep the WALs in `VersionSet`  for the purpose of writing out the DB state's snapshot when creating new MANIFESTs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7164

Test Plan:
make version_edit_test && ./version_edit_test
make wal_edit_test && ./wal_edit_test

Reviewed By: ltamasi

Differential Revision: D22677936

Pulled By: cheng-chang

fbshipit-source-id: 5a3b6890140e572ffd79eb37e6e4c3c32361a859
2020-08-05 16:34:38 -07:00
sdong
5c1a544122 Clean up InternalIterator upper bound logic a little bit (#7200)
Summary:
IteratorIterator::IsOutOfBound() and IteratorIterator::MayBeOutOfUpperBound() are two functions that related to upper bound check. It is hard for users to reason about this complexity. Consolidate the two functions into one and assign an enum as results to improve readability.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7200

Test Plan: Run all existing test. Would run crash test with atomic for a while.

Reviewed By: anand1976

Differential Revision: D22833181

fbshipit-source-id: a0c724267056adbd0476bde74650e6c7226077e6
2020-08-05 10:44:57 -07:00
Yanqin Jin
2735b0275d ReadOptions.iter_start_ts should support tombstones (#7178)
Summary:
as title.
When ReadOptions.iter_start_ts is not nullptr, DBIter::key() should
return internal keys including value type.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7178

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D22935879

Pulled By: riversand963

fbshipit-source-id: 7508d962cf11ebcfa6386d2529b4f3606b47ccfd
2020-08-04 18:52:08 -07:00
Akanksha Mahajan
493f425e77 Add support to start and end IOTracing through DB APIs (#7203)
Summary:
1. Add support to start io tracing through DB::StartIOTrace(Env*, const TraceOptions&, std::unique_ptr<TraceWriter>&&) and end tracing through DB::EndIOTrace(). This doesn't trace DB::Open.

User side code:

//Open DB
DB::Open(options, dbname, &db);

/* Start tracing */
db->StartIOTrace(env, trace_opt, std::move(trace_writer));

/* Perform Operations */

/*End tracing*/
db->EndIOTrace();

2. Fix the build errors for Windows.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7203

Test Plan: make check -j64

Reviewed By: anand1976

Differential Revision: D22901947

Pulled By: akankshamahajan15

fbshipit-source-id: e59c0b785a802168e6f1aa028d99c224a35cb30c
2020-08-04 18:41:45 -07:00
Andrew Kryczka
a4a4a2dabd dedup ReadOptions in iterator hierarchy (#7210)
Summary:
Previously, a `ReadOptions` object was stored in every `BlockBasedTableIterator`
and every `LevelIterator`. This redundancy consumes extra memory,
resulting in the `Arena` making more allocations, and iteration
observing worse cache performance.

This PR migrates callers of `NewInternalIterator()` and
`MakeInputIterator()` to provide a `ReadOptions` object guaranteed to
outlive the returned iterator. When the iterator's lifetime will be managed by the
user, this lifetime guarantee is achieved by storing the `ReadOptions`
value in `ArenaWrappedDBIter`. Then, sub-iterators of `NewInternalIterator()` and
`MakeInputIterator()` can hold a reference-to-const `ReadOptions`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7210

Test Plan:
- `make check` under ASAN and valgrind
- benchmark: on a DB with 2 L0 files and 3 L1+ levels, this PR reduced `Arena` allocation 4792 -> 4160 bytes.

Reviewed By: anand1976

Differential Revision: D22861323

Pulled By: ajkr

fbshipit-source-id: 54aebb3e89c872eeab0f5793b4b6e42878d093ce
2020-08-03 15:23:04 -07:00
Aaron Kabcenell
56ed601df3 Compaction Read/Write Stats by Compaction Type (#7165)
Summary:
Adds compaction statistics (total bytes read and written) for compactions that occur for delete-triggered, periodic, and TTL compaction reasons.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7165

Test Plan:
TTL and periodic can be checked by runnning db_bench with the options activated:

/db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -periodic_compaction_seconds=1
./db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -fifo_compaction_ttl=1

Setting the time to one second causes non-zero bytes read/written for those compaction reasons. Disabling them or setting them to times longer than the test run length causes the stats to return to zero as expected.

Delete-triggered compaction counting is tested in DBTablePropertiesTest.DeletionTriggeredCompactionMarking

Reviewed By: ajkr

Differential Revision: D22693050

Pulled By: akabcenell

fbshipit-source-id: d15cef4d94576f703015c8942d5f0d492f69401d
2020-07-29 13:39:29 -07:00
codingsh
50f206ad84 feat: export SetBackgroundThreads(n, Env::BOTTOM); (#7191)
Summary:
- https://github.com/rust-rocksdb/rust-rocksdb/pull/448

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7191

Reviewed By: riversand963

Differential Revision: D22809066

Pulled By: ajkr

fbshipit-source-id: 036939f9a28cacc3f677c318d1aed97fe5f4f85e
2020-07-29 12:24:13 -07:00
sdong
692f6a3138 Implement NextAndGetResult() in memtable and level iterator (#7179)
Summary:
NextAndGetResult() is not implemented in memtable and is very simply implemented in level iterator. The result is that for a normal leveled iterator, performance regression will be observed for calling PrepareValue() for most iterator Next(). Mitigate the problem by implementing the function for both iterators. In level iterator, the implementation cannot be perfect as when calling file iterator's SeekToFirst() we don't have information about whether the value is prepared. Fortunately, the first key should not cause a big portion of the CPu.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7179

Test Plan: Run normal crash test for a while.

Reviewed By: anand1976

Differential Revision: D22783840

fbshipit-source-id: c19f45cdf21b756190adef97a3b66ccde3936e05
2020-07-29 09:45:21 -07:00
mrambacher
d9d190742c Make env*_test work with ASSERT_STATUS_CHECKED (#7176)
Summary:
Make (most of) the env*_test pass when ASSERT_STATUS_CHECKED is enabled.

One test that opens a database is currently disabled in this mode, as there are many errors that need revisited for DB tests and status checks.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7176

Reviewed By: cheng-chang

Differential Revision: D22799278

Pulled By: ajkr

fbshipit-source-id: 16d8a02eaeecd6df1060249b6a5811292801f2ed
2020-07-28 22:59:48 -07:00
codingsh
83ea266b43 export stats_persist_period_sec (#7168)
Summary:
fixed
 - https://github.com/rust-rocksdb/rust-rocksdb/issues/447
 -  https://github.com/rust-rocksdb/rust-rocksdb/pull/448

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7168

Reviewed By: cheng-chang

Differential Revision: D22736013

Pulled By: ajkr

fbshipit-source-id: fdd784aa75d26a367b9108b05ffdd94a2ae117d3
2020-07-28 13:05:34 -07:00
Tomas Kolda
cd4592c220 SST Partitioner interface that allows to split SST files (#6957)
Summary:
SST Partitioner interface that allows to split SST files during compactions.

It basically instruct compaction to create a new file when needed. When one is using well defined prefixes and prefixed way of defining tables it is good to define also partitioning so that promotion of some SST file does not cover huge key space on next level (worst case complete space).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6957

Reviewed By: ajkr

Differential Revision: D22461239

fbshipit-source-id: 9ce07bba08b3ba89c2d45630520368f704d1316e
2020-07-24 13:44:49 -07:00
Jay Zhuang
b0c5ecd6b3 Make max_subcompactions dynamically changeable (#7159)
Summary:
Make `max-subcompactions` dynamically changeable by passing the `DBOption` to Compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7159

Reviewed By: siying

Differential Revision: D22671238

Pulled By: jay-zhuang

fbshipit-source-id: 311ca9f6bb606965544d8708616d358cfed5be42
2020-07-22 18:32:52 -07:00
Levi Tamasi
0d04a8434a Sync blob files before closing them (#7160)
Summary:
BlobDB currently syncs each blob file periodically after writing a certain amount of
data (as specified by the configuration option `BlobDBOptions::bytes_per_sync`)
and all open blob files when the base DB's memtables are flushed. With the patch,
in addition to the above, blob files are also synced right before being closed, after
the footer has been written. This will be beneficial for the new integrated blob file
write path as well.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7160

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D22672646

Pulled By: ltamasi

fbshipit-source-id: 62b34263543a7e74abcbb7adf011daa1e699998f
2020-07-22 17:25:20 -07:00
Cheng Chang
96ce0470a7 Clean snapshot dir before taking snapshot (#7156)
Summary:
`DBTest::SnapshotFiles` runs the tests in a `while` loop.
Currently, the snapshot directory is not cleaned up in each loop, so previous snapshot files may remain in the next loop's snapshot.
When I'm working on https://github.com/facebook/rocksdb/pull/7129, when checking the tracked WALs in MANIFEST, I find that this test always fails because it reads some unknown WAL. It turns out that the unknown WAL is left from previous loops.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7156

Test Plan: make db_test && ./db_test --gtest_filters=*SnapshotFiles

Reviewed By: siying

Differential Revision: D22668360

Pulled By: cheng-chang

fbshipit-source-id: 69d4aa3506038ba30e218e8ae966357935a99c6c
2020-07-22 13:54:01 -07:00
mrambacher
d44cbc5314 Add hash of key/value checks when paranoid_file_checks=true (#7134)
Summary:
When paraoid_files_checks=true, a rolling key-value hash is generated and compared to what is written to the file.  If the values do not match, the SST file is rejected.

Code put in place for the check for both flush and compaction jobs.  Corresponding test added to corruption_test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7134

Reviewed By: cheng-chang

Differential Revision: D22646149

fbshipit-source-id: 8fde1984a1a11edd3bd82a413acffc5ea7aa683f
2020-07-22 11:04:40 -07:00
Haosen Wen
dbc51adbac Use steady_clock instead of system_clock in FileOperationInfo::TimePoint (#7153)
Summary:
Issue https://github.com/facebook/rocksdb/issues/7133 reported that using `system_clock` in `FileOperationInfo::TimePoint` causes the duration of file flush operation (which can be a noop on MacOS in some scenarios) appears to be 0 and fail an assertion in listener_test. Using `steady_clock` supposedly fixed the problem.
`steady_clock` actually fits better into the use cases of `FileOperationInfo::TimePoint` as all usages care about durations but not wall clock time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7153

Test Plan: make check.

Reviewed By: riversand963

Differential Revision: D22654136

Pulled By: roghnin

fbshipit-source-id: 5980b1080734bdae496a18071a2c2b5887c67d85
2020-07-22 08:55:02 -07:00
sdong
1cf4731dbb column_family_test: fix a data race related to sleeping task (#7150)
Summary:
TSAN reports warning in one column_family_test:

WARNING: ThreadSanitizer: data race (pid=16352)
  Write of size 8 at 0x7ffcdf042158 by main thread:
    #0 pthread_cond_destroy <null> (column_family_test+0x471f65)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::port::CondVar::~CondVar() /home/circleci/project/port/port_posix.cc:101:49 (column_family_test+0x8a627a)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::test::SleepingBackgroundTask::~SleepingBackgroundTask() /home/circleci/project/./test_util/testutil.h:397:7 (column_family_test+0x54b6e2)
    https://github.com/facebook/rocksdb/issues/3 rocksdb::ColumnFamilyTest_FlushCloseWALFiles_Test::TestBody() /home/circleci/project/db/column_family_test.cc:3008:1 (column_family_test+0x54b6e2)
......
  Previous read of size 8 at 0x7ffcdf042158 by thread T2 (mutexes: write M0):
    #0 pthread_cond_broadcast <null> (column_family_test+0x471dd2)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::port::CondVar::SignalAll() /home/circleci/project/port/port_posix.cc:139:28 (column_family_test+0x8a651a)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::test::SleepingBackgroundTask::DoSleep() /home/circleci/project/./test_util/testutil.h:412:12 (column_family_test+0x58574b)
......

Likely, SleepingBackgroundTask::DoSleep() started to execute after the main thread has finished everything, cancelled and waited for sleeping tasks to finish. At this time, although DoSlee() will not sleep, but it also accesses the mutex, creating a data race with destructor of the test. Fix this bug by waiting for the sleeping task to start sleeping after it is scheduled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7150

Test Plan: Run these modified tests and make sure it doesn't break.

Reviewed By: riversand963

Differential Revision: D22630716

fbshipit-source-id: cc5781cf69083685de406490438898238bdfc2d3
2020-07-20 14:19:48 -07:00
sdong
9870704420 Fix a minor data race in stats dumping threads initialization (#7151)
Summary:
https://github.com/facebook/rocksdb/pull/7145 creates a minor data race against the stat creation counter. Turn it to atomic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7151

Test Plan: Run the test.

Reviewed By: ajkr

Differential Revision: D22631014

fbshipit-source-id: c6fb69ac5b9df7139795dacea5ce9fb9fd3278d7
2020-07-20 12:12:43 -07:00
Zhichao Cao
ed4712fe7e Remove time out testing cases in error_handler_fs_test (#7141)
Summary:
Remove the 3 testing cases that cause the time out in linux build by https://github.com/facebook/rocksdb/issues/6765 . Will fix them later.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7141

Test Plan: make asan_check, buck run

Reviewed By: ajkr

Differential Revision: D22593831

Pulled By: zhichao-cao

fbshipit-source-id: 14956c36476ecc3393f613178c22e13df843126e
2020-07-17 23:27:21 -07:00
Andrew Kryczka
9a83fd21e6 stagger first DumpMallocStats after opening DB (#7145)
Summary:
Previously when running `db_bench` with large value for `num_multi_dbs` and enabled `Options::dump_malloc_stats`, we would see most CPU spent in jemalloc locking. After this PR that no longer shows up at the top of the profile.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7145

Reviewed By: riversand963

Differential Revision: D22593031

Pulled By: ajkr

fbshipit-source-id: 3b3fc91f93249c6afee53f59f34c487c3fc5add6
2020-07-17 16:13:26 -07:00
sdong
ca5a069a79 Suppress a TSAN warning (#7126)
Summary:
TSAN shows warning with clang with warning similar to this:

WARNING: ThreadSanitizer: data race (pid=10159)
  Atomic write of size 8 at 0x7b5000002890 by thread T33:
    #0 __tsan_atomic64_store <null> (db_test+0x4ca2b5)
    https://github.com/facebook/rocksdb/issues/1 std::__atomic_base<unsigned long>::store(unsigned long, std::memory_order) /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/atomic_base.h:374:2 (db_test+0x774fde)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::VersionSet::SetLastSequence(unsigned long) /home/circleci/project/./db/version_set.h:1057:20 (db_test+0x774fde)
    https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*) /home/circleci/project/db/db_impl/db_impl_write.cc:449:18 (db_test+0x774fde)
......
  Previous read of size 8 at 0x7b5000002890 by thread T5 (mutexes: write M1044689462619020832):
    #0 rocksdb::DBImpl::ReleaseSnapshot(rocksdb::Snapshot const*) /home/circleci/project/db/db_impl/db_impl.cc (db_test+0x6f4ae7)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::(anonymous namespace)::MTThreadBody(void*) /home/circleci/project/db/db_test.cc:2514:13 (db_test+0x56ac59)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::(anonymous namespace)::StartThreadWrapper(void*) /home/circleci/project/env/env_posix.cc:443:3 (db_test+0x88c4cd)

It is not limited to ReleaseSnapshot() and rocksdb::DBImpl::MultiCFSnapshot().

While we are not 100% sure it doesn't indicate any correctness violation, we suppress them for now to keep TSAN clean with more tests so that we can cover more bugs with CI.

In the gcc runs we have been running, this warning rarely shows up.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7126

Test Plan: See the mini-TSAN test to pass with reasonable run time.

Reviewed By: ajkr

Differential Revision: D22552375

fbshipit-source-id: ebdd3854cb3becec3403970326a1ca961db2ab00
2020-07-15 13:25:14 -07:00
Zhichao Cao
a10f12eda1 Auto resume the DB from Retryable IO Error (#6765)
Summary:
In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765

Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check

Reviewed By: anand1976

Differential Revision: D21916789

Pulled By: zhichao-cao

fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b
2020-07-15 11:03:58 -07:00
Yanqin Jin
27735dea9a Report corrupted keys during compaction (#7124)
Summary:
Currently, RocksDB lets compaction to go through even in case of
corrupted keys, the number of which is reported in CompactionJobStats.
However, RocksDB does not check this value. We should let compaction run
in a stricter mode.

Temporarily disable two tests that allow corrupted keys in compaction.
With this PR, the two tests will assert(false) and terminate. Still need
to investigate what is the recommended google-test way of doing it.
Death test (EXPECT_DEATH) in gtest has warnings now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7124

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D22530722

Pulled By: riversand963

fbshipit-source-id: 6a5a6a992028c6d4f92cb74693c92db462ae4ad6
2020-07-14 17:18:17 -07:00
Levi Tamasi
bdf4de6cb9 Remove some dead code from BlobLogWriter (#7125)
Summary:
Periodic syncing of blob files is performed by `WritableFileWriter`;
`bytes_per_sync_` and `next_sync_offset_` in `BlobLogWriter` are
actually unused (or more precisely, only used by methods that are
themselves unused). The patch removes all this dead code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7125

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D22531021

Pulled By: ltamasi

fbshipit-source-id: 6b293ad5a79d3e6bf15c5c68f7aedd7ce7a15f10
2020-07-14 13:51:54 -07:00
Yanqin Jin
c628fae6d1 Report corruption on unrecognized value type (#7121)
Summary:
During memtable lookup, an unrecognized value type should be reported as
Status::Corruption.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7121

Test Plan: make check

Reviewed By: cheng-chang

Differential Revision: D22512124

Pulled By: riversand963

fbshipit-source-id: 9b97be7d9b230c5aae9205f96054420e5ea09066
2020-07-13 20:26:58 -07:00