Commit Graph

4279 Commits

Author SHA1 Message Date
mrambacher
e9befdebbf Add EnvTestWithParam::OptionsTest to the ASSERT_STATUS_CHECKED passes (#7283)
Summary:
This test uses database functionality and required more extensive work to get it to pass than the other tests.  The DB functionality required for this test now passes the check.

When it was unclear what the proper behavior was for unchecked status codes, a TODO was added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7283

Reviewed By: akankshamahajan15

Differential Revision: D23251497

Pulled By: ajkr

fbshipit-source-id: 52b79629bdafa0a58de8ead1d1d66f141b331523
2020-08-20 19:18:35 -07:00
Stanislav Tkach
b288f0131b Add getters for the read options to the C API (#7289)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7289

Reviewed By: akankshamahajan15

Differential Revision: D23252520

Pulled By: ajkr

fbshipit-source-id: 85cea485a6dcaa1c67c32a83eb49a1b623966609
2020-08-20 16:36:19 -07:00
Cheng Chang
ce4192375d Track WAL in MANIFEST: minor udpates (#7282)
Summary:
The updates resolve comments left from https://github.com/facebook/rocksdb/pull/7164.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7282

Test Plan: wal_edit_test

Reviewed By: ltamasi

Differential Revision: D23196824

Pulled By: cheng-chang

fbshipit-source-id: 797f3fef27fc72114c2be777d9eadd3429da5301
2020-08-20 15:12:00 -07:00
Jay Zhuang
3e422ce0ca Fix a timer_test deadlock (#7277)
Summary:
There's a potential deadlock caused by MockTimeEnv time value get to a large number, which causes TimedWait() wait forever. The test misuses the microseconds as seconds, making it more likely to happen.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7277

Reviewed By: pdillinger

Differential Revision: D23183873

Pulled By: jay-zhuang

fbshipit-source-id: 6fc38ebd40b4125a99551204b271f91a27e70086
2020-08-20 08:43:13 -07:00
Jay Zhuang
ac7dcfda10 Add missing ComputeCompactionScore() for a new universal manual compaction (#7281)
Summary:
Seems it's only causing assert failure during compaction pick, but in production code, the problematic compactions are excluded at a later step.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7281

Reviewed By: akankshamahajan15

Differential Revision: D23228000

Pulled By: jay-zhuang

fbshipit-source-id: 2e4055aeebe0f5a2b07e299e0a2d51a1ad2e216d
2020-08-19 17:42:08 -07:00
Levi Tamasi
b9bb59d49d Add initial set of options for integrated blob write path (#7280)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7280

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D23195192

Pulled By: ltamasi

fbshipit-source-id: 743b382de391963e62ba86119e9fbd0233ea3b3a
2020-08-18 18:32:37 -07:00
Akanksha Mahajan
cc24ac14eb Store FSSequentialFilePtr object in SequenceFileReader (#7190)
Summary:
This diff contains following changes:
    1. Replace `FSSequentialFile` pointer with `FSSequentialFilePtr` object that wraps `FSSequentialFile` pointer in `SequenceFileReader`.

Objective: If tracing is enabled, `FSSequentialFilePtr` returns `FSSequentialFileTracingWrapper` pointer that includes all necessary information in `IORecord` and calls underlying FileSystem and invokes `IOTracer` to dump that record in a binary file. If tracing is disabled then, underlying `FileSystem` pointer is returned directly. `FSSequentialFilePtr` wrapper class is added to bypass the `FSSequentialFileTracingWrapper` when tracing is disabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7190

Test Plan:
make check -j64
          COMPILE_WITH_TSAN=1 make check -j64

Reviewed By: anand1976

Differential Revision: D23059616

Pulled By: akankshamahajan15

fbshipit-source-id: 1564b94dd1297cd0fbfe2ed5c9cc3e20f7395301
2020-08-18 16:20:54 -07:00
sdong
b194c21bba Whole DBTest to skip fsync (#7274)
Summary:
After https://github.com/facebook/rocksdb/pull/7036, we still see extra DBTest that can timeout when running 10 or 20 in parallel. Expand skip-fsync mode in whole DBTest. Still preserve other tests from doing this mode to be conservative.

This commit reinstates https://github.com/facebook/rocksdb/issues/7049, whose un-revert was lost in an automatic
infrastructure mis-merge.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7274

Test Plan: Run all existing files.

Reviewed By: pdillinger

Differential Revision: D23177444

fbshipit-source-id: 1f61690b2ac6333c3b2c87176fef6b2cba086b33
2020-08-17 18:42:25 -07:00
Andrew Kryczka
5d5ff82408 Disable recycle_log_file_num with kTolerateCorruptedTailRecords (#7271)
Summary:
The two features are naturally incompatible. WAL recycling expects the recovery to succeed upon encountering a corrupt record at the point where new data ends and recycled data remains at the tail. However, `WALRecoveryMode::kTolerateCorruptedTailRecords` must fail upon encountering any such corrupt record, as it cannot differentiate between this and a real corruption, which would cause committed updates to be truncated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7271

Reviewed By: riversand963

Differential Revision: D23169923

Pulled By: ajkr

fbshipit-source-id: 2cf8a3bcd2c9a0ecb0055a84725047a10fd4db50
2020-08-17 18:21:10 -07:00
Yanqin Jin
92593d511a Add a new EntryType for deletion with timestamp (#7195)
Summary:
Add `kEntryDeleteWithTimestamp` to `EntryType` which is a public API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7195

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D22914704

Pulled By: riversand963

fbshipit-source-id: 886f73c6b70c527cad1c8fc9fc8d3afe60e1ea39
2020-08-17 16:26:06 -07:00
Levi Tamasi
9b083cb11c Build blob file reader/writer classes in LITE mode as well (#7272)
Summary:
The patch makes sure that the functionality required for the new integrated
BlobDB implementation (most importantly, the classes related to reading and
writing blob files) is also built in LITE mode by removing the corresponding
`#ifndef`s.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7272

Test Plan: Ran `make check` in both regular and LITE mode.

Reviewed By: zhichao-cao

Differential Revision: D23173280

Pulled By: ltamasi

fbshipit-source-id: 1596bd1a76409a8a6d83d8f1dbfe08bfdea7ffe6
2020-08-17 15:19:05 -07:00
sdong
1760637539 CompactRange() refit level should confirm destination level is not empty (#7261)
Summary:
There is potential data race related CompactRange() with level refitting. After the compaction step and refitting step, some automatic compaction could put data to the destination level and cause the DB to be corrupted. Fix the bug by checking the target level to be empty.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7261

Test Plan: Add a unit test, which would fail with "Corruption: L1 have overlapping ranges '666F6F' seq:6, type:1 vs. '626172' seq:2, type:1", and now it succeeds.

Reviewed By: ajkr

Differential Revision: D23142269

fbshipit-source-id: 28bc14d5ac934c192260b23a4ce3f10a95e3ee91
2020-08-17 14:21:53 -07:00
matthewvon
2ad88ceae9 Populate cf_id member of CompactionJobInfo for OnCompactionBegin (#6938)
Summary:
Looks like somebody simply missed initializing a member variable. The column family ID, cf_id, is not set during OnCompactionBegin. But it is set properly in the next function for OnCompactionCompleted. Need this cf_id for tracking progress of a Stardog optimize since there may be multiple compactions required for a given column family.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6938

Reviewed By: siying

Differential Revision: D23153235

Pulled By: ajkr

fbshipit-source-id: 932938de3a4ebbc7ac89702f655583862587d251
2020-08-17 11:57:47 -07:00
Jay Zhuang
69760b4d05 Introduce a global StatsDumpScheduler for stats dumping (#7223)
Summary:
Have a global StatsDumpScheduler for all DB instance stats dumping, including `DumpStats()` and `PersistStats()`. Before this, there're 2 dedicate threads for every DB instance, one for DumpStats() one for PersistStats(), which could create lots of threads if there're hundreds DB instances.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7223

Reviewed By: riversand963

Differential Revision: D23056737

Pulled By: jay-zhuang

fbshipit-source-id: 0faa2311142a73433ebb3317361db7cbf43faeba
2020-08-14 20:12:44 -07:00
Yanqin Jin
d758273ceb Get() with timestamp should respect snapshot (#7227)
Summary:
If user-defined timestamp is enabled, current implementation can expose
newer data to queries even if an older sequence number is specified via
read_options.snapshot. This PR makes Get() respect sequence-number-based
snapshot.

Solution is simple. Besides using <ukey, ts, seq> to search the index for the key,
we also verify that the candidate result's seq is smaller than or equal to seq. This
requires passing a seq via `GetContext`, which results in the majority of code
change caused by this PR.

Also added a few unit tests to demonstrate standard visibility during point lookup
and range scan when timestamp and snapshot are both present.

Test plan (devserver):
```
make check
$./db_bench --benchmarks=fillseq,readrandom -cache_size=$[64*1024*1024]
```
Result
this PR: readrandom   :       4.827 micros/op 207180 ops/sec;   22.9 MB/s (1000000 of 1000000 found)
master:  readrandom   :       4.936 micros/op 202610 ops/sec;   22.4 MB/s (1000000 of 1000000 found)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7227

Reviewed By: ltamasi

Differential Revision: D23015242

Pulled By: riversand963

fbshipit-source-id: ea7b85a728654553ba357d2e6a207b5e40f7376a
2020-08-14 19:20:58 -07:00
Andrew Kryczka
a1aa3f8385 Disable manual compaction during ReFitLevel() (#7250)
Summary:
Manual compaction with `CompactRangeOptions::change_levels` set could
refit to a level targeted by another manual compaction. If
force_consistency_checks were disabled, it could be possible for
overlapping files to be written at that target level.

This PR prevents the possibility by calling `DisableManualCompaction()`
prior to `ReFitLevel()`. It also improves the manual compaction disabling
mechanism to wait for pending manual compactions to complete before
returning, and support disabling from multiple threads.

Fixes https://github.com/facebook/rocksdb/issues/6432.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7250

Test Plan:
crash test command that repro'd the bug reliably:

```
$ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --simple -target_file_size_base=524288 -write_buffer_size=1048576 -clear_column_family_one_in=0 -reopen=0 -max_key=10000000 -column_families=1 -max_background_compactions=8 -compact_range_one_in=100000 -compression_type=none -compaction_style=1 -num_levels=5 -universal_min_merge_width=4 -universal_max_merge_width=8 -level0_file_num_compaction_trigger=12 -rate_limiter_bytes_per_sec=1048576000 -universal_max_size_amplification_percent=100 --duration=3600 --interval=60 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --enable_compaction_filter=0
```

Reviewed By: ltamasi

Differential Revision: D23090800

Pulled By: ajkr

fbshipit-source-id: afcbcd51b42ce76789fdb907d8b9ada790709c13
2020-08-14 11:29:52 -07:00
sdong
e7358da9a2 Upgrade tool chain (#7251)
Summary:
Upgrade tool chain to the latest. It is done mostly manually as build_tools/build_detect_platform fails to update many of them.

Try to fix a new clang analyze warning with the new tool chain.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7251

Test Plan: "make all", "USE_CLANG=1 make all"

Reviewed By: riversand963

Differential Revision: D23091090

fbshipit-source-id: 732e5a30137837431438f85f36296406b641f975
2020-08-12 19:30:00 -07:00
Levi Tamasi
9d6f48ec1d Clean up CompressBlock/CompressBlockInternal a bit (#7249)
Summary:
The patch cleans up and refactors `CompressBlock` and `CompressBlockInternal` a bit.
In particular, it does the following:
* It renames `CompressBlockInternal` to `CompressData` and moves it to `util/compression.h`,
where other general compression-related utilities are located. This will facilitate reuse in the
BlobDB write path.
* The signature of the method is changed so it now takes `compression_format_version`
(similarly to the compression library specific methods) instead of `format_version` (which is
specific to the block based table).
* `GetCompressionFormatForVersion` no longer takes `compression_type` as a parameter.
This parameter was only used in a (not entirely up-to-date) assertion; also, removing it
eliminates the need to ensure this precondition holds at all call sites.
* Does some minor cleanup in `CompressBlock`, for instance, it is now possible to pass
only one of `sampled_output_fast` and `sampled_output_slow`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7249

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D23087278

Pulled By: ltamasi

fbshipit-source-id: e6316e45baed8b4e7de7c1780c90501c2a3439b3
2020-08-12 18:25:48 -07:00
Akanksha Mahajan
1f9f630b27 Store FileSystemPtr object that contains FileSystem ptr (#7180)
Summary:
As part of the IOTracing project, this PR
    1. Caches "FileSystemPtr" object(wrapper class that returns file system pointer based on tracing enabled) instead of "FileSystem" pointer.
    2. FileSystemPtr object is created using FileSystem pointer and IOTracer
    pointer.
    3. IOTracer shared_ptr is created in DBImpl and it is passed to different classes through constructor.
    4. When tracing is enabled through DB::StartIOTrace, FileSystemPtr
    returns FileSystemTracingWrapper pointer for tracing purpose and when
    it is disabled underlying FileSystem pointer is returned.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7180

Test Plan:
make check -j64
                COMPILE_WITH_TSAN=1 make check -j64

Reviewed By: anand1976

Differential Revision: D22987117

Pulled By: akankshamahajan15

fbshipit-source-id: 6073617e4c2d5bc363914f3a1f55ae3b0a58fbf1
2020-08-12 17:31:23 -07:00
Zitan Chen
b578ca2e4d BackupEngine supports custom file checksums (#7085)
Summary:
A new option `std::shared_ptr<FileChecksumGenFactory> backup_checksum_gen_factory` is added to `BackupableDBOptions`. This allows custom checksum functions to be used for creating, verifying, or restoring backups.

Tests are added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7085

Test Plan: Passed make check

Reviewed By: pdillinger

Differential Revision: D22390756

Pulled By: gg814

fbshipit-source-id: 3b7756ca444c2129844536b91c3ca09f53b6248f
2020-08-12 13:31:09 -07:00
Peter Dillinger
6ac1d25fd0 Fix+clean up handling of mock sleeps (#7101)
Summary:
We have a number of tests hanging on MacOS and windows due to
mishandling of code for mock sleeps. In addition, the code was in
terrible shape because the same variable (addon_time_) would sometimes
refer to microseconds and sometimes to seconds. One test even assumed it
was nanoseconds but was written to pass anyway.

This has been cleaned up so that DB tests generally use a SpecialEnv
function to mock sleep, for either some number of microseconds or seconds
depending on the function called. But to call one of these, the test must first
call SetMockSleep (precondition enforced with assertion), which also turns
sleeps in RocksDB into mock sleeps. To also removes accounting for actual
clock time, call SetTimeElapseOnlySleepOnReopen, which implies
SetMockSleep (on DB re-open). This latter setting only works by applying
on DB re-open, otherwise havoc can ensue if Env goes back in time with
DB open.

More specifics:

Removed some unused test classes, and updated comments on the general
problem.

Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
of mock time. For this we have the only modification to production code,
inserting a sync point callback in flush_job.cc, which is not a change to
production behavior.

Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
deals in relative time. Any behaviors relying on absolute date/time are likely
a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
clearly injecting a specific absolute time for actual testing convenience.) Just
in case I misunderstood some test, I put this note in each replacement:
// NOTE: Presumed unnecessary and removed: resetting mock time in env

Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
FilterCompactionTimeTest in db_test.cc

stats_history_test and blob_db_test are each their own beast, rather deeply
dependent on MockTimeEnv. Each gets its own variant of a work-around for
TimedWait in a mock time environment. (Reduces redundancy and
inconsistency in stats_history_test.)

Intended follow-up:

Remove TimedWait from the public API of InstrumentedCondVar, and only
make that accessible through Env by passing in an InstrumentedCondVar and
a deadline. Then the Env implementations mocking time can fix this problem
without using sync points. (Test infrastructure using sync points interferes
with individual tests' control over sync points.)

With that change, we can simplify/consolidate the scattered work-arounds.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101

Test Plan: make check on Linux and MacOS

Reviewed By: zhichao-cao

Differential Revision: D23032815

Pulled By: pdillinger

fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
2020-08-11 12:41:30 -07:00
Levi Tamasi
a99fb67233 Remove redundant consistency check from VersionStorageInfo::AddFile (#7237)
Summary:
`VersionStorageInfo::AddFile` currently has a debug-mode consistency
check to make sure the newly added file does not overlap with the
previous one (for levels below L0). Considering that
`VersionBuilder::CheckConsistency` also performs similar checks (in
fact, those checks are more comprehensive and cover L0 as well), this
check is redundant. The patch removes it and also cleans up `AddFile` a
little.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7237

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D23041937

Pulled By: ltamasi

fbshipit-source-id: e00665f3b83bfd17f86c54c238800f3d77d739bd
2020-08-11 09:23:17 -07:00
anand76
f308da5273 Fix delete triggered compaction for single level universal (#7224)
Summary:
Delete triggered compaction (DTC) for universal compaction style with ```num_levels = 1``` has been disabled for sometime due to a data correctness bug. This PR re-enables it with a bug fix. A file marked for compaction can be picked, along with all L0 files after it as the compaction input. We stop adding files to the input once we encounter a file already being compacted (the original bug failed to check the compaction status of the files).

Tests:
Add unit tests to ```compaction_picker_test.cc```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7224

Reviewed By: ajkr

Differential Revision: D23031845

Pulled By: anand1976

fbshipit-source-id: 9de3cab5f9774cede666c2c48d309a7d9b88a505
2020-08-10 12:19:17 -07:00
Yuhong Guo
5444942f15 Fix cmake build on MacOS (#7205)
Summary:
1. `std::random_shuffle` is deprecated and now we can use `std::shuffle`
```
/rocksdb/db/prefix_test.cc:590:12: error: 'random_shuffle<std::__1::__wrap_iter<unsigned long long *> >'
      is deprecated [-Werror,-Wdeprecated-declarations]
      std::random_shuffle(prefixes.begin(), prefixes.end());
           ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/algorithm:2982:1: note:
      'random_shuffle<std::__1::__wrap_iter<unsigned long long *> >' has been explicitly marked deprecated here
_LIBCPP_DEPRECATED_IN_CXX14 void
^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__config:1107:39: note: expanded from macro
      '_LIBCPP_DEPRECATED_IN_CXX14'
#  define _LIBCPP_DEPRECATED_IN_CXX14 _LIBCPP_DEPRECATED
                                      ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__config:1090:48: note: expanded from macro
      '_LIBCPP_DEPRECATED'
#    define _LIBCPP_DEPRECATED __attribute__ ((deprecated))
```
2. `c_test` link error with `-DROCKSDB_BUILD_SHARED=OFF`:
```
[  7%] Linking CXX executable c_test
ld: library not found for -lrocksdb-shared
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[5]: *** [c_test] Error 1
make[4]: *** [CMakeFiles/c_test.dir/all] Error 2
make[4]: *** Waiting for unfinished jobs....
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7205

Reviewed By: ajkr

Differential Revision: D23030641

Pulled By: pdillinger

fbshipit-source-id: f270e50fc0b824ca1a0876ec5c65d33f55a72dd0
2020-08-10 10:48:05 -07:00
Remington Brasga
633bff2f19 Fixed typo on Value mismatch error in db_test (#6587)
Summary:
The debug is supposed to print out two keys to show the value mismatch, which was compared just a few lines above.

However, the actual print-out is the same values (so they obviously won't be mismatched)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6587

Reviewed By: riversand963

Differential Revision: D23025279

Pulled By: ajkr

fbshipit-source-id: 4c6c35bc60b273f13c08b5464b6f690d8a5cfe41
2020-08-10 10:06:08 -07:00
anand76
832b056a30 Enable IO timeouts for iterators (#7161)
Summary:
Introduce io_timeout in ReadOptions and enabled deadline/io_timeout for
Iterators.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7161

Test Plan: New unit tests in db_basic_test

Reviewed By: riversand963

Differential Revision: D22687352

Pulled By: anand1976

fbshipit-source-id: 67bbb0e6d7ae80b256589244468494292538c6ec
2020-08-07 12:01:08 -07:00
Zhichao Cao
b79f13b2aa Fix the potential deadlock in WriteImplWALOnly and UnorderedWriteMemtable (#7199)
Summary:
Pointed out by https://github.com/facebook/rocksdb/issues/7197 , there is a double lock in WriteImplWALOnly.
Also find another deadlock in UnorderedWriteMemtable. Move the check after switch_all_.notify_all().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7199

Test Plan: pass make check

Reviewed By: anand1976

Differential Revision: D22961714

Pulled By: zhichao-cao

fbshipit-source-id: 0707922dc50d28ea141a15a8cdcbd1c8993ea0d8
2020-08-07 11:28:49 -07:00
mrambacher
56f468b356 Add more tests to ASSERT_STATUS_CHECKED (#7211)
Summary:
Added 4 more tests to those which pass ASSERT_STATUS_CHECKED (cache_test, lru_cache_test, filename_test, filelock_test).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7211

Reviewed By: ajkr

Differential Revision: D22982858

Pulled By: zhichao-cao

fbshipit-source-id: acdd071582ed6aa7447ed96c5732f10bf720d783
2020-08-06 17:19:41 -07:00
Yingchun Lai
67bbac3621 Remove duplicate colon in Status message (#7041)
Summary:
A colon will be added after 'msg' automatically when invoke function Status(Code _code, const Slice& msg, const Slice& msg2),
it's not needed to append a colon explicitly to 'msg'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7041

Reviewed By: ajkr

Differential Revision: D22292801

fbshipit-source-id: 8f2d69065bb779d2613468bf9fc9169f32c3f1ec
2020-08-06 15:18:04 -07:00
jsteemann
5e1808d515 fix typo: paraniod -> paranoid (#7163)
Summary:
Rename "paraniod" to "paranoid" in a few places.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7163

Reviewed By: ajkr

Differential Revision: D22678242

fbshipit-source-id: 28b1011a736d0a95612676f7e1b9500a70c324b4
2020-08-06 14:25:34 -07:00
Cheng Chang
cd48ecaa1a Define WAL related classes to be used in VersionEdit and VersionSet (#7164)
Summary:
`WalAddition`, `WalDeletion` are defined in `wal_version.h` and used in `VersionEdit`.
`WalAddition` is used to represent events of creating a new WAL (no size, just log number), or closing a WAL (with size).
`WalDeletion` is used to represent events of deleting or archiving a WAL, it means the WAL is no longer alive (won't be replayed during recovery).

`WalSet` is the set of alive WALs kept in `VersionSet`.

1. Why use `WalDeletion` instead of relying on `MinLogNumber` to identify outdated WALs

On recovery, we can compute `MinLogNumber()` based on the log numbers kept in MANIFEST, any log with number < MinLogNumber can be ignored. So it seems that we don't need to persist `WalDeletion` to MANIFEST, since we can ignore the WALs based on MinLogNumber.

But the `MinLogNumber()` is actually a lower bound, it does not exactly mean that logs starting from MinLogNumber must exist. This is because in a corner case, when a column family is empty and never flushed, its log number is set to the largest log number, but not persisted in MANIFEST. So let's say there are 2 column families, when creating the DB, the first WAL has log number 1, so it's persisted to MANIFEST for both column families. Then CF 0 is empty and never flushed, CF 1 is updated and flushed, so a new WAL with log number 2 is created and persisted to MANIFEST for CF 1. But CF 0's log number in MANIFEST is still 1. So on recovery, MinLogNumber is 1, but since log 1 only contains data for CF 1, and CF 1 is flushed, log 1 might have already been deleted from disk.

We can make `MinLogNumber()` be the exactly minimum log number that must exist, by persisting the most recent log number for empty column families that are not flushed. But if there are N such column families, then every time a new WAL is created, we need to add N records to MANIFEST.

In current design, a record is persisted to MANIFEST only when WAL is created, closed, or deleted/archived, so the number of WAL related records are bounded to 3x number of WALs.

2. Why keep `WalSet` in `VersionSet` instead of applying the `VersionEdit`s to `VersionStorageInfo`

`VersionEdit`s are originally designed to track the addition and deletion of SST files. The SST files are related to column families, each column family has a list of `Version`s, and each `Version` keeps the set of active SST files in `VersionStorageInfo`.

But WALs are a concept of DB, they are not bounded to specific column families. So logically it does not make sense to store WALs in a column family's `Version`s.
Also, `Version`'s purpose is to keep reference to SST / blob files, so that they are not deleted until there is no version referencing them. But a WAL is deleted regardless of version references.
So we keep the WALs in `VersionSet`  for the purpose of writing out the DB state's snapshot when creating new MANIFESTs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7164

Test Plan:
make version_edit_test && ./version_edit_test
make wal_edit_test && ./wal_edit_test

Reviewed By: ltamasi

Differential Revision: D22677936

Pulled By: cheng-chang

fbshipit-source-id: 5a3b6890140e572ffd79eb37e6e4c3c32361a859
2020-08-05 16:34:38 -07:00
sdong
5c1a544122 Clean up InternalIterator upper bound logic a little bit (#7200)
Summary:
IteratorIterator::IsOutOfBound() and IteratorIterator::MayBeOutOfUpperBound() are two functions that related to upper bound check. It is hard for users to reason about this complexity. Consolidate the two functions into one and assign an enum as results to improve readability.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7200

Test Plan: Run all existing test. Would run crash test with atomic for a while.

Reviewed By: anand1976

Differential Revision: D22833181

fbshipit-source-id: a0c724267056adbd0476bde74650e6c7226077e6
2020-08-05 10:44:57 -07:00
Yanqin Jin
2735b0275d ReadOptions.iter_start_ts should support tombstones (#7178)
Summary:
as title.
When ReadOptions.iter_start_ts is not nullptr, DBIter::key() should
return internal keys including value type.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7178

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D22935879

Pulled By: riversand963

fbshipit-source-id: 7508d962cf11ebcfa6386d2529b4f3606b47ccfd
2020-08-04 18:52:08 -07:00
Akanksha Mahajan
493f425e77 Add support to start and end IOTracing through DB APIs (#7203)
Summary:
1. Add support to start io tracing through DB::StartIOTrace(Env*, const TraceOptions&, std::unique_ptr<TraceWriter>&&) and end tracing through DB::EndIOTrace(). This doesn't trace DB::Open.

User side code:

//Open DB
DB::Open(options, dbname, &db);

/* Start tracing */
db->StartIOTrace(env, trace_opt, std::move(trace_writer));

/* Perform Operations */

/*End tracing*/
db->EndIOTrace();

2. Fix the build errors for Windows.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7203

Test Plan: make check -j64

Reviewed By: anand1976

Differential Revision: D22901947

Pulled By: akankshamahajan15

fbshipit-source-id: e59c0b785a802168e6f1aa028d99c224a35cb30c
2020-08-04 18:41:45 -07:00
Andrew Kryczka
a4a4a2dabd dedup ReadOptions in iterator hierarchy (#7210)
Summary:
Previously, a `ReadOptions` object was stored in every `BlockBasedTableIterator`
and every `LevelIterator`. This redundancy consumes extra memory,
resulting in the `Arena` making more allocations, and iteration
observing worse cache performance.

This PR migrates callers of `NewInternalIterator()` and
`MakeInputIterator()` to provide a `ReadOptions` object guaranteed to
outlive the returned iterator. When the iterator's lifetime will be managed by the
user, this lifetime guarantee is achieved by storing the `ReadOptions`
value in `ArenaWrappedDBIter`. Then, sub-iterators of `NewInternalIterator()` and
`MakeInputIterator()` can hold a reference-to-const `ReadOptions`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7210

Test Plan:
- `make check` under ASAN and valgrind
- benchmark: on a DB with 2 L0 files and 3 L1+ levels, this PR reduced `Arena` allocation 4792 -> 4160 bytes.

Reviewed By: anand1976

Differential Revision: D22861323

Pulled By: ajkr

fbshipit-source-id: 54aebb3e89c872eeab0f5793b4b6e42878d093ce
2020-08-03 15:23:04 -07:00
Aaron Kabcenell
56ed601df3 Compaction Read/Write Stats by Compaction Type (#7165)
Summary:
Adds compaction statistics (total bytes read and written) for compactions that occur for delete-triggered, periodic, and TTL compaction reasons.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7165

Test Plan:
TTL and periodic can be checked by runnning db_bench with the options activated:

/db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -periodic_compaction_seconds=1
./db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -fifo_compaction_ttl=1

Setting the time to one second causes non-zero bytes read/written for those compaction reasons. Disabling them or setting them to times longer than the test run length causes the stats to return to zero as expected.

Delete-triggered compaction counting is tested in DBTablePropertiesTest.DeletionTriggeredCompactionMarking

Reviewed By: ajkr

Differential Revision: D22693050

Pulled By: akabcenell

fbshipit-source-id: d15cef4d94576f703015c8942d5f0d492f69401d
2020-07-29 13:39:29 -07:00
codingsh
50f206ad84 feat: export SetBackgroundThreads(n, Env::BOTTOM); (#7191)
Summary:
- https://github.com/rust-rocksdb/rust-rocksdb/pull/448

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7191

Reviewed By: riversand963

Differential Revision: D22809066

Pulled By: ajkr

fbshipit-source-id: 036939f9a28cacc3f677c318d1aed97fe5f4f85e
2020-07-29 12:24:13 -07:00
sdong
692f6a3138 Implement NextAndGetResult() in memtable and level iterator (#7179)
Summary:
NextAndGetResult() is not implemented in memtable and is very simply implemented in level iterator. The result is that for a normal leveled iterator, performance regression will be observed for calling PrepareValue() for most iterator Next(). Mitigate the problem by implementing the function for both iterators. In level iterator, the implementation cannot be perfect as when calling file iterator's SeekToFirst() we don't have information about whether the value is prepared. Fortunately, the first key should not cause a big portion of the CPu.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7179

Test Plan: Run normal crash test for a while.

Reviewed By: anand1976

Differential Revision: D22783840

fbshipit-source-id: c19f45cdf21b756190adef97a3b66ccde3936e05
2020-07-29 09:45:21 -07:00
mrambacher
d9d190742c Make env*_test work with ASSERT_STATUS_CHECKED (#7176)
Summary:
Make (most of) the env*_test pass when ASSERT_STATUS_CHECKED is enabled.

One test that opens a database is currently disabled in this mode, as there are many errors that need revisited for DB tests and status checks.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7176

Reviewed By: cheng-chang

Differential Revision: D22799278

Pulled By: ajkr

fbshipit-source-id: 16d8a02eaeecd6df1060249b6a5811292801f2ed
2020-07-28 22:59:48 -07:00
codingsh
83ea266b43 export stats_persist_period_sec (#7168)
Summary:
fixed
 - https://github.com/rust-rocksdb/rust-rocksdb/issues/447
 -  https://github.com/rust-rocksdb/rust-rocksdb/pull/448

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7168

Reviewed By: cheng-chang

Differential Revision: D22736013

Pulled By: ajkr

fbshipit-source-id: fdd784aa75d26a367b9108b05ffdd94a2ae117d3
2020-07-28 13:05:34 -07:00
Tomas Kolda
cd4592c220 SST Partitioner interface that allows to split SST files (#6957)
Summary:
SST Partitioner interface that allows to split SST files during compactions.

It basically instruct compaction to create a new file when needed. When one is using well defined prefixes and prefixed way of defining tables it is good to define also partitioning so that promotion of some SST file does not cover huge key space on next level (worst case complete space).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6957

Reviewed By: ajkr

Differential Revision: D22461239

fbshipit-source-id: 9ce07bba08b3ba89c2d45630520368f704d1316e
2020-07-24 13:44:49 -07:00
Jay Zhuang
b0c5ecd6b3 Make max_subcompactions dynamically changeable (#7159)
Summary:
Make `max-subcompactions` dynamically changeable by passing the `DBOption` to Compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7159

Reviewed By: siying

Differential Revision: D22671238

Pulled By: jay-zhuang

fbshipit-source-id: 311ca9f6bb606965544d8708616d358cfed5be42
2020-07-22 18:32:52 -07:00
Levi Tamasi
0d04a8434a Sync blob files before closing them (#7160)
Summary:
BlobDB currently syncs each blob file periodically after writing a certain amount of
data (as specified by the configuration option `BlobDBOptions::bytes_per_sync`)
and all open blob files when the base DB's memtables are flushed. With the patch,
in addition to the above, blob files are also synced right before being closed, after
the footer has been written. This will be beneficial for the new integrated blob file
write path as well.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7160

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D22672646

Pulled By: ltamasi

fbshipit-source-id: 62b34263543a7e74abcbb7adf011daa1e699998f
2020-07-22 17:25:20 -07:00
Cheng Chang
96ce0470a7 Clean snapshot dir before taking snapshot (#7156)
Summary:
`DBTest::SnapshotFiles` runs the tests in a `while` loop.
Currently, the snapshot directory is not cleaned up in each loop, so previous snapshot files may remain in the next loop's snapshot.
When I'm working on https://github.com/facebook/rocksdb/pull/7129, when checking the tracked WALs in MANIFEST, I find that this test always fails because it reads some unknown WAL. It turns out that the unknown WAL is left from previous loops.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7156

Test Plan: make db_test && ./db_test --gtest_filters=*SnapshotFiles

Reviewed By: siying

Differential Revision: D22668360

Pulled By: cheng-chang

fbshipit-source-id: 69d4aa3506038ba30e218e8ae966357935a99c6c
2020-07-22 13:54:01 -07:00
mrambacher
d44cbc5314 Add hash of key/value checks when paranoid_file_checks=true (#7134)
Summary:
When paraoid_files_checks=true, a rolling key-value hash is generated and compared to what is written to the file.  If the values do not match, the SST file is rejected.

Code put in place for the check for both flush and compaction jobs.  Corresponding test added to corruption_test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7134

Reviewed By: cheng-chang

Differential Revision: D22646149

fbshipit-source-id: 8fde1984a1a11edd3bd82a413acffc5ea7aa683f
2020-07-22 11:04:40 -07:00
Haosen Wen
dbc51adbac Use steady_clock instead of system_clock in FileOperationInfo::TimePoint (#7153)
Summary:
Issue https://github.com/facebook/rocksdb/issues/7133 reported that using `system_clock` in `FileOperationInfo::TimePoint` causes the duration of file flush operation (which can be a noop on MacOS in some scenarios) appears to be 0 and fail an assertion in listener_test. Using `steady_clock` supposedly fixed the problem.
`steady_clock` actually fits better into the use cases of `FileOperationInfo::TimePoint` as all usages care about durations but not wall clock time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7153

Test Plan: make check.

Reviewed By: riversand963

Differential Revision: D22654136

Pulled By: roghnin

fbshipit-source-id: 5980b1080734bdae496a18071a2c2b5887c67d85
2020-07-22 08:55:02 -07:00
sdong
1cf4731dbb column_family_test: fix a data race related to sleeping task (#7150)
Summary:
TSAN reports warning in one column_family_test:

WARNING: ThreadSanitizer: data race (pid=16352)
  Write of size 8 at 0x7ffcdf042158 by main thread:
    #0 pthread_cond_destroy <null> (column_family_test+0x471f65)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::port::CondVar::~CondVar() /home/circleci/project/port/port_posix.cc:101:49 (column_family_test+0x8a627a)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::test::SleepingBackgroundTask::~SleepingBackgroundTask() /home/circleci/project/./test_util/testutil.h:397:7 (column_family_test+0x54b6e2)
    https://github.com/facebook/rocksdb/issues/3 rocksdb::ColumnFamilyTest_FlushCloseWALFiles_Test::TestBody() /home/circleci/project/db/column_family_test.cc:3008:1 (column_family_test+0x54b6e2)
......
  Previous read of size 8 at 0x7ffcdf042158 by thread T2 (mutexes: write M0):
    #0 pthread_cond_broadcast <null> (column_family_test+0x471dd2)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::port::CondVar::SignalAll() /home/circleci/project/port/port_posix.cc:139:28 (column_family_test+0x8a651a)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::test::SleepingBackgroundTask::DoSleep() /home/circleci/project/./test_util/testutil.h:412:12 (column_family_test+0x58574b)
......

Likely, SleepingBackgroundTask::DoSleep() started to execute after the main thread has finished everything, cancelled and waited for sleeping tasks to finish. At this time, although DoSlee() will not sleep, but it also accesses the mutex, creating a data race with destructor of the test. Fix this bug by waiting for the sleeping task to start sleeping after it is scheduled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7150

Test Plan: Run these modified tests and make sure it doesn't break.

Reviewed By: riversand963

Differential Revision: D22630716

fbshipit-source-id: cc5781cf69083685de406490438898238bdfc2d3
2020-07-20 14:19:48 -07:00
sdong
9870704420 Fix a minor data race in stats dumping threads initialization (#7151)
Summary:
https://github.com/facebook/rocksdb/pull/7145 creates a minor data race against the stat creation counter. Turn it to atomic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7151

Test Plan: Run the test.

Reviewed By: ajkr

Differential Revision: D22631014

fbshipit-source-id: c6fb69ac5b9df7139795dacea5ce9fb9fd3278d7
2020-07-20 12:12:43 -07:00
Zhichao Cao
ed4712fe7e Remove time out testing cases in error_handler_fs_test (#7141)
Summary:
Remove the 3 testing cases that cause the time out in linux build by https://github.com/facebook/rocksdb/issues/6765 . Will fix them later.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7141

Test Plan: make asan_check, buck run

Reviewed By: ajkr

Differential Revision: D22593831

Pulled By: zhichao-cao

fbshipit-source-id: 14956c36476ecc3393f613178c22e13df843126e
2020-07-17 23:27:21 -07:00
Andrew Kryczka
9a83fd21e6 stagger first DumpMallocStats after opening DB (#7145)
Summary:
Previously when running `db_bench` with large value for `num_multi_dbs` and enabled `Options::dump_malloc_stats`, we would see most CPU spent in jemalloc locking. After this PR that no longer shows up at the top of the profile.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7145

Reviewed By: riversand963

Differential Revision: D22593031

Pulled By: ajkr

fbshipit-source-id: 3b3fc91f93249c6afee53f59f34c487c3fc5add6
2020-07-17 16:13:26 -07:00
sdong
ca5a069a79 Suppress a TSAN warning (#7126)
Summary:
TSAN shows warning with clang with warning similar to this:

WARNING: ThreadSanitizer: data race (pid=10159)
  Atomic write of size 8 at 0x7b5000002890 by thread T33:
    #0 __tsan_atomic64_store <null> (db_test+0x4ca2b5)
    https://github.com/facebook/rocksdb/issues/1 std::__atomic_base<unsigned long>::store(unsigned long, std::memory_order) /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/atomic_base.h:374:2 (db_test+0x774fde)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::VersionSet::SetLastSequence(unsigned long) /home/circleci/project/./db/version_set.h:1057:20 (db_test+0x774fde)
    https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*) /home/circleci/project/db/db_impl/db_impl_write.cc:449:18 (db_test+0x774fde)
......
  Previous read of size 8 at 0x7b5000002890 by thread T5 (mutexes: write M1044689462619020832):
    #0 rocksdb::DBImpl::ReleaseSnapshot(rocksdb::Snapshot const*) /home/circleci/project/db/db_impl/db_impl.cc (db_test+0x6f4ae7)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::(anonymous namespace)::MTThreadBody(void*) /home/circleci/project/db/db_test.cc:2514:13 (db_test+0x56ac59)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::(anonymous namespace)::StartThreadWrapper(void*) /home/circleci/project/env/env_posix.cc:443:3 (db_test+0x88c4cd)

It is not limited to ReleaseSnapshot() and rocksdb::DBImpl::MultiCFSnapshot().

While we are not 100% sure it doesn't indicate any correctness violation, we suppress them for now to keep TSAN clean with more tests so that we can cover more bugs with CI.

In the gcc runs we have been running, this warning rarely shows up.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7126

Test Plan: See the mini-TSAN test to pass with reasonable run time.

Reviewed By: ajkr

Differential Revision: D22552375

fbshipit-source-id: ebdd3854cb3becec3403970326a1ca961db2ab00
2020-07-15 13:25:14 -07:00
Zhichao Cao
a10f12eda1 Auto resume the DB from Retryable IO Error (#6765)
Summary:
In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765

Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check

Reviewed By: anand1976

Differential Revision: D21916789

Pulled By: zhichao-cao

fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b
2020-07-15 11:03:58 -07:00
Yanqin Jin
27735dea9a Report corrupted keys during compaction (#7124)
Summary:
Currently, RocksDB lets compaction to go through even in case of
corrupted keys, the number of which is reported in CompactionJobStats.
However, RocksDB does not check this value. We should let compaction run
in a stricter mode.

Temporarily disable two tests that allow corrupted keys in compaction.
With this PR, the two tests will assert(false) and terminate. Still need
to investigate what is the recommended google-test way of doing it.
Death test (EXPECT_DEATH) in gtest has warnings now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7124

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D22530722

Pulled By: riversand963

fbshipit-source-id: 6a5a6a992028c6d4f92cb74693c92db462ae4ad6
2020-07-14 17:18:17 -07:00
Levi Tamasi
bdf4de6cb9 Remove some dead code from BlobLogWriter (#7125)
Summary:
Periodic syncing of blob files is performed by `WritableFileWriter`;
`bytes_per_sync_` and `next_sync_offset_` in `BlobLogWriter` are
actually unused (or more precisely, only used by methods that are
themselves unused). The patch removes all this dead code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7125

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D22531021

Pulled By: ltamasi

fbshipit-source-id: 6b293ad5a79d3e6bf15c5c68f7aedd7ce7a15f10
2020-07-14 13:51:54 -07:00
Yanqin Jin
c628fae6d1 Report corruption on unrecognized value type (#7121)
Summary:
During memtable lookup, an unrecognized value type should be reported as
Status::Corruption.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7121

Test Plan: make check

Reviewed By: cheng-chang

Differential Revision: D22512124

Pulled By: riversand963

fbshipit-source-id: 9b97be7d9b230c5aae9205f96054420e5ea09066
2020-07-13 20:26:58 -07:00
Stanislav Tkach
393e486e3e Add getters for options to the C API (#7094)
Summary:
Along with https://github.com/facebook/rocksdb/issues/6925 and https://github.com/facebook/rocksdb/issues/6998, this should add getters for all Options fields except several ones with non-trivial interface (for example rocksdb_options_set_min_level_to_compress).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7094

Reviewed By: riversand963

Differential Revision: D22479800

Pulled By: pdillinger

fbshipit-source-id: d14f305e12cfe268d07e0fe229d55cef299c792a
2020-07-10 14:30:04 -07:00
wenh
4924a506b9 Reduce env_->GetChildren() calls in DBImpl::Recover() (#7044)
Summary:
There currently exist multiple `GetChildren()` calls in `DBImpl::Recover()`, which can be expensive in cases of distributed file systems.
This pull request try to call `DBImpl::Recover()` of each necessary directory only _once_ and reuse the results in the places of repeated calls in current code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7044

Test Plan:
Run `make check` and use the default test suite. The modified code should be semantically identical to the current code. As a proof of this solution, we may optionally deploy the system onto a (real or simulated) distributed system and expect reduced latency caused by manifest fetching.

(WIP)

Reviewed By: riversand963

Differential Revision: D22419925

Pulled By: roghnin

fbshipit-source-id: d3774fbfbc246c5527101bc16747eb5c90919886
2020-07-10 13:41:08 -07:00
mrambacher
c7c7b07f06 More Makefile Cleanup (#7097)
Summary:
Cleans up some of the dependencies on test code in the Makefile while building tools:
- Moves the test::RandomString, DBBaseTest::RandomString into Random
- Moves the test::RandomHumanReadableString into Random
- Moves the DestroyDir method into file_utils
- Moves the SetupSyncPointsToMockDirectIO into sync_point.
- Moves the FaultInjection Env and FS classes under env

These changes allow all of the tools to build without dependencies on test_util, thereby simplifying the build dependencies.  By moving the FaultInjection code, the dependency in db_stress on different libraries for debug vs release was eliminated.

Tested both release and debug builds via Make and CMake for both static and shared libraries.

More work remains to clean up how the tools are built and remove some unnecessary dependencies.  There is also more work that should be done to get the Makefile and CMake to align in their builds -- what is in the libraries and the sizes of the executables are different.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7097

Reviewed By: riversand963

Differential Revision: D22463160

Pulled By: pdillinger

fbshipit-source-id: e19462b53324ab3f0b7c72459dbc73165cc382b2
2020-07-09 14:35:17 -07:00
Yanqin Jin
f70ad03137 Parameterize a few tests in DBWALTest (#7105)
Summary:
As title. The goal is to shorten the execution time of several tests
when they are combined together in a single TEST_F.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7105

Test Plan:
make db_wal_test
./db_wal_test

Reviewed By: ltamasi

Differential Revision: D22442705

Pulled By: riversand963

fbshipit-source-id: 0ad49b8f21fa86dcd5a4d3c9a06af313735ac217
2020-07-09 11:31:06 -07:00
Akanksha Mahajan
54f171fe90 Update Flush policy in PartitionedIndexBuilder on switching from user-key to internal-key mode (#7096)
Summary:
When format_version is high enough to support user-key and
there are index entries for same user key that spans multiple data
blocks then it changes from user-key mode to internal-key mode. But the
flush policy is not reset to point to Block Builder of internal-keys.
After this switch, no entries are added to user key index partition
result, thus it never triggers flushing the block.

Fix: 1. After adding the entry in sub_builder_index_, if there is a switch
from user-key to internal-key, then flush policy is updated to point to
Block Builder of internal-keys index partition.
2. Set sub_builder_index_->seperator_is_key_plus_seq_ = true if
seperator_is_key_plus_seq_  is set to true so that subsequent partitions
can also use internal key mode.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7096

Test Plan: make check -j64

Reviewed By: ajkr

Differential Revision: D22416598

Pulled By: akankshamahajan15

fbshipit-source-id: 01fc2dc07ea1b32f8fb803995ebe6e9a3fbe67ac
2020-07-08 21:03:04 -07:00
rafael-aero
712458fc34 Add RestoreDBFromLatestBackup to C API, add new C# package (#7092)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7092

Reviewed By: riversand963

Differential Revision: D22412323

Pulled By: ajkr

fbshipit-source-id: 3fc1c63bb19a8cd2c0ae620800c28f199a7f494b
2020-07-08 11:56:41 -07:00
rockeet
b649d8cb97 Fixed Factory construct just for calling .Name() (#7080)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7080

Reviewed By: riversand963

Differential Revision: D22412352

Pulled By: ajkr

fbshipit-source-id: 1d7f4c1621040a0130245139b52c3f4d3deac865
2020-07-08 11:54:00 -07:00
wenh
226d1f9c73 extend listener callback functions to more file I/O operations (#7055)
Summary:
Currently, `EventListener` in listner.h only have callback functions for file read and write. One may favor extended callback functions for more file I/O operations like flush, sync and close. This PR tries to add those interface and have them called when appropriate throughout the code base.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7055

Test Plan:
Write an experimental listener with those new callback functions with log output in them; run experiments and check logs to see those functions are actually called.
Default test suits `make check` should also be included.

Reviewed By: riversand963

Differential Revision: D22380624

Pulled By: roghnin

fbshipit-source-id: 4121491d45c2c2aae8c255e7998090559a241c6a
2020-07-07 18:21:18 -07:00
Andrew Kryczka
dd29ad4223 Separate internal and user key comparators in BlockIter (#6944)
Summary:
Replace `BlockIter::comparator_` and `IndexBlockIter::user_comparator_wrapper_` with a concrete `UserComparatorWrapper` and `InternalKeyComparator`. The motivation for this change was the inconvenience of not knowing the concrete type of `BlockIter::comparator_`, which prevented calling specialized internal key comparison functions to optimize comparison of keys with global seqno applied.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6944

Test Plan:
benchmark setup -- single file DBs, in-memory, no compression. "normal_db"
created by regular flush; "ingestion_db" created by ingesting a file. Both
DBs have same contents.

```
$ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000
$ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst | awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}')
$ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/
```

benchmark run command:

```
$ TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=$SEEK_NEXT -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=0 -threads=1 -reads=200000000 -mmap_read=1 -verify_checksum=false
```

results: perf improved marginally for ingestion_db and did not change significantly for normal_db:

SEEK_NEXT | DB | code | ops/sec | % change
-- | -- | -- | -- | --
0 | normal_db | master | 350880 |  
0 | normal_db | PR6944 | 351040 | 0.0
0 | ingestion_db | master | 343255 |  
0 | ingestion_db | PR6944 | 349424 | 1.8
10 | normal_db | master | 218711 |  
10 | normal_db | PR6944 | 217892 | -0.4
10 | ingestion_db | master | 220334 |  
10 | ingestion_db | PR6944 | 226437 | 2.8

Reviewed By: pdillinger

Differential Revision: D21924676

Pulled By: ajkr

fbshipit-source-id: ea4288a2eefa8112eb6c651a671c1de18c12e538
2020-07-07 17:26:16 -07:00
Levi Tamasi
a693341604 Move the blob file format related classes to the main namespace, rename reader/writer (#7086)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7086

Test Plan: `make check`

Reviewed By: zhichao-cao

Differential Revision: D22395420

Pulled By: ltamasi

fbshipit-source-id: 088a20097bd6b73b0c433cd79725779f97ec04f2
2020-07-06 17:18:14 -07:00
Peter Dillinger
4b107ceb7e Improve code comments in EstimateLiveDataSize (#7072)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7072

Reviewed By: ajkr

Differential Revision: D22391641

Pulled By: pdillinger

fbshipit-source-id: 0ef355576454514263ab684eb1a5c06787f3242a
2020-07-06 16:17:02 -07:00
Jay Zhuang
00de699096 Replace reinterpret_cast with static_cast_with_check (#7067)
Summary:
Replace `reinterpret_cast` with `static_cast_with_check` for `DBImpl` and `ColumnFamilyHandleImpl`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7067

Reviewed By: siying

Differential Revision: D22361587

Pulled By: jay-zhuang

fbshipit-source-id: dfe9e8f3af39c3d27cc372c55ab9ad905eb0a5a1
2020-07-02 19:25:41 -07:00
Zitan Chen
373d5ac485 BackupEngine verifies table file checksums on creating new backups (#7015)
Summary:
When table file checksums are enabled and stored in the DB manifest by using the RocksDB default crc32c checksum function, BackupEngine will calculate the crc32c checksum of the file to be copied and compare the calculated result with the one stored in the DB manifest before copying the file to the backup directory.

After copying to the backup directory, BackupEngine will verify the checksum of the copied file with the one calculated before copying. This helps detect some rare corruption events such as bit-flips during the copying process.

No verification with checksums in DB manifest will be performed if the table file checksum function is not the RocksDB default crc32c checksum function.

In addition, If `share_table_files` and `share_files_with_checksum` are true, BackupEngine will compare the checksums computed before and after copying of the table files.

Corresponding tests are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7015

Test Plan: Passed make check

Reviewed By: pdillinger

Differential Revision: D22165732

Pulled By: gg814

fbshipit-source-id: ee0e8cc397c455eba64545c29380b9d9853588ec
2020-07-02 18:15:12 -07:00
Peter Dillinger
a680a7ea37 Un-revert #7049, revert #7022 (#7071)
Summary:
Even though local bisection gave me a clear signal (and still does) that reverting https://github.com/facebook/rocksdb/issues/7049 would fix the failures in MultiThreadedDBTest, https://github.com/facebook/rocksdb/issues/7022 seems to be the root cause. Reverting https://github.com/facebook/rocksdb/issues/7022 and keeping https://github.com/facebook/rocksdb/issues/7049 seems to fix the issue in local reproducer also. (Had these landed in opposite order, bisection would have found the root cause.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7071

Reviewed By: akankshamahajan15

Differential Revision: D22362857

Pulled By: pdillinger

fbshipit-source-id: ed63df3d74e9d4ce1604de8fe43b216166c7a3f0
2020-07-02 13:30:41 -07:00
Peter Dillinger
52d59e0c93 Revert "Whole DBTest to skip fsync (#7049)" (#7070)
Summary:
This reverts commit 4f1534bdb0.

This commit caused failures and deadlocks in
MultiThreadedDBTest.MultiThreaded/69 and others.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7070

Reviewed By: riversand963

Differential Revision: D22358778

Pulled By: pdillinger

fbshipit-source-id: faf8f2cb469a7063a113921c8e9c64a9f7610dac
2020-07-02 10:22:43 -07:00
sdong
4f1534bdb0 Whole DBTest to skip fsync (#7049)
Summary:
After https://github.com/facebook/rocksdb/pull/7036, we still see extra DBTest that can timeout when running 10 or 20 in parallel. Expand skip-fsync mode in whole DBTest. Still preserve other tests from doing this mode to be conservative.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7049

Test Plan: Run all existing files.

Reviewed By: pdillinger

Differential Revision: D22301700

fbshipit-source-id: f9a9e3b3b26ce640665a47cb8bff33ba0c89b565
2020-07-01 19:37:56 -07:00
Akanksha Mahajan
5edfe3a3d8 Update Flush policy in PartitionedIndexBuilder on switching from user-key to internal-key mode (#7022)
Summary:
When format_version is high enough to support user-key and there are index entries for same user key that spans multiple data blocks then it changes from user-key mode to internal-key mode. But the flush policy is not reset to point to Block Builder of internal-keys. After this switch, no entries are added to user key index partition result, thus it never triggers flushing the block.

Fix: After adding the entry in sub_builder_index_, if there is a switch from user-key to internal-key, then flush policy is updated to point to Block Builder of internal-keys index partition.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7022

Test Plan:
1. make check -j64
           2. Added one unit test case

Reviewed By: ajkr

Differential Revision: D22197734

Pulled By: akankshamahajan15

fbshipit-source-id: d87e9e46bccab8e896ee6979d6b79c51f73d479e
2020-07-01 14:58:08 -07:00
Andrew Kryczka
c25a014792 deflake DBCompactionTestWithParam.IntraL0Compaction test (#7065)
Summary:
This check is flaky because compaction could run between the `Flush()` and the `TestGetTickerCount()`, which would increase the `BLOCK_CACHE_INDEX_MISS` count beyond what the test expects. Verified by adding a `sleep(1)` between those two lines and observing the counter is too high every time. The solution is just to remove this check as it doesn't have any use anyways. The latter check of index miss is sufficient to conclude the newest L0 file (i.e., the one generated by intra-L0) does not have its index block pinned in cache. It'd be nice to simultaneously check the L0 files generated by flush do have their index blocks pinned in cache, but that's not what the line deleted in this PR was checking..
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7065

Reviewed By: pdillinger

Differential Revision: D22340327

Pulled By: ajkr

fbshipit-source-id: e076b2c7228b7fa763dd0c0cb13828e176c1abee
2020-07-01 14:53:10 -07:00
Peter Dillinger
e2fd501d44 Stabilize DBTest.ApproximateSizesMemTable (#7064)
Summary:
Random memtable layouts could cause random failure,
reproducible with command below running for a while. Test now using
deterministic behavior.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7064

Test Plan: while ./db_test --gtest_filter=*SizesMemTable*; do true; done

Reviewed By: siying

Differential Revision: D22339442

Pulled By: pdillinger

fbshipit-source-id: 8e74e5a9b5e88f7030854045a22c12cf561d5de6
2020-07-01 13:52:20 -07:00
mrambacher
80f71b5863 Use Libraries in the RocksDB Makefile Build (#6660)
Summary:
Change the linking of tests/tools to be against a library rather than a list of objects.  This change substantially reduces the size of the objects produced.

peterd clean repo size: 264M
Before this change, with make all: 40G
After this change, with make all: 28G
With make LIB_MODE=shared all: 7.0G

The list of TESTS was changed from being hard-coded to generated from the test sources variable.  Note that there are some test sources that are not built as tests (though the set of tests is identical to the previous version).

Added OBJ_DIR option to Makefile to allow objects to be placed in an alternative location.  By default, OBJ_DIR is the same as before ("./").

This change is a precursor to being able to build/run the tests/tools linked against static libraries.  Additionally, it should be possible to clean up and merge some of the rules for building tests and the like if so desired.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6660

Reviewed By: riversand963

Differential Revision: D22244463

Pulled By: pdillinger

fbshipit-source-id: db9c6341d81ed62c2270374f4ede02fb9604c754
2020-06-30 19:33:31 -07:00
Levi Tamasi
e367bc7f4b Clean up blob files based on the linked SST set (#7001)
Summary:
The earlier `VersionBuilder` code only cleaned up blob files that were
marked as entirely consisting of garbage using `VersionEdits` with
`BlobFileGarbage`. This covers the cases when table files go through
regular compaction, where we iterate through the KVs and thus have an
opportunity to calculate the amount of garbage (that is, most cases).
However, it does not help when table files are simply dropped (e.g. deletion
compactions or the `DeleteFile` API). To deal with such cases, the patch
adds logic that cleans up all blob files at the head of the list until the first
one with linked SSTs is found. (As an example, let's assume we have blob files
with numbers 1..10, and the first one with any linked SSTs is number 8.
This means that SSTs in the `Version` only rely on blob files with numbers >= 8,
and thus 1..7 are no longer needed.)

The code change itself is pretty small; however, changing the logic like this
necessitated changes to some tests that have been added recently (namely
to the ones that use blob files in isolation, i.e. without any table files referring
to them). Some of these cases were fixed by bypassing `VersionBuilder` altogether
in order to keep the tests simple (which actually makes them more proper unit tests
as well), while the `VersionBuilder` unit tests were fixed by adding dummy table
files to the test cases as needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7001

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D22119474

Pulled By: ltamasi

fbshipit-source-id: c6547141355667d4291d9661d6518eb741e7b54a
2020-06-30 15:31:21 -07:00
sdong
80b107a0a9 Divide WriteCallbackTest.WriteWithCallbackTest (#7037)
Summary:
WriteCallbackTest.WriteWithCallbackTest has a deep for-loop and in some cases runs very long. Parameterimized it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7037

Test Plan: Run the test and see it passes.

Reviewed By: ltamasi

Differential Revision: D22269259

fbshipit-source-id: a1b6687b5bf4609754833d14cf383d68bc7ab27a
2020-06-30 12:31:30 -07:00
Burton Li
5be2cb6948 Compaction filter support for BlobDB (#6850)
Summary:
Added compaction filter support for BlobDB non-TTL values. Same as vanilla RocksDB, user compaction filter applies to all k/v pairs of the compaction for non-TTL values. It honors `min_blob_size`, which potentially results value transitions between inlined data and stored-in-blob data when size of value is changed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6850

Reviewed By: siying

Differential Revision: D22263487

Pulled By: ltamasi

fbshipit-source-id: 8fc03f8cde2a5c831e63b436b3dbf1b7f90939e8
2020-06-29 17:32:14 -07:00
sdong
58547e533b Disable fsync in some tests to speed them up (#7036)
Summary:
Fsyncing files is not providing more test coverage in many tests. Provide an option in SpecialEnv to turn it off to speed it up and enable this option in some tests with relatively long run time.
Most of those tests can be divided as parameterized gtest too. This two speed up approaches are orthogonal and we can do both if needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7036

Test Plan: Run all tests and make sure they pass.

Reviewed By: ltamasi

Differential Revision: D22268084

fbshipit-source-id: 6d4a838a1b7328c13931a2a5d93de57aa02afaab
2020-06-29 16:56:59 -07:00
Anand Ananthabhotla
9a5886bd8c Extend Get/MultiGet deadline support to table open (#6982)
Summary:
Current implementation of the ```read_options.deadline``` option only checks the deadline for random file reads during point lookups. This PR extends the checks to file opens, prefetches and preloads as part of table open.

The main changes are in the ```BlockBasedTable```, partitioned index and filter readers, and ```TableCache``` to take ReadOptions as an additional parameter. In ```BlockBasedTable::Open```, in order to retain existing behavior w.r.t checksum verification and block cache usage, we filter out most of the options in ```ReadOptions``` except ```deadline```. However, having the ```ReadOptions``` gives us more flexibility to honor other options like verify_checksums, fill_cache etc. in the future.

Additional changes in callsites due to function signature changes in ```NewTableReader()``` and ```FilePrefetchBuffer```.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6982

Test Plan: Add new unit tests in db_basic_test

Reviewed By: riversand963

Differential Revision: D22219515

Pulled By: anand1976

fbshipit-source-id: 8a3b92f4a889808013838603aa3ca35229cd501b
2020-06-29 14:53:17 -07:00
Stanislav Tkach
1b85d57cf5 Expose KeyMayExist in the C API (#7021)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7021

Reviewed By: ajkr

Differential Revision: D22246297

Pulled By: pdillinger

fbshipit-source-id: 81dfd0a49e4d5ce0c9f00772c17cca425757ea24
2020-06-29 12:21:53 -07:00
Yanqin Jin
d47c871190 Fix data race to VersionSet::io_status_ (#7034)
Summary:
After https://github.com/facebook/rocksdb/issues/6949 , VersionSet::io_status_ can be concurrently accessed by multiple
threads without lock, causing tsan test to fail. For example, a bg flush thread
resets io_status_ before calling LogAndApply(), while another thread already in
the process of LogAndApply() reads io_status_. This is a bug.

We do not have to reset io_status_ each time we call LogAndApply(). io_status_
is part of the state of VersionSet, and it indicates the outcome of preceding
MANIFEST/CURRENT files IO operations. Its value should be updated only when:

1. MANIFEST/CURRENT files IO fail for the first time.
2. MANIFEST/CURRENT files IO succeed as part of recovering from a prior
   failure without process restart, e.g. calling Resume().

Test Plan (devserver):
COMPILE_WITH_TSAN=1 make check
COMPILE_WITH_TSAN=1 make db_test2
./db_test2 --gtest_filter=DBTest2.CompactionStall
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7034

Reviewed By: zhichao-cao

Differential Revision: D22247137

Pulled By: riversand963

fbshipit-source-id: 77b83e05390f3ee3cd2d96d3fdd6fe4f225e3216
2020-06-27 08:57:31 -07:00
sdong
f9817201af Add unity build to CircleCI (#7026)
Summary:
We are still keeping unity build working. So it's a good idea to add to a pre-commit CI.
A latest GCC docker image just to get a little bit more coverage. Fix three small issues to make it pass.
Also make unity_test to run db_basic_test rather than db_test to cut the test time. There is no point to run expensive tests here. It was set to run db_test before db_basic_test was separated out.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7026

Test Plan: watch tests to pass.

Reviewed By: zhichao-cao

Differential Revision: D22223197

fbshipit-source-id: baa3b6cbb623bf359829b63ce35715c75bcb0ed4
2020-06-26 11:14:08 -07:00
Stanislav Tkach
70b5d95dc7 Add (more) getters for options to the C API (#6998)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6998

Reviewed By: ajkr

Differential Revision: D22211700

Pulled By: pdillinger

fbshipit-source-id: 1141c20527dee5e13205059bf8e83927063c4c1e
2020-06-25 13:53:33 -07:00
sdong
d64cf0e4ee Move away from direct TmpDir() call in some tests (#7030)
Summary:
Some tests directly uses TmpDir() as temporary directory without adding any randomize factor. This would cause failures when tests run in parallel. Fix it by moving some of them to test::PerThreadDBPath()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7030

Test Plan: Watch existing tests pass

Reviewed By: zhichao-cao

Differential Revision: D22224710

fbshipit-source-id: 28c9932fede0a4a64670e5b5fdb08f4fb5dccdd0
2020-06-25 12:09:57 -07:00
Zitan Chen
be41c61f22 Add a new option for BackupEngine to store table files under shared_checksum using DB session id in the backup filenames (#6997)
Summary:
`BackupableDBOptions::new_naming_for_backup_files` is added. This option is false by default. When it is true, backup table filenames under directory shared_checksum are of the form `<file_number>_<crc32c>_<db_session_id>.sst`.

Note that when this option is true, it comes into effect only when both `share_files_with_checksum` and `share_table_files` are true.

Three new test cases are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6997

Test Plan: Passed make check.

Reviewed By: ajkr

Differential Revision: D22098895

Pulled By: gg814

fbshipit-source-id: a1d9145e7fe562d71cde7ac995e17cb24fd42e76
2020-06-24 19:31:25 -07:00
Yanqin Jin
e66199d848 First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.

Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.

Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949

Reviewed By: anand1976

Differential Revision: D22026020

Pulled By: riversand963

fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-24 19:07:08 -07:00
sdong
9cc25190e1 Test CircleCI with CLANG-10 (#7025)
Summary:
It's useful to build RocksDB using a more recent clang version in CI. Add a CircleCI build and fix some issues with it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7025

Test Plan: See all tests pass.

Reviewed By: pdillinger

Differential Revision: D22215700

fbshipit-source-id: 914a729c2cd3f3ac4a627cc0ac58d4691dca2168
2020-06-24 16:22:49 -07:00
Levi Tamasi
9f21d08660 Move kNoExpiration to blob_db.h (#7018)
Summary:
The constant `kNoExpiration` is currently defined in an
internal/implementation header (`blob_log_format.h`); the patch moves it
to the public header `blob_db.h` so it is accessible to users.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7018

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D22191354

Pulled By: ltamasi

fbshipit-source-id: 98c8012a83b999a3f1a30e955ce6bb71ba29dc5c
2020-06-23 13:45:06 -07:00
sdong
dea4063b13 Remove an assertion in FlushAfterIntraL0CompactionCheckConsistencyFail (#7003)
Summary:
FlushAfterIntraL0CompactionCheckConsistencyFail is flakey. It sometimes fails with:

db/db_compaction_test.cc:5186: Failure
Expected equality of these values:
  10
  NumTableFilesAtLevel(0)
    Which is: 3

I don't see a clear reason why the assertion would always be true. The necessarily of the assertion is not clear either. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7003

Test Plan: See the test still builds.

Reviewed By: riversand963

Differential Revision: D22129753

fbshipit-source-id: 42f0bb05e32b369e8d726bfd3e35c29cf52fe008
2020-06-19 16:58:29 -07:00
Peter Dillinger
25a0d0ca30 Fix block checksum for >=4GB, refactor (#6978)
Summary:
Although RocksDB falls over in various other ways with KVs
around 4GB or more, this change fixes how XXH32 and XXH64 were being
called by the block checksum code to support >= 4GB in case that should
ever happen, or the code copied for other uses.

This change is not a schema compatibility issue because the checksum
verification code would checksum the first (block_size + 1) mod 2^32
bytes while the checksum construction code would checksum the first
block_size mod 2^32 plus the compression type byte, meaning the
XXH32/64 checksums for >=4GB block would not match about 255/256 times.

While touching this code, I refactored to consolidate redundant
implementations, improving diagnostics and performance tracking in some
cases. Also used less confusing language in those diagnostics.

Makes https://github.com/facebook/rocksdb/issues/6875 obsolete.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6978

Test Plan:
I was able to write a test for this using an SST file writer
and VerifyChecksum in a reader. The test fails before the fix, though
I'm leaving the test disabled because I don't think it's worth the
expense of running regularly.

Reviewed By: gg814

Differential Revision: D22143260

Pulled By: pdillinger

fbshipit-source-id: 982993d16134e8c50bea2269047f901c1783726e
2020-06-19 16:18:24 -07:00
Peter Dillinger
88b4210701 Remove racially charged terms "whitelist" and "blacklist" (#7008)
Summary:
We don't need them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7008

Test Plan: "make check" and ensure "make crash_test" starts

Reviewed By: ajkr

Differential Revision: D22143838

Pulled By: pdillinger

fbshipit-source-id: 72c8e16603abc59f4954e304466bc4dc1f58f94e
2020-06-19 15:27:32 -07:00
sdong
d6b7b7712f Fix a bug that causes iterator to return wrong result in a rare data race (#6973)
Summary:
The bug fixed in https://github.com/facebook/rocksdb/pull/1816/ is now applicable to iterator too. This was not an issue but https://github.com/facebook/rocksdb/pull/2886 caused the regression. If a put and DB flush happens just between iterator to get latest sequence number and getting super version, empty result for the key or an older value can be returned, which is wrong.
Fix it in the same way as the fix in https://github.com/facebook/rocksdb/issues/1816, that is to get the sequence number after referencing the super version.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6973

Test Plan: Will run stress tests for a while to make sure there is no general regression.

Reviewed By: ajkr

Differential Revision: D22029348

fbshipit-source-id: 94390f93630906796d6e2fec321f44a920953fd1
2020-06-18 10:16:38 -07:00
Yanqin Jin
569b87e8c7 Fail recovery when MANIFEST record checksum mismatch (#6996)
Summary:
https://github.com/facebook/rocksdb/issues/5411 refactored `VersionSet::Recover` but introduced a bug, explained as follows.
Before, once a checksum mismatch happens, `reporter` will set `s` to be non-ok. Therefore, Recover will stop processing the MANIFEST any further.
```
// Correct
// Inside Recover
LogReporter reporter;
reporter.status = &s;
log::Reader reader(..., reporter);
while (reader.ReadRecord() && s.ok()) {
...
}
```
The bug is that, the local variable `s` in `ReadAndRecover` won't be updated by `reporter` while reading the MANIFEST. It is possible that the reader sees a checksum mismatch in a record, but `ReadRecord` retries internally read and finds the next valid record. The mismatched record will be ignored and no error is reported.
```
// Incorrect
// Inside Recover
LogReporter reporter;
reporter.status = &s;
log::Reader reader(..., reporter);
s = ReadAndRecover(reader, ...);

// Inside ReadAndRecover
  Status s;  // Shadows the s in Recover.
  while (reader.ReadRecord() && s.ok()) {
   ...
  }
```
`LogReporter` can use a separate `log_read_status` to track the errors while reading the MANIFEST. RocksDB can process more MANIFEST entries only if `log_read_status.ok()`.
Test plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6996

Reviewed By: ajkr

Differential Revision: D22105746

Pulled By: riversand963

fbshipit-source-id: b22f717a423457a41ca152a242abbb64cf91fc38
2020-06-18 10:09:12 -07:00
sdong
223b57eeb8 Fix the bug that compressed cache is disabled in read-only DBs (#6990)
Summary:
Compressed block cache is disabled in https://github.com/facebook/rocksdb/pull/4650 for no good reason. Re-enable it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6990

Test Plan: Add a unit test to make sure a general function works with read-only DB + compressed block cache.

Reviewed By: ltamasi

Differential Revision: D22072755

fbshipit-source-id: 2a55df6363de23a78979cf6c747526359e5dc7a1
2020-06-17 14:30:54 -07:00
Zitan Chen
94d04529de Store DB identity and DB session ID in SST files (#6983)
Summary:
`db_id` and `db_session_id` are now part of the table properties for all formats and stored in SST files. This adds about 99 bytes to each new SST file.

The `TablePropertiesNames` for these two identifiers are `rocksdb.creating.db.identity` and `rocksdb.creating.session.identity`.

In addition, SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as `DB::GetDbSessionId`.

A table property test is added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6983

Test Plan: make check and some manual tests.

Reviewed By: zhichao-cao

Differential Revision: D22048826

Pulled By: gg814

fbshipit-source-id: afdf8c11424a6f509b5c0b06dafad584a80103c9
2020-06-17 10:57:40 -07:00
Yanqin Jin
b7bab48099 Fix a bug of overwriting return code (#6989)
Summary:
In best-efforts recovery, an error that is not Corruption or IOError::kNotFound or IOError::kPathNotFound will be overwritten silently. Fix this by checking all non-ok cases and return early.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6989

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D22071418

Pulled By: riversand963

fbshipit-source-id: 5a4ea5dfb1a41f41c7a3fdaf62b163007b42f04b
2020-06-16 12:59:35 -07:00
Yanqin Jin
9bfd46d0d8 Let best-efforts recovery ignore CURRENT file (#6970)
Summary:
Best-efforts recovery does not check the content of CURRENT file to determine which MANIFEST to recover from. However, it still checks the presence of CURRENT file to determine whether to create a new DB during `open()`. Therefore, we can tweak the logic in `open()` a little bit so that best-efforts recovery does not rely on CURRENT file at all.

Test plan (dev server):
make check
./db_basic_test --gtest_filter=DBBasicTest.RecoverWithNoCurrentFile
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6970

Reviewed By: anand1976

Differential Revision: D22013990

Pulled By: riversand963

fbshipit-source-id: db552a1868c60ed70e1f7cd252a3a076eb8ea58f
2020-06-15 14:11:24 -07:00
Zitan Chen
88db97b06d Add a DB Session ID (#6959)
Summary:
Added DB::GetDbSessionId by using the same format and machinery as DB::GetDbIdentity.
The DB Session ID is generated (and therefore, updated) each time a DB object is opened. It is written to the LOG file right after the line of “DB SUMMARY”.
A test for the uniqueness, for different openings and during the same opening, is also added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6959

Test Plan: Passed make check

Reviewed By: zhichao-cao

Differential Revision: D21951721

Pulled By: gg814

fbshipit-source-id: 958a48a612db49a39998ea703cded45987d3fa8b
2020-06-15 10:47:02 -07:00
Zhen Li
9c24a5cb4d Fix persistent cache on windows (#6932)
Summary:
Persistent cache feature caused rocks db crash on windows. I posted a issue for it, https://github.com/facebook/rocksdb/issues/6919. I found this is because no "persistent_cache_key_prefix" is generated for persistent cache. Looking repo history, "GetUniqueIdFromFile" is not implemented on Windows. So my fix is adding "NewId()" function in "persistent_cache" and using it to generate prefix for persistent cache. In this PR, i also re-enable related test cases defined in "db_test2" and "persistent_cache_test" for windows.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6932

Test Plan:
1. run related test cases in "db_test2" and "persistent_cache_test" on windows and see it passed.
2. manually run db_bench.exe with "read_cache_path" and verified.

Reviewed By: riversand963

Differential Revision: D21911608

Pulled By: cheng-chang

fbshipit-source-id: cdfd938d54a385edbb2836b13aaa1d39b0a6f1c2
2020-06-13 13:28:31 -07:00
Levi Tamasi
83833637c1 Maintain the set of linked SSTs in BlobFileMetaData (#6945)
Summary:
The `FileMetaData` objects associated with table files already contain the
number of the oldest blob file referenced by the SST in question. This patch
adds the inverse mapping to `BlobFileMetaData`, namely the set of table file
numbers for which the oldest blob file link points to the given blob file (these
are referred to as *linked SSTs*). This mapping will be used by the GC logic.

Implementation-wise, the patch builds on the `BlobFileMetaDataDelta`
functionality introduced in https://github.com/facebook/rocksdb/pull/6835: newly linked/unlinked SSTs are
accumulated in `BlobFileMetaDataDelta`, and the changes to the linked SST set
are applied in one shot when the new `Version` is saved. The patch also reworks
the blob file related consistency checks in `VersionBuilder` so they validate the
consistency of the forward table file -> blob file links and the backward blob file ->
table file links for blob files that are part of the `Version`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6945

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D21912228

Pulled By: ltamasi

fbshipit-source-id: c5bc7acf6e729a8fccbb12672dd5cd00f6f000f8
2020-06-12 09:54:39 -07:00
Yanqin Jin
717749f4c0 Fail point-in-time WAL recovery upon IOError reading WAL (#6963)
Summary:
If `options.wal_recovery_mode == WALRecoveryMode::kPointInTimeRecovery`, RocksDB stops replaying WAL once hitting an error and discards the rest of the WAL. This can lead to data loss if the error occurs at an offset smaller than the last sync'ed offset.
Ideally, RocksDB point-in-time recovery should permit recovery if the error occurs after last synced offset while fail recovery if error occurs before the last synced offset. However, RocksDB does not track the synced offset of WALs. Consequently, RocksDB does not know whether an error occurs before or after the last synced offset. An error can be one of the following.
- WAL record checksum mismatch. This can result from both corruption of synced data and dropping of unsynced data during shutdown. We cannot be sure which one. In order not to defeat the original motivation to permit the latter case, we keep the original behavior of point-in-time WAL recovery.
- IOError. This means the WAL can be bad, an indicator of whole file becoming unavailable, not to mention synced part of the WAL. Therefore, we choose to modify the behavior of point-in-time recovery and fail the database recovery.

Test plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6963

Reviewed By: ajkr

Differential Revision: D22011083

Pulled By: riversand963

fbshipit-source-id: f9cbf29a37dc5cc40d3fa62f89eed1ad67ca1536
2020-06-11 18:42:10 -07:00
Levi Tamasi
d854abad78 Revisit the handling of the case when a file is re-added to the same level (#6939)
Summary:
https://github.com/facebook/rocksdb/pull/6901 subtly changed the handling of the corner case
when a table file is deleted from a level, then re-added to the same level. (Note: this
should be extremely rare; one scenario that comes to mind is a trivial move followed by
a call to `ReFitLevel` that moves the file back to the original level.) Before that change,
a new `FileMetaData` object was created as a result of this sequence; after the change,
the original `FileMetaData` was essentially resurrected (since the deletion and the addition
simply cancel each other out with the change). This patch restores the original behavior,
which is more intuitive considering the interface, and in sync with how trivial moves are handled.
(Also note that `FileMetaData` contains some mutable data members, the values of which
might be different in the resurrected object and the freshly created one.)
The PR also fixes a bug in this area: with the original pre-6901 code, `VersionBuilder`
would add the same file twice to the same level in the scenario described above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6939

Test Plan: `make check`

Reviewed By: ajkr

Differential Revision: D21905580

Pulled By: ltamasi

fbshipit-source-id: da07ae45384ecf3c6c53506d106432d88a7ec9df
2020-06-11 18:32:18 -07:00
Levi Tamasi
722ebba834 Turn DBTest2.CompressionFailures into a parameterized test (#6968)
Summary:
`DBTest2.CompressionFailures` currently tests many configurations
sequentially using nested loops, which often leads to timeouts
in our test system. The patch turns it into a parameterized test
instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6968

Test Plan: `make check`

Reviewed By: siying

Differential Revision: D22006954

Pulled By: ltamasi

fbshipit-source-id: f71f2f7108086b7651ecfce3d79a7fab24620b2c
2020-06-11 16:35:39 -07:00
Zhichao Cao
b3585a11b4 Ingest SST files with checksum information (#6891)
Summary:
Application can ingest SST files with file checksum information, such that during ingestion, DB is able to check data integrity and identify of the SST file. The PR introduces generate_and_verify_file_checksum to IngestExternalFileOption to control if the ingested checksum information should be verified with the generated checksum.

    1. If generate_and_verify_file_checksum options is *FALSE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enables the SST file checksum and the checksum function name matches the checksum function name in DB, we trust the ingested checksum, store it in Manifest. If the checksum function name does not match, we treat that as an error and fail the IngestExternalFile() call.
    2. If generate_and_verify_file_checksum options is *TRUE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enable the SST file checksum, we will use the checksum generator from DB to calculate the checksum for each ingested SST files after they are copied or moved. Then, compare the checksum results with the ingested checksum information: _A)_ if the checksum function name does not match, _verification always report true_ and we store the DB generated checksum information in Manifest. _B)_ if the checksum function name mach, and checksum match, ingestion continues and stores the checksum information in the Manifest. Otherwise, terminate file ingestion and report file corruption.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6891

Test Plan: added unit test, pass make asan_check

Reviewed By: pdillinger

Differential Revision: D21935988

Pulled By: zhichao-cao

fbshipit-source-id: 7b55f486632db467e76d72602218d0658aa7f6ed
2020-06-11 14:27:36 -07:00
Levi Tamasi
fbe2d259cb Use a per-thread path for the export directory in import_column_family_test (#6962)
Summary:
This is required so that the test cases can safely be run in parallel.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6962

Test Plan: `make check`

Reviewed By: zhichao-cao

Differential Revision: D21980060

Pulled By: ltamasi

fbshipit-source-id: 616b7a0b686155d3874848b9098c67ad3f47efcc
2020-06-10 14:04:07 -07:00
Andrew Kryczka
02db03af8d make L0 index/filter pinned memory usage predictable (#6911)
Summary:
Memory pinned by `pin_l0_filter_and_index_blocks_in_cache` needs to be predictable based on user config. This PR makes sure
we do not pin extra memory for large files generated by intra-L0 (see https://github.com/facebook/rocksdb/issues/6889).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6911

Test Plan: unit test

Reviewed By: siying

Differential Revision: D21835818

Pulled By: ajkr

fbshipit-source-id: a11a088549d06bed8aacc2548d266e5983f0ead4
2020-06-09 16:51:23 -07:00
Levi Tamasi
5abda3bb8b Move blob_log_{format,reader,writer}.{cc,h} to db/blob/ (#6960)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6960

Test Plan: `make check`

Reviewed By: zhichao-cao

Differential Revision: D21958416

Pulled By: ltamasi

fbshipit-source-id: 97cf05027b7363014b07836e7f158c23827bd661
2020-06-09 15:16:05 -07:00
anand76
1fb3593f25 Fix a bug in looking up duplicate keys with MultiGet (#6953)
Summary:
When MultiGet is called with duplicate keys, and the key matches the
largest key in an SST file and the value type is merge, only the first
instance of the duplicate key is returned with correct results. This is
due to the incorrect assumption that if a key in a batch is equal to the
largest key in the file, the next key cannot be present in that file.

Tests:
Add a new unit test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6953

Reviewed By: cheng-chang

Differential Revision: D21935898

Pulled By: anand1976

fbshipit-source-id: a2cc327a15150e23fd997546ca64d1c33021cb4c
2020-06-08 16:11:21 -07:00
Levi Tamasi
f5e649453c Add convenience method GetFileMetaDataByNumber (#6940)
Summary:
The patch adds a convenience method `GetFileMetaDataByNumber` that
builds on the `FileLocation` functionality introduced recently (see
https://github.com/facebook/rocksdb/pull/6862). This method makes it possible to
retrieve the `FileMetaData` directly as opposed to having to go through
`LevelFiles` and friends.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6940

Test Plan: `make check`

Reviewed By: cheng-chang

Differential Revision: D21905946

Pulled By: ltamasi

fbshipit-source-id: af99e19de21242b2b4a87594a535c6028d16ee72
2020-06-08 16:01:59 -07:00
Yanqin Jin
3020df9df5 Remove unnecessary inclusion of version_edit.h in env (#6952)
Summary:
In db_options.c, we should avoid including header files in the `db` directory to avoid introducing unnecessary dependency. The reason why `version_edit.h` has been included in `db_options.cc` is because we need two constants, `kUnknownChecksum` and `kUnknownChecksumFuncName`. We can put these two constants as `constexpr` in the public header `file_checksum.h`.

Test plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6952

Reviewed By: zhichao-cao

Differential Revision: D21925341

Pulled By: riversand963

fbshipit-source-id: 2902f3b74c97f0cf16c58ad24c095c787c3a40e2
2020-06-07 21:56:55 -07:00
Levi Tamasi
f8c2e5a608 Do not print messages to stderr in VersionBuilder (#6948)
Summary:
RocksDB is an embedded library; we should not write to the application's
console. Note: in each case, the same information is returned in the form of a
`Status::Corruption` object.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6948

Test Plan: `make check`

Reviewed By: ajkr

Differential Revision: D21914965

Pulled By: ltamasi

fbshipit-source-id: ae4b66789aa6b659eb8cc2ed4a048187962c86cc
2020-06-05 20:10:30 -07:00
Levi Tamasi
8988f831d8 Fix up a VersionBuilder test case (#6942)
Summary:
We currently do not have any validation that would ensure that the `FileMetaData`
objects are equivalent when a file gets deleted from the LSM tree and then re-added
(think trivial moves); however, if we did, this test case would be in violation. The patch
changes the values used in the test case so they are consistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6942

Test Plan: `make check`

Reviewed By: zhichao-cao

Differential Revision: D21911366

Pulled By: ltamasi

fbshipit-source-id: 2f0486f8337373a6a111b6f28433d70507857104
2020-06-05 18:16:38 -07:00
Zitan Chen
23e446a157 Disable OpenForReadOnly tests in the LITE mode (#6947)
Summary:
Disable two OpenForReadOnly tests in the LITE mode
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6947

Test Plan: passed db_test2

Reviewed By: cheng-chang

Differential Revision: D21914345

Pulled By: gg814

fbshipit-source-id: 58e81baf5d8cf8adcedaef3966aa3a427bbdf7c2
2020-06-05 17:28:24 -07:00
anand76
98b0cbea88 Check iterator status BlockBasedTableReader::VerifyChecksumInBlocks() (#6909)
Summary:
The ```for``` loop in ```VerifyChecksumInBlocks``` only checks ```index_iter->Valid()``` which could be ```false``` either due to reaching the end of the index or, in case of partitioned index, it could be due to a checksum mismatch error when reading a 2nd level index block. Instead of throwing away the index iterator status, we need to return any errors back to the caller.

Tests:
Add a test in block_based_table_reader_test.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6909

Reviewed By: pdillinger

Differential Revision: D21833922

Pulled By: anand1976

fbshipit-source-id: bc778ebf1121dbbdd768689de5183f07a9f0beae
2020-06-05 11:08:25 -07:00
Akanksha Mahajan
2677bd5967 Add logs and stats in DeleteScheduler (#6927)
Summary:
Add logs and stats for files marked as trash and files deleted immediately in DeleteScheduler
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6927

Test Plan: make check -j64

Reviewed By: riversand963

Differential Revision: D21869068

Pulled By: akankshamahajan15

fbshipit-source-id: e9f673c4fa8049ce648b23c75d742f2f9c6c57a1
2020-06-05 09:43:04 -07:00
Peter Dillinger
aaece2a98d Fix some defects reported by Coverity Scan (#6933)
Summary:
Confusing checks for null that are never null
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6933

Test Plan: make check

Reviewed By: cheng-chang

Differential Revision: D21885466

Pulled By: pdillinger

fbshipit-source-id: 4b48e03c2a33727f2702b0d12292f9fda5a3c475
2020-06-04 15:46:27 -07:00
Yanqin Jin
2f3261831b Fix a typo (bug) when setting error during Flush (#6928)
Summary:
As title. The prior change to the line is a typo. Fixing it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6928

Test Plan: make check

Reviewed By: zhichao-cao

Differential Revision: D21873587

Pulled By: riversand963

fbshipit-source-id: f4837fc8792d7106bc230b7b499dfbb7a2847430
2020-06-04 08:30:42 -07:00
Zitan Chen
02df00d97b API change: DB::OpenForReadOnly will not write to the file system unless create_if_missing is true (#6900)
Summary:
DB::OpenForReadOnly will not write anything to the file system (i.e., create directories or files for the DB) unless create_if_missing is true.

This change also fixes some subcommands of ldb, which write to the file system even if the purpose is for readonly.

Two tests for this updated behavior of DB::OpenForReadOnly are also added.

Other minor changes:
1. Updated HISTORY.md to include this API change of DB::OpenForReadOnly;
2. Updated the help information for the put and batchput subcommands of ldb with the option [--create_if_missing];
3. Updated the comment of Env::DeleteDir to emphasize that it returns OK only if the directory to be deleted is empty.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6900

Test Plan: passed make check; also manually tested a few ldb subcommands

Reviewed By: pdillinger

Differential Revision: D21822188

Pulled By: gg814

fbshipit-source-id: 604cc0f0d0326a937ee25a32cdc2b512f9a3be6e
2020-06-03 18:57:49 -07:00
Stanislav Tkach
b7c825d5cf Add (some) getters for options to the C API (#6925)
Summary:
Additionally I have extended the incomplete test added in the https://github.com/facebook/rocksdb/issues/6880.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6925

Reviewed By: ajkr

Differential Revision: D21869788

Pulled By: pdillinger

fbshipit-source-id: e9db80f259c57ca1bdcbc2c66cb938cb1ac26e48
2020-06-03 17:08:50 -07:00
sdong
afa3518839 Revert "Update googletest from 1.8.1 to 1.10.0 (#6808)" (#6923)
Summary:
This reverts commit 8d87e9cea1.

Based on offline discussions, it's too early to upgrade to gtest 1.10, as it prevents some developers from using an older version of gtest to integrate to some other systems. Revert it for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6923

Reviewed By: pdillinger

Differential Revision: D21864799

fbshipit-source-id: d0726b1ff649fc911b9378f1763316200bd363fc
2020-06-03 15:55:03 -07:00
Hao Chen
ffe08ffcc2 correct level information in version_set.cc (#6920)
Summary:
fix these two issues https://github.com/facebook/rocksdb/issues/6912  and https://github.com/facebook/rocksdb/issues/6667
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6920

Reviewed By: cheng-chang

Differential Revision: D21864885

Pulled By: ajkr

fbshipit-source-id: 10e21fc1851b67a59d44358f59c64fa5523bd263
2020-06-03 12:27:13 -07:00
Anatoly Zhmur
22e5c513c2 Add zstd_max_train_bytes to c interop (#6796)
Summary:
Added setting of zstd_max_train_bytes compression option parameter to c interop.

rocksdb_options_set_bottommost_compression_options was using bool parameter and thus not exported, updated it to unsigned char and added to c.h as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6796

Reviewed By: cheng-chang

Differential Revision: D21611471

Pulled By: ajkr

fbshipit-source-id: caaaf153de934837ad9af283c7f8c025ff0b0cf5
2020-06-03 12:27:12 -07:00
Peter Dillinger
43f8a9dcce Some fixes for gcc 4.8 and add to Travis (#6915)
Summary:
People keep breaking the gcc 4.8 compilation due to different
warnings for shadowing member functions with locals. Adding to Travis
to keep compatibility. (gcc 4.8 is default on CentOS 7.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6915

Test Plan: local and Travis

Reviewed By: siying

Differential Revision: D21842894

Pulled By: pdillinger

fbshipit-source-id: bdcd4385127ee5d1cc222d87e53fb3695c32a9d4
2020-06-03 11:39:25 -07:00
Levi Tamasi
78e291b17e Improve consistency checks in VersionBuilder (#6901)
Summary:
The patch cleans up the code and improves the consistency checks around
adding/deleting table files in `VersionBuilder`. Namely, it makes the checks
stricter and improves them in the following ways:
1) A table file can now only be deleted from the LSM tree using the level it
resides on. Earlier, there was some unnecessary wiggle room for
trivially moved files (they could be deleted using a lower level number than
the actual one).
2) A table file cannot be added to the tree if it is already present in the tree
on any level (not just the target level). The earlier code only had an assertion
(which is a no-op in release builds) that the newly added file is not already
present on the target level.
3) The above consistency checks around state transitions are now mandatory,
as opposed to the earlier `CheckConsistencyForDeletes`, which was a no-op
in release mode unless `force_consistency_checks` was set to `true`. The rationale
here is that assuming that the initial state is consistent, a valid transition leads to a
next state that is also consistent; however, an *invalid* transition offers no such
guarantee. Hence it makes sense to validate the transitions unconditionally,
and save `force_consistency_checks` for the paranoid checks that re-validate
the entire state.
4) The new checks build on the mechanism introduced in https://github.com/facebook/rocksdb/pull/6862,
which enables us to efficiently look up the location (level and position within level)
of files in a `Version` by file number. This makes the consistency checks much more
efficient than the earlier `CheckConsistencyForDeletes`, which essentially
performed a linear search.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6901

Test Plan:
Extended the unit tests and ran:

`make check`
`make whitebox_crash_test`

Reviewed By: ajkr

Differential Revision: D21822714

Pulled By: ltamasi

fbshipit-source-id: e2b29c8b6da1bf0f59004acc889e4870b2d18215
2020-06-03 11:24:55 -07:00
Peter Dillinger
9360776cb9 Fix handling of too-small filter partition size (#6905)
Summary:
Because ARM and some other platforms have a larger cache line
size, they have a larger minimum filter size, which causes recently
added PartitionedMultiGet test in db_bloom_filter_test to fail on those
platforms. The code would actually end up using larger partitions,
because keys_per_partition_ would be 0 and never == number of keys
added.

The code now attempts to get as close as possible to the small target
size, while fully utilizing that filter size, if the target partition
size is smaller than the minimum filter size.

Also updated the test to break more uniformly across platforms
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6905

Test Plan: updated test, tested on ARM

Reviewed By: anand1976

Differential Revision: D21840639

Pulled By: pdillinger

fbshipit-source-id: 11684b6d35f43d2e98b85ddb2c8dcfd59d670817
2020-06-03 10:43:01 -07:00
Zhichao Cao
2adb7e3768 Fix potential overflow of unsigned type in for loop (#6902)
Summary:
x.size() -1 or y - 1 can overflow to an extremely large value when x.size() pr y is 0 when they are unsigned type. The end condition of i in the for loop will be extremely large, potentially causes segment fault. Fix them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6902

Test Plan: pass make asan_check

Reviewed By: ajkr

Differential Revision: D21843767

Pulled By: zhichao-cao

fbshipit-source-id: 5b8b88155ac5a93d86246d832e89905a783bb5a1
2020-06-02 15:05:07 -07:00
Stanislav Tkach
38f988d3b4 Expose rocksdb_options_copy function to the C API (#6880)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6880

Reviewed By: ajkr

Differential Revision: D21842752

Pulled By: pdillinger

fbshipit-source-id: eda326f551ddd9cb397681544b9e9799ea614e52
2020-06-02 13:48:51 -07:00
Peter Dillinger
14eca6bf04 For ApproximateSizes, pro-rate table metadata size over data blocks (#6784)
Summary:
The implementation of GetApproximateSizes was inconsistent in
its treatment of the size of non-data blocks of SST files, sometimes
including and sometimes now. This was at its worst with large portion
of table file used by filters and querying a small range that crossed
a table boundary: the size estimate would include large filter size.

It's conceivable that someone might want only to know the size in terms
of data blocks, but I believe that's unlikely enough to ignore for now.
Similarly, there's no evidence the internal function AppoximateOffsetOf
is used for anything other than a one-sided ApproximateSize, so I intend
to refactor to remove redundancy in a follow-up commit.

So to fix this, GetApproximateSizes (and implementation details
ApproximateSize and ApproximateOffsetOf) now consistently include in
their returned sizes a portion of table file metadata (incl filters
and indexes) based on the size portion of the data blocks in range. In
other words, if a key range covers data blocks that are X% by size of all
the table's data blocks, returned approximate size is X% of the total
file size. It would technically be more accurate to attribute metadata
based on number of keys, but that's not computationally efficient with
data available and rarely a meaningful difference.

Also includes miscellaneous comment improvements / clarifications.

Also included is a new approximatesizerandom benchmark for db_bench.
No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784

Test Plan:
Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
Old code running new test...

    [ RUN      ] DBTest.ApproximateSizesFilesWithErrorMargin
    db/db_test.cc:1562: Failure
    Expected: (size) <= (11 * 100), actual: 9478 vs 1100

Other tests updated to reflect consistent accounting of metadata.

Reviewed By: siying

Differential Revision: D21334706

Pulled By: pdillinger

fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185
2020-06-02 12:30:23 -07:00
Adam Retter
8d87e9cea1 Update googletest from 1.8.1 to 1.10.0 (#6808)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6808

Reviewed By: anand1976

Differential Revision: D21483984

Pulled By: pdillinger

fbshipit-source-id: 70c5eff2bd54ddba469761d95e4cd4611fb8e598
2020-06-01 20:33:42 -07:00
Andrew Kryczka
c5abf78bca avoid IterKey::UpdateInternalKey() in BlockIter (#6843)
Summary:
`IterKey::UpdateInternalKey()` is an error-prone API as it's
incompatible with `IterKey::TrimAppend()`, which is used for
decoding delta-encoded internal keys. This PR stops using it in
`BlockIter`. Instead, it assigns global seqno in a separate `IterKey`'s
buffer when needed. The logic for safely getting a Slice with global
seqno properly assigned is encapsulated in `GlobalSeqnoAppliedKey`.
`BinarySeek()` is also migrated to use this API (previously it ignored
global seqno entirely).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6843

Test Plan:
benchmark setup -- single file DBs, in-memory, no compression. "normal_db"
created by regular flush; "ingestion_db" created by ingesting a file. Both
DBs have same contents.

```
$ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000
$ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst | awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}')
$ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/
```

benchmark run command:
```
TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=10 -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=1048576000 -threads=1 -reads=40000000
```

results:

| DB | code | throughput |
|---|---|---|
| normal_db | master |  267.9 |
| normal_db   |    PR6843 | 254.2 (-5.1%) |
| ingestion_db |   master |  259.6 |
| ingestion_db |   PR6843 | 250.5 (-3.5%) |

Reviewed By: pdillinger

Differential Revision: D21562604

Pulled By: ajkr

fbshipit-source-id: 937596f836930515da8084d11755e1f247dcb264
2020-05-28 10:51:30 -07:00
Yanqin Jin
961c7590d6 Add timestamp to delete (#6253)
Summary:
Preliminary user-timestamp support for delete.

If ["a", ts=100] exists, you can delete it by calling `DB::Delete(write_options, key)` in which `write_options.timestamp` points to a `ts` higher than 100.

Implementation
A new ValueType, i.e. `kTypeDeletionWithTimestamp` is added for deletion marker with timestamp.
The reason for a separate `kTypeDeletionWithTimestamp`: RocksDB may drop tombstones (keys with kTypeDeletion) when compacting them to the bottom level. This is OK and useful if timestamp is disabled. When timestamp is enabled, should we still reuse `kTypeDeletion`, we may drop the tombstone with a more recent timestamp, causing deleted keys to re-appear.

Test plan (dev server)
```
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6253

Reviewed By: ltamasi

Differential Revision: D20995328

Pulled By: riversand963

fbshipit-source-id: a9e5c22968ad76f98e3dc6ee0151265a3f0df619
2020-05-28 10:40:03 -07:00
Levi Tamasi
e3f953a863 Make it possible to look up files by number in VersionStorageInfo (#6862)
Summary:
Does what it says on the can: the patch adds a hash map to `VersionStorageInfo`
that maps file numbers to file locations, i.e. (level, position in level) pairs. This
will enable stricter consistency checks in `VersionBuilder`. The patch also fixes
all the unit tests that used duplicate file numbers in a version (which would trigger
an assertion with the new code).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6862

Test Plan:
`make check`
`make whitebox_crash_test`

Reviewed By: riversand963

Differential Revision: D21670446

Pulled By: ltamasi

fbshipit-source-id: 2eac249945cf33d8fb8597b26bfff5221e1a861a
2020-05-28 10:03:06 -07:00
Akanksha Mahajan
bcefc59e9f Allow MultiGet users to limit cumulative value size (#6826)
Summary:
1. Add a value_size in read options which limits the cumulative value size of keys read in batches. Once the size exceeds read_options.value_size, all the remaining keys are returned with status Abort without further fetching any key.
2. Add a unit test case MultiGetBatchedValueSizeSimple the reads keys from memory and sst files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6826

Test Plan:
1. make check -j64
	   2. Add a new unit test case

Reviewed By: anand1976

Differential Revision: D21471483

Pulled By: akankshamahajan15

fbshipit-source-id: dea51b8e76d5d1df38ece8cdb29933b1d798b900
2020-05-27 13:07:14 -07:00
Zhichao Cao
545e14b53b Generate file checksum in SstFileWriter (#6859)
Summary:
If Option.file_checksum_gen_factory is set, rocksdb generates the file checksum during flush and compaction based on the checksum generator created by the factory and store the checksum and function name in vstorage and Manifest.

This PR enable file checksum generation in SstFileWrite and store the checksum and checksum function name in the  ExternalSstFileInfo, such that application can use them for other purpose, for example, ingest the file checksum with files in IngestExternalFile().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6859

Test Plan: add unit test and pass make asan_check.

Reviewed By: ajkr

Differential Revision: D21656247

Pulled By: zhichao-cao

fbshipit-source-id: 78a3570c76031d8832e3d2de3d6c79cdf2b675d0
2020-05-20 11:55:31 -07:00
Levi Tamasi
1551f1011a Refactor the blob file related logic in VersionBuilder (#6835)
Summary:
This patch is groundwork for an upcoming change to store the set of
linked SSTs in `BlobFileMetaData`. With the current code, a new
`BlobFileMetaData` object is created each time a `VersionEdit` touches
a certain blob file. This is fine as long as these objects are lightweight
and cheap to create; however, with the addition of the linked SST set, it would
be very inefficient since the set would have to be copied over and over again.
Note that this is the same kind of problem that `VersionBuilder` is solving
w/r/t `Version`s and files, and we can apply the same solution; that is, we can
accumulate the changes in a different mutable object, and apply the delta in
one shot when the changes are committed. The patch does exactly that by
adding a new `BlobFileMetaDataDelta` class to `VersionBuilder`. In addition,
it turns the existing `GetBlobFileMetaData` helper into `IsBlobFileInVersion`
(which is fine since that's the only thing the method's clients care about now),
and adds a couple of helper methods that can create a `BlobFileMetaData`
object from the `BlobFileMetaData` in the base (if applicable) and the delta
when the `Version` is saved.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6835

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D21505187

Pulled By: ltamasi

fbshipit-source-id: d81a48c5f2ca7b79d7124c935332a6bcf3d5d988
2020-05-19 10:00:04 -07:00
Yanqin Jin
b11a8b1b9a Fix valgrind error by init memory region (#6842)
Summary:
As title. After allocating a memory buffer, initialize its content to 0s.
This fixes valgrind issue introduced in https://github.com/facebook/rocksdb/issues/6709.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6842

Test Plan:
```
$make valgrind_test
$valgrind --tool=memcheck --track-origins=yes ./db_test2 --gtest_filter=DBTest2.CompressionFailures
```

Reviewed By: pdillinger

Differential Revision: D21551848

Pulled By: riversand963

fbshipit-source-id: e87a6f413e3f3d92d8e23d8ecc4cf93479c6674c
2020-05-14 18:50:03 -07:00
anand76
50d63a2af0 Fix LITE build failure in compaction_picker_test (#6839)
Summary:
Fix LITE build.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6839

Test Plan: LITE=1 make check

Reviewed By: riversand963

Differential Revision: D21535808

Pulled By: anand1976

fbshipit-source-id: fcad961eca08e13fb0c256c92d18c3c1f1165f22
2020-05-13 10:47:15 -07:00
sdong
4a4b8a1344 sst_dump to reduce number of file reads (#6836)
Summary:
sst_dump can issue many file reads from the file system. This doesn't work well with file systems without a OS cache, especially remote file systems. In order to mitigate this problem, several improvements are done:
1. --readahead_size is added, so that users can specify readahead size when scanning the data.
2. Force a 512KB tail readahead, which prevents three I/Os for footer, meta index and property blocks and hopefully index and filter blocks too.
3. Consoldiate SSTDump's I/Os before opening the file for read. Use the same file prefetch buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6836

Test Plan: Add a test that covers this new feature.

Reviewed By: pdillinger

Differential Revision: D21516607

fbshipit-source-id: 3ae43526286f67b2f4a5bdedfbc92719d579b87e
2020-05-12 18:23:33 -07:00
Stanislav Tkach
70aaa9ceeb Expose CancellAllBackgroundWork to C api (#6832)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6832

Reviewed By: cheng-chang

Differential Revision: D21498186

Pulled By: riversand963

fbshipit-source-id: 66bb0d7c06af2bf0df3c6a09b61bca2fb81f2dd3
2020-05-12 14:50:52 -07:00
Ziyue Yang
c384c08a4f Add tests for compression failure in BlockBasedTableBuilder (#6709)
Summary:
Currently there is no check for whether BlockBasedTableBuilder will expose
compression error status if compression fails during the table building.
This commit adds fake faulting compressors and a unit test to test such
cases.

This check finds 5 bugs, and this commit also fixes them:

1. Not handling compression failure well in
   BlockBasedTableBuilder::BGWorkWriteRawBlock.
2. verify_compression failing in BlockBasedTableBuilder when used with ZSTD.
3. Wrongly passing the same reference of block contents to
   BlockBasedTableBuilder::CompressAndVerifyBlock in parallel compression.
4. Wrongly setting block_rep->first_key_in_next_block to nullptr in
   BlockBasedTableBuilder::EnterUnbuffered when there are still incoming data
   blocks.
5. Not maintaining variables for compression ratio estimation and first_block
   in BlockBasedTableBuilder::EnterUnbuffered.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6709

Reviewed By: ajkr

Differential Revision: D21236254

fbshipit-source-id: 101f6e62b2bac2b7be72be198adf93cd32a1ff46
2020-05-12 09:27:35 -07:00
sdong
a50ea71c00 Improve ldb consistency checks (#6802)
Summary:
When using ldb, users cannot turn on force consistency check in most commands, while they cannot use checksonsistnecy with --try_load_options. The change fixes both by:
1. checkconsistency now calls OpenDB() so that it gets all the options loading and sanitized options logic
2. use options.check_consistency_checks = true by default, and add a --disable_consistency_checks to turn it off.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6802

Test Plan: Add a new unit test. Some manual tests with corrupted DBs.

Reviewed By: pdillinger

Differential Revision: D21388051

fbshipit-source-id: 8d122732d391b426e3982a1c3232a8e3763ffad0
2020-05-08 14:17:47 -07:00
Yanqin Jin
e72e2167fd Fix a few bugs in best-efforts recovery (#6824)
Summary:
1. Update column_family_memtables_ to point to latest column_family_set in
   version_set after recovery.
2. Normalize file paths passed by application so that directories end with '/'
   or '\\'.
3. In addition to missing files, corrupted files are also ignored in
   best-efforts recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6824

Test Plan: COMPILE_WITH_ASAN=1 make check

Reviewed By: anand1976

Differential Revision: D21463905

Pulled By: riversand963

fbshipit-source-id: c48db8843cc93c8c1c7139c474b64e6f775307d2
2020-05-08 13:01:42 -07:00
anand76
94265234de Fix race due to delete triggered compaction in Universal compaction mode (#6799)
Summary:
Delete triggered compaction in universal compaction mode was causing a corruption when scheduled in parallel with other compactions.
1. When num_levels = 1, a file marked for compaction may be picked along with all older files in L0, without checking if any of them are already being compaction. This can cause unpredictable results like resurrection of older versions of keys or deleted keys.
2. When num_levels > 1, a delete triggered compaction would not get scheduled if it overlaps with a running regular compaction. However, the reverse is not true. This is due to the fact that in ```UniversalCompactionBuilder::CalculateSortedRuns```, it assumes that entire sorted runs are picked for compaction and only checks the first file in a sorted run to determine conflicts. This is violated by a delete triggered compaction as it works on a subset of a sorted run.

Fix the bug for num_levels > 1, and disable the feature for now when num_levels = 1. After disabling this feature, files would still get marked for compaction, but no compaction would get scheduled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6799

Reviewed By: pdillinger

Differential Revision: D21431286

Pulled By: anand1976

fbshipit-source-id: ae9f0bdb1d6ae2f10284847db731c23f43af164a
2020-05-07 17:32:17 -07:00
Peter Dillinger
b27a1448b6 Fix false NotFound from batched MultiGet with kHashSearch (#6821)
Summary:
The error is assigning KeyContext::s to NotFound status in a
table reader for a "not found in this table" case, which skips searching
in later tables, like only a delete should. (The hash search index iterator
is the only one that can return status NotFound even if Valid() == false.)

This was detected by intermittent failure in
MultiThreadedDBTest.MultiThreaded/5, a kHashSearch configuration.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6821

Test Plan: modified existing unit test to reproduce problem

Reviewed By: anand1976

Differential Revision: D21450469

Pulled By: pdillinger

fbshipit-source-id: 7478003684d637dbd491cdac81468041a791be2c
2020-05-07 15:41:37 -07:00
Andrew Kryczka
e9ba4ba348 validate range tombstone covers positive range (#6788)
Summary:
We found some files containing nothing but negative range tombstones,
and unsurprisingly their metadata specified a negative range, which made
things crash. Time to add a bit of user input validation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6788

Reviewed By: zhichao-cao

Differential Revision: D21343719

Pulled By: ajkr

fbshipit-source-id: f1c16e4c3e9fa150958c8c866176632a3206fb74
2020-05-07 11:55:30 -07:00
Levi Tamasi
ac3ae1df0b Find/purge obsolete blob files (#6807)
Summary:
The patch extends `FindObsoleteFiles` and `PurgeObsoleteFiles` with
support for blob files. The behavior is analogous to SST files: obsolete
blob files are put on the "candidates for deletion" list, while live (and pending)
files are preserved.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6807

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D21406249

Pulled By: ltamasi

fbshipit-source-id: 1948f71c31927564b61e8af394f50ca3964880d9
2020-05-07 09:32:51 -07:00
sdong
c21c459771 Slightly expand converage to file consistency check failure (#6800)
Summary:
Current DBCompactionTest.ConsistencyFailTest checks DB fails after L0 inconsitency is found. Add slightly more coverage by introducing DBCompactionTest.ConsistencyFailTest2 which checks non-L0 files too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6800

Test Plan: Run the new test.

Reviewed By: riversand963

Differential Revision: D21384806

fbshipit-source-id: 36db7b657eed42115283fe2f6afa4c3a31a3b510
2020-05-05 18:31:53 -07:00
mrambacher
394f2bbd13 Add OptionTypeInfo::Enum and related methods (#6423)
Summary:
Add methods and constructors for handling enums to the OptionTypeInfo.  This change allows enums to be converted/compared without adding a special "type" to the OptionType.

This change addresses a couple of issues:
- It allows new enumerated types to be added to the options without editing the OptionType base class (and related methods)
- It standardizes the procedure for adding enumerated types to the options, reducing potential mistakes
- It moves the enum maps to the location where they are used, allowing them to be static file members rather than global values
- It reduces the number of types and cases that need to be handled in the various OptionType methods
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6423

Reviewed By: siying

Differential Revision: D21408713

Pulled By: zhichao-cao

fbshipit-source-id: fc492af285d011822578b95d186a0fce25d35626
2020-05-05 15:04:04 -07:00
Andrew Kryczka
a96461d169 fix swallowed error for file deletion consistency check (#6809)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6809

Reviewed By: pdillinger

Differential Revision: D21411971

Pulled By: ajkr

fbshipit-source-id: 900b6b0370b76e9a3e5e03f968e2ac1bbaab73b8
2020-05-05 14:54:21 -07:00
Peter Dillinger
2f1700c8c5 Fix failure to write output in SpecialEnv::GetCurrentTime (#6803)
Summary:
This very old test code bug was causing a new valgrind failure
in MultiGetDeadlineExceeded

Also fix hang in MultiGetDeadlineExceeded by unifying with some logic from another test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6803

Test Plan: run that unit test under valgrind, make check

Reviewed By: siying

Differential Revision: D21388470

Pulled By: pdillinger

fbshipit-source-id: 0ce99d6d5eb8cd3195b17406892c8c5cff5fa5dd
2020-05-05 13:11:29 -07:00
Cheng Chang
91bc0130fa Refactor level compaction picker (#6804)
Summary:
1. refactor out PickFileToCompact to remove duplicated logic.
2. remove redundant checks of `start_level_inputs_.empty()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6804

Test Plan: make check

Reviewed By: siying

Differential Revision: D21390053

Pulled By: cheng-chang

fbshipit-source-id: 185d5987a08bfdaf63f0f245310c6da69878d415
2020-05-05 11:09:29 -07:00
Yanqin Jin
5584595f80 Do not swallow error returned from SaveTo() (#6801)
Summary:
With consistency check enabled, VersionBuilder::SaveTo() may return error once
corruption is detected while building versions. We should handle these errors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6801

Test Plan: make check

Reviewed By: siying

Differential Revision: D21385045

Pulled By: riversand963

fbshipit-source-id: 98f6424e2a4699b62befa21e9fe00e70a771118e
2020-05-05 10:46:20 -07:00
Yanqin Jin
5a61e7864d Fix db_stress when GetLiveFiles() flushes dropped CF (#6805)
Summary:
Current impl. of db_stress will abort verification and report failure if
GetLiveFiles() causes a dropped column family to be flushed. This is not
desired.
To fix, this PR makes the following change:
In GetLiveFiles, if flush is triggered and returns
Status::IsColumnFamilyDropped(), then set status to Status::OK().
This is OK because dropped column families will be skipped during the rest of
this function, and valid column families will have their live files returned to
caller.

Test plan (dev server):
make check
./db_stress -ops_per_thread=1000 -get_live_files_one_in=100 -clear_column_family_one_in=100
./db_stress -disable_wal=1 -reopen=0 -ops_per_thread=1000 -get_live_files_one_in=100 -clear_column_family_one_in=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6805

Reviewed By: ltamasi

Differential Revision: D21390044

Pulled By: riversand963

fbshipit-source-id: de67846b95a4f1b88aa0a30c3d70c43cc68625b9
2020-05-04 17:45:49 -07:00
Levi Tamasi
a00ddf1574 Expose the set of live blob files from Version/VersionSet (#6785)
Summary:
The patch adds logic that returns the set of live blob files from
`Version::AddLiveFiles` and `VersionSet::AddLiveFiles` (in addition to
live table files), and also cleans up the code a bit, for example, by
exposing only the numbers of table files as opposed to the earlier
`FileDescriptor`s that no clients used. Moreover, the patch extends
the `GetLiveFiles` API so that it also exposes blob files in the current version.
Similarly to https://github.com/facebook/rocksdb/pull/6755,
this is a building block for identifying and purging obsolete blob files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6785

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D21336210

Pulled By: ltamasi

fbshipit-source-id: fc1aede8a49eacd03caafbc5f6f9ce43b6270821
2020-05-04 15:08:13 -07:00
sdong
680c416348 Avoid Swallowing Some File Consistency Checking Bugs (#6793)
Summary:
We are swallowing some file consistency checking failures. This is not expected. We are fixing two cases: DB reopen and manifest dump.
More places are not fixed and need follow-up.

Error from CheckConsistencyForDeletes() is also swallowed, which is not fixed in this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6793

Test Plan: Add a unit test to cover the reopen case.

Reviewed By: riversand963

Differential Revision: D21366525

fbshipit-source-id: eb438a322237814e8d5125f916a3c6de97f39ded
2020-05-04 14:18:11 -07:00
sdong
6504ae0c4e Remove the support of setting CompressionOptions.parallel_threads from string for now (#6782)
Summary:
The current way of implementing CompressionOptions.parallel_threads introduces a format change. We plan to change CompressionOptions's serailization format to a new JSON-like format, which would be another format change. We would like to consolidate the two format changes into one, rather than making some users to change twice. Hold CompressionOptions.parallel_threads from being supported by option string for now. Will add it back after the general CompressionOptions's format change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6782

Test Plan: Run all existing tests.

Reviewed By: zhichao-cao

Differential Revision: D21338614

fbshipit-source-id: bca2dac3cb37d4e6e64b52cbbe8ea749cd848685
2020-04-30 17:01:17 -07:00
anand76
ab13d43e1d Pass a timeout to FileSystem for random reads (#6751)
Summary:
Calculate ```IOOptions::timeout``` using ```ReadOptions::deadline``` and pass it to ```FileSystem::Read/FileSystem::MultiRead```. This allows us to impose a tighter bound on the time taken by Get/MultiGet on FileSystem/Envs that support IO timeouts. Even on those that don't support, check in ```RandomAccessFileReader::Read``` and ```MultiRead``` and return ```Status::TimedOut()``` if the deadline is exceeded.

For now, TableReader creation, which might do file opens and reads, are not covered. It will be implemented in another PR.

Tests:
Update existing unit tests to verify the correct timeout value is being passed
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6751

Reviewed By: riversand963

Differential Revision: D21285631

Pulled By: anand1976

fbshipit-source-id: d89af843e5a91ece866e87aa29438b52a65a8567
2020-04-30 14:50:39 -07:00
Levi Tamasi
fe238e5438 Keep track of obsolete blob files in VersionSet (#6755)
Summary:
The patch adds logic to keep track of obsolete blob files. A blob file becomes
obsolete when the last `shared_ptr` that points to the corresponding
`SharedBlobFileMetaData` object goes away, which, in turn, happens when the
last `Version` that contains the blob file is destroyed. No longer needed blob
files are added to the obsolete list in `VersionSet` using a custom deleter to
avoid unnecessary coupling between `SharedBlobFileMetaData` and `VersionSet`.
Obsolete blob files are returned by `VersionSet::GetObsoleteFiles` and stored
in `JobContext`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6755

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D21233155

Pulled By: ltamasi

fbshipit-source-id: 47757e06fdc0127f27ed57f51abd27893d9a7b7a
2020-04-30 11:25:51 -07:00
Zhichao Cao
8c694025e9 Fix potential size_t overflow in import_column_family (#6762)
Summary:
The issue is reported in https://github.com/facebook/rocksdb/issues/6753 . size_t is unsigned and if sorted_file.size() is 0, the end condition of i will be extremely large, cause segment fault in sorted_files[i] and sorted_files[i+1]. Added condition to fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6762

Test Plan: make asan_check

Reviewed By: pdillinger

Differential Revision: D21323063

Pulled By: zhichao-cao

fbshipit-source-id: 56ce59201949ed319448228553202b8642c2cc3a
2020-04-30 08:40:42 -07:00
Derrick Pallas
5272305437 Fix FilterBench when RTTI=0 (#6732)
Summary:
The dynamic_cast in the filter benchmark causes release mode to fail due to
no-rtti.  Replace with static_cast_with_check.

Signed-off-by: Derrick Pallas <derrick@pallas.us>

Addition by peterd: Remove unnecessary 2nd template arg on all static_cast_with_check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6732

Reviewed By: ltamasi

Differential Revision: D21304260

Pulled By: pdillinger

fbshipit-source-id: 6e8eb437c4ca5a16dbbfa4053d67c4ad55f1608c
2020-04-29 13:09:23 -07:00
Peter Dillinger
8086e5e294 Fix LITE build (#6770)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6770

Test Plan: make LITE=1 check

Reviewed By: ajkr

Differential Revision: D21296261

Pulled By: pdillinger

fbshipit-source-id: b6075cc13a6d6db48617b7e0e9ebeea9364dfd9f
2020-04-28 21:37:20 -07:00
anand76
335ea73e49 Fix a valgrind failure due to DBBasicTestMultiGetDeadline (#6756)
Summary:
Fix a valgrind failure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6756

Test Plan: valgrind_test

Reviewed By: pdillinger

Differential Revision: D21284660

Pulled By: anand1976

fbshipit-source-id: 39bf1bd130b6adb585ddbf2f9aa2f53dbf666f80
2020-04-28 20:06:06 -07:00
mrambacher
618bf638aa Add Functions to OptionTypeInfo (#6422)
Summary:
Added functions for parsing, serializing, and comparing elements to OptionTypeInfo.  These functions allow all of the special cases that could not be handled directly in the map of OptionTypeInfo to be moved into the map.  Using these functions, every type can be handled via the map rather than special cased.

By adding these functions, the code for handling options can become more standardized (fewer special cases) and (eventually) handled completely by common classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6422

Test Plan: pass make check

Reviewed By: siying

Differential Revision: D21269005

Pulled By: zhichao-cao

fbshipit-source-id: 9ba71c721a38ebf9ee88259d60bd81b3282b9077
2020-04-28 18:04:26 -07:00
Peter Dillinger
b810e62b39 Clarifying comments in db.h (#6768)
Summary:
And fix a confusingly worded log message
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6768

Reviewed By: anand1976

Differential Revision: D21284527

Pulled By: pdillinger

fbshipit-source-id: f03c1422c229a901c3a65e524740452349626164
2020-04-28 15:26:03 -07:00
Peter Dillinger
bae6f58696 Basic MultiGet support for partitioned filters (#6757)
Summary:
In MultiGet, access each applicable filter partition only once
per batch, rather than for each applicable key. Also,

* Fix Bloom stats for MultiGet
* Fix/refactor MultiGetContext::Range::KeysLeft, including
* Add efficient BitsSetToOne implementation
* Assert that MultiGetContext::Range does not go beyond shift range

Performance test: Generate db:

    $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 -partition_index_and_filters=true
    ...

Before (middle performing run of three; note some missing Bloom stats):

    $ ./db_bench --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget'
    multireadrandom :      26.403 micros/op 597517 ops/sec; (548427 of 671968 found)
    rocksdb.block.cache.filter.hit COUNT : 83443275
    rocksdb.bloom.filter.useful COUNT : 0
    rocksdb.bloom.filter.full.positive COUNT : 0
    rocksdb.bloom.filter.full.true.positive COUNT : 7931450
    rocksdb.number.multiget.get COUNT : 385984
    rocksdb.number.multiget.keys.read COUNT : 12351488
    rocksdb.number.multiget.bytes.read COUNT : 793145000
    rocksdb.number.multiget.keys.found COUNT : 7931450

After (middle performing run of three):

    $ ./db_bench_new --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget'
    multireadrandom :      21.024 micros/op 752963 ops/sec; (705188 of 863968 found)
    rocksdb.block.cache.filter.hit COUNT : 49856682
    rocksdb.bloom.filter.useful COUNT : 45684579
    rocksdb.bloom.filter.full.positive COUNT : 10395458
    rocksdb.bloom.filter.full.true.positive COUNT : 9908456
    rocksdb.number.multiget.get COUNT : 481984
    rocksdb.number.multiget.keys.read COUNT : 15423488
    rocksdb.number.multiget.bytes.read COUNT : 990845600
    rocksdb.number.multiget.keys.found COUNT : 9908456

So that's about 25% higher throughput even for random keys
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6757

Test Plan: unit test included

Reviewed By: anand1976

Differential Revision: D21243256

Pulled By: pdillinger

fbshipit-source-id: 5644a1468d9e8c8575be02f4e04bc5d62dbbb57f
2020-04-28 14:49:34 -07:00
Yanqin Jin
d4398e08fc Fix timestamp support for MultiGet (#6748)
Summary:
1. Avoid nullptr dereference when passing timestamp to KeyContext creation.
2. Construct LookupKey correctly with timestamp when creating MultiGetContext.
3. Compare without timestamp when sorting KeyContexts.

Fixes https://github.com/facebook/rocksdb/issues/6745

Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6748

Reviewed By: pdillinger

Differential Revision: D21258691

Pulled By: riversand963

fbshipit-source-id: 44e65b759c18b9986947783edf03be4f890bb004
2020-04-27 22:49:56 -07:00
Peter Dillinger
249eff0f30 Stats for redundant insertions into block cache (#6681)
Summary:
Since read threads do not coordinate on loading data into block
cache, two threads between Lookup and Insert can end up loading and
inserting the same data. This is particularly concerning with
cache_index_and_filter_blocks since those are hot and more likely to
be race targets if ejected from (or not pre-populated in) the cache.

Particularly with moves toward disaggregated / network storage, the cost
of redundant retrieval might be high, and we should at least have some
hard statistics from which we can estimate impact.

Example with full filter thrashing "cliff":

    $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10
    ...
    $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((130 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
    rocksdb.block.cache.add COUNT : 14181
    rocksdb.block.cache.add.failures COUNT : 0
    rocksdb.block.cache.add.redundant COUNT : 476
    rocksdb.block.cache.data.add COUNT : 12749
    rocksdb.block.cache.data.add.redundant COUNT : 18
    rocksdb.block.cache.filter.add COUNT : 1003
    rocksdb.block.cache.filter.add.redundant COUNT : 217
    rocksdb.block.cache.index.add COUNT : 429
    rocksdb.block.cache.index.add.redundant COUNT : 241
    $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((120 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
    rocksdb.block.cache.add COUNT : 1182223
    rocksdb.block.cache.add.failures COUNT : 0
    rocksdb.block.cache.add.redundant COUNT : 302728
    rocksdb.block.cache.data.add COUNT : 31425
    rocksdb.block.cache.data.add.redundant COUNT : 12
    rocksdb.block.cache.filter.add COUNT : 795455
    rocksdb.block.cache.filter.add.redundant COUNT : 130238
    rocksdb.block.cache.index.add COUNT : 355343
    rocksdb.block.cache.index.add.redundant COUNT : 172478
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6681

Test Plan: Some manual testing (above) and unit test covering key metrics is included

Reviewed By: ltamasi

Differential Revision: D21134113

Pulled By: pdillinger

fbshipit-source-id: c11497b5f00f4ffdfe919823904e52d0a1a91d87
2020-04-27 13:20:27 -07:00
Cheng Chang
0a77617820 Disable O_DIRECT in stress test when db directory does not support direct IO (#6727)
Summary:
In crash test, the db directory might be set to /dev/shm or /tmp, in certain environments such as internal testing infrastructure, neither of these directories support direct IO, so direct IO is never enabled in crash test.

This PR sets up SyncPoints in direct IO related code paths to disable O_DIRECT flag in calls to `open`, so the direct IO code paths will be executed, all direct IO related assertions will be checked, but no real direct IO request will be issued to the file system.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6727

Test Plan:
export CRASH_TEST_EXT_ARGS="--use_direct_reads=1 --mmap_read=0"
make -j24 crash_test

Reviewed By: zhichao-cao

Differential Revision: D21139250

Pulled By: cheng-chang

fbshipit-source-id: db9adfe78d91aa4759835b1af91c5db7b27b62ee
2020-04-25 00:01:03 -07:00
Yanqin Jin
e04f3bce4f Update CURRENT file after best-efforts recovery (#6746)
Summary:
After a successful recovery, the CURRENT file should be updated to point to the valid MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6746

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D21189876

Pulled By: riversand963

fbshipit-source-id: 7537b49988c5c425ebe9505a5cc260de351ad79b
2020-04-23 16:21:09 -07:00
Levi Tamasi
bc51e33d9c Make sure (Shared)BlobFileMetaData are owned by shared_ptrs (#6749)
Summary:
The patch makes a couple of small cleanups to `SharedBlobFileMetaData` and `BlobFileMetaData`:
* It makes the constructors private and introduces factory methods to ensure these objects are always owned by `shared_ptr`s. Note that `SharedBlobFileMetaData` has an additional factory that takes a deleter object; we can utilize this to e.g. notify `VersionSet` when a blob file becomes obsolete (which is exactly when `SharedBlobFileMetaData` is destroyed).
* It disables move operations explicitly instead of relying on them being suppressed because of a user-declared destructor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6749

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D21206947

Pulled By: ltamasi

fbshipit-source-id: 9094c14cc335b3e226f883e5a0df4f87a5cdeb95
2020-04-23 13:44:29 -07:00
mrambacher
4cbc19d2a1 Add a ConfigOptions for use in comparing objects and converting to/from strings (#6389)
Summary:
The methods in convenience.h are used to compare/convert objects to/from strings.  There is a mishmash of parameters in use here with more needed in the future.  This PR replaces those parameters with a single structure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389

Reviewed By: siying

Differential Revision: D21163707

Pulled By: zhichao-cao

fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd
2020-04-21 17:38:17 -07:00
anand76
c1ccd6b6af Implement deadline support for MultiGet (#6710)
Summary:
Initial implementation of ReadOptions.deadline for MultiGet. If the request takes longer than the deadline, the keys not yet found will be returned with Status::TimedOut(). This
implementation enforces the deadline in DBImpl, which is fairly high
level. Its best effort and may not check the deadline after every key
lookup, but may do so after a batch of keys.

In subsequent stages, we will extend this to passing a timeout down to the FileSystem.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6710

Test Plan: Add new unit tests

Reviewed By: riversand963

Differential Revision: D21149158

Pulled By: anand1976

fbshipit-source-id: 9f44eecffeb40873f5034ed59a66d21f9f88879e
2020-04-21 14:51:51 -07:00
Tomas Kolda
6ee66cf8f0 Prevents Table Cache to open same files more times (#6707)
Summary:
In highly concurrent requests table cache opens same file more times which lowers purpose of max_open_files. Fixes (https://github.com/facebook/rocksdb/issues/6699)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6707

Reviewed By: ltamasi

Differential Revision: D21044965

fbshipit-source-id: f6e91d90b60dad86e518b5147021da42460ee1d2
2020-04-21 13:16:31 -07:00
Akanksha Mahajan
03a1d95db0 Set max_background_flushes dynamically (#6701)
Summary:
1. Add changes so that max_background_flushes can be set dynamically.
                   2. Add a testcase DBOptionsTest.SetBackgroundFlushThreads which set the
                        max_background_flushes dynamically using SetDBOptions.

TestPlan:  1. make -j64 check
                  2. Using new testcase DBOptionsTest.SetBackgroundFlushThreads
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6701

Reviewed By: ajkr

Differential Revision: D21028010

Pulled By: akankshamahajan15

fbshipit-source-id: 5f949e4a8fd3c32537b637947b7ee09a69cfc7c1
2020-04-20 16:19:02 -07:00
Peter Dillinger
31da5e34c1 C++20 compatibility (#6697)
Summary:
Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended:

* Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists
* Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition
* std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API
* Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line
* Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20)
* Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor
* Added GCC 9 C++11 & GCC9 C++20 in Travis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697

Test Plan: make check and CI

Reviewed By: cheng-chang

Differential Revision: D21020318

Pulled By: pdillinger

fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91
2020-04-20 13:24:25 -07:00
Peter Dillinger
45d2b4efca Fix tabs and lint-ignores (#6734)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6734

Reviewed By: cheng-chang

Differential Revision: D21134556

Pulled By: pdillinger

fbshipit-source-id: 3636cc1d1333137b70031f8277458781c21631fb
2020-04-20 11:39:31 -07:00
Andrew Kryczka
6717ada899 Fix CF import with overlapping SST files (#6663)
Summary:
Invariant checking should use internal key comparator rather than
`sstableKeyCompare()`. The latter was intended for checking whether a
compaction input file's neighboring files need to be included in the
same compaction. Using it for invariant checking was leading to false
positives for files with overlapping endpoints.

Fixes https://github.com/facebook/rocksdb/issues/6647.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6663

Test Plan: regression test

Reviewed By: zhichao-cao

Differential Revision: D20910466

Pulled By: ajkr

fbshipit-source-id: f0b70dad7c4096fce635cab7a36f16e14f74ae3f
2020-04-16 13:16:06 -07:00
Mike Kolupaev
e45673dece Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.

Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.

It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.

Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621

Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.

Reviewed By: siying

Differential Revision: D20786930

Pulled By: al13n321

fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
2020-04-15 17:40:44 -07:00
Yanqin Jin
eeb3cf3f58 Fix release build (#6690)
Summary:
Fix release build caused by variable defined but unused.

Test plan (devserver)
```
make release
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6690

Reviewed By: cheng-chang

Differential Revision: D20980571

Pulled By: riversand963

fbshipit-source-id: c3f3b13f81dce4bdb19876dc2e710d5902ff8a02
2020-04-11 22:04:04 -07:00
Yanqin Jin
0c05624d50 Compaction with timestamp: input boundaries (#6645)
Summary:
Towards making compaction logic compatible with user timestamp.
When computing boundaries and overlapping ranges for inputs of compaction, We need to compare SSTs by user key without timestamp.

Test plan (devserver):
```
make check
```
Several individual tests:
```
./version_set_test --gtest_filter=VersionStorageInfoTimestampTest.GetOverlappingInputs
./db_with_timestamp_compaction_test
./db_with_timestamp_basic_test
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6645

Reviewed By: ltamasi

Differential Revision: D20960012

Pulled By: riversand963

fbshipit-source-id: ad377fa9eb481bf7a8a3e1824aaade48cdc653a4
2020-04-10 16:05:49 -07:00
Akanksha Mahajan
a0faff126d Report kFilesMarkedForCompaction for delete triggered compactions (#6680)
Summary:
Summary : Set manual_compaction false in case of DeleteTriggeredCompaction object so that kFilesMarkedForComapaction can be reported.
          Added a DeletionTriggeredUniversalCompactionMarking test case for Deletion Triggered compaction in case of Universal Compaction.

Test Plan : make check -j64
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6680

Test Plan: make check -j64

Reviewed By: anand1976

Differential Revision: D20945946

Pulled By: akankshamahajan15

fbshipit-source-id: af84e417bd7127652aaae9143c560d1ab3815d25
2020-04-10 15:30:38 -07:00
Huisheng Liu
9e89ffb776 make iterator return versions between timestamp bounds (#6544)
Summary:
(Based on Yanqin's idea) Add a new field in readoptions as lower timestamp bound for iterator. When the parameter is not supplied (nullptr), the iterator returns the latest visible version of a record. When it is supplied, the existing timestamp field is the upper bound. Together the two serves as a bounded time window. The iterator returns all versions of a record falling in the window.

SeekRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
base line (commit e860f8840):
seekrandom   : 7.836 micros/op 4082449 ops/sec; (0 of 73481999 found)
This PR:
seekrandom   : 7.764 micros/op 4120935 ops/sec; (0 of 71303999 found)

db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=seekrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6544

Reviewed By: ltamasi

Differential Revision: D20844069

Pulled By: riversand963

fbshipit-source-id: d97f2bf38a323c8c6a68db213b2d3c694b1c1f74
2020-04-10 09:51:58 -07:00
Yi Wu
eb287c72d7 Fix wrong key being read on ingested file with global seqno and delta encoding (#6669)
Summary:
On reading an ingested SST file, `DataBlockIter` will replace seqno encoded in a key with global seqno. However, if the original seqno was part of the prefix used for the next key, the global seqno is by mistake used as part of the prefix to construct the next key, causing wrong result being returned. Although at this point it is only software error while data in the file is not corrupted, the issue can further cause compaction output out of order and corrupted result when the ingested SST participated in compaction. Fixing the issue by save the actual seqno and restore it before the key being used as prefix to construct next key.

The unit test is by Little-Wallace from https://github.com/facebook/rocksdb/issues/6666. Fixing https://github.com/facebook/rocksdb/issues/6666.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6669

Test Plan:
New unit test

Signed-off-by: Yi Wu <yiwu@pingcap.com>

Reviewed By: cheng-chang

Differential Revision: D20931808

Pulled By: ajkr

fbshipit-source-id: f01959c35d6a493954dca981663766c7a5a9e8ab
2020-04-08 21:22:15 -07:00
Peter Dillinger
e5f1bfc263 Fix initializer syntax for old Xcode compiler (#6662)
Summary:
Example compiler output, from OSX TEST_GROUP=3:

db/flush_job_test.cc:185:7: error: suggest braces around initialization
of subobject [-Werror,-Wmissing-braces]
      kInvalidBlobFileNumber, 5, 103, 17, 102, 101};

Apparently permitted in newer version, but worth working around.
https://stackoverflow.com/questions/31555584/why-is-clang-warning-suggest-braces-around-initialization-of-subobject-wmis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6662

Test Plan: CI (temporarily including OSX TEST_GROUP=3 in Travis)

Reviewed By: ltamasi

Differential Revision: D20901009

Pulled By: pdillinger

fbshipit-source-id: 5338878613b5725e5d632c8858904de467dc4692
2020-04-07 16:00:26 -07:00
Kirill Abrosimov
3ff603171d added new functions to c-api (#5630)
Summary:
Few functions from options added to C-api
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5630

Reviewed By: anand1976

Differential Revision: D20896731

Pulled By: ltamasi

fbshipit-source-id: e4215a58b3c2429ec44e3f0d0381cbf86700fb14
2020-04-07 14:45:39 -07:00
Steven Fackler
f53cdab3d7 Hex encode keys in compaction flush logs (#6616)
Summary:
The raw key bytes are currently dumped directly into the log messages,
which is not ideal if the keys aren't ASCII strings. Null bytes in
particular can cut off bits of the message early.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6616

Reviewed By: ajkr

Differential Revision: D20879218

Pulled By: anand1976

fbshipit-source-id: 825a20715fe6d8012c0163c6e7b8159f7926a1a7
2020-04-06 17:41:45 -07:00
mrambacher
259b6ec8da Move the OptionTypeMap code closer to home (#6198)
Summary:
This is a predecessor to the Configurable PR.  This change moves the OptionTypeInfo maps closer to where they will be used.

When the Configurable changes are adopted, these values will become static and not associated with the OptionsHelper.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6198

Reviewed By: siying

Differential Revision: D20778108

Pulled By: zhichao-cao

fbshipit-source-id: a9f85fc73bc53503656e1958ecc1e764052fd1aa
2020-04-03 10:52:38 -07:00
Peter Dillinger
079e77ff9e Revamp cache_bench to resemble a real workload (#6629)
Summary:
I suspect LRUCache could use some optimization, and to support
such an effort, a good benchmarking tool is needed. The existing
cache_bench was heavily skewed toward insertion and lookup misses, and
did not saturate memory with other work. This change should improve
those things to better resemble a real workload.

(All below using clang compiler, for some consistency, but not
necessarily same version and settings.)

The real workload is from production MySQL on RocksDB, filtering stacks
containing "LRU", "ShardedCache" or "CacheShard."
Lookup inclusive: 66%
Insert inclusive: 17%
Release inclusive: 15%

An alternate simulated workload is MySQL running a LinkBench read test:
Lookup inclusive: 54%
Insert inclusive: 24%
Release inclusive: 21%

cache_bench default settings, prior to this change:
Lookup inclusive: 35.8%
Insert inclusive: 63.6%
Release inclusive: 0%

cache_bench after this change (intended as somewhat "tighter" workload
than average production, more like LinkBench):
Lookup inclusive: 52%
Insert inclusive: 20%
Release inclusive: 26%

And top exclusive stacks (portion of stack samples as filtered above):
Production MySQL:
LRUHandleTable::FindPointer: 25.3%
rocksdb::operator==: 15.1%  <-- Slice ==
LRUCacheShard::LRU_Remove: 13.8%
ShardedCache::Lookup: 8.9%
__pthread_mutex_lock: 7.1%
LRUCacheShard::LRU_Insert: 6.3%
MurmurHash64A: 4.8%  <-- Since upgraded to XXH3p
...

Old cache_bench:
LRUHandleTable::FindPointer: 23.6%
__pthread_mutex_lock: 15.0%
__pthread_mutex_unlock_usercnt: 11.7%
__lll_lock_wait: 8.6%
__lll_unlock_wake: 6.8%
LRUCacheShard::LRU_Insert: 6.0%
ShardedCache::Lookup: 4.4%
LRUCacheShard::LRU_Remove: 2.8%
...
rocksdb::operator==: 0.2%  <-- Slice ==
...

New cache_bench:
LRUHandleTable::FindPointer: 22.8%
__pthread_mutex_unlock_usercnt: 14.3%
rocksdb::operator==: 10.5%  <-- Slice ==
LRUCacheShard::LRU_Insert: 9.0%
__pthread_mutex_lock: 5.9%
LRUCacheShard::LRU_Remove: 5.0%
...
ShardedCache::Lookup: 2.9%
...

So there's a bit more lock contention in the benchmark than in
production, but otherwise looks similar enough to me. At least it's a
big improvement over the existing code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6629

Test Plan: No production code changes, ran cache_bench with ASAN

Reviewed By: ltamasi

Differential Revision: D20824318

Pulled By: pdillinger

fbshipit-source-id: 6f8dc5891ead0f87edbed3a615ecd5289d9abe12
2020-04-03 10:26:49 -07:00
Zhichao Cao
ef088f0e93 Fix the multi-thread Manifest write dependency in error_handler_fs_test (#6637)
Summary:
In CompactionManifestWriteRetryableError in error_handler_fs_test, the manifest write of flush should pass with no fs error. After flush, fs is set to error status and the manifest write of compaction should fail due to the IO Error. Currently, the manifest write of flush is not synced with the compaction in order, which might cause manifest write fails, which will cause test failure. Fixed by adding the LoadDependency of sync-point after flush and before compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6637

Test Plan: pass error_hanlder_fs_tes. Pass make asan_check

Reviewed By: anand1976

Differential Revision: D20826969

Pulled By: zhichao-cao

fbshipit-source-id: fb2e702caa19bd63c82570320536b7acda870ff1
2020-04-02 18:08:46 -07:00
anand76
0709cd04ca Fix LITE mode test failure in DBOptionsTest.ChangeCompression (#6635)
Summary:
This failure was introduced in https://github.com/facebook/rocksdb/issues/6262
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6635

Reviewed By: siying

Differential Revision: D20822602

Pulled By: anand1976

fbshipit-source-id: 96b316816cce6b95b092a7fc46ea968ed6ba8809
2020-04-02 16:41:09 -07:00
sdong
d0f3894cf1 In block based table builder, make variables for estimating file size atomic (#6636)
Summary:
With https://github.com/facebook/rocksdb/issues/6262, TSAN complains about data race of some variables. Those variables are used to estimate file size and are accessed in writer and background threads. Since file size estimation doesn't have to be 100% accurate, we make some variables atomic and use relaxed memory order.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6636

Test Plan: Run all tests with TSAN.

Reviewed By: anand1976

Differential Revision: D20820635

fbshipit-source-id: 1ea45ff38be15e33674ffe06b7d42fc9fe161ea5
2020-04-02 16:16:24 -07:00
Levi Tamasi
2165c3bacc Re-persist blob file metadata when a new manifest file is created (#6630)
Summary:
Does what it says on the can. Similarly to table files, we need to re-persist
the metadata of live blob files whenever a new manifest file is opened.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6630

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D20802126

Pulled By: ltamasi

fbshipit-source-id: 5738692d898790293bf09d66e9997369bbf89566
2020-04-02 11:53:05 -07:00
Ziyue Yang
03a781a90c Add pipelined & parallel compression optimization (#6262)
Summary:
This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262

Reviewed By: ajkr

Differential Revision: D20651306

fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb
2020-04-01 16:40:18 -07:00
Yanqin Jin
b5818f87f0 Fix clang analyze error (#6622)
Summary:
As title. https://github.com/facebook/rocksdb/issues/6612 caused clang analyze to fail with the error:
```
db/compaction/compaction_picker_fifo.cc:105:39: warning: Called C++ object pointer is null
                     cf_name.c_str(), f->fd.GetNumber(), creation_time);
                                      ^~~~~~~~~~~~~~~~~
./logging/logging.h:59:36: note: expanded from macro 'ROCKS_LOG_BUFFER'
                                 ##__VA_ARGS__)
                                   ^~~~~~~~~~~
1 warning generated.
```

Test Plan (devserver):
USE_CLANG=1 make analyze
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6622

Reviewed By: ltamasi

Differential Revision: D20787407

Pulled By: riversand963

fbshipit-source-id: a5de4910cc1aa0d3481a73ec114578925bfd63f7
2020-04-01 10:01:38 -07:00
Levi Tamasi
e6f86cfb36 Revert the recent cache deleter change (#6620)
Summary:
Revert "Use function objects as deleters in the block cache (https://github.com/facebook/rocksdb/issues/6545)"

    This reverts commit 6301dbe7a7.

    Revert "Call out the cache deleter related interface change in HISTORY.md (https://github.com/facebook/rocksdb/issues/6606)"

    This reverts commit 3a35542f86.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6620

Test Plan: `make check`

Reviewed By: zhichao-cao

Differential Revision: D20773311

Pulled By: ltamasi

fbshipit-source-id: 7637a761f718f323ef0e7da959462e8fb06e7a2b
2020-03-31 16:11:06 -07:00
sdong
80979f81c7 Make options.bottommost_compression, compression_opts and bottommost_compression_opts dynamically changeable. (#6615)
Summary:
These three options should be made dynamically changeable. Simply add them to MutableCFOptions and made the change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6615

Test Plan: Add a unit test to make sure that SetOptions() can change the options.

Reviewed By: riversand963

Differential Revision: D20755951

fbshipit-source-id: 8165f4fd7a7a665cc7fb049698935022a5d2e7ff
2020-03-31 12:11:42 -07:00
Yanqin Jin
18cf0de640 Use flush time for the props.creation_time for FIFO compaction (#6612)
Summary:
For FIFO compaction, we use flush time instead of oldest key time as the
creation time. This is to prevent FIFO compaction dropping files whose oldest
key time is older than TTL but which has newer keys than TTL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6612

Test Plan: make check

Reviewed By: siying

Differential Revision: D20748217

Pulled By: riversand963

fbshipit-source-id: 3f7b00a847020760537cdddd12f6fe039e5bc663
2020-03-30 18:59:17 -07:00
Zhichao Cao
e8d332d97e Use FileChecksumGenFactory for SST file checksum (#6600)
Summary:
In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600

Test Plan: tested with make asan_check

Reviewed By: riversand963

Differential Revision: D20717670

Pulled By: zhichao-cao

fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6
2020-03-29 15:58:46 -07:00
Cheng Chang
ee50b8d499 Be able to decrease background thread's CPU priority when creating database backup (#6602)
Summary:
When creating a database backup, the background threads will not only consume IO resources by copying files, but also consuming CPU such as by computing checksums. During peak times, the CPU consumption by the background threads might affect online queries.

This PR makes it possible to decrease CPU priority of these threads when creating a new backup.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6602

Test Plan: make check

Reviewed By: siying, zhichao-cao

Differential Revision: D20683216

Pulled By: cheng-chang

fbshipit-source-id: 9978b9ed9488e8ce135e90ca083e5b4b7221fd84
2020-03-28 19:07:25 -07:00