Commit Graph

9867 Commits

Author SHA1 Message Date
Zhichao Cao
fd8247b1be update HISTORY.md and bump version for 6.19.4 2021-04-23 15:42:04 -07:00
Zhichao Cao
fe352574b4 Fix the false positive alert of CF consistency check in WAL recovery (#8207)
Summary:
In current RocksDB, in recover the information form WAL, we do the consistency check for each column family when one WAL file is corrupted and PointInTimeRecovery is set. However, it will report a false positive alert on "SST file is ahead of WALs" when one of the CF current log number is greater than the corrupted WAL number (CF contains the data beyond the corrupted WAl) due to a new column family creation during flush. In this case, a new WAL is created (it is empty) during a flush. Also, due to some reason (e.g., storage issue or crash happens before SyncCloseLog is called), the old WAL is corrupted. The new CF has no data, therefore, it does not have the consistency issue.

Fix: when checking cfd->GetLogNumber() > corrupted_wal_number also check cfd->GetLiveSstFilesSize() > 0. So the CFs with no SST file data will skip the check here.

Note potential ignored inconsistency caused due to fix: empty CF can also be caused by write+delete. In this case, after flush, there is no SST files being generated. However, this CF still have the log in the WAL. When the WAL is corrupted, the DB might be inconsistent.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8207

Test Plan: added unit test, make crash_test

Reviewed By: riversand963

Differential Revision: D27898839

Pulled By: zhichao-cao

fbshipit-source-id: 931fc2d8b92dd00b4169bf84b94e712fd688a83e
2021-04-23 15:39:35 -07:00
Andrew Kryczka
6fbe42a6fd Fix seqno in ingested file boundary key metadata (#8209)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/6245.

Adapted from https://github.com/facebook/rocksdb/issues/8201 and https://github.com/facebook/rocksdb/issues/8205.

Previously we were writing the ingested file's smallest/largest internal keys
with sequence number zero, or `kMaxSequenceNumber` in case of range
tombstone. The former (sequence number zero) is incorrect and can lead
to files being incorrectly ordered. The fix in this PR is to overwrite
boundary keys that have sequence number zero with the ingested file's assigned
sequence number.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8209

Test Plan: repro unit test

Reviewed By: riversand963

Differential Revision: D27885678

Pulled By: ajkr

fbshipit-source-id: 4a9f2c6efdfff81c3a9923e915ea88b250ee7b6a
2021-04-23 15:37:41 -07:00
Andrew Gallagher
e6c47cd6cc Cleanup include (#8208)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8208

Make include of "file_system.h" use the same include path as everywhere
else.

Reviewed By: riversand963, akankshamahajan15

Differential Revision: D27881606

fbshipit-source-id: fc1e076229fde21041a813c655ce017b5070c8b3
2021-04-20 15:24:25 -07:00
Yanqin Jin
645c445978 Bump version 2021-04-19 19:57:42 -07:00
Yanqin Jin
f952de5be2 Handle rename() failure in non-local FS (#8192)
Summary:
In a distributed environment, a file `rename()` operation can succeed on server (remote)
side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which
can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a
new MANIFEST. We currently always delete the new MANIFEST if an error occurs.

This is problematic in distributed world. If the server-side successfully updates the CURRENT
file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail.

As a fix, we can track the execution result of IO operations on the new MANIFEST.
- If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original
  MANIFEST. Therefore, it is safe to remove the new MANIFEST.
- If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up
  code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
  POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
  new MANIFEST.) Therefore, we keep the new MANIFEST.
    - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT.
    - If process reopens the db immediately after the failure, then the CURRENT file can point
      to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
      succeed and ignore the other.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192

Test Plan: make check

Reviewed By: zhichao-cao

Differential Revision: D27804648

Pulled By: riversand963

fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
2021-04-19 19:56:59 -07:00
Adam Retter
4362985805 Fix Windows strcmp for Unicode (#8190)
Summary:
The code for strcmp that was present does work when compiled for Windows unicode file paths.

Needs backporting to:
* 6.17.fb
* 6.18.fb
* 6.19.fb

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8190

Reviewed By: akankshamahajan15

Differential Revision: D27765588

Pulled By: jay-zhuang

fbshipit-source-id: 89f8a5ac61fd7edc758340dfd335b0a5f96dae6e
2021-04-19 08:41:06 -07:00
Yanqin Jin
85cdad116b Update history and bump version 2021-04-08 11:26:13 -07:00
Yanqin Jin
07e3794972 Fix a bug for SeekForPrev with partitioned filter and prefix (#8137)
Summary:
According to https://github.com/facebook/rocksdb/issues/5907, each filter partition "should include the bloom of the prefix of the last
key in the previous partition" so that SeekForPrev() in prefix mode can return correct result.
The prefix of the last key in the previous partition does not necessarily have the same prefix
as the first key in the current partition. Regardless of the first key in current partition, the
prefix of the last key in the previous partition should be added. The existing code, however,
does not follow this. Furthermore, there is another issue: when finishing current filter partition,
`FullFilterBlockBuilder::AddPrefix()` is called for the first key in next filter partition, which effectively
overwrites `last_prefix_str_` prematurely. Consequently, when the filter block builder proceeds
to the next partition, `last_prefix_str_` will be the prefix of its first key, leaving no way of adding
the bloom of the prefix of the last key of the previous partition.

Prefix extractor is FixedLength.2.
```
[  filter part 1   ]    [  filter part 2    ]
                  abc    d
```
When SeekForPrev("abcd"), checking the filter partition will land on filter part 2 because "abcd" > "abc"
but smaller than "d".
If the filter in filter part 2 happens to return false for the test for "ab", then SeekForPrev("abcd") will build
incorrect iterator tree in non-total-order mode.

Also fix a unit test which starts to fail following this PR. `InDomain` should not fail due to assertion
error when checking on an arbitrary key.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8137

Test Plan:
```
make check
```

Without this fix, the following command will fail pretty soon.
```
./db_stress --acquire_snapshot_one_in=10000 --avoid_flush_during_recovery=0 \
--avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 \
--batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=17 \
--bottommost_compression_type=disable --cache_index_and_filter_blocks=1 --cache_size=1048576 \
--checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 \
--compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 \
--compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 \
--compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=0 \
--continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_whitebox \
--db_write_buffer_size=8388608 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --enable_blob_files=0 \
--enable_compaction_filter=0 --enable_pipelined_write=1 --file_checksum_impl=big --flush_one_in=1000000 \
--format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 \
--get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --index_type=2 --ingest_external_file_one_in=0 \
--iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True \
--log2_keys_per_lock=10 --long_running_snapshots=1 --mark_for_compaction_one_file_in=0 \
--max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000000 --max_key_len=3 \
--max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 \
--max_write_buffer_size_to_maintain=8388608 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False \
--nooverwritepercent=0 --open_files=500000 --ops_per_thread=20000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=1 --partition_pinning=0 --pause_background_one_in=1000000 \
--periodic_compaction_seconds=0 --prefixpercent=5 --progress_reports=0 --read_fault_one_in=0 --read_only=0 \
--readpercent=45 --recycle_log_file_num=0 --reopen=20 --secondary_catch_up_one_in=0 \
--snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 \
--sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=False \
--target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 \
--top_level_index_pinning=0 --unpartitioned_pinning=1 --use_blob_db=0 --use_block_based_filter=0 \
--use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 \
--use_multiget=0 --use_ribbon_filter=0 --use_txn=0 --user_timestamp_size=8 --verify_checksum=1 \
--verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=4194304 \
--write_dbid_to_manifest=1 --writepercent=35
```

Reviewed By: pdillinger

Differential Revision: D27553054

Pulled By: riversand963

fbshipit-source-id: 60e391e4a2d8d98a9a3172ec5d6176b90ec3de98
2021-04-08 11:21:00 -07:00
Adam Retter
1624f20934 Update ZStd. Fixes an issue with Make 3.82 (#8155)
Summary:
The previous version of ZStd doesn't build correctly with Make 3.82. Updating it resolves the issue.

jay-zhuang This also needs to be cherry-picked to:
1. 6.17.fb
2. 6.18.fb
3. 6.19.fb

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8155

Reviewed By: riversand963

Differential Revision: D27596460

Pulled By: ajkr

fbshipit-source-id: ac8492245e6273f54efcc1587346a797a91c9441
2021-04-08 09:04:45 -07:00
Andrew Gallagher
1e96a70be4 Use include_paths instead of raw -I in TARGETS (#8143)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8143

The latter assume the location of the compile root, which can break
if the build root changes.  Switch to the slightly more intelligent
`include_paths`, which should provide the same functionality, but do
with independent of include root.

Reviewed By: riversand963

Differential Revision: D27535869

fbshipit-source-id: 0129e47c0ce23e08528c9139114a591c14866fa8
2021-04-05 14:38:18 -07:00
Andrew Kryczka
efe9d5c823 update HISTORY.md and bump version for 6.19.1 2021-04-01 10:01:18 -07:00
Andrew Kryczka
a265ac75ab Fix compression dictionary sampling with dedicated range tombstone SSTs (#8141)
Summary:
Return early in case there are zero data blocks when
`BlockBasedTableBuilder::EnterUnbuffered()` is called. This crash can
only be triggered by applying dictionary compression to SST files that
contain only range tombstones. It cannot be triggered by a low buffer
limit alone since we only consider entering unbuffered mode after
buffering a data block causing the limit to be breached, or `Finish()`ing the file. It also cannot
be triggered by a totally empty file because those go through
`Abandon()` rather than `Finish()` so unbuffered mode is never entered.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8141

Test Plan: added a unit test that repro'd the "Floating point exception"

Reviewed By: riversand963

Differential Revision: D27495640

Pulled By: ajkr

fbshipit-source-id: a463cfba476919dc5c5c380800a75a86c31ffa23
2021-04-01 10:00:27 -07:00
Adam Retter
9435b2e959 range_tree requires GNU libc on ppc64 (#8070)
Summary:
If the platform is ppc64 and the libc is not GNU libc, then we exclude the range_tree from compilation.

See https://jira.percona.com/browse/PS-7559

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8070

Reviewed By: jay-zhuang

Differential Revision: D27246004

Pulled By: mrambacher

fbshipit-source-id: 59d8433242ce7ce608988341becb4f83312445f5
2021-03-30 16:04:52 -07:00
Zhichao Cao
0f8c041ea7 Remove disabled tests (#8123)
Summary:
Remove disabled tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8123

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D27367066

Pulled By: zhichao-cao

fbshipit-source-id: 71fa1d492d9b0144decff0a1d0e0ef25c0ecc4ba
2021-03-28 23:06:08 -07:00
Jay Zhuang
9da6019f9d Avoid checking errno on success call (#8119)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8119

Reviewed By: sushilpa

Differential Revision: D27365407

Pulled By: jay-zhuang

fbshipit-source-id: 327c09bf76834ce0be4287680640adc8b88bcec2
2021-03-28 23:05:56 -07:00
Zhichao Cao
d5e2462946 Fix flush no wal IO error bug (#8107)
Summary:
There is bug in the current code base introduced in https://github.com/facebook/rocksdb/issues/8049 , we still set the SST file write IO Error only case as hard error. Fix it by removing the logic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8107

Test Plan: make check, error_handler_fs_test

Reviewed By: anand1976

Differential Revision: D27321422

Pulled By: zhichao-cao

fbshipit-source-id: c014afc1553ca66b655e3bbf9d0bf6eb417ccf94
2021-03-25 21:52:27 -07:00
Zhichao Cao
5b3ebdc3d1 Added append with checksum handoff API to hdfs (#8084)
Summary:
Added append with checksum handoff API to hdfs

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8084

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D27237823

Pulled By: zhichao-cao

fbshipit-source-id: 93b38db23b1811a6daa049afb89240089ec6f67c
2021-03-23 17:25:42 -07:00
Zhichao Cao
7457c7cd00 Update release version to 6.19 (#8083)
Summary:
Update release version to 6.19

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8083

Test Plan: no code change

Reviewed By: riversand963

Differential Revision: D27222083

Pulled By: zhichao-cao

fbshipit-source-id: 94b49997019347e6e6a9e341837f4f9d3149428c
2021-03-21 18:33:46 -07:00
Peter Dillinger
3bfd3ed2f3 Begin forward compatibility for new backup meta schema (#8069)
Summary:
This does not add any new public APIs or published
functionality, but adds the ability to read and use (and in tests,
write) backups with a new meta file schema, based on the old schema
but not forward-compatible (before this change). The new schema enables
some capabilities not in the old:

* Explicit versioning, so that users get clean error messages the next
time we want to break forward compatibility.
* Ignoring unrecognized fields (with warning), so that new non-critical
features can be added without breaking forward compatibility.
* Rejecting future "non-ignorable" fields, so that new features critical
to some use-cases could potentially be added outside of linear schema
versions, with broken forward compatibility.
* Fields at the end of the meta file, such as for checksum of the meta
file's contents (up to that point)
* New optional 'size' field for each file, which is checked when present
* Optionally omitting 'crc32' field, so that we aren't required to have
a crc32c checksum for files to take a backup. (E.g. to support backup
via hard links and to better support file custom checksums.)

Because we do not have a JSON parser and to share code, the new schema
is simply derived from the old schema.

BackupEngine code is updated to allow missing checksums in some places,
and to make that easier, `has_checksum` and `verify_checksum_after_work`
are eliminated. Empty `checksum_hex` indicates checksum is unknown. I'm
not too afraid of regressing on data integrity, because
(a) we have pretty good test coverage of corruption detection in backups, and
(b) we are increasingly relying on the DB itself for data integrity rather than
it being an exclusive feature of backups.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8069

Test Plan:
new unit tests, added to crash test (some local run with
boosted backup probability)

Reviewed By: ajkr

Differential Revision: D27139824

Pulled By: pdillinger

fbshipit-source-id: 9e0e4decfb42bb84783d64d2d246456d97e8e8c5
2021-03-19 20:15:40 -07:00
storagezhang
c8b0842bcd Remove unused variable (#8067)
Summary:
Remove unused variable `Slice blob_to_write` in `db/blob/blob_file_cache_test.cc`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8067

Reviewed By: zhichao-cao

Differential Revision: D27107693

Pulled By: riversand963

fbshipit-source-id: 9bfd4d296a6a1714ad5c1fa5bb231a0c52dbd56d
2021-03-19 12:13:59 -07:00
storagezhang
d9be6556aa Include C++ standard library headers instead of C compatibility headers (#8068)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8068

Reviewed By: zhichao-cao

Differential Revision: D27147685

Pulled By: riversand963

fbshipit-source-id: 5428b1c0142ecae17c977fba31a6d49b52983d1c
2021-03-19 12:09:47 -07:00
storagezhang
c706324208 Add default in switch (#8065)
Summary:
switch may not cover all branch in `db/c.cc`:

```c++
void rocksdb_options_set_access_hint_on_compaction_start(
    rocksdb_options_t* opt, int v) {
  switch(v) {
    case 0:
      opt->rep.access_hint_on_compaction_start =
          ROCKSDB_NAMESPACE::Options::NONE;
      break;
    case 1:
      opt->rep.access_hint_on_compaction_start =
          ROCKSDB_NAMESPACE::Options::NORMAL;
      break;
    case 2:
      opt->rep.access_hint_on_compaction_start =
          ROCKSDB_NAMESPACE::Options::SEQUENTIAL;
      break;
    case 3:
      opt->rep.access_hint_on_compaction_start =
          ROCKSDB_NAMESPACE::Options::WILLNEED;
      break;
  }
}
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8065

Reviewed By: riversand963

Differential Revision: D27102892

Pulled By: zhichao-cao

fbshipit-source-id: ad1d20d192712878e61597311ba75b55df0066d7
2021-03-19 11:57:52 -07:00
Zhichao Cao
dd0447ae2c Add new Append API with DataVerificationInfo to Env WritableFile (#8071)
Summary:
Add the new Append and PositionedAppend API to env WritableFile. User is able to benefit from the write checksum handoff API when using the legacy Env classes. FileSystem already implemented the checksum handoff API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8071

Test Plan: make check, added new unit test.

Reviewed By: anand1976

Differential Revision: D27177043

Pulled By: zhichao-cao

fbshipit-source-id: 430c8331fc81099fa6d00f4fff703b68b9e8080e
2021-03-19 11:44:13 -07:00
Yanqin Jin
7ee41a5d25 Fix a test failure when built with ASSERT_STATUS_CHECKED=1 (#8075)
Summary:
As title.
Test plan
ASSERT_STATUS_CHECKED=1 make -j20 backupable_db_test error_handler_fs_test
./backupable_db_test
./error_handler_fs_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8075

Reviewed By: zhichao-cao

Differential Revision: D27173832

Pulled By: riversand963

fbshipit-source-id: 37dac50f7c89127804ff2572abddd4174642de30
2021-03-18 21:52:48 -07:00
Yanqin Jin
576cff11da Remove db_with_timestamp_basic_test from platform_dependent list (#8077)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8077

Test Plan: Travis CI

Reviewed By: zhichao-cao

Differential Revision: D27178276

Pulled By: riversand963

fbshipit-source-id: 17911dcc2d5790eb396efcd7f90dea76a127cf15
2021-03-18 20:09:04 -07:00
Yanqin Jin
063a68b9cd Check and handle failure in ldb (#8072)
Summary:
Currently, a few ldb commands do not check the execution result of
database operations. This PR checks the execution results and tries to
improve the error reporting.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8072

Test Plan:
```
make check
```
and
```
ASSERT_STATUS_CHECKED=1 make -j20 ldb
python tools/ldb_test.py
```

Reviewed By: zhichao-cao

Differential Revision: D27152466

Pulled By: riversand963

fbshipit-source-id: b94220496a4b3591b61c1d350f665860a6579f30
2021-03-18 14:43:34 -07:00
Zhichao Cao
c810947184 Separate handling of WAL Sync io error with SST flush io error (#8049)
Summary:
In previous codebase, if WAL is used, all the retryable IO Error will be treated as hard error. So write is stalled. In this PR, the retryable IO error from WAL sync is separated from SST file flush io error. If WAL Sync is ok and retryable IO Error only happens during SST flush, the error is mapped to soft error. So user can continue insert to Memtable and append to WAL.

Resolve the bug that if WAL sync fails, the memtable status does not roll back due to calling PickMemtable early than calling and checking SyncClosedLog.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8049

Test Plan: added new unit test, make check

Reviewed By: anand1976

Differential Revision: D26965529

Pulled By: zhichao-cao

fbshipit-source-id: f5fecb66602212523c92ee49d7edcb6065982410
2021-03-18 14:33:16 -07:00
Peter Dillinger
e7a60d01b2 Revamp WriteController (#8064)
Summary:
WriteController had a number of issues:
* It could introduce a delay of 1ms even if the write rate never exceeded the
configured delayed_write_rate.
* The DB-wide delayed_write_rate could be exceeded in a number of ways
with multiple column families:
  * Wiping all pending delay "debts" when another column family joins
  the delay with GetDelayToken().
  * Resetting last_refill_time_ to (now + sleep amount) means each
  column family can write with delayed_write_rate for large writes.
  * Updating bytes_left_ for a partial refill without updating
  last_refill_time_ would essentially give out random bonuses,
  especially to medium-sized writes.

Now the code is much simpler, with these issues fixed. See comments in
the new code and new (replacement) tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8064

Test Plan: new tests, better than old tests

Reviewed By: mrambacher

Differential Revision: D27064936

Pulled By: pdillinger

fbshipit-source-id: 497c23fe6819340b8f3d440bd634d8a2bc47323f
2021-03-18 09:47:31 -07:00
Zhichao Cao
08ec5e7321 Add the statistics and info log for Error handler (#8050)
Summary:
Add statistics and info log for error handler: counters for bg error, bg io error, bg retryable io error, auto resume, auto resume total retry, and auto resume sucess; Histogram for auto resume retry count in each recovery call.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8050

Test Plan: make check and add test to error_handler_fs_test

Reviewed By: anand1976

Differential Revision: D26990565

Pulled By: zhichao-cao

fbshipit-source-id: 49f71e8ea4e9db8b189943976404205b56ab883f
2021-03-17 22:38:13 -07:00
Akanksha Mahajan
27d57a035e Use SST file manager to track blob files as well (#8037)
Summary:
Extend support to track blob files in SST File manager.
 This PR notifies SstFileManager whenever a new blob file is created,
 via OnAddFile and  an obsolete blob file deleted via OnDeleteFile
 and delete file via ScheduleFileDeletion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8037

Test Plan: Add new unit tests

Reviewed By: ltamasi

Differential Revision: D26891237

Pulled By: akankshamahajan15

fbshipit-source-id: 04c69ccfda2a73782fd5c51982dae58dd11979b6
2021-03-17 20:44:49 -07:00
Xiaopeng Zhang
c603f2f898 support getUsage and getPinnedUsage in JavaAPI for Cache (#7925)
Summary:
support getUsage and getPinnedUsage in JavaAPI for Cache
also fix a typo in LRUCacheTest.java that the highPriPoolRatio is not valid(set 5, I guess it means 0.05)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7925

Reviewed By: mrambacher

Differential Revision: D26900241

Pulled By: ajkr

fbshipit-source-id: 735d1e40a16fa8919c89c7c7154ba7f81208ec33
2021-03-17 09:30:33 -07:00
Mark Callaghan
326670d265 Add new db_bench --benchmarks options for controlling compaction (#8027)
Summary:
The new options are:
* compact0 - compact L0 into L1 using one thread
* compact1 - compact L1 into L2 using one thread
* flush - flush memtable
* waitforcompaction - wait for compaction to finish

These are useful for reproducible benchmarks to help get the LSM tree shape
into a deterministic state. I wrote about this at:
http://smalldatum.blogspot.com/2021/02/read-only-benchmarks-with-lsm-are.html

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8027

Reviewed By: riversand963

Differential Revision: D27053861

Pulled By: ajkr

fbshipit-source-id: 1646f35584a3db03740fbeb47d91c3f00fb35d6e
2021-03-17 09:12:27 -07:00
stefan-zobel
8d9088464b Java-API: Fix minor Javadoc copy-paste errors (#8034)
Summary:
Fixes 3 minor Javadoc copy-paste errors in the `RocksDB#newIterator()` and `Transaction#getIterator()` variants that take a column family handle but are talking about iterating over "the database" or "the default column family".

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8034

Reviewed By: jay-zhuang

Differential Revision: D26877667

Pulled By: mrambacher

fbshipit-source-id: 95dd95b667c496e389f221acc9a91b340e4b63bf
2021-03-16 18:07:09 -07:00
mrambacher
1a343bc393 Make ChRootEnv, EncryptedEnv, and TimedEnv into FileSystems (#7968)
Summary:
These classes were wraps of Env that provided only extensions to the FileSystem functionality.  Changed the classes to be FileSystems and the wraps to be of the CompositeEnvWrapper.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7968

Reviewed By: anand1976

Differential Revision: D26900253

Pulled By: mrambacher

fbshipit-source-id: 94001d8024a3c54a1c11adadca2bac66c3af2a77
2021-03-15 19:50:11 -07:00
Yanqin Jin
0304352882 Fix a bug in key comparison when index type is kBinarySearchWithFirstKey (#8062)
Summary:
When timestamp is enabled, key comparison should take this into account.
In `BlockBasedTableReader::Get()`, `BlockBasedTableReader::MultiGet()`,
assume the target key is `key`, and the timestamp upper bound is `ts`.
The highest key in current block is (key, ts1), while the lowest key in next
block is (key, ts2).
If
```
ts1 > ts > ts2
```
then
```
(key, ts1) < (key, ts) < (key, ts2)
```
It can be shown that if `Compare()` is used, then we will mistakenly skip the next
block. Instead, we should use `CompareWithoutTimestamp()`.

The majority of this PR makes some existing tests in `db_with_timestamp_basic_test.cc`
parameterized so that different index types can be tested. A new unit test is
also added for more coverage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8062

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D27057557

Pulled By: riversand963

fbshipit-source-id: c1062fa7c159ed600a1ad7e461531d52265021f1
2021-03-15 17:44:52 -07:00
Yanqin Jin
85d4f2c8b3 Move a test file to a better location (#8054)
Summary:
As title.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8054

Test Plan: make check

Reviewed By: mrambacher

Differential Revision: D27017955

Pulled By: riversand963

fbshipit-source-id: 829497d507bc89afbe982f8a8cf3555e52fd7098
2021-03-15 15:03:27 -07:00
mrambacher
3dff28cf9b Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033)
Summary:
For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>.  The shared ptr has some performance degradation on certain hardware classes.

For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere.  For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it.  The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.

There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold.  In those cases, the shared pointer was preserved.

Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:

6.17: readrandom   :      28.046 micros/op 854902 ops/sec;   61.3 MB/s (355999 of 355999 found)
6.18: readrandom   :      32.615 micros/op 735306 ops/sec;   52.7 MB/s (290999 of 290999 found)
PR: readrandom   :      27.500 micros/op 871909 ops/sec;   62.5 MB/s (367999 of 367999 found)

(Note that the times for 6.18 are prior to revert of the SystemClock).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033

Reviewed By: pdillinger

Differential Revision: D27014563

Pulled By: mrambacher

fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
2021-03-15 04:34:11 -07:00
Andrew Kryczka
b8f40f7f7b Deflake tests of compaction based on compensated file size (#8036)
Summary:
CompactionDeletionTriggerReopen was observed to be flaky recently:
https://app.circleci.com/pipelines/github/facebook/rocksdb/6030/workflows/787af4f3-b9f7-4645-8e8d-1fb0ebf05539/jobs/101451.

I went through it and the related tests and arrived at different
conclusions on what constraints we can expect on DB size. Some
constraints got looser and some got tighter. The particular constraint
that flaked got a lot looser so at least the flake linked above would have been prevented.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8036

Reviewed By: riversand963

Differential Revision: D26862566

Pulled By: ajkr

fbshipit-source-id: 3512b86b4fb41aeecae32e1c7382c03916d88d88
2021-03-14 20:25:42 -07:00
Levi Tamasi
b708b166dc Fix a harmless data race affecting two test cases (#8055)
Summary:
`DBTest.GetLiveBlobFiles` and `ObsoleteFilesTest.BlobFiles` both modify the
current `Version` in their setup phase, implicitly assuming that no other
threads would touch the `Version` while this is happening. The periodic
stats dumper thread violates this assumption; the patch fixes this by
disabling it in the affected test cases. (Note: the data race is
harmless in the sense that it only affects test code.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8055

Test Plan:
```
COMPILE_WITH_TSAN=1 make db_test -j24
gtest-parallel --repeat=10000 ./db_test --gtest_filter="*GetLiveBlobFiles"
COMPILE_WITH_TSAN=1 make obsolete_files_test -j24
gtest-parallel --repeat=10000 ./obsolete_files_test --gtest_filter="*BlobFiles"
```

Reviewed By: riversand963

Differential Revision: D27022715

Pulled By: ltamasi

fbshipit-source-id: b6cc77ed63d8bc1cbe0603522ff1a572182fc9ab
2021-03-12 16:44:35 -08:00
Peter Dillinger
01c2ec3fcb Add ROCKSDB_GTEST_BYPASS (#8048)
Summary:
This is for cases that do not meet the Facebook criteria for
SKIP (see new comments). Also made ROCKSDB_GTEST_{SKIP,BYPASS} print the
message because gtest doesn't ever seem to.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8048

Test Plan: manual inspection of ./ribbon_test output, CI

Reviewed By: mrambacher

Differential Revision: D26953688

Pulled By: pdillinger

fbshipit-source-id: c914eaffe7d419db6ab90a193d474531e23582e5
2021-03-12 16:02:06 -08:00
Peter Dillinger
119dda2195 Instantiate tests DBIteratorTestForPinnedData (#8051)
Summary:
a trial gtest upgrade discovered some parameterized tests missing instantiation. By some miracle, they still pass.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8051

Test Plan: thisisthetest

Reviewed By: mrambacher

Differential Revision: D27003684

Pulled By: pdillinger

fbshipit-source-id: cde1cab1551fb282f67d462d46574bd30bd5e61f
2021-03-12 12:31:29 -08:00
Peter Dillinger
589ea6bec2 Add BackupEngine API for backup file details (#8042)
Summary:
This API can be used for things like determining how much space
can be freed up by deleting a particular backup, etc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8042

Test Plan:
validation of the API added to many existing backup unit
tests

Reviewed By: mrambacher

Differential Revision: D26936577

Pulled By: pdillinger

fbshipit-source-id: f0bbd90f0917b9781a6837652fb4616d9247816a
2021-03-12 11:03:54 -08:00
Yanqin Jin
82b3888433 Enable backward iterator for keys with user-defined timestamp (#8035)
Summary:
This PR does the following:

- Enable backward iteration for keys with user-defined timestamp. Note that merge, single delete, range delete are not supported yet.
- Introduces a new helper API `Comparator::EqualWithoutTimestamp()`.
- Fix a typo in `SetTimestamp()`.
- Add/update unit tests

Run db_bench (built with DEBUG_LEVEL=0) to demonstrate that no overhead is introduced for CPU-intensive workloads with a lot of `Prev()`. Also provided results of iterating keys with timestamps.

1. Disable timestamp, run:
```
./db_bench -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5
```
Results:
> Baseline
> - seekrandom [AVG    6 runs] : 96115 ops/sec;   53.2 MB/sec
> - seekrandom [MEDIAN 6 runs] : 98075 ops/sec;   54.2 MB/sec
>
> This PR
> - seekrandom [AVG    6 runs] : 95521 ops/sec;   52.8 MB/sec
> - seekrandom [MEDIAN 6 runs] : 96338 ops/sec;   53.3 MB/sec

2. Enable timestamp, run:
```
./db_bench -user_timestamp_size=8  -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5
```
Result:
> Baseline: not supported
>
> This PR
> - seekrandom [AVG    6 runs] : 90514 ops/sec;   50.1 MB/sec
> - seekrandom [MEDIAN 6 runs] : 90834 ops/sec;   50.2 MB/sec

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8035

Reviewed By: ltamasi

Differential Revision: D26926668

Pulled By: riversand963

fbshipit-source-id: 95330cc2242397c03e09d29e5417dfb0adc98ef5
2021-03-10 11:15:46 -08:00
Yanqin Jin
64517d184a Make secondary instance use ManifestTailer (#7998)
Summary:
This PR

- adds a class `ManifestTailer` that inherits from `VersionEditHandlerPointInTime`. `ManifestTailer::Iterate()` can be called multiple times to tail the primary instance's MANIFEST and apply the changes to the secondary,
- updates the implementation of `ReactiveVersionSet::ReadAndApply` to use this class,
- removes unused code in version_set.cc,
- updates existing tests, e.g. removing deleted sync points from unit tests,
- adds a new test to address the bug in https://github.com/facebook/rocksdb/issues/7815.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7998

Test Plan:
make check
Existing and newly-added tests in version_set_test.cc and db_secondary_test.cc

Reviewed By: jay-zhuang

Differential Revision: D26926641

Pulled By: riversand963

fbshipit-source-id: 8d4dd15db0ba863c213f743e33b5a207e948c980
2021-03-10 10:59:44 -08:00
David CARLIER
7a3444bf1f Mac M1 crc32 intrinsics ARM64 check support proposal (#7893)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7893

Reviewed By: ajkr

Differential Revision: D26050966

Pulled By: jay-zhuang

fbshipit-source-id: 9df2bb65d82defd7fad49d5369979b03e22d39c2
2021-03-10 09:05:56 -08:00
stefan-zobel
cc34da75b5 Java-API: byteCompressionType should be declared as primitive type byte (#7981)
Summary:
The variable `byteCompressionType` is only assigned values of primitive type and is never 'null', but it is declared with the boxed type 'Byte'.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7981

Reviewed By: ajkr

Differential Revision: D26546600

Pulled By: jay-zhuang

fbshipit-source-id: 07b579cdfcfc2262a448ca3626e216416fd05892
2021-03-09 22:05:16 -08:00
qinzuoyan
6fad38ebe8 Fix compile error (#7908)
Summary:
OS: Ubuntu 14.04
Compiler: GCC 4.9.4
Compile error:
```
db/forward_iterator.cc:996:62: error: declaration of ‘key’ shadows a member of 'this' [-Werror=shadow]
   auto cmp = [&](const FileMetaData* f, const Slice& key) -> bool {
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7908

Reviewed By: jay-zhuang

Differential Revision: D26899986

Pulled By: ajkr

fbshipit-source-id: 66b0b97aefd0f13a085e063491f8207366a9f848
2021-03-09 20:53:33 -08:00
Hans Holmberg
670567db09 Add support for custom file systems to ldb and sst_dump (#8010)
Summary:
This PR adds support for custom file systems to ldb and sst_dump by adding command line options for specifying --fs_uri and --backup_fs uri (for ldb backup/restore commands). fs_uri is already supported in db_bench and db_stress, and there is already support in ldb and db stress for specifying customized envs.

The PR also fixes what looks like a bug in the ldb backup/restore commands. As it is right now, backups can only be made from and to the same environment/file system which does not seem to be the intended behavior. This PR makes it possible to do/restore backups between different envs/file systems.

Example:
`./ldb backup --fs_uri=zenfs://dev:nvme2n1 --backup_fs_uri=posix:// --backup_dir=/tmp/my_rocksdb_backup  --db=rocksdbtest/dbbench
`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8010

Reviewed By: jay-zhuang

Differential Revision: D26904654

Pulled By: ajkr

fbshipit-source-id: 9b695ed8b944fcc6b27c4daaa9f52e87ee2c1fb4
2021-03-09 20:49:15 -08:00
Ed rodriguez
7381dad1b1 make:Fix c header prototypes (#7994)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7994

Reviewed By: jay-zhuang

Differential Revision: D26904603

Pulled By: ajkr

fbshipit-source-id: 0af92a51de895b40c7faaa4f0870b3f63279fe21
2021-03-09 20:44:23 -08:00