Commit Graph

9384 Commits

Author SHA1 Message Date
Yanqin Jin
e66199d848 First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.

Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.

Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949

Reviewed By: anand1976

Differential Revision: D22026020

Pulled By: riversand963

fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-24 19:07:08 -07:00
sdong
9cc25190e1 Test CircleCI with CLANG-10 (#7025)
Summary:
It's useful to build RocksDB using a more recent clang version in CI. Add a CircleCI build and fix some issues with it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7025

Test Plan: See all tests pass.

Reviewed By: pdillinger

Differential Revision: D22215700

fbshipit-source-id: 914a729c2cd3f3ac4a627cc0ac58d4691dca2168
2020-06-24 16:22:49 -07:00
sdong
50d6969816 Fix unity build broken by #7007 (#7024)
Summary:
https://github.com/facebook/rocksdb/pull/7007 broken the unity build. Fix it by moving the const inside the function
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7024

Test Plan: make unity and see it to build.

Reviewed By: zhichao-cao

Differential Revision: D22212028

fbshipit-source-id: 5daff7383b691808164d4745ab543238502d946b
2020-06-24 13:40:48 -07:00
Zhichao Cao
83a4dd1a67 Fix the memory leak in Env_basic_test (#7017)
Summary:
Fix the memory leak broken asan and other test introduced by https://github.com/facebook/rocksdb/issues/6830
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7017

Test Plan: pass asan_check

Reviewed By: siying

Differential Revision: D22190289

Pulled By: zhichao-cao

fbshipit-source-id: 03a095f698b4f9d72fd9374191b17c890d7c2b56
2020-06-24 11:05:24 -07:00
Andrew Kryczka
8efb5cfb6f add SstFileManager to crash test (#6993)
Summary:
SstFileManager is already supported in the stress test as of https://github.com/facebook/rocksdb/issues/6454. This
PR enables the SstFileManager in some of the crash test runs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6993

Reviewed By: riversand963

Differential Revision: D22084406

Pulled By: ajkr

fbshipit-source-id: 78b8642682e7570ff6ec3a1c3ccd9940f4362289
2020-06-23 16:27:20 -07:00
Levi Tamasi
9f21d08660 Move kNoExpiration to blob_db.h (#7018)
Summary:
The constant `kNoExpiration` is currently defined in an
internal/implementation header (`blob_log_format.h`); the patch moves it
to the public header `blob_db.h` so it is accessible to users.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7018

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D22191354

Pulled By: ltamasi

fbshipit-source-id: 98c8012a83b999a3f1a30e955ce6bb71ba29dc5c
2020-06-23 13:45:06 -07:00
Peter Dillinger
5b2bbacb6f Minimize memory internal fragmentation for Bloom filters (#6427)
Summary:
New experimental option BBTO::optimize_filters_for_memory builds
filters that maximize their use of "usable size" from malloc_usable_size,
which is also used to compute block cache charges.

Rather than always "rounding up," we track state in the
BloomFilterPolicy object to mix essentially "rounding down" and
"rounding up" so that the average FP rate of all generated filters is
the same as without the option. (YMMV as heavily accessed filters might
be unluckily lower accuracy.)

Thus, the option near-minimizes what the block cache considers as
"memory used" for a given target Bloom filter false positive rate and
Bloom filter implementation. There are no forward or backward
compatibility issues with this change, though it only works on the
format_version=5 Bloom filter.

With Jemalloc, we see about 10% reduction in memory footprint (and block
cache charge) for Bloom filters, but 1-2% increase in storage footprint,
due to encoding efficiency losses (FP rate is non-linear with bits/key).

Why not weighted random round up/down rather than state tracking? By
only requiring malloc_usable_size, we don't actually know what the next
larger and next smaller usable sizes for the allocator are. We pick a
requested size, accept and use whatever usable size it has, and use the
difference to inform our next choice. This allows us to narrow in on the
right balance without tracking/predicting usable sizes.

Why not weight history of generated filter false positive rates by
number of keys? This could lead to excess skew in small filters after
generating a large filter.

Results from filter_bench with jemalloc (irrelevant details omitted):

    (normal keys/filter, but high variance)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.6278
    Number of filters: 5516
    Total size (MB): 200.046
    Reported total allocated memory (MB): 220.597
    Reported internal fragmentation: 10.2732%
    Bits/key stored: 10.0097
    Average FP rate %: 0.965228
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 30.5104
    Number of filters: 5464
    Total size (MB): 200.015
    Reported total allocated memory (MB): 200.322
    Reported internal fragmentation: 0.153709%
    Bits/key stored: 10.1011
    Average FP rate %: 0.966313

    (very few keys / filter, optimization not as effective due to ~59 byte
     internal fragmentation in blocked Bloom filter representation)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.5649
    Number of filters: 162950
    Total size (MB): 200.001
    Reported total allocated memory (MB): 224.624
    Reported internal fragmentation: 12.3117%
    Bits/key stored: 10.2951
    Average FP rate %: 0.821534
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 31.8057
    Number of filters: 159849
    Total size (MB): 200
    Reported total allocated memory (MB): 208.846
    Reported internal fragmentation: 4.42297%
    Bits/key stored: 10.4948
    Average FP rate %: 0.811006

    (high keys/filter)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.7017
    Number of filters: 164
    Total size (MB): 200.352
    Reported total allocated memory (MB): 221.5
    Reported internal fragmentation: 10.5552%
    Bits/key stored: 10.0003
    Average FP rate %: 0.969358
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 30.7131
    Number of filters: 160
    Total size (MB): 200.928
    Reported total allocated memory (MB): 200.938
    Reported internal fragmentation: 0.00448054%
    Bits/key stored: 10.1852
    Average FP rate %: 0.963387

And from db_bench (block cache) with jemalloc:

    $ ./db_bench -db=/dev/shm/dbbench.no_optimize -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false
    $ ./db_bench -db=/dev/shm/dbbench -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -optimize_filters_for_memory -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false
    $ (for FILE in /dev/shm/dbbench.no_optimize/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }'
    17063835
    $ (for FILE in /dev/shm/dbbench/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }'
    17430747
    $ #^ 2.1% additional filter storage
    $ ./db_bench -db=/dev/shm/dbbench.no_optimize -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000
    rocksdb.block.cache.index.add COUNT : 33
    rocksdb.block.cache.index.bytes.insert COUNT : 8440400
    rocksdb.block.cache.filter.add COUNT : 33
    rocksdb.block.cache.filter.bytes.insert COUNT : 21087528
    rocksdb.bloom.filter.useful COUNT : 4963889
    rocksdb.bloom.filter.full.positive COUNT : 1214081
    rocksdb.bloom.filter.full.true.positive COUNT : 1161999
    $ #^ 1.04 % observed FP rate
    $ ./db_bench -db=/dev/shm/dbbench -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -optimize_filters_for_memory -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000
    rocksdb.block.cache.index.add COUNT : 33
    rocksdb.block.cache.index.bytes.insert COUNT : 8448592
    rocksdb.block.cache.filter.add COUNT : 33
    rocksdb.block.cache.filter.bytes.insert COUNT : 18220328
    rocksdb.bloom.filter.useful COUNT : 5360933
    rocksdb.bloom.filter.full.positive COUNT : 1321315
    rocksdb.bloom.filter.full.true.positive COUNT : 1262999
    $ #^ 1.08 % observed FP rate, 13.6% less memory usage for filters

(Due to specific key density, this example tends to generate filters that are "worse than average" for internal fragmentation. "Better than average" cases can show little or no improvement.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6427

Test Plan: unit test added, 'make check' with gcc, clang and valgrind

Reviewed By: siying

Differential Revision: D22124374

Pulled By: pdillinger

fbshipit-source-id: f3e3aa152f9043ddf4fae25799e76341d0d8714e
2020-06-22 13:32:07 -07:00
Matthew Von-Maszewski
1092f19d95 Make EncryptEnv inheritable (#6830)
Summary:
EncryptEnv class is both declared and defined within env_encryption.cc.  This makes it really tough to derive new classes from that base.

This branch moves declaration of the class to rocksdb/env_encryption.h.  The change facilitates making new encryption modules (such as an upcoming openssl AES CTR pull request) possible / easy.

The only coding change was to add the EncryptEnv object to env_basic_test.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6830

Reviewed By: riversand963

Differential Revision: D21706593

Pulled By: ajkr

fbshipit-source-id: 64d2da95a1569ceeb9b1549c3bec5404cf4c89f0
2020-06-22 13:27:16 -07:00
Zhichao Cao
d739318ba7 Fix double define in IO_tracer (#7007)
Summary:
Fix the following error

"./trace_replay/io_tracer.h:20:20: error: redefinition of ‘const unsigned int rocksdb::{anonymous}::kCharSize’
 const unsigned int kCharSize = 1;
                    ^~~~~~~~~
In file included from unity.cc:177:
trace_replay/block_cache_tracer.cc:22:20: note: ‘const unsigned int rocksdb::{anonymous}::kCharSize’ previously defined here
 const unsigned int kCharSize = 1;"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7007

Reviewed By: akankshamahajan15

Differential Revision: D22142618

Pulled By: zhichao-cao

fbshipit-source-id: e6dcd51ccc21d1f58df52cdc7a1c88e54cf4f6e8
2020-06-22 10:20:13 -07:00
sdong
096beb787e Remove CircleCI clang build's verbose output (#7000)
Summary:
As CirclrCI build's clang build is stable, verbose flag is less useful. On the other hand, the long outputs might create other problems. A non-reproducible failure "make: write error: stdout" might be related to it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7000

Test Plan: Watch the run

Reviewed By: pdillinger

Differential Revision: D22118870

fbshipit-source-id: a4157a4282adddcb0c55c0e9e53b2d9ce18bda66
2020-06-19 17:11:55 -07:00
sdong
dea4063b13 Remove an assertion in FlushAfterIntraL0CompactionCheckConsistencyFail (#7003)
Summary:
FlushAfterIntraL0CompactionCheckConsistencyFail is flakey. It sometimes fails with:

db/db_compaction_test.cc:5186: Failure
Expected equality of these values:
  10
  NumTableFilesAtLevel(0)
    Which is: 3

I don't see a clear reason why the assertion would always be true. The necessarily of the assertion is not clear either. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7003

Test Plan: See the test still builds.

Reviewed By: riversand963

Differential Revision: D22129753

fbshipit-source-id: 42f0bb05e32b369e8d726bfd3e35c29cf52fe008
2020-06-19 16:58:29 -07:00
Peter Dillinger
25a0d0ca30 Fix block checksum for >=4GB, refactor (#6978)
Summary:
Although RocksDB falls over in various other ways with KVs
around 4GB or more, this change fixes how XXH32 and XXH64 were being
called by the block checksum code to support >= 4GB in case that should
ever happen, or the code copied for other uses.

This change is not a schema compatibility issue because the checksum
verification code would checksum the first (block_size + 1) mod 2^32
bytes while the checksum construction code would checksum the first
block_size mod 2^32 plus the compression type byte, meaning the
XXH32/64 checksums for >=4GB block would not match about 255/256 times.

While touching this code, I refactored to consolidate redundant
implementations, improving diagnostics and performance tracking in some
cases. Also used less confusing language in those diagnostics.

Makes https://github.com/facebook/rocksdb/issues/6875 obsolete.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6978

Test Plan:
I was able to write a test for this using an SST file writer
and VerifyChecksum in a reader. The test fails before the fix, though
I'm leaving the test disabled because I don't think it's worth the
expense of running regularly.

Reviewed By: gg814

Differential Revision: D22143260

Pulled By: pdillinger

fbshipit-source-id: 982993d16134e8c50bea2269047f901c1783726e
2020-06-19 16:18:24 -07:00
Andrew Kryczka
d76eed4839 minor fixes for stress/crash contruns (#7006)
Summary:
Avoid using `cf_consistency` together with `enable_compaction_filter` as
the former heavily uses snapshots while the latter is incompatible with
snapshots.

Also fix a clang-analyze error for a write to a variable that is never
read.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7006

Reviewed By: zhichao-cao

Differential Revision: D22141679

Pulled By: ajkr

fbshipit-source-id: 1840ae238168818a9ab5973f90fd78c067399447
2020-06-19 16:05:17 -07:00
Peter Dillinger
88b4210701 Remove racially charged terms "whitelist" and "blacklist" (#7008)
Summary:
We don't need them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7008

Test Plan: "make check" and ensure "make crash_test" starts

Reviewed By: ajkr

Differential Revision: D22143838

Pulled By: pdillinger

fbshipit-source-id: 72c8e16603abc59f4954e304466bc4dc1f58f94e
2020-06-19 15:27:32 -07:00
Zhichao Cao
a607f3efaa Fix unused variable failure (#7004)
Summary:
pass make check, run db_stress
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7004

Reviewed By: ajkr

Differential Revision: D22132617

Pulled By: zhichao-cao

fbshipit-source-id: d65397967e213206ec5efcb767bbdda8a575662a
2020-06-18 22:06:51 -07:00
Kefu Chai
f4583f7480 add WITH_EXAMPLES options to cmake and cleanups. (#6580)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6580

Reviewed By: cheng-chang

Differential Revision: D21846336

Pulled By: ajkr

fbshipit-source-id: e5bb0152a0876061d4ff158e7144eb9cc5a88cad
2020-06-18 18:00:04 -07:00
Akanksha Mahajan
552fd765b3 Add IOTracer reader, writer classes for reading/writing IO operations in a binary file (#6958)
Summary:
1. As part of IOTracing project, Add a class IOTracer,
IOTraceReader and IOTracerWriter that writes the file operations
information in a binary file. IOTrace Record contains record information
and right now it contains access_timestamp, file_operation, file_name,
io_status, len, offset and later other options will be added when file
system APIs will be call IOTracer.

2. Add few unit test cases that verify that reading and writing to a IO
Trace file is working properly and before start trace and after ending
trace nothing is added to the binary file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6958

Test Plan:
1. make check -j64
                 2. New testcases for IOTracer.

Reviewed By: anand1976

Differential Revision: D21943375

Pulled By: akankshamahajan15

fbshipit-source-id: 3532204e2a3eab0104bf411ab142e3fdd4fbce54
2020-06-18 10:46:11 -07:00
sdong
d6b7b7712f Fix a bug that causes iterator to return wrong result in a rare data race (#6973)
Summary:
The bug fixed in https://github.com/facebook/rocksdb/pull/1816/ is now applicable to iterator too. This was not an issue but https://github.com/facebook/rocksdb/pull/2886 caused the regression. If a put and DB flush happens just between iterator to get latest sequence number and getting super version, empty result for the key or an older value can be returned, which is wrong.
Fix it in the same way as the fix in https://github.com/facebook/rocksdb/issues/1816, that is to get the sequence number after referencing the super version.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6973

Test Plan: Will run stress tests for a while to make sure there is no general regression.

Reviewed By: ajkr

Differential Revision: D22029348

fbshipit-source-id: 94390f93630906796d6e2fec321f44a920953fd1
2020-06-18 10:16:38 -07:00
Yanqin Jin
569b87e8c7 Fail recovery when MANIFEST record checksum mismatch (#6996)
Summary:
https://github.com/facebook/rocksdb/issues/5411 refactored `VersionSet::Recover` but introduced a bug, explained as follows.
Before, once a checksum mismatch happens, `reporter` will set `s` to be non-ok. Therefore, Recover will stop processing the MANIFEST any further.
```
// Correct
// Inside Recover
LogReporter reporter;
reporter.status = &s;
log::Reader reader(..., reporter);
while (reader.ReadRecord() && s.ok()) {
...
}
```
The bug is that, the local variable `s` in `ReadAndRecover` won't be updated by `reporter` while reading the MANIFEST. It is possible that the reader sees a checksum mismatch in a record, but `ReadRecord` retries internally read and finds the next valid record. The mismatched record will be ignored and no error is reported.
```
// Incorrect
// Inside Recover
LogReporter reporter;
reporter.status = &s;
log::Reader reader(..., reporter);
s = ReadAndRecover(reader, ...);

// Inside ReadAndRecover
  Status s;  // Shadows the s in Recover.
  while (reader.ReadRecord() && s.ok()) {
   ...
  }
```
`LogReporter` can use a separate `log_read_status` to track the errors while reading the MANIFEST. RocksDB can process more MANIFEST entries only if `log_read_status.ok()`.
Test plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6996

Reviewed By: ajkr

Differential Revision: D22105746

Pulled By: riversand963

fbshipit-source-id: b22f717a423457a41ca152a242abbb64cf91fc38
2020-06-18 10:09:12 -07:00
Andrew Kryczka
775dc623ad add CompactionFilter to stress/crash tests (#6988)
Summary:
Added a `CompactionFilter` that is aware of the stress test's expected state. It only drops key versions that are already covered according to the expected state. It is incompatible with snapshots (same as all `CompactionFilter`s), so disables all snapshot-related features when used in the crash test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6988

Test Plan:
running a minified blackbox crash test

```
$ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --max_key=1000000 -write_buffer_size=1048576 -max_bytes_for_level_base=4194304 -target_file_size_base=1048576 -value_size_mult=33 --interval=10 --duration=3600
```

Reviewed By: anand1976

Differential Revision: D22072888

Pulled By: ajkr

fbshipit-source-id: 727b9d7a90d5eab18be0ec6cd5a810712ac13320
2020-06-18 09:54:55 -07:00
Andrew Kryczka
312f23c92d build fixes for GNU/kFreeBSD (#6992)
Summary:
Upstream https://salsa.debian.org/mariadb-team/mariadb-10.4/-/blob/master/debian/patches/rocksdb-kfreebsd.patch
by jrtc27.

Fixes https://github.com/facebook/rocksdb/issues/5223.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6992

Reviewed By: zhichao-cao

Differential Revision: D22084150

Pulled By: ajkr

fbshipit-source-id: 1822311ba16f112a15065b2180ce89d36af9cafc
2020-06-18 09:51:28 -07:00
sdong
223b57eeb8 Fix the bug that compressed cache is disabled in read-only DBs (#6990)
Summary:
Compressed block cache is disabled in https://github.com/facebook/rocksdb/pull/4650 for no good reason. Re-enable it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6990

Test Plan: Add a unit test to make sure a general function works with read-only DB + compressed block cache.

Reviewed By: ltamasi

Differential Revision: D22072755

fbshipit-source-id: 2a55df6363de23a78979cf6c747526359e5dc7a1
2020-06-17 14:30:54 -07:00
Zitan Chen
94d04529de Store DB identity and DB session ID in SST files (#6983)
Summary:
`db_id` and `db_session_id` are now part of the table properties for all formats and stored in SST files. This adds about 99 bytes to each new SST file.

The `TablePropertiesNames` for these two identifiers are `rocksdb.creating.db.identity` and `rocksdb.creating.session.identity`.

In addition, SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as `DB::GetDbSessionId`.

A table property test is added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6983

Test Plan: make check and some manual tests.

Reviewed By: zhichao-cao

Differential Revision: D22048826

Pulled By: gg814

fbshipit-source-id: afdf8c11424a6f509b5c0b06dafad584a80103c9
2020-06-17 10:57:40 -07:00
Andrew Kryczka
742b452863 update minor version for 6.11 release (#6994)
Summary:
The 6.11.fb branch is already cut so I will also backport this PR to
that branch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6994

Reviewed By: riversand963

Differential Revision: D22084532

Pulled By: ajkr

fbshipit-source-id: 0b025f738cc31c65c673cbf89302359e88a34d19
2020-06-16 21:46:05 -07:00
Yanqin Jin
b7bab48099 Fix a bug of overwriting return code (#6989)
Summary:
In best-efforts recovery, an error that is not Corruption or IOError::kNotFound or IOError::kPathNotFound will be overwritten silently. Fix this by checking all non-ok cases and return early.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6989

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D22071418

Pulled By: riversand963

fbshipit-source-id: 5a4ea5dfb1a41f41c7a3fdaf62b163007b42f04b
2020-06-16 12:59:35 -07:00
Yanqin Jin
9bfd46d0d8 Let best-efforts recovery ignore CURRENT file (#6970)
Summary:
Best-efforts recovery does not check the content of CURRENT file to determine which MANIFEST to recover from. However, it still checks the presence of CURRENT file to determine whether to create a new DB during `open()`. Therefore, we can tweak the logic in `open()` a little bit so that best-efforts recovery does not rely on CURRENT file at all.

Test plan (dev server):
make check
./db_basic_test --gtest_filter=DBBasicTest.RecoverWithNoCurrentFile
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6970

Reviewed By: anand1976

Differential Revision: D22013990

Pulled By: riversand963

fbshipit-source-id: db552a1868c60ed70e1f7cd252a3a076eb8ea58f
2020-06-15 14:11:24 -07:00
Levi Tamasi
aa8f1331af Fix uninitialized memory read in table_test (#6980)
Summary:
When using parameterized tests, `gtest` sometimes prints the test
parameters. If no other printing method is available, it essentially
produces a hex dump of the object. This can cause issues with valgrind
with types like `TestArgs` in `table_test`, where the object layout has
gaps (with uninitialized contents) due to the members' alignment
requirements. The patch fixes the uninitialized reads by providing an
`operator<<` for `TestArgs` and also makes sure all members are
initialized (in a consistent order) on all code paths.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6980

Test Plan: `valgrind --leak-check=full ./table_test`

Reviewed By: siying

Differential Revision: D22045536

Pulled By: ltamasi

fbshipit-source-id: 6f5920ac28c712d0aa88162fffb80172ed769c32
2020-06-15 14:08:12 -07:00
Zitan Chen
88db97b06d Add a DB Session ID (#6959)
Summary:
Added DB::GetDbSessionId by using the same format and machinery as DB::GetDbIdentity.
The DB Session ID is generated (and therefore, updated) each time a DB object is opened. It is written to the LOG file right after the line of “DB SUMMARY”.
A test for the uniqueness, for different openings and during the same opening, is also added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6959

Test Plan: Passed make check

Reviewed By: zhichao-cao

Differential Revision: D21951721

Pulled By: gg814

fbshipit-source-id: 958a48a612db49a39998ea703cded45987d3fa8b
2020-06-15 10:47:02 -07:00
Zhen Li
9c24a5cb4d Fix persistent cache on windows (#6932)
Summary:
Persistent cache feature caused rocks db crash on windows. I posted a issue for it, https://github.com/facebook/rocksdb/issues/6919. I found this is because no "persistent_cache_key_prefix" is generated for persistent cache. Looking repo history, "GetUniqueIdFromFile" is not implemented on Windows. So my fix is adding "NewId()" function in "persistent_cache" and using it to generate prefix for persistent cache. In this PR, i also re-enable related test cases defined in "db_test2" and "persistent_cache_test" for windows.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6932

Test Plan:
1. run related test cases in "db_test2" and "persistent_cache_test" on windows and see it passed.
2. manually run db_bench.exe with "read_cache_path" and verified.

Reviewed By: riversand963

Differential Revision: D21911608

Pulled By: cheng-chang

fbshipit-source-id: cdfd938d54a385edbb2836b13aaa1d39b0a6f1c2
2020-06-13 13:28:31 -07:00
Cheng Chang
f7613e2a9e Make it able to lower cpu priority to specific level in threadpool (#6969)
Summary:
`Env::LowerThreadPoolCPUPriority` takes a new parameter `CpuPriority` to be able to lower to a specific priority such as `CpuPriority::kIdle`, previously, the priority is always lowered to `CpuPriority::kLow`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6969

Test Plan: unit test `EnvPosixTest::LowerThreadPoolCpuPriority` added to `env_test.cc`.

Reviewed By: siying

Differential Revision: D22011169

Pulled By: cheng-chang

fbshipit-source-id: 568878c24a924912e35cef00c552d4a63431cdf4
2020-06-13 13:25:20 -07:00
Yanqin Jin
15d9f28da5 Add stress test for best-efforts recovery (#6819)
Summary:
Add crash test for the case of best-efforts recovery.
After a certain amount of time, we kill the db_stress process, randomly delete some certain table files and restart db_stress. Given the randomness of file deletion, it is difficult to verify against a reference for data correctness. Therefore, we just check that the db can restart successfully.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6819

Test Plan:
```
./db_stress -best_efforts_recovery=true -disable_wal=1 -reopen=0
./db_stress -best_efforts_recovery=true -disable_wal=0 -skip_verifydb=1 -verify_db_one_in=0 -continuous_verification_interval=0
make crash_test_with_best_efforts_recovery
```

Reviewed By: anand1976

Differential Revision: D21436753

Pulled By: riversand963

fbshipit-source-id: 0b3605c922a16c37ed17d5ab6682ca4240e47926
2020-06-12 19:27:46 -07:00
Levi Tamasi
bacd6edcbe Turn HarnessTest in table_test into a parameterized test (#6974)
Summary:
`HarnessTest` in `table_test.cc` currently tests many parameter
combinations sequentially in a loop. This is problematic from
a testing perspective, since if the test fails, we have no way of
knowing how many/which combinations have failed. It can also cause timeouts on
our test system due to the sheer number of combinations tested.
(Specifically, the parallel compression threads parameter added by
https://github.com/facebook/rocksdb/pull/6262 seems to have been the last straw.)
There is some DIY code there that splits the load among eight test cases
but that does not appear to be sufficient anymore.

Instead, the patch turns `HarnessTest` into a parameterized test, so all the
parameter combinations can be tested separately and potentially
concurrently. It also cleans up the tests a little, fixes
`RandomizedLongDB`, which did not get updated when the parallel
compression threads parameter was added, and turns `FooterTests` into a
standalone test case (since it does not actually need a fixture class).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6974

Test Plan: `make check`

Reviewed By: siying

Differential Revision: D22029572

Pulled By: ltamasi

fbshipit-source-id: 51baea670771c33928f2eb3902bd69dcf540aa41
2020-06-12 17:31:06 -07:00
sdong
7e2ac0c3a0 Reduce test coverage in older VS versions (#6966)
Summary:
With Appveyor we run the same set of tests for older versions of VS as the latest version. It creates extra hanging which we don't plan to investigate. Instead, minimize tests run there. The full tests on Windows are already covered in CircleCI.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6966

Test Plan: Watch appveyor runs.

Reviewed By: pdillinger

Differential Revision: D22025383

fbshipit-source-id: 079dff9e8213bc750a47f4add90fdbf18de9d737
2020-06-12 17:05:47 -07:00
Zhen Li
d63f86e506 fix build with 'USE_HDFS' on windows (#6950)
Summary:
Build with "USE_HDFS" failed with below errors on Windows. This PR is trying to fix them
Severity	Code	Description	Project	File	Line	Suppression State
Error (active)	E0020	identifier "ssize_t" is undefined	rocksdb	D:\Git\rocksdb\rocksdb\env\env_hdfs.cc	127
Error (active)	E1696	cannot open source file "sys/time.h"	rocksdb	D:\Git\rocksdb\rocksdb\env\env_hdfs.cc	15
Error	C2065	'pthread_t': undeclared identifier	rocksdb	d:\git\rocksdb\rocksdb\hdfs\env_hdfs.h	166
Error	C3861	'pthread_self': identifier not found	rocksdb	d:\git\rocksdb\rocksdb\hdfs\env_hdfs.h	167
Error	C1083	Cannot open include file: 'sys/time.h': No such file or directory	rocksdb	d:\git\rocksdb\rocksdb\env\env_hdfs.cc	15
Error	C2065	'pthread_t': undeclared identifier	db_bench	d:\git\rocksdb\rocksdb\hdfs\env_hdfs.h	166
Error	C3861	'pthread_self': identifier not found	db_bench	d:\git\rocksdb\rocksdb\hdfs\env_hdfs.h	167
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6950

Test Plan:
1. manually test build with "USE_HDFS" on Windows, verified HDFS Env related function by db_bench.exe.
D:\Git\rocksdb\build\Debug>db_bench.exe --hdfs="abfs://test@rdbtest2.dfs.core.windows.net" --num=100 --benchmarks="fillseq,readseq,fillseekseq" --db="abfs://test@rdbtest2.dfs.core.windows.net/test"
2020-06-05 20:42:21,102 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020-06-05 20:42:22,646 WARN utils.SSLSocketFactoryEx: Failed to load OpenSSL. Falling back to the JSSE default.
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 6.10
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    100
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    0.0 MB (estimated)
FileSize:   0.0 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: skip_list
Perf Level: 1
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
DB path: [abfs://test@rdbtest2.dfs.core.windows.net/test]
fillseq      :    1138.350 micros/op 877 ops/sec;    0.1 MB/s
DB path: [abfs://test@rdbtest2.dfs.core.windows.net/test]
readseq      :      63.580 micros/op 15627 ops/sec;    1.7 MB/s
DB path: [abfs://test@rdbtest2.dfs.core.windows.net/test]
fillseekseq  :      45.615 micros/op 21762 ops/sec;

Reviewed By: cheng-chang

Differential Revision: D21964806

Pulled By: riversand963

fbshipit-source-id: 9d7413178ece0113d11bc4398583f7d0590d5dbd
2020-06-12 16:21:50 -07:00
sdong
9810f40075 Circle CI's clang build to really use clang (#6965)
Summary:
The CircleCI's Clang flavor has a bug that doesn't really use CLANG. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6965

Test Plan: See CI results.

Reviewed By: pdillinger

Differential Revision: D22025355

fbshipit-source-id: e86922b9152e9f5732e5099d0ce41da9226ff806
2020-06-12 15:50:34 -07:00
Andrew Kryczka
af58d92760 update HISTORY.md for 6.11 release (#6972)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6972

Reviewed By: zhichao-cao

Differential Revision: D22021953

Pulled By: ajkr

fbshipit-source-id: 4debbafe45b5939fd28549230eebf6006eb43440
2020-06-12 11:15:06 -07:00
Levi Tamasi
83833637c1 Maintain the set of linked SSTs in BlobFileMetaData (#6945)
Summary:
The `FileMetaData` objects associated with table files already contain the
number of the oldest blob file referenced by the SST in question. This patch
adds the inverse mapping to `BlobFileMetaData`, namely the set of table file
numbers for which the oldest blob file link points to the given blob file (these
are referred to as *linked SSTs*). This mapping will be used by the GC logic.

Implementation-wise, the patch builds on the `BlobFileMetaDataDelta`
functionality introduced in https://github.com/facebook/rocksdb/pull/6835: newly linked/unlinked SSTs are
accumulated in `BlobFileMetaDataDelta`, and the changes to the linked SST set
are applied in one shot when the new `Version` is saved. The patch also reworks
the blob file related consistency checks in `VersionBuilder` so they validate the
consistency of the forward table file -> blob file links and the backward blob file ->
table file links for blob files that are part of the `Version`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6945

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D21912228

Pulled By: ltamasi

fbshipit-source-id: c5bc7acf6e729a8fccbb12672dd5cd00f6f000f8
2020-06-12 09:54:39 -07:00
Yanqin Jin
717749f4c0 Fail point-in-time WAL recovery upon IOError reading WAL (#6963)
Summary:
If `options.wal_recovery_mode == WALRecoveryMode::kPointInTimeRecovery`, RocksDB stops replaying WAL once hitting an error and discards the rest of the WAL. This can lead to data loss if the error occurs at an offset smaller than the last sync'ed offset.
Ideally, RocksDB point-in-time recovery should permit recovery if the error occurs after last synced offset while fail recovery if error occurs before the last synced offset. However, RocksDB does not track the synced offset of WALs. Consequently, RocksDB does not know whether an error occurs before or after the last synced offset. An error can be one of the following.
- WAL record checksum mismatch. This can result from both corruption of synced data and dropping of unsynced data during shutdown. We cannot be sure which one. In order not to defeat the original motivation to permit the latter case, we keep the original behavior of point-in-time WAL recovery.
- IOError. This means the WAL can be bad, an indicator of whole file becoming unavailable, not to mention synced part of the WAL. Therefore, we choose to modify the behavior of point-in-time recovery and fail the database recovery.

Test plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6963

Reviewed By: ajkr

Differential Revision: D22011083

Pulled By: riversand963

fbshipit-source-id: f9cbf29a37dc5cc40d3fa62f89eed1ad67ca1536
2020-06-11 18:42:10 -07:00
Levi Tamasi
d854abad78 Revisit the handling of the case when a file is re-added to the same level (#6939)
Summary:
https://github.com/facebook/rocksdb/pull/6901 subtly changed the handling of the corner case
when a table file is deleted from a level, then re-added to the same level. (Note: this
should be extremely rare; one scenario that comes to mind is a trivial move followed by
a call to `ReFitLevel` that moves the file back to the original level.) Before that change,
a new `FileMetaData` object was created as a result of this sequence; after the change,
the original `FileMetaData` was essentially resurrected (since the deletion and the addition
simply cancel each other out with the change). This patch restores the original behavior,
which is more intuitive considering the interface, and in sync with how trivial moves are handled.
(Also note that `FileMetaData` contains some mutable data members, the values of which
might be different in the resurrected object and the freshly created one.)
The PR also fixes a bug in this area: with the original pre-6901 code, `VersionBuilder`
would add the same file twice to the same level in the scenario described above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6939

Test Plan: `make check`

Reviewed By: ajkr

Differential Revision: D21905580

Pulled By: ltamasi

fbshipit-source-id: da07ae45384ecf3c6c53506d106432d88a7ec9df
2020-06-11 18:32:18 -07:00
Levi Tamasi
722ebba834 Turn DBTest2.CompressionFailures into a parameterized test (#6968)
Summary:
`DBTest2.CompressionFailures` currently tests many configurations
sequentially using nested loops, which often leads to timeouts
in our test system. The patch turns it into a parameterized test
instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6968

Test Plan: `make check`

Reviewed By: siying

Differential Revision: D22006954

Pulled By: ltamasi

fbshipit-source-id: f71f2f7108086b7651ecfce3d79a7fab24620b2c
2020-06-11 16:35:39 -07:00
Zhichao Cao
b3585a11b4 Ingest SST files with checksum information (#6891)
Summary:
Application can ingest SST files with file checksum information, such that during ingestion, DB is able to check data integrity and identify of the SST file. The PR introduces generate_and_verify_file_checksum to IngestExternalFileOption to control if the ingested checksum information should be verified with the generated checksum.

    1. If generate_and_verify_file_checksum options is *FALSE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enables the SST file checksum and the checksum function name matches the checksum function name in DB, we trust the ingested checksum, store it in Manifest. If the checksum function name does not match, we treat that as an error and fail the IngestExternalFile() call.
    2. If generate_and_verify_file_checksum options is *TRUE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enable the SST file checksum, we will use the checksum generator from DB to calculate the checksum for each ingested SST files after they are copied or moved. Then, compare the checksum results with the ingested checksum information: _A)_ if the checksum function name does not match, _verification always report true_ and we store the DB generated checksum information in Manifest. _B)_ if the checksum function name mach, and checksum match, ingestion continues and stores the checksum information in the Manifest. Otherwise, terminate file ingestion and report file corruption.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6891

Test Plan: added unit test, pass make asan_check

Reviewed By: pdillinger

Differential Revision: D21935988

Pulled By: zhichao-cao

fbshipit-source-id: 7b55f486632db467e76d72602218d0658aa7f6ed
2020-06-11 14:27:36 -07:00
Levi Tamasi
fbe2d259cb Use a per-thread path for the export directory in import_column_family_test (#6962)
Summary:
This is required so that the test cases can safely be run in parallel.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6962

Test Plan: `make check`

Reviewed By: zhichao-cao

Differential Revision: D21980060

Pulled By: ltamasi

fbshipit-source-id: 616b7a0b686155d3874848b9098c67ad3f47efcc
2020-06-10 14:04:07 -07:00
Andrew Kryczka
e6be168aa5 save a key comparison in block seeks (#6646)
Summary:
This saves up to two key comparisons in block seeks. The first key
comparison saved is a redundant key comparison against the restart key
where the linear scan starts. This comparison is saved in all cases
except when the found key is in the first restart interval. The
second key comparison saved is a redundant key comparison against the
restart key where the linear scan ends. This is only saved in cases
where all keys in the restart interval are less than the target
(probability roughly `1/restart_interval`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6646

Test Plan:
ran a benchmark with mostly default settings and counted key comparisons

before: `user_key_comparison_count = 19399529`
after: `user_key_comparison_count = 18431498`

setup command:

```
$ TEST_TMPDIR=/dev/shm/dbbench ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -level_compaction_dynamic_level_bytes=true -num=10000000
```

benchmark command:

```
$ TEST_TMPDIR=/dev/shm/dbbench/ ./db_bench -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=10000000 -compression_type=none -reads=1000000 -perf_level=3
```

Reviewed By: pdillinger

Differential Revision: D20849707

Pulled By: ajkr

fbshipit-source-id: 1f01c5cd99ea771fd27974046e37b194f1cdcfac
2020-06-10 13:58:39 -07:00
Andrew Kryczka
02db03af8d make L0 index/filter pinned memory usage predictable (#6911)
Summary:
Memory pinned by `pin_l0_filter_and_index_blocks_in_cache` needs to be predictable based on user config. This PR makes sure
we do not pin extra memory for large files generated by intra-L0 (see https://github.com/facebook/rocksdb/issues/6889).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6911

Test Plan: unit test

Reviewed By: siying

Differential Revision: D21835818

Pulled By: ajkr

fbshipit-source-id: a11a088549d06bed8aacc2548d266e5983f0ead4
2020-06-09 16:51:23 -07:00
Levi Tamasi
5abda3bb8b Move blob_log_{format,reader,writer}.{cc,h} to db/blob/ (#6960)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6960

Test Plan: `make check`

Reviewed By: zhichao-cao

Differential Revision: D21958416

Pulled By: ltamasi

fbshipit-source-id: 97cf05027b7363014b07836e7f158c23827bd661
2020-06-09 15:16:05 -07:00
Peter Dillinger
edf74d1cb1 Add --version and --help to ldb and sst_dump (#6951)
Summary:
as title
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6951

Test Plan: tests included + manual

Reviewed By: ajkr

Differential Revision: D21918540

Pulled By: pdillinger

fbshipit-source-id: 79d4991f2a831214fc7e477a839ec19dbbace6c5
2020-06-09 10:04:01 -07:00
sdong
6a8ddd374d Introduce some Linux build to CircleCI (#6937)
Summary:
Moving towards the long term goal of moving most CI build to CircleCI when possible, add some Linux tests in CircleCI. This is not all what we can include to CircleCI. For example, Java builds are not includ
ed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6937

Test Plan: Watch CI build results.

Reviewed By: pdillinger

Differential Revision: D21941605

fbshipit-source-id: db6aead3c45f523386d4fb30d224cfde573cccad
2020-06-08 19:34:31 -07:00
anand76
1fb3593f25 Fix a bug in looking up duplicate keys with MultiGet (#6953)
Summary:
When MultiGet is called with duplicate keys, and the key matches the
largest key in an SST file and the value type is merge, only the first
instance of the duplicate key is returned with correct results. This is
due to the incorrect assumption that if a key in a batch is equal to the
largest key in the file, the next key cannot be present in that file.

Tests:
Add a new unit test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6953

Reviewed By: cheng-chang

Differential Revision: D21935898

Pulled By: anand1976

fbshipit-source-id: a2cc327a15150e23fd997546ca64d1c33021cb4c
2020-06-08 16:11:21 -07:00
Levi Tamasi
f5e649453c Add convenience method GetFileMetaDataByNumber (#6940)
Summary:
The patch adds a convenience method `GetFileMetaDataByNumber` that
builds on the `FileLocation` functionality introduced recently (see
https://github.com/facebook/rocksdb/pull/6862). This method makes it possible to
retrieve the `FileMetaData` directly as opposed to having to go through
`LevelFiles` and friends.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6940

Test Plan: `make check`

Reviewed By: cheng-chang

Differential Revision: D21905946

Pulled By: ltamasi

fbshipit-source-id: af99e19de21242b2b4a87594a535c6028d16ee72
2020-06-08 16:01:59 -07:00
Zitan Chen
119b26fac0 Implement a new subcommand "identify" for sst_dump (#6943)
Summary:
Implemented a subcommand of sst_dump called identify, which determines whether a file is an SST file or identifies and lists all the SST files in a directory;

This update also fixes the problem that sst_dump exits with a success state even if target file/directory does not exist/is not an SST file/is empty/is corrupted.

One test is added to sst_dump_test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6943

Test Plan: Passed make check and a few manual tests

Reviewed By: pdillinger

Differential Revision: D21928985

Pulled By: gg814

fbshipit-source-id: 9a8b48e0cf1a0e96b13f42b690aba8ad981afad3
2020-06-08 13:58:28 -07:00