Commit Graph

4385 Commits

Author SHA1 Message Date
Levi Tamasi
8a06fe278f Do not use ASSERT_OK in child threads in ExternalSstFileTest.PickedLevelBug (#7754)
Summary:
`googletest` uses exceptions to communicate assertion failures when
`GTEST_THROW_ON_FAILURE` is set, which does not go well with
`std::thread`s, since an exception escaping the top-level function of an
`std::thread` object or an `std::thread` getting destroyed without
having been `join`ed or `detach`ed first results in a call to
`std::terminate`. The patch fixes this by moving the `Status` assertions
of background operations in `ExternalSstFileTest.PickedLevelBug` to the
main thread.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7754

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D25383808

Pulled By: ltamasi

fbshipit-source-id: 32fb2721e5169ec898d218900bc0d83eead45d03
2020-12-07 17:37:17 -08:00
Akanksha Mahajan
20c7d7c58a Handling misuse of snprintf return value (#7686)
Summary:
Handle misuse of snprintf return value to avoid Out of bound
read/write.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7686

Test Plan: make check -j64

Reviewed By: riversand963

Differential Revision: D25030831

Pulled By: akankshamahajan15

fbshipit-source-id: 1a1d181c067c78b94d720323ae00b79566b57cfa
2020-12-07 13:43:55 -08:00
Akanksha Mahajan
1df8584896 Fix unit test failure ppc64le in travis (#7752)
Summary:
Added a fix for the failure of
DBTest2.PartitionedIndexUserToInternalKey on ppc64le in travis
Closes https://github.com/facebook/rocksdb/issues/7746

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7752

Test Plan:
Ran travis job multiple times and it passed. Will keep
watching the travis job after this patch.

Reviewed By: pdillinger

Differential Revision: D25373130

Pulled By: akankshamahajan15

fbshipit-source-id: fa0e3f85f75b687415044a506e42cc38ead87975
2020-12-07 10:24:33 -08:00
Yanqin Jin
eee0af9af1 Add full_history_ts_low to column family (#7740)
Summary:
Following https://github.com/facebook/rocksdb/issues/7655 and https://github.com/facebook/rocksdb/issues/7657, this PR adds `full_history_ts_low_` to `ColumnFamilyData`.
`ColumnFamilyData::full_history_ts_low_` will be used to create `FlushJob` and `CompactionJob`.

`ColumnFamilyData::full_history_ts_low` is persisted to the MANIFEST file. An application can only
increase its value. Consider the following case:

>
> The database has a key at ts=950. `full_history_ts_low` is first set to 1000, and then a GC is triggered
> and cleans up all data older than 1000. If the application sets `full_history_ts_low` to 900 afterwards,
> and tries to read at ts=960, the key at 950 is not seen. From the perspective of the read, the result
> is hard to reason. For simplicity, we just do now allow decreasing full_history_ts_low for now.
>

During recovery, the value of `full_history_ts_low` is restored for each column family if applicable. Note that
version edits in the MANIFEST file for the same column family may have `full_history_ts_low` unsorted due
to the potential interleaving of `LogAndApply` calls. Only the max will be used to restore the state of the
column family.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7740

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D25296217

Pulled By: riversand963

fbshipit-source-id: 24acda1df8262cd7cfdc6ce7b0ec56438abe242a
2020-12-05 14:18:22 -08:00
Levi Tamasi
61932cdf1d Add blob support to DBIter (#7731)
Summary:
The patch adds iterator support to the integrated BlobDB implementation.
Whenever a blob reference is encountered during iteration, the corresponding
blob is retrieved by calling `Version::GetBlob`, assuming the `expose_blob_index`
(formerly `allow_blob`) flag is *not* set. (Note: the flag is set by the old stacked
BlobDB implementation, which has its own blob file handling/blob retrieval logic.)

In addition, `DBIter` now uniformly returns `Status::NotSupported` with the error
message `"BlobDB does not support merge operator."` when encountering a
blob reference while performing a merge (instead of potentially returning a
message that implies the database should be opened using the stacked BlobDB's
`Open`.)

TODO: We can implement support for lazily retrieving the blob value (or in other
words, bypassing the retrieval of blob values based on key) by extending the `Iterator`
API with a new `PrepareValue` method (similarly to `InternalIterator`, which already
supports lazy values).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7731

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D25256293

Pulled By: ltamasi

fbshipit-source-id: c39cd782011495a526cdff99c16f5fca400c4811
2020-12-04 21:29:38 -08:00
Zhichao Cao
e102de7318 Fix assert(cfd->imm()->NumNotFlushed() > 0) in FlushMemtable (#7744)
Summary:
In current code base, in FlushMemtable, when `(Flush_reason == FlushReason::kErrorRecoveryRetryFlush && (!cfd->mem()->IsEmpty() || !cached_recoverable_state_empty_.load()))`, we assert that cfd->imm()->NumNotFlushed() > 0. However, there are some corner cases that can fail this assert: 1) if there are multiple CFs, some CF has immutable memtable, some CFs don't. In ResumeImpl, all CFs will call FlushMemtable, which will hit the assert. 2) Regular flush is scheduled and running, the resume thread is waiting. New KVs are inserted and SchedulePendingFlush is called. Regular flush will continue call MaybeScheduleFlushAndCompaction until all the immutable memtables are flushed. When regular flush ends and auto resume thread starts to schedule new flushes, cfd->imm()->NumNotFlushed() can be 0.

Remove the assert and added the comments.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7744

Test Plan: make check and pass the stress test

Reviewed By: riversand963

Differential Revision: D25340573

Pulled By: zhichao-cao

fbshipit-source-id: eac357bdace660247c197f01a9ff6857e3c97672
2020-12-04 20:31:39 -08:00
Zhichao Cao
eb5a8c06dd Fix the thread wait case in error_handler (#7700)
Summary:
In error_handler auto recovery case, if recovery_in_prog_ is false, the recover is finished or failed. In this case, the auto recovery thread should finish its execution so recovery_thread_ should be null. However, in some cases, it is not null, the caller should not directly returned. Instead, it should wait for a while and create a new thread to execute the new recovery.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7700

Test Plan: make check, error_handler_fs_test

Reviewed By: anand1976

Differential Revision: D25098233

Pulled By: zhichao-cao

fbshipit-source-id: 5a1cba234ca18f6dd5d1be88e02d66e1d5ce931b
2020-12-04 14:58:37 -08:00
Cheng Chang
70f2e0916a Write min_log_number_to_keep to MANIFEST during atomic flush under 2 phase commit (#7570)
Summary:
When 2 phase commit is enabled, if there are prepared data in a WAL, the WAL should be kept, the minimum log number for such a WAL is written to MANIFEST during flush. In atomic flush, such information is not written to MANIFEST.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7570

Test Plan: Added a new unit test `DBAtomicFlushTest.ManualFlushUnder2PC`, this test fails in atomic flush without this PR, after this PR, it succeeds.

Reviewed By: riversand963

Differential Revision: D24394222

Pulled By: cheng-chang

fbshipit-source-id: 60ce74b21b704804943be40c8de01b41269cf116
2020-12-03 19:22:24 -08:00
Zhichao Cao
29e8f6a698 Add kManifestWriteNoWAL to BackgroundErrorReason to handle Flush IO Error when WAL is disabled (#7693)
Summary:
In the current code base, all the manifest writes with IO error will be set with reason: BackgroundErrorReason::kManifestWrite, which will be mapped to the kHardError if the IO Error is retryable. However, if the system does not use the WAL, all the retryable IO error should be mapped to kSoftError. Create this PR to handle is special case by adding kManifestWriteNoWAL to BackgroundErrorReason.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7693

Test Plan: make check, add new testing cases to error_handler_fs_test

Reviewed By: anand1976

Differential Revision: D25066204

Pulled By: zhichao-cao

fbshipit-source-id: d59553896c2eac3fb37c05238544d2b265379462
2020-12-02 18:24:01 -08:00
Jay Zhuang
7fec715db4 Make CompactRange and GetApproximateSizes work with timestamp (#7684)
Summary:
Add timestamp to the `CompactRange()` and `GetApproximateSizes` range keys if needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7684

Test Plan: make check

Reviewed By: riversand963

Differential Revision: D25015421

Pulled By: jay-zhuang

fbshipit-source-id: 51ca0756087eb053a3b11801e5c7ce1c6e2d38a9
2020-12-02 13:00:53 -08:00
Yanqin Jin
e062a719cc Fix assertion failure in bg flush (#7362)
Summary:
https://github.com/facebook/rocksdb/issues/7340 reports and reproduces an assertion failure caused by a combination of the following:
- atomic flush is disabled.
- a column family can appear multiple times in the flush queue at the same time. This behavior was introduced in release 5.17.

Consequently, it is possible that two flushes race with each other. One bg flush thread flushes all memtables. The other thread calls `FlushMemTableToOutputFile()` afterwards, and hits the assertion error below.

```
  assert(cfd->imm()->NumNotFlushed() != 0);
  assert(cfd->imm()->IsFlushPending());
```

Fix this by reverting the behavior. In non-atomic-flush case, a column family can appear in the flush queue at most once at the same time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7362

Test Plan:
make check
Also run stress test successfully for 10 times.
```
make crash_test
```

Reviewed By: ajkr

Differential Revision: D25172996

Pulled By: riversand963

fbshipit-source-id: f1559b6366cc609e961e3fc83fae548f1fad08ce
2020-12-02 09:31:14 -08:00
Jay Zhuang
9e1640403a Exclude timestamp from prefix extractor (#7668)
Summary:
Timestamp should not be included in prefix extractor, as we discussed here: https://github.com/facebook/rocksdb/pull/7589#discussion_r511068586

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7668

Test Plan: added unittest

Reviewed By: riversand963

Differential Revision: D24966265

Pulled By: jay-zhuang

fbshipit-source-id: 0dae618c333d4b7942a40d556535a1795e060aea
2020-12-01 14:07:15 -08:00
Andrew Kryczka
eb65d673fe Fix kPointInTimeRecovery handling of truncated WAL (#7701)
Summary:
WAL may be truncated to an incomplete record due to crash while writing
the last record or corruption. In the former case, no hole will be
produced since no ACK'd data was lost. In the latter case, a hole could
be produced without this PR since we proceeded to recover the next WAL
as if nothing happened. This PR changes the record reading code to
always report a corruption for incomplete records in
`kPointInTimeRecovery` mode, and the upper layer will only ignore them
if the next WAL has consecutive seqnum (i.e., we are guaranteed no
hole).

While this solves the hole problem for the case of incomplete
records, the possibility is still there if the WAL is corrupted by
truncation to an exact record boundary. This PR also regresses how much data
can be recovered when writes are mixed with/without
`WriteOptions::disableWAL`, as then we can not distinguish between a
seqnum gap caused by corruption and a seqnum gap caused by a `disableWAL` write.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7701

Test Plan:
Interestingly there already was a test for this case
(`DBWALTestWithParams.kPointInTimeRecovery`); it just had a typo bug in
the verification that prevented it from noticing holes in recovery.

Reviewed By: anand1976

Differential Revision: D25111765

Pulled By: ajkr

fbshipit-source-id: 5e330b13b1ee2b5be096cea9d0ff6075843e57b6
2020-11-30 18:11:38 -08:00
Levi Tamasi
51a8dc6d14 Integrated blob garbage collection: relocate blobs (#7694)
Summary:
The patch adds basic garbage collection support to the integrated BlobDB
implementation. Valid blobs residing in the oldest blob files are relocated
as they are encountered during compaction. The threshold that determines
which blob files qualify is computed based on the configuration option
`blob_garbage_collection_age_cutoff`, which was introduced in https://github.com/facebook/rocksdb/issues/7661 .
Once a blob is retrieved for the purposes of relocation, it passes through the
same logic that extracts large values to blob files in general. This means that
if, for instance, the size threshold for key-value separation (`min_blob_size`)
got changed or writing blob files got disabled altogether, it is possible for the
value to be moved back into the LSM tree. In particular, one way to re-inline
all blob values if needed would be to perform a full manual compaction with
`enable_blob_files` set to `false`, `enable_blob_garbage_collection` set to
`true`, and `blob_file_garbage_collection_age_cutoff` set to `1.0`.

Some TODOs that I plan to address in separate PRs:

1) We'll have to measure the amount of new garbage in each blob file and log
`BlobFileGarbage` entries as part of the compaction job's `VersionEdit`.
(For the time being, blob files are cleaned up solely based on the
`oldest_blob_file_number` relationships.)
2) When compression is used for blobs, the compression type hasn't changed,
and the blob still qualifies for being written to a blob file, we can simply copy
the compressed blob to the new file instead of going through decompression
and compression.
3) We need to update the formula for computing write amplification to account
for the amount of data read from blob files as part of GC.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7694

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D25069663

Pulled By: ltamasi

fbshipit-source-id: bdfa8feb09afcf5bca3b4eba2ba72ce2f15cd06a
2020-11-23 21:08:22 -08:00
Andrew Kryczka
dd6b7fc520 Return Status from MemTable mutation functions (#7656)
Summary:
This PR updates `MemTable::Add()`, `MemTable::Update()`, and
`MemTable::UpdateCallback()` to return `Status` objects, and adapts the
client code in `MemTableInserter`. The goal is to prepare these
functions for key-value checksum, where we want to verify key-value
integrity while adding to memtable. After this PR, the memtable mutation
functions can report a failed integrity check by returning `Status::Corruption`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7656

Reviewed By: riversand963

Differential Revision: D24900497

Pulled By: ajkr

fbshipit-source-id: 1a7e80581e3774676f2bbba2f0a0b04890f40009
2020-11-23 16:29:04 -08:00
Yanqin Jin
1a5fc4f577 Port corruption test to use custom env (#7699)
Summary:
Allow corruption_test to run on custom env loaded via
`Env::LoadEnv()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7699

Test Plan:
```
make corruption_test
./corruption_test
```

Also run on in-house custom env.

Reviewed By: zhichao-cao

Differential Revision: D25135525

Pulled By: riversand963

fbshipit-source-id: 7941e7ce342dc88ec2cd63e90f7674a2f57de6b7
2020-11-20 18:40:24 -08:00
Cheng Chang
7169ca9c80 Do not track empty WALs (#7697)
Summary:
An empty WAL won't be backed up by the BackupEngine. So if we track the empty WALs in MANIFEST, then when restoring from a backup, it may report corruption that the empty WAL is missing, which is correct because the WAL is actually in the main DB but not in the backup DB, but missing an empty WAL does not logically break DB consistency.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7697

Test Plan: watch existing tests to pass

Reviewed By: pdillinger

Differential Revision: D25077194

Pulled By: cheng-chang

fbshipit-source-id: 01917b57234b92b6063925f2ee9452c5732bdc03
2020-11-18 21:27:54 -08:00
Cheng Chang
8c93b16f02 Track WAL in MANIFEST: Update logic for computing min_log_number_to_keep in atomic flush (#7660)
Summary:
The logic for computing min_log_number_to_keep in atomic flush was incorrect.

For example, when all column families are flushed, the min_log_number_to_keep should be the latest new log. But the incorrect logic calls `PrecomputeMinLogNumberToKeepNon2PC` for each column family, and returns the minimum of them. However, `PrecomputeMinLogNumberToKeepNon2PC(cf)` assumes column families other than `cf` are flushed, but in case all column families are flushed, this assumption is incorrect.

Without this fix, the WAL referenced by the computed min_log_number_to_keep may actually contain no unflushed data, so the WAL might have actually been deleted from disk on recovery, then an incorrect error `Corruption: missing WAL` will be reported.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7660

Test Plan:
run `make crash_test_with_atomic_flush`  on devserver
added a unit test in `db_flush_test`

Reviewed By: riversand963

Differential Revision: D24906265

Pulled By: cheng-chang

fbshipit-source-id: 08deda62e71f67f59e3b7925cdd86dd09bd4f430
2020-11-17 15:55:55 -08:00
Yanqin Jin
869f0538dd Clean up after two test failures in db_basic_test (#7682)
Summary:
In db_basic_test.cc, there are two tests that rely on the underlying
system's `LockFile` support to function correctly:
DBBasicTest.OpenWhenOpen and DBBasicTest.CheckLock. In both tests,
re-opening a db using `DB::Open` is expected to fail because the second
open cannot lock the LOCK file. Some distributed file systems, e.g. HDFS
do not support the POSIX-style file lock. Therefore, these unit tests will cause
assertion failure and the second `Open` will create a db instance.
Currently, these db instances are not closed after the assertion
failure. Since these db instances are registered with some process-wide, static
data structures, e.g. `PeriodicWorkScheduler::Default()`, they can still be
accessed after the unit tests. However, the `Env` object created for this db
instance is destroyed when the test finishes in `~DBTestBase()`. Consequently,
it causes illegal memory access.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7682

Test Plan:
Run the following on a distrubited file system:
```
make check
```

Reviewed By: anand1976

Differential Revision: D25004215

Pulled By: riversand963

fbshipit-source-id: f4327d7716c0e72b13bb43737ec9a5d156da4d52
2020-11-16 22:09:01 -08:00
Andrew Kryczka
1c5f13f2a5 Fail early when merge_operator not configured (#7667)
Summary:
An application may accidentally write merge operands without properly configuring `merge_operator`. We should alert them as early as possible that there's an API misuse. Previously RocksDB only notified them when a query or background operation needed to merge but couldn't. With this PR, RocksDB notifies them of the problem before applying the merge operand to the memtable (although it may already be in WAL, which seems it'd cause a crash loop until they enable `merge_operator`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7667

Reviewed By: riversand963

Differential Revision: D24933360

Pulled By: ajkr

fbshipit-source-id: 3a4a2ceb0b7aed184113dd03b8efd735a8332f7f
2020-11-16 20:39:01 -08:00
Cheng Chang
1aae41786a Do not track WAL in MANIFEST when fsync is disabled in a test (#7669)
Summary:
If fsync is disabled in a unit test, then do not track WAL in MANIFEST, because on DB recovery, the WAL might be missing because the directory is not fsynced.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7669

Test Plan: Tests with fsync enabled should pass.

Reviewed By: riversand963

Differential Revision: D24941431

Pulled By: cheng-chang

fbshipit-source-id: ab3ff0f90769795cfb4e4d6dcf084ea5545d1975
2020-11-13 13:37:14 -08:00
Peter Dillinger
60af964372 Experimental (production candidate) SST schema for Ribbon filter (#7658)
Summary:
Added experimental public API for Ribbon filter:
NewExperimentalRibbonFilterPolicy(). This experimental API will
take a "Bloom equivalent" bits per key, and configure the Ribbon
filter for the same FP rate as Bloom would have but ~30% space
savings. (Note: optimize_filters_for_memory is not yet implemented
for Ribbon filter. That can be added with no effect on schema.)

Internally, the Ribbon filter is configured using a "one_in_fp_rate"
value, which is 1 over desired FP rate. For example, use 100 for 1%
FP rate. I'm expecting this will be used in the future for configuring
Bloom-like filters, as I expect people to more commonly hold constant
the filter accuracy and change the space vs. time trade-off, rather than
hold constant the space (per key) and change the accuracy vs. time
trade-off, though we might make that available.

### Benchmarking

```
$ ./filter_bench -impl=2 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 34.1341
Number of filters: 1993
Total size (MB): 238.488
Reported total allocated memory (MB): 262.875
Reported internal fragmentation: 10.2255%
Bits/key stored: 10.0029
----------------------------
Mixed inside/outside queries...
  Single filter net ns/op: 18.7508
  Random filter net ns/op: 258.246
    Average FP rate %: 0.968672
----------------------------
Done. (For more info, run with -legend or -help.)
$ ./filter_bench -impl=3 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 130.851
Number of filters: 1993
Total size (MB): 168.166
Reported total allocated memory (MB): 183.211
Reported internal fragmentation: 8.94626%
Bits/key stored: 7.05341
----------------------------
Mixed inside/outside queries...
  Single filter net ns/op: 58.4523
  Random filter net ns/op: 363.717
    Average FP rate %: 0.952978
----------------------------
Done. (For more info, run with -legend or -help.)
```

168.166 / 238.488 = 0.705  -> 29.5% space reduction

130.851 / 34.1341 = 3.83x construction time for this Ribbon filter vs. lastest Bloom filter (could make that as little as about 2.5x for less space reduction)

### Working around a hashing "flaw"

bloom_test discovered a flaw in the simple hashing applied in
StandardHasher when num_starts == 1 (num_slots == 128), showing an
excessively high FP rate.  The problem is that when many entries, on the
order of number of hash bits or kCoeffBits, are associated with the same
start location, the correlation between the CoeffRow and ResultRow (for
efficiency) can lead to a solution that is "universal," or nearly so, for
entries mapping to that start location. (Normally, variance in start
location breaks the effective association between CoeffRow and
ResultRow; the same value for CoeffRow is effectively different if start
locations are different.) Without kUseSmash and with num_starts > 1 (thus
num_starts ~= num_slots), this flaw should be completely irrelevant.  Even
with 10M slots, the chances of a single slot having just 16 (or more)
entries map to it--not enough to cause an FP problem, which would be local
to that slot if it happened--is 1 in millions. This spreadsheet formula
shows that: =1/(10000000*(1 - POISSON(15, 1, TRUE)))

As kUseSmash==false (the setting for Standard128RibbonBitsBuilder) is
intended for CPU efficiency of filters with many more entries/slots than
kCoeffBits, a very reasonable work-around is to disallow num_starts==1
when !kUseSmash, by making the minimum non-zero number of slots
2*kCoeffBits. This is the work-around I've applied. This also means that
the new Ribbon filter schema (Standard128RibbonBitsBuilder) is not
space-efficient for less than a few hundred entries. Because of this, I
have made it fall back on constructing a Bloom filter, under existing
schema, when that is more space efficient for small filters. (We can
change this in the future if we want.)

TODO: better unit tests for this case in ribbon_test, and probably
update StandardHasher for kUseSmash case so that it can scale nicely to
small filters.

### Other related changes

* Add Ribbon filter to stress/crash test
* Add Ribbon filter to filter_bench as -impl=3
* Add option string support, as in "filter_policy=experimental_ribbon:5.678;"
where 5.678 is the Bloom equivalent bits per key.
* Rename internal mode BloomFilterPolicy::kAuto to kAutoBloom
* Add a general BuiltinFilterBitsBuilder::CalculateNumEntry based on
binary searching CalculateSpace (inefficient), so that subclasses
(especially experimental ones) don't have to provide an efficient
implementation inverting CalculateSpace.
* Minor refactor FastLocalBloomBitsBuilder for new base class
XXH3pFilterBitsBuilder shared with new Standard128RibbonBitsBuilder,
which allows the latter to fall back on Bloom construction in some
extreme cases.
* Mostly updated bloom_test for Ribbon filter, though a test like
FullBloomTest::Schema is a next TODO to ensure schema stability
(in case this becomes production-ready schema as it is).
* Add some APIs to ribbon_impl.h for configuring Ribbon filters.
Although these are reasonably covered by bloom_test, TODO more unit
tests in ribbon_test
* Added a "tool" FindOccupancyForSuccessRate to ribbon_test to get data
for constructing the linear approximations in GetNumSlotsFor95PctSuccess.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7658

Test Plan:
Some unit tests updated but other testing is left TODO. This
is considered experimental but laying down schema compatibility as early
as possible in case it proves production-quality. Also tested in
stress/crash test.

Reviewed By: jay-zhuang

Differential Revision: D24899349

Pulled By: pdillinger

fbshipit-source-id: 9715f3e6371c959d923aea8077c9423c7a9f82b8
2020-11-12 20:46:14 -08:00
Yanqin Jin
76ef894f9f Add full_history_ts_low_ to FlushJob (#7655)
Summary:
https://github.com/facebook/rocksdb/issues/7556 enables `CompactionIterator` to perform garbage collection during compaction according
to a lower bound (user-defined) timestamp `full_history_ts_low_`.
This PR adds a data member `full_history_ts_low_` of type `std::string` to `FlushJob`, and
`full_history_ts_low_` does not change during flush. `FlushJob` will pass a pointer to this data member
to the `CompactionIterator` used during flush.

Also refactored flush_job_test.cc to re-use some existing code, which is actually the majority of this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7655

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D24933340

Pulled By: riversand963

fbshipit-source-id: 2e584bfd0cf6e5c295ab1af264e68e9d6a12fca3
2020-11-12 18:44:34 -08:00
Levi Tamasi
bb69b4ce7f Fix InternalStats::DumpCFStats (#7666)
Summary:
https://github.com/facebook/rocksdb/pull/7461 accidentally broke
`InternalStats::DumpCFStats` by making `DumpCFFileHistogram` overwrite
the output of `DumpCFStatsNoFileHistogram` instead of appending to it,
resulting in only the file histogram related information getting logged.
The patch fixes this by reverting to appending in `DumpCFFileHistogram`.

Fixes https://github.com/facebook/rocksdb/issues/7664 .

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7666

Test Plan: Ran `make check` and checked the info log of `db_bench`.

Reviewed By: riversand963

Differential Revision: D24929051

Pulled By: ltamasi

fbshipit-source-id: 636a3d5ebb5ce23de4f3fe4f03ad3f16cb2858f8
2020-11-12 17:33:04 -08:00
Yanqin Jin
cf9d8e45c0 Add full_history_ts_low_ to CompactionJob (#7657)
Summary:
https://github.com/facebook/rocksdb/issues/7556 enables `CompactionIterator` to perform garbage collection during compaction according
to a lower bound (user-defined) timestamp `full_history_ts_low_`.

This PR adds a data member `full_history_ts_low_` of type `std::string` to `CompactionJob`, and
`full_history_ts_low_` does not change during compaction. `CompactionJob` will pass a pointer to this
data member to the `CompactionIterator` used during compaction.

Also refactored compaction_job_test.cc to re-use some existing code, which is actually the majority of this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7657

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D24913803

Pulled By: riversand963

fbshipit-source-id: 11ad5329ddac365667152e7b3b02f84182c0ca8e
2020-11-12 11:43:24 -08:00
Levi Tamasi
0dc437d65c Clean up CompactionProxy (#7662)
Summary:
`CompactionProxy` is currently both a concrete class used for actual `Compaction`s
and a base class that `FakeCompaction` (which is used in `compaction_iterator_test`)
is derived from. This is bad from an OO design standpoint, and also results in
`FakeCompaction` containing an (uninitialized and unused) `Compaction*` member.
The patch fixes this by making `CompactionProxy` a pure interface and introducing
a separate concrete class `RealCompaction` for non-test/non-fake compactions. It
also removes an unused parameter from the virtual method `level`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7662

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D24907680

Pulled By: ltamasi

fbshipit-source-id: c100ecb1beef4b0ada35e799116c5bda71719ee7
2020-11-12 08:49:35 -08:00
Andrew Kryczka
ec346da98c Always apply bottommost_compression_opts when enabled (#7633)
Summary:
Previously, even when `bottommost_compression_opts`'s `enabled` flag was set, it only took effect when
`bottommost_compression` was also set to something other than `kDisableCompressionOption`.
This wasn't documented and, if we kept the old behavior, it'd make
things complicated like the migration instructions in https://github.com/facebook/rocksdb/issues/7619. We can
simplify the API by making `bottommost_compression_opts` always take
effect when its `enabled` flag is set.

Fixes https://github.com/facebook/rocksdb/issues/7631.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7633

Reviewed By: ltamasi

Differential Revision: D24710358

Pulled By: ajkr

fbshipit-source-id: bbbdf9c1b53c63a4239d902cc3f5a11da1874647
2020-11-11 20:32:28 -08:00
Yanqin Jin
8b6b6aeb1a Refactor with VersionEditHandler (#6581)
Summary:
Added a few classes in the same class hierarchy to remove code duplication and
refactor the logic of reading and processing MANIFEST files.

New classes are as follows.
```
class VersionEditHandlerBase;
class ListColumnFamiliesHandler : VersionEditHandlerBase;
class FileChecksumRetriever : VersionEditHandlerBase;
class DumpManifestHandler : VersionEditHandler;
```
Classes that already existed before this PR are as follows.
```
class VersionEditHandler : VersionEditHandlerBase;
```

With these classes, refactored functions: `VersionSet::Recover()`,
`VersionSet::ListColumnFamilies()`, `VersionSet::DumpManifest()`,
`GetFileChecksumFromManifest()`.

Test Plan (devserver):
```
make check
COMPILE_WITH_ASAN=1 make check
```
These refactored code, especially recovery-related logic, will be tested intensively by
all existing unit tests and stress tests. For example, run
```
make crash_test
```
Verified 3 successful runs on devserver.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6581

Reviewed By: ajkr

Differential Revision: D20616217

Pulled By: riversand963

fbshipit-source-id: 048c7743aa4be2623ccd0cc3e61c0027e604e78b
2020-11-11 08:00:14 -08:00
Peter Dillinger
c57f914482 Use NPHash64 in more places (#7632)
Summary:
Since the hashes should not be persisted in output_validator
nor mock_env.

Also updated NPHash64 to use 64-bit seed, and comments.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7632

Test Plan:
make check, and new build setting that enables modification
to NPHash64, to check for behavior depending on specific values. Added
that setting to one of the CircleCI configurations.

Reviewed By: jay-zhuang

Differential Revision: D24833780

Pulled By: pdillinger

fbshipit-source-id: 02a57652ccf1ac105fbca79e77875bb7bf7c071f
2020-11-10 23:42:13 -08:00
Yanqin Jin
bcba372352 Report if unpinnable value encountered during backward iteration (#7618)
Summary:
There is an undocumented behavior about a certain combination of options and operations.
- inplace_update_support = true, and
- call `SeekForPrev()`, `SeekToLast()`, and/or `Prev()` on unflushed data.

We should stop the backward iteration and report an error of `Status::NotSupported`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7618

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D24769619

Pulled By: riversand963

fbshipit-source-id: 81d199fa55ed4739ab10e719cc345a992238ccbb
2020-11-10 17:17:39 -08:00
Jay Zhuang
18aee7db7e Fix a seek issue with prefix extractor and timestamp (#7644)
Summary:
During seek, prefix compare should not include timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7644

Test Plan: added unittest

Reviewed By: riversand963

Differential Revision: D24772066

Pulled By: jay-zhuang

fbshipit-source-id: 3982655a8bf8da256a738e8497b73b3d9bdac92e
2020-11-10 14:53:13 -08:00
Yanqin Jin
9f1c84ca47 Fix a bug in compaction iterator with timestamp (#7645)
Summary:
https://github.com/facebook/rocksdb/issues/7556 introduced support for compaction iterator to perform timestamp-aware garbage collection.
However, there was a bug. The comparison between `ikey_.user_key` and `current_user_key_` should happen
before `key_ = current_key_.SetInternalKey(key_, &ikey_);` (line 336 of compaction_iterator.cc).
Otherwise, after this line, `current_key_` is always the same as `ikey_.user_key`.

This PR also re-arranged the order of some data members because some of them are state variables of `CompactionIterator` while others are inputs from callers.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7645

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D24845028

Pulled By: riversand963

fbshipit-source-id: c7e79914832701462b86867e8463cd463b6c0c25
2020-11-09 18:23:31 -08:00
Cheng Chang
c3911f1a72 Track WAL in MANIFEST: Track deleted WALs in MANIFEST after recovering from the WALs (#7649)
Summary:
After replaying the WALs, the memtables are flushed synchronously to L0 instead of being flushed in background. Currently, we only track WAL obsoletion events in the code path of background flush jobs. This PR tracks these events in RecoverLogFiles.

After this change, we can enable `track_and_verify_wal_in_manifest` in `db_stress`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7649

Test Plan: `python tools/db_crashtest.py whitebox`

Reviewed By: riversand963

Differential Revision: D24824501

Pulled By: cheng-chang

fbshipit-source-id: 207129f7b845c50b333680ce6818a68a2fad54b9
2020-11-09 10:25:43 -08:00
Cheng Chang
5e794b0841 Fix a recovery corner case (#7621)
Summary:
Consider the following sequence of events:

1. Db flushed an SST with file number N, appended to MANIFEST, and tried to sync the MANIFEST.
2. Syncing MANIFEST failed and db crashed.
3. Db tried to recover with this MANIFEST. In the meantime, no entry about the newly-flushed SST was found in the MANIFEST. Therefore, RocksDB replayed WAL and tried to flush to an SST file reusing the same file number N. This failed because file system does not support overwrite. Then Db deleted this file.
4. Db crashed again.
5. Db tried to recover. When db read the MANIFEST, there was an entry referencing N.sst. This could happen probably because the append in step 1 finally reached the MANIFEST and became visible. Since N.sst had been deleted in step 3, recovery failed.

It is possible that N.sst created in step 1 is valid. Although step 3 would still fail since the MANIFEST was not synced properly in step 1 and 2, deleting N.sst would make it impossible for the db to recover even if the remaining part of MANIFEST was appended and visible after step 5.

After this PR, in step 3, immediately after recovering from MANIFEST, a new MANIFEST is created, then we find that N.sst is not referenced in the MANIFEST, so we delete it, and we'll not reuse N as file number. Then in step 5, since the new MANIFEST does not contain N.sst, the recovery failure situation in step 5 won't happen.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7621

Test Plan:
1. some tests are updated, because these tests assume that new MANIFEST is created after WAL recovery.
2. a new unit test is added in db_basic_test to simulate step 3.

Reviewed By: riversand963

Differential Revision: D24668144

Pulled By: cheng-chang

fbshipit-source-id: 90d7487fbad2bc3714f5ede46ea949895b15ae3b
2020-11-07 22:23:27 -08:00
Cheng Chang
1e40696dd1 Track WAL in MANIFEST: LogAndApply WAL events to MANIFEST (#7601)
Summary:
When a WAL is synced, an edit is written to MANIFEST.
After flushing memtables, the obsoleted WALs are piggybacked to MANIFEST while writing the new L0 files to MANIFEST.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7601

Test Plan:
`track_and_verify_wals_in_manifest` is enabled by default for all tests extending `DBBasicTest`, and in db_stress_test.
Unit test `wal_edit_test`, `version_edit_test`, and `version_set_test` are also updated.
Watch all tests to pass.

Reviewed By: ltamasi

Differential Revision: D24553957

Pulled By: cheng-chang

fbshipit-source-id: 66a569ff1bdced38e22900bd240b73113906e040
2020-11-06 17:22:36 -08:00
Cheng Chang
1ce105d0ea Disable fsync in DBMergeOperatorTest to save test time (#7640)
Summary:
The test often times out in internal test infra.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7640

Test Plan: watch test to pass internally

Reviewed By: anand1976

Differential Revision: D24764928

Pulled By: cheng-chang

fbshipit-source-id: 587f2afc97f52909837943fd938a86ca94544b2c
2020-11-06 15:24:17 -08:00
Cheng Chang
cdc7ba3a32 DBTablePropertiesTest often times out in internal test infra (#7639)
Summary:
In this test, after flushing memtable, it will read directly from the sst files, so `env_do_fsync` was `true` to ensure that the flushed sst files can be read afterwards. Considering that the test does not last long, the data should be available in os buffer even without fsync, so this PR tries to disable fsync to reduce test time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7639

Test Plan: watch the test to pass in internal infra

Reviewed By: anand1976

Differential Revision: D24764689

Pulled By: cheng-chang

fbshipit-source-id: ef827611a3eaca04201e4280ae801d6c8e60c138
2020-11-06 14:25:14 -08:00
Cheng Chang
4c2aef04bd ColumnFamilyTest often times out in internal test infra (#7638)
Summary:
Tries to fix by skipping fsync.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7638

Test Plan: watch the tests to pass

Reviewed By: jay-zhuang

Differential Revision: D24764355

Pulled By: cheng-chang

fbshipit-source-id: 9c21b177709025ca1943066d94da89324ed47655
2020-11-06 10:25:20 -08:00
Cheng Chang
81543369e5 Disable fsync in db_range_del_test (#7637)
Summary:
This test often times out in internal test infra.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7637

Test Plan: watch test to pass

Reviewed By: ajkr

Differential Revision: D24763939

Pulled By: cheng-chang

fbshipit-source-id: 6564ee2ef637e9faf6688d4b6a5d74a72a51c5e8
2020-11-06 10:25:20 -08:00
Yanqin Jin
b6d8e36741 Compute NeedCompact() after table builder Finish() (#7627)
Summary:
In `BuildTable()`, we call `builder->Finish()` before evaluating `builder->NeedCompact()`.
However, we call `builder->NeedCompact()` before `builder->Finish()` in compaction job. This can be wrong because the table properties collectors may rely on the success of `Finish()` to provide correct result for `NeedCompact()`.

Test plan (on devserver):
make check

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7627

Reviewed By: ajkr

Differential Revision: D24728741

Pulled By: riversand963

fbshipit-source-id: 5a0dce244e14eb1106c4f87021e6bebca82b486e
2020-11-04 10:44:56 -08:00
Yanqin Jin
fde0cd7ced Add API to verify whole sst file checksum (#7578)
Summary:
Existing API `VerifyChecksum()` allows application to verify sst files' block checksums.
Since whole file, user-specified checksum is tracked in MANIFEST, we can expose a new
API to verify sst files' file checksums.

```
// Compute table file checksums if applicable and compare with MANIFEST.
// Returns OK if no file has mismatching whole-file checksum.
Status DB::VerifyFileChecksums(const ReadOptions& /*read_options*/);
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7578

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D24436783

Pulled By: riversand963

fbshipit-source-id: 52b51519b842f2b3c4e3351998a97c86cbec85b3
2020-11-03 20:34:56 -08:00
Yanqin Jin
0b94468bba Avoid skipping a test in db_wal_test (#7628)
Summary:
Recent test report shows that some tests have been skipped.

For DBWALTest that inherits from DBTestBase, the following will always be
true, since `env_` is an instance of `SpecialEnv`, not `Env::Default()`. Thus the test
will always be skipped.

```
if (options.env != Env::Default()) {
  ROCKSDB_GTEST_SKIP("Test requires default environment");
  return;
}
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7628

Test Plan:
./db_wal_test --gtest_filter=DBWALTest.TruncateLastLogAfterRecoverWithoutFlush
MEM_ENV=1 ./db_wal_test --gtest_filter=DBWALTest.TruncateLastLogAfterRecoverWithoutFlush
make check

Reviewed By: jay-zhuang

Differential Revision: D24693006

Pulled By: riversand963

fbshipit-source-id: 7f2a772492a0f11bff17bbf5e9f493e9e9a1c125
2020-11-03 09:48:16 -08:00
Jay Zhuang
881e0dcc09 Fix MultiGet unable to query timestamp data issue (#7589)
Summary:
The filter query key should not contain timestamp. The timestamp is
stripped for Get(), but not MultiGet().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7589

Reviewed By: riversand963

Differential Revision: D24494661

Pulled By: jay-zhuang

fbshipit-source-id: fc5ff40f9d683a89a760c6ff0ab3aed05a70c317
2020-11-03 09:45:41 -08:00
Yanqin Jin
c992eb118b Avoid skipping a test in db_test2 (#7629)
Summary:
Test report shows that this test has been skipped recently due to
a condition that will never meet. `env_` is not equal to
`Env::Default()` for DBTest2 that inherits from DBTestBase.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7629

Test Plan:
make check
./db_test2 --gtest_filter=DBTest2.PinnableSliceAndMmapReads

Reviewed By: jay-zhuang

Differential Revision: D24693317

Pulled By: riversand963

fbshipit-source-id: b1bbd5c1e05a6fa57c1de0d74462b69e3c2d5215
2020-11-02 19:48:23 -08:00
Andrew Kryczka
1adbceb581 Expand effect of dictionary settings in ColumnFamilyOptions::compression_opts (#7619)
Summary:
In dictionary compression's initial implementation, in order to save CPU overhead, we only enabled it
for bottom level under the assumption that the vast majority of data is
stored there. At that time, there was no
such thing as `ColumnFamilyOptions::bottommost_compression_opts`, so we just
hardcoded disabling dictionary compression in flush and compactions to
non-bottommost level. Now, we have users who generate all their files
through flush and are considering using dictionary compression.

To support such a use case, this PR expands the scope of `ColumnFamilyOptions::compression_opts` to
additionally include flushed files and files generated by compaction to
a non-bottommost level. Users can still get the old behavior by moving
their dictionary settings to `ColumnFamilyOptions::bottommost_compression_opts`
and explicitly enabling both that and `ColumnFamilyOptions::bottommost_compression`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7619

Reviewed By: ltamasi

Differential Revision: D24665610

Pulled By: ajkr

fbshipit-source-id: 656b90bce1033fe21c71e09af931ef5bde3e464c
2020-11-02 19:21:11 -08:00
Yanqin Jin
394210f280 Remove unused includes (#7604)
Summary:
This is a PR generated **semi-automatically** by an internal tool to remove unused includes and `using` statements.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7604

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D24579392

Pulled By: riversand963

fbshipit-source-id: c4bfa6c6b08da1de186690d37eb73d8fff45aecd
2020-10-28 23:22:27 -07:00
Zhichao Cao
ea347d80df Updated GenerateOneFileChecksum to use requested_checksum_func_name (#7586)
Summary:
CreateFileChecksumGenerator may uses requested_checksum_func_name in generator context to decide which generator will be used. GenerateOneFileChecksum has not being updated to use it, which will always get the generator when the name is empty. Fix it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7586

Test Plan: make check

Reviewed By: riversand963

Differential Revision: D24491989

Pulled By: zhichao-cao

fbshipit-source-id: d9fdfdd431240f0a9a2e781ddbd48a7d6c609aad
2020-10-28 16:47:12 -07:00
darionyaphet
793e9b7f5b Remove duplicate close (#7594)
Summary:
Because `Close()` have called in `Destroy()`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7594

Reviewed By: zhichao-cao

Differential Revision: D24576407

Pulled By: riversand963

fbshipit-source-id: eba70d73375fd47dd78ca64c6a1fab3628448276
2020-10-28 10:48:53 -07:00
Ramkumar Vadivelu
9a690a74e1 In ParseInternalKey(), include corrupt key info in Status (#7515)
Summary:
Fixes Issue https://github.com/facebook/rocksdb/issues/7497

When allow_data_in_errors db_options is set, log error key details in `ParseInternalKey()`

Have fixed most of the calls. Have few TODOs still pending - because have to make more deeper changes to pass in the allow_data_in_errors flag. Will do those in a separate PR later.

Tests:
- make check
- some of the existing tests that exercise the "internal key too small" condition are: dbformat_test, cuckoo_table_builder_test
- some of the existing tests that exercise the corrupted key path are: corruption_test, merge_helper_test, compaction_iterator_test

Example of new status returns:
- Key too small - `Corrupted Key: Internal Key too small. Size=5`
- Corrupt key with allow_data_in_errors option set to false: `Corrupted Key: '<redacted>' seq:3, type:3`
- Corrupt key with allow_data_in_errors option set to true: `Corrupted Key: '61' seq:3, type:3`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7515

Reviewed By: ajkr

Differential Revision: D24240264

Pulled By: ramvadiv

fbshipit-source-id: bc48f5d4475ac19d7713e16df37505b31aac42e7
2020-10-28 10:12:58 -07:00
mrambacher
f35f7f2704 Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:

1.  The MockFileSystem class was split out from the MockEnv.  This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one).  The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.

2.  Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set.  To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).

3.  Updated the test framework to have a ROCKSDB_GTEST_SKIP macro.  This can be used to flag tests that are skipped.  Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).

I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV,  both, and neither under both MacOS and RedHat.  A few tests were disabled/skipped for the MEM/ENCRYPTED cases.  The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem.  (I will also push a change to disable those tests soon).  There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.

Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.

Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale.  I do not know how to do that, so if someone could write that job, it would be appreciated :)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566

Reviewed By: zhichao-cao

Differential Revision: D24408980

Pulled By: jay-zhuang

fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 10:33:09 -07:00
Yanqin Jin
6134ce6444 Perform post-flush updates of memtable list in a callback (#6069)
Summary:
Currently, the following interleaving of events can lead to SuperVersion containing both immutable memtables as well as the resulting L0. This can cause Get to return incorrect result if there are merge operands. This may also affect other operations such as single deletes.

```
  time  main_thr  bg_flush_thr  bg_compact_thr  compact_thr  set_opts_thr
0  |                                                         WriteManifest:0
1  |                                           issue compact
2  |                                 wait
3  |   Merge(counter)
4  |   issue flush
5  |                   wait
6  |                                                         WriteManifest:1
7  |                                 wake up
8  |                                 write manifest
9  |                  wake up
10 |  Get(counter)
11 |                  remove imm
   V
```

The reason behind is that: one bg flush thread's installing new `Version` can be batched and performed by another thread that is the "leader" MANIFEST writer. This bg thread removes the memtables from current super version only after `LogAndApply` returns. After the leader MANIFEST writer signals (releasing mutex) this bg flush thread, it is possible that another thread sees this cf with both memtables (whose data have been flushed to the newest L0) and the L0 before this bg flush thread removes the memtables.

To address this issue, each bg flush thread can pass a callback function to `LogAndApply`. The callback is responsible for removing the memtables. Therefore, the leader MANIFEST writer can call this callback and remove the memtables before releasing the mutex.

Test plan (devserver)
```
$make merge_test
$./merge_test --gtest_filter=MergeTest.MergeWithCompactionAndFlush
$make check
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6069

Reviewed By: cheng-chang

Differential Revision: D18790894

Pulled By: riversand963

fbshipit-source-id: e41bd600c0448b4f4b2deb3f7677f95e3076b4ed
2020-10-26 18:23:01 -07:00
Levi Tamasi
a7a04b6898 Integrate BlobFileBuilder into the compaction process (#7573)
Summary:
Similarly to how https://github.com/facebook/rocksdb/issues/7345
integrated blob file writing into the flush process,
the patch adds support for writing blob files to the compaction logic.
Namely, if `enable_blob_files` is set, large values encountered during
compaction are extracted to blob files and replaced with blob indexes.
The resulting blob files are then logged to the MANIFEST as part of the
compaction job's `VersionEdit` and added to the `Version` alongside any
table files written by the compaction. Any errors during blob file building fail
the compaction job.

There will be a separate follow-up patch to perform blob garbage collection
during compactions.

In addition, the patch continues to chip away at the mess around computing
various compaction related statistics by eliminating some code duplication
and by making the `num_output_files` and `bytes_written` stats more consistent
for flushes, compactions, and recovery.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7573

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D24404696

Pulled By: ltamasi

fbshipit-source-id: 21216af3a172ad3ce8f85d11cd30923784ae426c
2020-10-26 13:51:55 -07:00
Yanqin Jin
6595267980 Allow compaction iterator to perform garbage collection (#7556)
Summary:
Add a threshold timestamp, full_history_ts_low_ of type `std::string*` to
`CompactionIterator`, so that RocksDB can also perform garbage collection during
compaction.
* If full_history_ts_low_ is nullptr, then compaction iterator does not perform
  GC, preserving all timestamp history for all keys. Compaction iterator will
treat user key with different timestamps as different user keys.
* If full_history_ts_low_ is not nullptr, then compaction iterator performs
  GC. GC will look at keys older than `*full_history_ts_low_` and determine their
  eligibility based on factors including snapshots.

Current rules of GC:
 * If an internal key is in the same snapshot as a previous counterpart
    with the same user key, and this key is eligible for GC, and the key is
    not single-delete or merge operand, then this key can be dropped. Note
    that the previous internal key cannot be a merge operand either.
 * If a tombstone is the most recent one in the earliest snapshot and it
    is eligible for GC, and keyNotExistsBeyondLevel() is true, then this
    tombstone can be dropped.
 * If a tombstone is the most recent one in a snapshot and it is eligible
    for GC, and the compaction is at bottommost level, then all other older
    internal keys of the same user key must also be eligible for GC, thus
    can be dropped
* Single-delete, delete-range and merge are not currently supported.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7556

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D24507728

Pulled By: riversand963

fbshipit-source-id: 3c09c7301f41eed76dfcf4d1527e68cf6e0a8bb3
2020-10-23 22:59:46 -07:00
Cheng Chang
1b224324b5 Track WAL in MANIFEST: persist WALs to and recover WALs from MANIFEST (#7256)
Summary:
This PR makes it able to `LogAndApply` `VersionEdit`s related to WALs, and also be able to `Recover` from MANIFEST with WAL related `VersionEdit`s.

The `VersionEdit`s related to WAL are treated similarly as those related to column family operations, they are not applied to versions, but can be in a commit group. Mixing WAL related `VersionEdit`s with other types of edits will make logic in `ProcessManifestWrite` more complicated, so `VersionEdit`s related to WAL can either be WAL additions or deletions, like column family add and drop.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7256

Test Plan: a set of unit tests are added in `version_set_test.cc`

Reviewed By: riversand963

Differential Revision: D23123238

Pulled By: cheng-chang

fbshipit-source-id: 246be2ed4744fd03fa2738aba408aaa611d0379c
2020-10-23 22:49:51 -07:00
Zhichao Cao
d8ec0a760a Make FileType Public and Replace kLogFile with kWalFile (#7580)
Summary:
As suggested by pdillinger ,The name of kLogFile is misleading, in some tests, kLogFile is defined as info log. Replace it with kWalFile and move it to public, which will be used in https://github.com/facebook/rocksdb/issues/7523

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7580

Test Plan: make check

Reviewed By: riversand963

Differential Revision: D24485420

Pulled By: zhichao-cao

fbshipit-source-id: 955e3dacc1021bb590fde93b0a568ffe9ad80799
2020-10-22 17:06:20 -07:00
Akanksha Mahajan
eef27d0048 Bug fix to remove function calling in assert statement (#7581)
Summary:
Remove function calling in assert statement as assert is a no
op in opt build and that function might not be called. This causes hang
in closing RocksDB when refit level is set.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7581

Test Plan: make check -j64

Reviewed By: riversand963

Differential Revision: D24466420

Pulled By: akankshamahajan15

fbshipit-source-id: 97db4ec5a95ae693c3290e176a3c12a9b1ad2f6d
2020-10-21 20:18:06 -07:00
Cheng Chang
73dbe10bbf Fix write_batch_test when ASSERT_STATUS_CHECKED=1 (#7575)
Summary:
Without this PR, `ASSERT_STATUS_CHECKED=1 make   -j32 write_batch_test && ./write_batch_test` fails.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7575

Test Plan: ASSERT_STATUS_CHECKED=1 make   -j32 write_batch_test && ./write_batch_test

Reviewed By: zhichao-cao

Differential Revision: D24411442

Pulled By: cheng-chang

fbshipit-source-id: f67dc43c44d6afcc6d7e5ff15c6ae9bbf4dfc943
2020-10-20 13:18:41 -07:00
Cheng Chang
fc9b416013 Fix typo in db_wal_test (#7571)
Summary:
as title

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7571

Test Plan: make check

Reviewed By: zhichao-cao

Differential Revision: D24392577

Pulled By: cheng-chang

fbshipit-source-id: c94f92db48270d0e215aa0f2782b0ff2e31bf708
2020-10-20 11:50:30 -07:00
anand76
00751e4292 Add a host location property to TableProperties (#7479)
Summary:
This PR adds support for writing a location identifier of the DB host to SST files as a table property. By default, the hostname is used, but can be overridden by the user. There have been some recent corruptions in files written by ```SstFileWriter``` before checksumming, so this property can be used to trace it back to the writing host and checking the host for hardware isues.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7479

Test Plan: Add new unit tests

Reviewed By: pdillinger

Differential Revision: D24340671

Pulled By: anand1976

fbshipit-source-id: 2038949fd8d160c0633ccb4f9da77740f19fa2a2
2020-10-19 11:38:48 -07:00
Stanislav Tkach
ed90e2a450 Add getters to the C API for env, universal compaction options and fifo compaction options (#7501)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7501

Reviewed By: ltamasi

Differential Revision: D24344109

Pulled By: pdillinger

fbshipit-source-id: d9a2b1b1cc8c8d8a96f13b8ae6814380caa10c96
2020-10-16 11:04:01 -07:00
Levi Tamasi
e8cb32ed67 Introduce BlobFileCache and add support for blob files to Get() (#7540)
Summary:
The patch adds blob file support to the `Get` API by extending `Version` so that
whenever a blob reference is read from a file, the blob is retrieved from the corresponding
blob file and passed back to the caller. (This is assuming the blob reference is valid
and the blob file is actually part of the given `Version`.) It also introduces a cache
of `BlobFileReader`s called `BlobFileCache` that enables sharing `BlobFileReader`s
between callers. `BlobFileCache` uses the same backing cache as `TableCache`, so
`max_open_files` (if specified) limits the total number of open (table + blob) files.

TODO: proactively open/cache blob files and pin the cache handles of the readers in the
metadata objects similarly to what `VersionBuilder::LoadTableHandlers` does for
table files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7540

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D24260219

Pulled By: ltamasi

fbshipit-source-id: a8a2a4f11d3d04d6082201b52184bc4d7b0857ba
2020-10-15 13:04:47 -07:00
mrambacher
a8c89cc969 Test for LoadLatestOptions (#7554)
Summary:
Make LoadLatestOptions return PathNotFound if the options file does not exist.  Added tests for the LoadOptions related methods.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7554

Reviewed By: akankshamahajan15

Differential Revision: D24298985

Pulled By: zhichao-cao

fbshipit-source-id: c9ae3cb12fc4a5bbef07743e1c1300f98a2441b3
2020-10-14 22:28:55 -07:00
Stanislav Tkach
1a83f5a8ac Expose BackupableDBOptions in the C API (#7550)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7550

Reviewed By: jay-zhuang

Differential Revision: D24315343

Pulled By: ajkr

fbshipit-source-id: fc7855b630a50c00dcb940241942295932732f39
2020-10-14 17:51:47 -07:00
Zhichao Cao
b99fe1ab74 Remove the status.PermitUncheckedError() from WriteGroup Destructor (#7555)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7555

Test Plan: ASSERT_STATUS_CHECKED=1 make -j48 error_handler_fs_test

Reviewed By: riversand963

Differential Revision: D24299387

Pulled By: zhichao-cao

fbshipit-source-id: 6c8aa91c4b6e2bc82580b8d2264c177068f5a32c
2020-10-14 10:47:58 -07:00
Zhichao Cao
16bff5370d Add plain_table_db_test to ASSERT_STATUS_CHECKED list (#7482)
Summary:
Add plain_table_db_test to ASSERT_STATUS_CHECKED list

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7482

Test Plan: ASSERT_STATUS_CHECKED=1 make -j48 plain_table_db_test

Reviewed By: riversand963

Differential Revision: D24034987

Pulled By: zhichao-cao

fbshipit-source-id: e61c937d55ded0947cc8936937362dafed572a60
2020-10-13 12:00:09 -07:00
mrambacher
bf342394b6 Add tests for paranoid checks with range deletion (#7521)
Summary:
Added unit tests that have paranoid_check = true that perform range deletions.  At the moment, the deleted ranges do not appear to be checked as part of the paranoid checks.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7521

Reviewed By: zhichao-cao

Differential Revision: D24262175

Pulled By: ajkr

fbshipit-source-id: 1035e968f7ab8ccaa7af086b835a4e72c7e56743
2020-10-13 10:20:36 -07:00
Zhichao Cao
861e544335 Add db_flush_test to ASSERT_STATUS_CHECKED list (#7476)
Summary:
Added status check enforcement for db_flush_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7476

Test Plan: ASSERT_STATUS_CHECKED=1 make -j48 db_flush_test

Reviewed By: akankshamahajan15

Differential Revision: D24033752

Pulled By: zhichao-cao

fbshipit-source-id: d957934e1666d0043bebdd8a4149e94cdcbbb89b
2020-10-12 15:18:00 -07:00
Jay Zhuang
f548a2a03c Fix a flaky tsan test for DBTest2 (#7526)
Summary:
ThreadSanitizer: data race for `DummyOldStats.num_rt`.
Failed build: https://app.circleci.com/pipelines/github/facebook/rocksdb/3991/workflows/b47c3ae1-5531-4489-ac51-11854abdfd0f/jobs/42305

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7526

Reviewed By: akankshamahajan15

Differential Revision: D24226736

Pulled By: jay-zhuang

fbshipit-source-id: e05ce354d0c0db0eba242d59d4b0e89ce7c25acf
2020-10-12 11:22:25 -07:00
Andrew Kryczka
75d3b6fdf0 Redesign block cache pinning API (#7520)
Summary:
The old flag-based APIs (`BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache` and `BlockBasedTableOptions::pin_top_level_index_and_filter`) were insufficient for our needs. For example, it was impossible to pin only unpartitioned meta-blocks, which could prevent block cache contention when turning on dictionary compression or during a migration to partitioned indexes/filters. It was also impossible to pin all meta-blocks in memory while having predictable memory usage via block cache. If we had continued adding flags to address these scenarios, they would have had significant overlap causing confusion. Instead, this PR deprecates the flags and starts a new API with non-overlapping options.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7520

Test Plan:
- new unit test
- added new options to stress/crash test and ran for a while: `$ python tools/db_crashtest.py blackbox --simple --max_key=1000000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 --interval=10 -value_size_mult=33 -column_families=1 -reopen=0`

Reviewed By: pdillinger

Differential Revision: D24200034

Pulled By: ajkr

fbshipit-source-id: 3fa7cfc71e7960f7a867511dd6ae5834dd73b13e
2020-10-11 14:58:24 -07:00
Cheng Chang
12b78e40bd Track WAL in MANIFEST: add option track_and_verify_wals_in_manifest (#7275)
Summary:
This option determines whether WALs will be tracked in MANIFEST and verified on recovery.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7275

Test Plan:
db_options_test
options_test

Reviewed By: pdillinger

Differential Revision: D23181418

Pulled By: cheng-chang

fbshipit-source-id: 5dd1cdc166f3dfc1c93c094df4a2f7734e3b4547
2020-10-09 16:42:19 -07:00
Akanksha Mahajan
24498ab1ec Add few unit test cases in ASSERT_STATUS_CHECKED (#7500)
Summary:
Add status enforcement for following tests:
  1. import_column_family_test
  2. memory_test
  3. table_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7500

Reviewed By: zhichao-cao

Differential Revision: D24095887

Pulled By: akankshamahajan15

fbshipit-source-id: db8e1ec595852df143fad78a0c07bfdd27dc3c84
2020-10-08 11:22:44 -07:00
Levi Tamasi
810ab34ede Do not rely on the two-argument std::pair constructor being constexpr (#7519)
Summary:
The `std::pair(const T1& x, const T2& y);` constructor is `constexpr`
only starting from C++14; relying on this breaks compilation on certain
compilers/platforms we need to support.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7519

Test Plan: `make check`

Reviewed By: ajkr

Differential Revision: D24195747

Pulled By: ltamasi

fbshipit-source-id: 665e8fbc9747675bb49c5d895aad3dcf2714750f
2020-10-08 10:49:40 -07:00
Zhichao Cao
4146276885 Add ldb_cmd_test to ASSERT_STATUS_CHECKED list (#7499)
Summary:
Add ldb_cmd_test to ASSERT_STATUS_CHECKED list

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7499

Test Plan: pass ASSERT_STATUS_CHECKED=1 make -j48 ldb_cmd_test

Reviewed By: cheng-chang

Differential Revision: D24086203

Pulled By: zhichao-cao

fbshipit-source-id: 29592202b1d4335e566de15e7937269d98d57841
2020-10-08 00:00:48 -07:00
Yanqin Jin
002b30c967 Fix clang analyzer (#7518)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7518

Test Plan:
```
$USE_CLANG=1 make analyze
```

Reviewed By: zhichao-cao

Differential Revision: D24175390

Pulled By: riversand963

fbshipit-source-id: c70121652908cf5d450120c38ab65cc595332ca7
2020-10-07 20:11:06 -07:00
Levi Tamasi
1f84611e5d Clean up BlobLogReader and rename it to BlobLogSequentialReader (#7517)
Summary:
The patch does some cleanup in and around the legacy `BlobLogReader` class:
* It renames the class to `BlobLogSequentialReader` to emphasize that it is for
sequentially iterating through blobs in a blob file, as opposed to doing random
point reads using `BlobIndex`es (which is `BlobFileReader`'s jurisdiction).
* It removes some dead code from the old BlobDB implementation that references
`BlobLogReader` (namely the method `BlobFile::OpenRandomAccessReader`).
* It cleans up some `#include`s and forward declarations.
* It fixes some incorrect/outdated comments related to the reader class.
* It adds a few assertions to the `Read` methods of the class.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7517

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D24172611

Pulled By: ltamasi

fbshipit-source-id: 43e2ae1eba5c3dd30c1070cb00f217edc45bd64f
2020-10-07 17:48:16 -07:00
Levi Tamasi
22655a398b Introduce a blob file reader class (#7461)
Summary:
The patch adds a class called `BlobFileReader` that can be used to retrieve blobs
using the information available in blob references (e.g. blob file number, offset, and
size). This will come in handy when implementing blob support for `Get`, `MultiGet`,
and iterators, and also for compaction/garbage collection.

When a `BlobFileReader` object is created (using the factory method `Create`),
it first checks whether the specified file is potentially valid by comparing the file
size against the combined size of the blob file header and footer (files smaller than
the threshold are considered malformed). Then, it opens the file, and reads and verifies
the header and footer. The verification involves magic number/CRC checks
as well as checking for unexpected header/footer fields, e.g. incorrect column family ID
or TTL blob files.

Blobs can be retrieved using `GetBlob`. `GetBlob` validates the offset and compression
type passed by the caller (because of the presence of the header and footer, the
specified offset cannot be too close to the start/end of the file; also, the compression type
has to match the one in the blob file header), and retrieves and potentially verifies and
uncompresses the blob. In particular, when `ReadOptions::verify_checksums` is set,
`BlobFileReader` reads the blob record header as well (as opposed to just the blob itself)
and verifies the key/value size, the key itself, as well as the CRC of the blob record header
and the key/value pair.

In addition, the patch exposes the compression type from `BlobIndex` (both using an
accessor and via `DebugString`), and adds a blob file read latency histogram to
`InternalStats` that can be used with `BlobFileReader`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7461

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D23999219

Pulled By: ltamasi

fbshipit-source-id: deb6b1160d251258b308d5156e2ec063c3e12e5e
2020-10-07 15:44:53 -07:00
Akanksha Mahajan
38d0a365e3 Add Stats for MultiGet (#7366)
Summary:
Add following stats for MultiGet in Histogram to get more insight on MultiGet.
    1. Number of index and filter blocks read from file as part of MultiGet
    request per level.
    2. Number of data blocks read from file per level.
    3. Number of SST files loaded from file system per level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7366

Reviewed By: anand1976

Differential Revision: D24127040

Pulled By: akankshamahajan15

fbshipit-source-id: e63a003056b833729b277edc0639c08fb432756b
2020-10-07 13:28:48 -07:00
Jay Zhuang
8891e9a0eb Disallow trivial move if BottommostLevelCompaction is kForce* (#7368)
Summary:
If `BottommostLevelCompaction.kForce*` is set, compaction should avoid
trivial move and always compact the sst to the target size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7368

Reviewed By: ajkr

Differential Revision: D23629525

Pulled By: jay-zhuang

fbshipit-source-id: 79f23c79ecb31587e0593b28cce43131107bbcd0
2020-10-07 13:19:31 -07:00
anand76
a242a58301 Enable ASSERT_STATUS_CHECKED for db_universal_compaction_test (#7460)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7460

Reviewed By: riversand963

Differential Revision: D24057636

Pulled By: anand1976

fbshipit-source-id: bfb13da6993a5e407be20073e4d6751dfb38e442
2020-10-06 14:42:12 -07:00
Yanqin Jin
1bcef3d83c Make sure assert(false) handles failures too (#7483)
Summary:
In opt mode, assertions are just no-ops. Therefore, we need to report errors instead of just doing an `assert(false)`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7483

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D24142725

Pulled By: riversand963

fbshipit-source-id: 5629556dbe29f00dd09e30a7d5df5e6cf09ee435
2020-10-06 13:52:42 -07:00
Jay Zhuang
53089038de Fix StallWrite crash with mixed of slowdown/no_slowdown writes (#7508)
Summary:
`BeginWriteStall()` removes no_slowdown write from the write
list and updates `link_newer`, which makes `CreateMissingNewerLinks()`
thought all write list has valid `link_newer` and failed to create link
for all writers.
It caused flaky test and SegFault for release build.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7508

Test Plan: Add unittest to reproduce the issue.

Reviewed By: anand1976

Differential Revision: D24126601

Pulled By: jay-zhuang

fbshipit-source-id: f8ac5dba653f7ee1b0950296427d4f5f8ee34a06
2020-10-06 12:44:20 -07:00
Yanqin Jin
758ead5df7 Enforce status check for corruption_test (#7453)
Summary:
Enforce status check for corruption_test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7453

Test Plan:
```
ASSERT_STATUS_CHECKED=1 make corruption_test
./corruption_test
```

Reviewed By: jay-zhuang

Differential Revision: D24006862

Pulled By: riversand963

fbshipit-source-id: 664677caf4c3007a25cf565cec3d677f2dcea130
2020-10-02 22:11:00 -07:00
sdong
668ee08915 Fix prefix_test for status check (#7495)
Summary:
Fix prefix_test so that it passes when ASSERT_STATUS_CHECKED=1

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7495

Test Plan: Run the test with the option

Reviewed By: anand1976

Differential Revision: D24069715

fbshipit-source-id: 54f74b58575a1b49dbdee9ea2d24751fa956b620
2020-10-02 17:01:15 -07:00
Zhichao Cao
b7062f0b2c Status check enforcement for error_handler_fs_test (#7342)
Summary:
Added status check enforcement for error_test_fs_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7342

Test Plan: ASSERT_STATUS_CHECKED=1 make -j48 error_test_fs_test

Reviewed By: akankshamahajan15

Differential Revision: D23972231

Pulled By: zhichao-cao

fbshipit-source-id: fa41bfe440012e0c55f2c9507c1d0104e5e93f84
2020-10-02 16:41:13 -07:00
Akanksha Mahajan
7cd760dfdf Add status check enforcement for column_family_test.cc (#7484)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7484

Reviewed By: jay-zhuang

Differential Revision: D24037616

Pulled By: akankshamahajan15

fbshipit-source-id: 0f63281f81046bcb1b95a7578783285cc6346ece
2020-10-02 13:35:15 -07:00
Yanqin Jin
48d5aa9bab Enable status check for db_secondary_test (#7487)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7487

Test Plan:
ASSERT_STATUS_CHECKED=1 make db_secondary_test
./db_secondary_test

Reviewed By: zhichao-cao

Differential Revision: D24071038

Pulled By: riversand963

fbshipit-source-id: e6600c0aecab71c1326b22af263e92bddee5f7ac
2020-10-02 11:23:36 -07:00
Andrew Kryczka
29ed766193 add Status check enforcement for stats_history_test (#7496)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7496

Reviewed By: zhichao-cao

Differential Revision: D24070007

Pulled By: ajkr

fbshipit-source-id: 4320413a4d7707774ee23a7e6232714d7ee7a57f
2020-10-02 08:25:30 -07:00
Andrew Kryczka
1e00909730 Periodically flush info log out of application buffer (#7488)
Summary:
This PR schedules a background thread (shared across all DB instances)
to flush info log every ten seconds. This improves debuggability in case
of RocksDB hanging since it ensures the log messages leading up to the hang
will eventually become visible in the log.

The bulk of this PR is moving monitoring/stats_dump_scheduler* to db/periodic_work_scheduler*
and making the corresponding name changes since now the scheduler handles info
log flushing, not just stats dumping.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7488

Reviewed By: riversand963

Differential Revision: D24065165

Pulled By: ajkr

fbshipit-source-id: 339c47a0ff43b79fdbd055fbd9fefbb6f9d8d3b5
2020-10-01 19:14:14 -07:00
sdong
94fc676d3f Fix db_properties_test for ASSERT_STATUS_CHECKED (#7490)
Summary:
Add all status handling in db_properties_test so that it can pass ASSERT_STATUS_CHECKED.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7490

Test Plan: Run the test with ASSERT_STATUS_CHECKED

Reviewed By: jay-zhuang

Differential Revision: D24065382

fbshipit-source-id: e008916155196891478c964df0226545308ca71d
2020-10-01 17:47:09 -07:00
Zhichao Cao
685cabdafa Add trace_analyzer_test to ASSERT_STATUS_CHECKED list (#7480)
Summary:
Add trace_analyzer_test to ASSERT_STATUS_CHECKED list

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7480

Test Plan: ASSERT_STATUS_CHECKED=1 make -j48 trace_analyzer_test

Reviewed By: riversand963

Differential Revision: D24033768

Pulled By: zhichao-cao

fbshipit-source-id: b415045e6fab01d6193448650772368c21c6dba6
2020-10-01 15:58:52 -07:00
Peter Dillinger
9082771b86 Add is_full_compaction to CompactionJobStats, cleanup (#7451)
Summary:
This exposes to the listener interface whether a compaction was
full or not. Also cleaned up API comment for CompactionJobInfo::stats,
which is not of a nullable type. And since CompactionJob is always
created with non-null CompactionJobStats, removed conditionals on it
being nullptr and instead assert non-null.

TODO later: update C and Java interfaces

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7451

Test Plan: updated existing unit tests to check new field, make check

Reviewed By: ltamasi

Differential Revision: D23977796

Pulled By: pdillinger

fbshipit-source-id: 1ae7e26cb949631c2b2fb9e696710daf53cc378d
2020-10-01 12:52:58 -07:00
Levi Tamasi
786c1a2cc4 Reduce the number of iterations in DBTest.FileCreationRandomFailure (#7481)
Summary:
`DBTest.FileCreationRandomFailure` frequently times out during our
continuous test runs. (It's a case of "stress test posing as unit test.")
The patch reduces the number of iterations to avoid this. Note that
the lower numbers are still sufficient to trigger both flushes and
compactions, so test coverage is still the same.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7481

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D24034712

Pulled By: ltamasi

fbshipit-source-id: 8731a9446e5a121a1041b00f0df473b9f714935a
2020-10-01 10:42:58 -07:00
sdong
7508175558 Introduce options.check_flush_compaction_key_order (#7467)
Summary:
Introduce an new option options.check_flush_compaction_key_order, by default set to true, which checks key order of flush and compaction, and fail the operation if the order is violated.
Also did minor refactor hash checking code, which consolidates the hashing logic to a vlidation class, where the key ordering logic is added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7467

Test Plan: Add unit tests to validate the check can catch reordering in flush and compaction, and can be properly disabled.

Reviewed By: riversand963

Differential Revision: D24010683

fbshipit-source-id: 8dd6292d2cda8006054e9ded7cfa4bf405f0527c
2020-10-01 10:10:26 -07:00
Koby Kahane
3e745053b7 Fix MSVC-related build issues (#7439)
Summary:
This PR addresses some build and functional issues on MSVC targets, as a step towards an eventual goal of having RocksDB build successfully for Windows on ARM64.

Addressed issues include:
- BitsSetToOne and CountTrailingZeroBits do not compile on non-x64 MSVC targets. A fallback implementation of BitsSetToOne when Intel intrinsics are not available is added, based on the C++20 `<bit>` popcount implementation in Microsoft's STL.
- The implementation of FloorLog2 for MSVC targets (including x64) gives incorrect results. The unit test easily detects this, but CircleCI is currently configured to only run a specific set of tests for Windows CMake builds, so this seems to have been unnoticed.
- AsmVolatilePause does not use YieldProcessor on Windows ARM64 targets, even though it is available.
- When CondVar::TimedWait calls Microsoft STL's condition_variable::wait_for, it can potentially trigger a bug (just recently fixed in the upcoming VS 16.8's STL) that deadlocks various tests that wait for a timer to execute, since `Timer::Run` doesn't get a chance to execute before being blocked by the test function acquiring the mutex.
- In c_test, `GetTempDir` assumes a POSIX-style temp path.
- `NormalizePath` did not eliminate consecutive POSIX-style path separators on Windows, resulting in test failures in e.g., wal_manager_test.
- Various other test failures.

In a followup PR I hope to modify CircleCI's config.yml to invoke all RocksDB unit tests in Windows CMake builds with CTest, instead of the current use of `run_ci_db_test.ps1` which requires individual tests to be specified and is missing many of the existing tests.

Notes from peterd: FloorLog2 is not yet used in production code (it's for something in progress). I also added a few more inexpensive platform-dependent tests to Windows CircleCI runs. And included facebook/folly#1461 as requested

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7439

Reviewed By: jay-zhuang

Differential Revision: D24021563

Pulled By: pdillinger

fbshipit-source-id: 0ec2027c0d6a494d8a0fe38d9667fc2f7e29f7e7
2020-10-01 09:23:04 -07:00
Ramkumar Vadivelu
e04a50923d Change ParseInternalKey() to return Status instead of bool (#7457)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/7430

Change ParseInternalKey() to return Status instead of bool.

db_bench (seekrandom) based before/after results with value size of 100 bytes and 16 bytes can be found at (tests ran on an udb server):
https://www.dropbox.com/s/47bwamdy5ozngph/PIK_ret_Status_results.xlsx?dl=0

![db_bench_results](https://user-images.githubusercontent.com/62277872/94642825-2a21a800-029a-11eb-88f2-124136c83fd3.png)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7457

Reviewed By: ajkr

Differential Revision: D24002433

Pulled By: ramvadiv

fbshipit-source-id: ac253ecf577a29044c47c3fe254a01e71404c44c
2020-09-30 19:16:47 -07:00
Andrew Kryczka
718e192965 Fix flaky intra-L0 consistency failure regression tests (#7477)
Summary:
Do not assert the number of files after intra-L0 compaction is eligible to run since it could complete (and reduce the number of files) before the assertion executes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7477

Reviewed By: pdillinger

Differential Revision: D24032049

Pulled By: ajkr

fbshipit-source-id: e838ac7a24651ebd643b9e5a9d39d2e789c46929
2020-09-30 16:50:24 -07:00
Peter Dillinger
ddbc5dad05 Enable force_consistency_checks by default (#7446)
Summary:
This has been running in production on some key workloads, so
we believe it to be safe and extremely low cost. Nevertheless, I've
added code to ensure that "force_consistency_checks" is mentioned in
any corruption reports so that people know how to disable in case of
false positive corruption reports.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7446

Test Plan:
make check, CI, temporary debug print new message with
./version_builder_test

Reviewed By: ajkr

Differential Revision: D23972101

Pulled By: pdillinger

fbshipit-source-id: 9623e400f3752577c0ecf977e6d0915562cf9968
2020-09-30 11:57:32 -07:00
sdong
5f33436285 Revert an uncessary status code check skipping (#7458)
Summary:
https://github.com/facebook/rocksdb/pull/7452 added an uncessary skip for status code checking. Revert it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7458

Test Plan: Watch CI to finish

Reviewed By: jay-zhuang

Differential Revision: D23994390

fbshipit-source-id: a2b50a6326d8073db3386bff3d32acc5a6666e9b
2020-09-30 11:38:39 -07:00
Akanksha Mahajan
9d212d3f0e Provide users with option to opt-in to get corrupt data in logs/messages (#7420)
Summary:
Add a new Option "allow_data_in_errors". When it's set by users, it allows them to opt-in to get error messages containing corrupted keys/values. Corrupt keys, values will be logged in the messages, logs, status etc. that will help users with the useful information regarding affected data.
By default value is set false to prevent users data to be exposed in the messages.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7420

Test Plan:
1. make check -j64
           2. Add a new test case

Reviewed By: ajkr

Differential Revision: D23835028

Pulled By: akankshamahajan15

fbshipit-source-id: 8d2eba8fb898e79fcf1fccc07295065a75eb59b1
2020-09-29 23:17:45 -07:00
Jay Zhuang
1bdaef7a06 Status check enforcement for timestamp_basic_test (#7454)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7454

Reviewed By: riversand963

Differential Revision: D23981719

Pulled By: jay-zhuang

fbshipit-source-id: 01073f73e54c17067b886c4a2f179b2804198399
2020-09-29 18:23:27 -07:00
Andrew Kryczka
8115eb520d add Status check assertions for repair_test (#7455)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7455

Reviewed By: riversand963

Differential Revision: D23985283

Pulled By: ajkr

fbshipit-source-id: 5dd2be62350f6e31d13a1e7821cb848a37699c93
2020-09-29 16:30:08 -07:00
Yanqin Jin
07dc955a1f Report error of GetChildren (#7459)
Summary:
As title

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7459

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D23999393

Pulled By: riversand963

fbshipit-source-id: 09df8e1637f4df3616c63ee314de397b35be4e4a
2020-09-29 15:27:00 -07:00
anand76
12ede5ed7c Remove invalid assertion in compaction_picker_universal.cc (#7421)
Summary:
The assertion checks that there is no overlap in sequence numbers across levels in universal compaction. However, this assumption doesn't hold when there is a delete triggered compaction or a trivial move, as they operate on a subset of a level.

Tests -
make check

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7421

Reviewed By: ajkr

Differential Revision: D23872672

Pulled By: anand1976

fbshipit-source-id: c386deab8e01a5746ca996ff1f4ebcae3b15b7d2
2020-09-29 13:29:58 -07:00
sdong
d08a9005b7 Make db_basic_test pass assert status checked (#7452)
Summary:
Add db_basic_test status check list. Some of the warnings are suppressed. It is possible that some of them are due to real bugs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7452

Test Plan: See CI tests pass.

Reviewed By: zhichao-cao

Differential Revision: D23979764

fbshipit-source-id: 6151570c2a9b931b0fbb3fe939a94b2bd1583cbe
2020-09-29 09:49:04 -07:00
Zhichao Cao
d71cfe04e4 Add flush_job_test to the list of ASSERT_STATUS_CHECKED tests (#7445)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7445

Test Plan: pass ASSERT_STATUS_CHECKED=1 make -j48 flush_job_test

Reviewed By: akankshamahajan15

Differential Revision: D23969372

Pulled By: zhichao-cao

fbshipit-source-id: 498ff45ef84e07ec27a8f35d0874d3371412afe9
2020-09-28 14:59:02 -07:00
Levi Tamasi
1abbc56aba Add version_builder_test to the list of ASSERT_STATUS_CHECKED tests (#7444)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7444

Test Plan: `ASSERT_STATUS_CHECKED=1 make version_builder_test -j24`

Reviewed By: jay-zhuang

Differential Revision: D23965793

Pulled By: ltamasi

fbshipit-source-id: 8beaf66548379f21146189cda699d5f6fbb35a1b
2020-09-28 12:12:40 -07:00
Ramkumar Vadivelu
c203e01773 reset refitting_level_ flag to false in error paths (#7403)
Summary:
Reset refitting_level_ flag to false in error paths in DBImpl::ReFitLevel()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7403

Reviewed By: ajkr

Differential Revision: D23909028

Pulled By: ramvadiv

fbshipit-source-id: 521ad9aadc1b734bef9ef9119d1e1ee1fa8126e9
2020-09-28 11:37:00 -07:00
Peter Dillinger
08552b19d3 Genericize and clean up FastRange (#7436)
Summary:
A generic algorithm in progress depends on a templatized
version of fastrange, so this change generalizes it and renames
it to fit our style guidelines, FastRange32, FastRange64, and now
FastRangeGeneric.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7436

Test Plan: added a few more test cases

Reviewed By: jay-zhuang

Differential Revision: D23958153

Pulled By: pdillinger

fbshipit-source-id: 8c3b76101653417804997e5f076623a25586f3e8
2020-09-28 11:35:00 -07:00
Jay Zhuang
fa92b9dc9f Fix TSAN build and re-enable the tests (#7386)
Summary:
Resolve TSAN build warnings and re-enable disabled TSAN tests.

Not sure if it's a compiler issue or TSAN check issue. Switching from
conditional operator to if-else mitigated the problem.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7386

Test Plan:
run TSAN check 10 times in circleci.

```
WARNING: ThreadSanitizer: data race (pid=27735)
  Atomic write of size 8 at 0x7b54000005e8 by thread T32:
    #0 __tsan_atomic64_store <null> (db_test+0x4cee95)
    https://github.com/facebook/rocksdb/issues/1 std::__atomic_base<unsigned long>::store(unsigned long, std::memory_order) /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/atomic_base.h:374:2 (db_test+0x78460e)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::VersionSet::SetLastSequence(unsigned long) /home/circleci/project/./db/version_set.h:1058:20 (db_test+0x78460e)
...
  Previous read of size 8 at 0x7b54000005e8 by thread T31:
    #0 bool rocksdb::DBImpl::MultiCFSnapshot<std::unordered_map<unsigned int, rocksdb::DBImpl::MultiGetColumnFamilyData, std::hash<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, rocksdb::DBImpl::MultiGetColumnFamilyData> > > >(rocksdb::ReadOptions const&, rocksdb::ReadCallback*, std::function<rocksdb::DBImpl::MultiGetColumnFamilyData* (std::unordered_map<unsigned int, rocksdb::DBImpl::MultiGetColumnFamilyData, std::hash<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, rocksdb::DBImpl::MultiGetColumnFamilyData> > >::iterator&)>&, std::unordered_map<unsigned int, rocksdb::DBImpl::MultiGetColumnFamilyData, std::hash<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, rocksdb::DBImpl::MultiGetColumnFamilyData> > >*, unsigned long*) /home/circleci/project/db/db_impl/db_impl.cc (db_test+0x715087)
```

Reviewed By: ltamasi

Differential Revision: D23725226

Pulled By: jay-zhuang

fbshipit-source-id: a6d662a5ea68111246cd32ec95f3411a25f76bc6
2020-09-25 14:46:28 -07:00
Peter Dillinger
c8a12aa94b EnableFileDeletions only read field while holding mutex (#7435)
Summary:
Possible fix for a TSAN issue reported in EnableFileDeletions.
disable_delete_obsolete_files_ should only be accessed holding the db
mutex, but for logging it was being accessed outside holding the mutex,
now fixed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7435

Test Plan: existing tests, watch for recurrence

Reviewed By: ltamasi

Differential Revision: D23917578

Pulled By: pdillinger

fbshipit-source-id: 8573025bca3f6fe169b24b87bbfc4ce9667b0482
2020-09-25 13:34:36 -07:00
Cheng Chang
1a24f4d1d6 Track WAL in MANIFEST: add method to check WAL consistency (#7236)
Summary:
Add a method `CheckWals` in `WalSet` to check the logs on disk. See `CheckWals`'s comments.
This method will be used to check consistency of WALs during DB recovery.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7236

Test Plan: a set of tests are added to wal_edit_test.cc.

Reviewed By: riversand963

Differential Revision: D23036505

Pulled By: cheng-chang

fbshipit-source-id: 5b1d6857ac173429b00f950c32c4a5b8d063a732
2020-09-25 13:25:54 -07:00
Jay Zhuang
8c9fff917c MultiGet() with timestamp should respect snapshot (#7404)
Summary:
Similar to PR https://github.com/facebook/rocksdb/issues/7227, add read callback to filter out rows with with
higher sequence number.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7404

Reviewed By: riversand963

Differential Revision: D23790762

Pulled By: jay-zhuang

fbshipit-source-id: bce854307612f1a22f985ffc934da627d0a139c2
2020-09-25 09:42:01 -07:00
Akanksha Mahajan
9a63bbd391 Add few unit test cases in ASSERT_STATUS_CHECKED build (#7427)
Summary:
Fix few test cases and add them in ASSERT_STATUS_CHECKED build.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7427

Test Plan:
1.  ASSERT_STATUS_CHECKED=1 make -j48 check,
                 2. travis build for ASSERT_STATUS_CHECKED,
                 3. Without ASSERT_STATUS_CHECKED:  make check -j64, CircleCI build and travis build

Reviewed By: pdillinger

Differential Revision: D23909983

Pulled By: akankshamahajan15

fbshipit-source-id: 42d7e4aea972acb9fcddb7ca73fcb82f93272434
2020-09-24 21:48:57 -07:00
Akanksha Mahajan
98ac6b646a Add IO Tracer Parser (#7333)
Summary:
Implement a parsing tool io_tracer_parser that takes IO trace file (binary file) with command line argument --io_trace_file and output file with --output_file and dumps the IO trace records in outputfile in human readable form.

Also added unit test cases that generates IO trace records and calls io_tracer_parse to parse those records.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7333

Test Plan:
make check -j64,
 Add unit test cases.

Reviewed By: anand1976

Differential Revision: D23772360

Pulled By: akankshamahajan15

fbshipit-source-id: 9c20519c189362e6663352d08863326f3e496271
2020-09-23 15:50:26 -07:00
Peter Dillinger
31d1cea4f3 Add and fix clang -Wshift-sign-overflow (#7431)
Summary:
This option is apparently used by some teams within Facebook
(internal ref T75998621)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7431

Test Plan: USE_CLANG=1 make check before (fails) and after

Reviewed By: jay-zhuang

Differential Revision: D23876584

Pulled By: pdillinger

fbshipit-source-id: abb8b67a1f1aac75327944d266e284b2b6727191
2020-09-23 15:26:17 -07:00
rockeet
b005f96937 db_iter.cc: DBIter::Next(): minor improve (#7407)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7407

Reviewed By: ajkr

Differential Revision: D23817122

Pulled By: jay-zhuang

fbshipit-source-id: 62bf43e4d780fad8c682edd750b4800b5b8f4a77
2020-09-23 09:53:24 -07:00
Cheng Chang
00ee89b584 Track WAL in MANIFEST: update WalMetadata for WAL syncing (#7414)
Summary:
There are some tricky behaviors related to WAL sync:

- When creating a WAL, the WAL might not be synced, if the WAL directory is not synced, the WAL file's metadata may not even be synced to disk, so during recovery, when listing the WAL directory, the WAL may not even show up.
- During each DB::Write, the WriteOption can control whether the WAL should be synced, so a WAL previously not synced on creation can be synced during Write.

For each `SyncWAL`, we'll track the synced status and the current WAL size. Previously, we only track the WAL size on closing.
During recovery, we check that the on-disk WAL size is >= the last synced size.

So this PR introduces `synced_size` and `closed` to `WalMetadata` for the above design update.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7414

Test Plan:
- updated wal_edit_test
- updated version_edit_test

Reviewed By: riversand963

Differential Revision: D23796127

Pulled By: cheng-chang

fbshipit-source-id: 5498ab80f537c48a10157e71a4745716aef5cf30
2020-09-22 14:35:14 -07:00
Yanqin Jin
cd72f8974b Allow mutex to be released in GetAggregatedIntProperty (#7412)
Summary:
Current implementation holds db mutex while calling
`GetAggregatedIntProperty()`. For property kEstimateTableReadersMem,
this can be expensive, especially if the number of table readers is
high.
We can release and re-acquire db mutex if
property_info.need_out_of_mutex is true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7412

Test Plan:
make check
COMPILE_WITH_ASAN=1 make check
COMPILE_WITH_TSAN=1 make check
Also test internally on a shadow host. Used bpf to verify the
excessively long db mutex holding no longer exists when applications
call GetApproximateMemoryUsageByType().

Reviewed By: jay-zhuang

Differential Revision: D23794824

Pulled By: riversand963

fbshipit-source-id: 6bc02a59fd25613d343a62cf817467c7122c9721
2020-09-22 12:37:16 -07:00
Peter Dillinger
6727259eb4 Possible fix to flaky db_write_test (#7418)
Summary:
Make the test robust to spurious wakeups on condition variable,
and clear sync points to ensure no use-after-free.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7418

Test Plan: repeated runs on updated test, watch CircleCI for recurrence

Reviewed By: jay-zhuang

Differential Revision: D23828823

Pulled By: pdillinger

fbshipit-source-id: af85117d9c02602541a90252840e0e5a6996de5b
2020-09-22 09:57:05 -07:00
Zhichao Cao
485fd9d9db fix the flaky test failure (#7415)
Summary:
Fix the flaky test failure in error_handler_fs_test. Add the sync point, solve the dependency.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7415

Test Plan: make asan_check, ~/gtest-parallel/gtest-parallel -r 100 ./error_handler_fs_test

Reviewed By: siying

Differential Revision: D23804330

Pulled By: zhichao-cao

fbshipit-source-id: 5175108651f7652e47e15978f2a9c1669ef59d80
2020-09-19 17:57:54 -07:00
Zhichao Cao
c268628c25 Map retryable IO error during Flush without WAL to soft error and no switch memtable during resume (#7310)
Summary:
In the current implementation, any retryable IO error happens during Flush is mapped to a hard error. In this case, DB is stopped and write is stalled unless the background error is cleaned. In this PR, if WAL is DISABLED, the retryable IO error during FLush is mapped to a soft error. Such that, the memtable can continue receive the writes. At the same time, if auto resume is triggered, SwtichMemtable will not be called during Flush when resuming the DB to avoid to many small memtables. Testing cases are added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7310

Test Plan: adding new unit test, pass make check.

Reviewed By: anand1976

Differential Revision: D23710892

Pulled By: zhichao-cao

fbshipit-source-id: bc4ca50d11c6b23b60d2c0cb171d86d542b038e9
2020-09-17 20:25:45 -07:00
Adam Retter
3ac07a12fe RocksJava - Add errorIfLogFileExists parameter to RocksDB.openReadOnly (#7046)
Summary:
Expose from C++ API to Java API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7046

Reviewed By: riversand963

Differential Revision: D23726297

Pulled By: pdillinger

fbshipit-source-id: fc66bf626ce6fe9797e7d021ac849eacab91bf6d
2020-09-17 15:41:25 -07:00
Peter Dillinger
93719fc953 Restore file size in backup table file names (and other cleanup) (#7400)
Summary:
Prior to 6.12, backup files using share_files_with_checksum had
the file size encoded in the file name, after the last '\_' and before
the last '.'. We considered this an implementation detail subject to
change, and indeed removed this information from the file name (with an
option to use old behavior) because it was considered
ineffective/inefficient for file name uniqueness. However, some
downstream RocksDB users were relying on this information since the file
size is not explicitly in the backup manifest file.

This primary purpose of this change is "retrofitting" the 6.12 release
(not yet a public release) to simultaneously support the benefits of the
new naming scheme (I/O performance and data correctness at scale) and
preserve the file size information, both as default behaviors. With this
change, we are essentially making the file size information encoded in
the file name an official, though obscure, extension of the backup meta
file format.

We preserve an option (kLegacyCrc32cAndFileSize) to use the original
"legacy" naming scheme, with its caveats, and make it easy to omit the
file size information (no kFlagIncludeFileSize), for more compact file
names. But note that changing the naming scheme used on an existing db
and backup directory can lead to transient space amplification, as some
files will be stored under two names in the shared_checksum directory.
Because some backups were saved using the original 6.12 naming scheme,
we offer two ways of dealing with those files: SST files generated by
older 6.12 versions can either use the default naming scheme in effect
when the SST files were generated (kFlagMatchInterimNaming, default, no
transient space amplification) or can use a new naming scheme (no
kFlagMatchInterimNaming, potential space amplification because some
already stored files getting a new name).

We don't have a natural way to detect which files were generated by
previous 6.12 versions, but this change hacks one in by changing DB
session ids to now use a more concise encoding, reducing file name
length, saving ~dozen bytes from SST files, and making them visually
distinct from DB ids so that they are less likely to be mixed up.

Two final auxiliary notes:
Recognizing that the backup file names have become a de facto part of
the backup meta schema, this change makes them easier to parse and
extend by putting a distinct marker, 's', before DB session ids embedded
in the name. When we extend this to allow custom checksums in the name,
they can get their own marker to ensure safe parsing. For backward
compatibility, file size does not get a marker but is assumed for
`_[0-9]+[.]`

Another change from initial 6.12 default behavior is never including
file custom checksum in the file name. Looking ahead to 6.13, we do not
want the default behavior to cause backup space amplification for
someone turning on file custom checksum checking in BackupEngine; we
want that to be an easy decision. When implemented, including file
custom checksums in backup file names will be a non-default option.

Actual file name patterns and priorities, as regexes:

    kLegacyCrc32cAndFileSize OR pre-6.12 SST file ->
      [0-9]+_[0-9]+_[0-9]+[.]sst
    kFlagMatchInterimNaming set (default) AND early 6.12 SST file ->
      [0-9]+_[0-9a-fA-F-]+[.]sst
    kUseDbSessionId AND NOT kFlagIncludeFileSize ->
      [0-9]+_s[0-9A-Z]{20}[.]sst
    kUseDbSessionId AND kFlagIncludeFileSize (default) ->
      [0-9]+_s[0-9A-Z]{20}_[0-9]+[.]sst

We might add opt-in options for more '\_' separated data in the name,
but embedded file size, if present, will always be after last '\_' and
before '.sst'.

This change was originally applied to version 6.12. (See https://github.com/facebook/rocksdb/issues/7390)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7400

Test Plan:
unit tests included. Sync point callbacks are used to mimic
previous version SST files.

Reviewed By: ajkr

Differential Revision: D23759587

Pulled By: pdillinger

fbshipit-source-id: f62d8af4e0978de0a34f26288cfbe66049b70025
2020-09-17 10:24:22 -07:00
mrambacher
a08d6f18f0 Add more tests to ASSERT_STATUS_CHECKED (#7367)
Summary:
db_options_test
options_file_test
auto_roll_logger_test
options_util_test
persistent_cache_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7367

Reviewed By: jay-zhuang

Differential Revision: D23712520

Pulled By: zhichao-cao

fbshipit-source-id: 99b331e357f5d6a6aabee89d1bd933002cbb3908
2020-09-16 15:48:07 -07:00
Andrew Kryczka
ec024a86de More robust sync points for intra-L0 compaction tests (#7382)
Summary:
`IntraL0CompactionAfterFlushCheckConsistencyFail` was flaky by sometimes failing due to no intra-L0 compactions happening. I was able to repro it by putting a `sleep(1)` in the compaction thread before it grabs the lock and picks a compaction. This also showed other intra-L0 tests are affected too, although some of them exhibit hanging forever rather than failing.

The problem was that all the flushes/ingestions could finish before any compaction got picked, so it would end up simply picking all the files that the test generates for L0->L1. But, these tests intend only the first few files to be picked for L0->L1, and the subsequent files to be picked for intra-L0. This PR adjusts the sync points of all the intra-L0 tests to enforce this.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7382

Test Plan: run all the `db_compaction_test`s with and without the artificial `sleep()`

Reviewed By: jay-zhuang

Differential Revision: D23684985

Pulled By: ajkr

fbshipit-source-id: 6508399030dddec7738e9853a7b3dc53ef77a584
2020-09-15 22:44:16 -07:00
Andrew Kryczka
9d3b2db9b5 Disable fsync in DB tests with timeouts (#7380)
Summary:
Some tests were encountering 600 second timeout in CI, such as `./db_universal_compaction_test --gtest_filter=NumLevels/DBTestUniversalCompaction.UniversalCompactionTrivialMoveTest2/5`, `./db_properties_test --gtest_filter=DBPropertiesTest.AggregatedTablePropertiesAtLevel`, and `./db_basic_test --gtest_filter=DBBasicTest.MultiGetBatchedSortedMultiFile`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7380

Test Plan:
- `./db_universal_compaction_test --gtest_filter=NumLevels/DBTestUniversalCompaction.UniversalCompactionTrivialMoveTest2/5`: 40 -> 3 seconds
- `./db_properties_test --gtest_filter=DBPropertiesTest.AggregatedTablePropertiesAtLevel`: 106 -> 1 second
- `./db_basic_test --gtest_filter=DBBasicTest.MultiGetBatchedSortedMultiFile`: 27 -> 1 second

Reviewed By: anand1976

Differential Revision: D23674570

Pulled By: ajkr

fbshipit-source-id: 4d4ca6a4e2d2e76fcf8b6f6cce91e0f98ba5050c
2020-09-15 18:55:08 -07:00
Levi Tamasi
bf1aeebb6c Integrate blob file writing with recovery (#7388)
Summary:
The patch adds support for extracting large values into blob files when
performing a flush during recovery (when `avoid_flush_during_recovery` is
`false`). Blob files are built and added to the `Version` similarly to flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7388

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D23709912

Pulled By: ltamasi

fbshipit-source-id: ce48b4227849cf25429ae98574e72b0e1cb9c67d
2020-09-15 17:14:10 -07:00
mrambacher
67bd5401e9 Changes to EncryptedEnv public API (#7279)
Summary:
Cleaned up the public API to use the EncryptedEnv.  This change will allow providers to be developed and added to the system easier in the future.  It will also allow better integration in the future with the OPTIONS file.

- The internal classes were moved out of the public API into an internal "env_encryption_ctr.h" header.  Short-cut constructors were added to provide the original API functionality.
- The APIs to the constructors were changed to take shared_ptr, rather than raw pointers or references to allow better memory management and alternative implementations.
- CreateFromString methods were added to allow future expansion to other provider and cipher implementations through a standard API.

Additionally, there was a code duplication in the NewXXXFile methods.  This common code was moved under a templatized function.

A first-pass at structuring the code was made to potentially allow multiple EncryptionProviders in a single EncryptedEnv.  The idea was that different providers may use different cipher keys or different versions/algorithms.  The EncryptedEnv should have some means of picking different providers based on information.  The groundwork was started for this (the use of the provider_ member variable was localized) but the work has not been completed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7279

Reviewed By: jay-zhuang

Differential Revision: D23709440

Pulled By: zhichao-cao

fbshipit-source-id: 0e845fff0e03a52603eb9672b4ade32d063ff2f2
2020-09-15 17:14:10 -07:00
Levi Tamasi
b0e7834100 Integrate blob file writing with the flush logic (#7345)
Summary:
The patch adds support for writing blob files during flush by integrating
`BlobFileBuilder` with the flush logic, most importantly, `BuildTable` and
`CompactionIterator`. If `enable_blob_files` is set, large values are extracted
to blob files and replaced with references. The resulting blob files are then
logged to the MANIFEST as part of the flush job's `VersionEdit` and
added to the `Version`, similarly to table files. Errors related to writing
blob files fail the flush, and any blob files written by such jobs are immediately
deleted (again, similarly to how SST files are handled). In addition, the patch
extends the logging and statistics around flushes to account for the presence
of blob files (e.g. `InternalStats::CompactionStats::bytes_written`, which is
used for calculating write amplification, now considers the blob files as well).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7345

Test Plan: Tested using `make check` and `db_bench`.

Reviewed By: riversand963

Differential Revision: D23506369

Pulled By: ltamasi

fbshipit-source-id: 646885f22dfbe063f650d38a1fedc132f499a159
2020-09-14 21:11:43 -07:00
mrambacher
7d472accdc Bring the Configurable options together (#5753)
Summary:
This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts

Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753

Reviewed By: ajkr

Differential Revision: D23385030

Pulled By: zhichao-cao

fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a
2020-09-14 17:01:01 -07:00
anand76
18a3227b12 Add a new IOStatus subcode to indicate that writes are fenced off (#7374)
Summary:
In a distributed file system, directory ownership is enforced by fencing
off the previous owner once they've been preempted by a new owner. This
PR adds a IOStatus subcode for ```StatusCode::IOError``` to indicate this.
Once this error is returned for a file write, the DB is put in read-only
mode and not allowed to resume in read-write mode.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7374

Test Plan: Add new unit tests in ```error_handler_fs_test```

Reviewed By: riversand963

Differential Revision: D23687777

Pulled By: anand1976

fbshipit-source-id: bef948642089dc0af399057864d9a8ca339e8b2f
2020-09-14 16:04:47 -07:00
Yanqin Jin
205e577694 Cancel tombstone skipping during bottommost compaction (#7356)
Summary:
During bottommost compaction, RocksDB cannot simply drop a tombstone if
this tombstone is not in the earliest snapshot. The current behavior is: RocksDB
skips other internal keys (of the same user key) in the same snapshot range. In
the meantime, RocksDB should check for the `shutting_down` flag. Otherwise, it
is possible for a bottommost compaction that has already started running to take
a long time to finish, even if the application has tried to cancel all background jobs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7356

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D23663241

Pulled By: riversand963

fbshipit-source-id: 25f8e9b51bc3bfa3353cdf87557800f9d90ee0b5
2020-09-11 17:45:43 -07:00
Peter Dillinger
be8445eea8 Assert valid linked list for write group (#7375)
Summary:
We've seen some segfaults in db_write_test, with at least one
suggesting corruption of a write group linked list. Adding an assertion
to have this fail in a more specific way if that is the broken
invariant.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7375

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D23638477

Pulled By: pdillinger

fbshipit-source-id: a76fd677cad60a3a516bd363947bfd9ce418edc1
2020-09-11 07:58:31 -07:00
Peter Dillinger
92639b93a6 Fix checkpoint file deletion race with avoid_unnecessary_blocking_io (#7369)
Summary:
https://github.com/facebook/rocksdb/issues/3341 guaranteed that upon return of `GetSortedWalFiles` after
`DisableFileDeletions`, all pending purges of previously obsolete WAL
files will have finished. However, the addition of
avoid_unnecessary_blocking_io in https://github.com/facebook/rocksdb/issues/5043 opened a hole in the code making
that assurance, which can lead to files to be copied for checkpoint or
backup going missing before being copied, with that option enabled.

This change patches the hole.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7369

Test Plan:
apparent fix to backups in crash test observed. Will work
on a unit test for another commit

Reviewed By: ajkr

Differential Revision: D23620258

Pulled By: pdillinger

fbshipit-source-id: bea36b461a5b719c3e3ef802f967bc3e8ae71614
2020-09-10 22:35:25 -07:00
Stanislav Tkach
5c39d8df69 Add getters to the C API for flush, write, cache and compact options (#7321)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7321

Reviewed By: ajkr

Differential Revision: D23590160

fbshipit-source-id: 35d106e732ac37f674222759cdb1dbb31e005ca7
2020-09-09 11:45:27 -07:00
Levi Tamasi
7b1d6c438a Fix the handling of the case when a blob file with a lower number gets added in VersionBuilder (#7349)
Summary:
When multiple background jobs are generating blob files in parallel, it is actually
possible for a blob file to be added with a file number that is lower than the
highest one in the base version. (This is a harmless race condition.) The patch
fixes the handling of this case and adds a unit test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7349

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D23542453

Pulled By: ltamasi

fbshipit-source-id: 4ff6f3654bc58c391d10b9870e1cc40b5e3fa8e4
2020-09-09 10:25:12 -07:00
Akanksha Mahajan
0de335e076 Use FSRandomRWFilePtr Object to call underlying file system. (#7198)
Summary:
Replace FSRandomRWFile pointer with FSRandomRWFilePtr object in the rocksdb internal code.
This new object wraps FSRandomRWFile pointer.

Objective: If tracing is enabled, FSRandomRWFile object returns FSRandomRWFileTracingWrapper pointer that includes all necessary information in IORecord and calls underlying FileSystem and invokes IOTracer to dump that record in a binary file. If tracing is disabled then, underlying FileSystem pointer is returned directly.
FSRandomRWFilePtr wrapper class is added to bypass the FSRandomRWFileWrapper when
tracing is disabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7198

Test Plan: make check -j64

Reviewed By: anand1976

Differential Revision: D23421116

Pulled By: akankshamahajan15

fbshipit-source-id: 8a5ba0e7d9c1ba34c3a6f29829b107c5f09ab6a3
2020-09-08 12:21:58 -07:00
Akanksha Mahajan
b175eceb09 Store FSWritableFilePtr object in WritableFileWriter (#7193)
Summary:
Replace FSWritableFile pointer with FSWritableFilePtr
    object in WritableFileWriter.
    This new object wraps FSWritableFile pointer.

    Objective: If tracing is enabled, FSWritableFile Ptr returns
    FSWritableFileTracingWrapper pointer that includes all necessary
    information in IORecord and calls underlying FileSystem and invokes
    IOTracer to dump that record in a binary file. If tracing is disabled
    then, underlying FileSystem pointer is returned directly.
    FSWritableFilePtr wrapper class is added to bypass the
    FSWritableFileWrapper when
    tracing is disabled.

    Test Plan: make check -j64

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7193

Reviewed By: anand1976

Differential Revision: D23355915

Pulled By: akankshamahajan15

fbshipit-source-id: e62a27a13c1fd77e36a6dbafc7006d969bed25cf
2020-09-08 10:56:08 -07:00
Levi Tamasi
423d051124 Clean up SubcompactionState a bit (#7322)
Summary:
The patch cleans up a few things in `CompactionJob::SubcompactionState`:

* Instead of using both the member initializer list and in-class initializers (and
sometimes both at the same time for the same member), the struct now uniformly
uses the latter to initialize integer members.
* The default parameter value for the constructor parameter `size` is removed.
* The explicitly deleted copy operations are removed, since they are implicitly deleted
anyways because of the `unique_ptr` members.
* The handwritten move operations, which did not move the member `c_iter` and
were not declared `nothrow`, are removed. Note that with the user-declared copy
operations gone (see the previous item), we can rely on the compiler to (correctly)
generate these methods.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7322

Test Plan: `make check`

Reviewed By: siying

Differential Revision: D23382408

Pulled By: ltamasi

fbshipit-source-id: a4ae5af150161c50ff7bdc07fa145482d0150bfe
2020-09-08 09:24:23 -07:00
Yanqin Jin
ab202e8d72 Add a new stats level to exclude tickers (#7329)
Summary:
Currently, application may pass a statistics object to db but later
wants to reduce stats tracking overhead by setting stats level to
kExceptHistogramOrTimers (the current lowest level). Tickers will still
be incremented, causing up to 1% CPU. We can add a new lowest stats
level `kExceptTickers` to disable ticker incrementing as well, thus
reducing CPU cycles spent on tickers.

Test Plan (devserver):
```
make check
make clean
DEBUG_LEVEL=0 make db_bench
./db_bench -perf_level=1 -stats_level=0 -statistics -benchmarks=fillseq,readrandom -duration=120
```

Measure CPU util (%) before and after change:
CPU util by rocksdb::RecordTick: 1.1 vs (<0.1)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7329

Reviewed By: pdillinger

Differential Revision: D23434014

Pulled By: riversand963

fbshipit-source-id: 72ff0f02a192ac476d4b0044b9f37fd4a22ff0d4
2020-09-04 23:25:03 -07:00
Cheng Chang
3f9b75604d Fix wrong level args (#7346)
Summary:
The level args should be output level instead of input levels.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7346

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D23506373

Pulled By: cheng-chang

fbshipit-source-id: b2f701d44c13581c5c10c4dbebded4fcd354d641
2020-09-03 23:17:37 -07:00
Eduardo Barreto Alexandre
5b1ccdc191 Expose rocksdb_open_column_families_with_ttl C function (#7314)
Summary:
This PR creates `rocksdb_open_column_families_with_ttl` which allows C API users to open a DBWithTLL with column families.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7314

Reviewed By: cheng-chang

Differential Revision: D23430287

Pulled By: ajkr

fbshipit-source-id: 307aa21d170d1402653263a91f6f832ef76afba0
2020-09-03 14:39:58 -07:00
Hiep
d0c1a01c1b Avoid converting MERGES to PUTS when allow_ingest_behind is true (#7166)
Summary:
- Closes https://github.com/facebook/rocksdb/issues/6490
- Currently MERGEs are converted to PUTs at bottom or compaction has reached the beginning of the key, this can wrongly cover a PUT future base case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7166

Test Plan:
- Automated: `make all check`
- Manual: With `allow_ingest_behind = true`, add Merge operations to a key then run compaction. Then run ingesting external files to make sure the base case is probably compacted with existing Merges.

Reviewed By: cheng-chang

Differential Revision: D23325425

Pulled By: ajkr

fbshipit-source-id: 3eb415eb7b381b5453e45245393566153b1abb68
2020-09-03 14:39:58 -07:00
Andrew Kryczka
177f8bd063 Bound L0->Lbase fanout in dynamic leveled compaction (#7325)
Summary:
L0 score is based on size target and number of files. The size target
used is `max_bytes_for_level_base`. However, the base level's size can
dynamically expand in write burst mode. In fact, it can expand so much
that L0->Lbase becomes the highest fanout in target sizes. This doesn't
make sense from an efficiency perspective, so this PR bounds the
L0->Lbase fanout to the smoothed level multiplier. The L0 scoring based
on file count remains unchanged.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7325

Test Plan:
contrived benchmark that exhibits the problem:

```
$ TEST_TMPDIR=/data/users/andrewkr/ ./db_bench -benchmarks=filluniquerandom,readrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -level0_file_num_compaction_trigger=4 -level_compaction_dynamic_level_bytes=true -compression_type=none -max_background_jobs=12 -rate_limiter_bytes_per_sec=104857600 -benchmark_write_rate_limit=10485760 -num=100000000
```

Results:

- "Burst W-Amp" is the write-amp near the end of the fillrandom benchmark
- "Total W-Amp" is the write-amp after readrandom has run a while and all levels no longer need compaction

Branch | Burst W-Amp | Total W-Amp | fillrandom (MB/s)
-- | -- | -- | --
master | 20.2 | 21.5 | 4.7
dynamic-l0-score | 12.6 | 14.1 | 7.2

Reviewed By: siying

Differential Revision: D23412935

Pulled By: ajkr

fbshipit-source-id: f91f2067188e432dd39deab02f1c56f195057a0e
2020-09-01 19:34:01 -07:00
Levi Tamasi
792d2f906e Log info about generated blob files in BlobFileBuilder (#7324)
Summary:
The patch adds a log message to `BlobFileBuilder` that is logged upon
generating a blob file, similarly to how we log the generation of table files
during flush and compaction. The log message contains the column family
name, job id, blob file number, and the number and total size of blobs in
the new file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7324

Test Plan: Ran `make check` and checked the actual log messages using a custom `db_bench`.

Reviewed By: riversand963

Differential Revision: D23402229

Pulled By: ltamasi

fbshipit-source-id: ca42beb4db284b783d1eb2651f321032a45d0c5f
2020-08-31 13:24:12 -07:00
Akanksha Mahajan
963314ffd6 Add unit test for max_write_buffer_size_to_maintain (#7311)
Summary:
Add a unit test case to check memory usage when
max_write_buffer_size_to_maintain is set if flushed immutable memtables are
trimmed timely or not.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7311

Test Plan: Compared the results with before bug fix.

Reviewed By: ltamasi

Differential Revision: D23321702

Pulled By: akankshamahajan15

fbshipit-source-id: da04ee21137d641a07fd499a9e2749eb036fcb1e
2020-08-28 17:38:05 -07:00
Levi Tamasi
5043960623 Add a blob file builder class that can be used in background jobs (#7306)
Summary:
The patch adds a class called `BlobFileBuilder` that can be used to build
and cut blob files in background jobs (flushes/compactions). The class
enforces a value size threshold (`min_blob_size`; smaller blobs will be inlined
in the LSM tree itself), and supports specifying a blob file size limit (`blob_file_size`),
as well as compression (`blob_compression_type`) and checksums for blob files.
It also keeps track of the generated blob files and their associated `BlobFileAddition`
metadata, which can be applied as part of the background job's `VersionEdit`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7306

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D23298817

Pulled By: ltamasi

fbshipit-source-id: 38f35d81dab1ba81f15236240612ec173d7f21b5
2020-08-27 11:55:54 -07:00
Akanksha Mahajan
8e0df9050c Store FSRandomAccessPtr object in RandomAccessFileReader (#7192)
Summary:
Replace FSRandomAccessFile pointer with FSRandomAccessFilePtr
    object in RandomAccessFileReader.
    This new object wraps FSRandomAccessFile pointer.

    Objective: If tracing is enabled, FSRandomAccessFile Ptr returns
    FSRandomAccessFileTracingWrapper pointer that includes all necessary
    information in IORecord and calls underlying FileSystem and invokes
    IOTracer to dump that record in a binary file. If tracing is disabled
    then, underlying FileSystem pointer is returned directly.
    FSRandomAccessFilePtr wrapper class is added to bypass the FSRandomAccessFileWrapper when
    tracing is disabled.

    Test Plan: make check -j64

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7192

Reviewed By: anand1976

Differential Revision: D23356867

Pulled By: akankshamahajan15

fbshipit-source-id: 48f31168166a17a7444b40be44a9a9d4a5c7182c
2020-08-27 11:21:52 -07:00
Peter Dillinger
9aad24da55 Real fix for race in backup custom checksum checking (#7309)
Summary:
This is a "real" fix for the issue worked around in https://github.com/facebook/rocksdb/issues/7294.
To get DB checksum info for live files, we now read the manifest file
that will become part of the checkpoint/backup. This requires a little
extra handling in taking a custom checkpoint, including only reading the
manifest file up to the size prescribed by the checkpoint.

This moves GetFileChecksumsFromManifest from backup code to
file_checksum_helper.{h,cc} and removes apparently unnecessary checking
related to column families.

Updated HISTORY.md and warned potential future users of
DB::GetLiveFilesChecksumInfo()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7309

Test Plan: updated unit test, before and after

Reviewed By: ajkr

Differential Revision: D23311994

Pulled By: pdillinger

fbshipit-source-id: 741e30a2dc1830e8208f7648fcc8c5f000d4e2d5
2020-08-26 10:39:20 -07:00
sdong
722814e357 Get() to fail with underlying failures in PartitionIndexReader::CacheDependencies() (#7297)
Summary:
Right now all I/O failures under PartitionIndexReader::CacheDependencies() is swallowed. This doesn't impact correctness but we've made a decision that any I/O error in read path now should be returned to users for awareness. Return errors in those cases instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7297

Test Plan: Add a new unit test that ingest errors in this code path and see Get() fails. Only one I/O path is hit in PartitionIndexReader::CacheDependencies(). Several option changes are attempt but not able to got other pread paths triggered. Not sure whether other failure cases would be even possible. Would rely on continuous stress test to validate it.

Reviewed By: anand1976

Differential Revision: D23257950

fbshipit-source-id: 859dbc92fa239996e1bb378329344d3d54168c03
2020-08-25 19:01:05 -07:00