Commit Graph

1392 Commits

Author SHA1 Message Date
Andrew Kryczka
538d2365e9 Fix race condition in BackupEngineTest.ChangeManifestDuringBackupCreation (#9327)
Summary:
The failure looked like this:

```
utilities/backupable/backupable_db_test.cc:3161: Failure
Value of: db_chroot_env_->FileExists(prev_manifest_path).IsNotFound()
  Actual: false
Expected: true
```

The failure could be coerced consistently with the following patch:

```
 diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc
index 80410f671..637636791 100644
 --- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@@ -2772,6 +2772,8 @@ void DBImpl::BackgroundCallFlush(Env::Priority thread_pri) {
     if (job_context.HaveSomethingToClean() ||
         job_context.HaveSomethingToDelete() || !log_buffer.IsEmpty()) {
       mutex_.Unlock();
+      bg_cv_.SignalAll();
+      sleep(1);
       TEST_SYNC_POINT("DBImpl::BackgroundCallFlush:FilesFound");
       // Have to flush the info logs before bg_flush_scheduled_--
       // because if bg_flush_scheduled_ becomes 0 and the lock is
```

The cause was a familiar problem, which is manual flush/compaction may
return before files they obsoleted are removed. The solution is just to
wait for "scheduled" work to complete, which includes all phases
including cleanup.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9327

Test Plan:
after this PR, even the above patch to coerce the bug cannot
cause the test to fail.

Reviewed By: riversand963

Differential Revision: D33252208

Pulled By: ajkr

fbshipit-source-id: 720a7eaca58c7247d221911fffe3d5e1dbf581e9
2021-12-22 21:59:53 -08:00
Sergei Petrunia
1b076e82db Expose locktree's wait count in RangeLockManagerHandle::Counters (#9289)
Summary:
locktree is a module providing Range Locking. It has a counter for
the number of times a lock acquisition request was blocked by an
existing conflicting lock and had to wait for it to be released.

Expose this counter in RangeLockManagerHandle::Counters::lock_wait_count.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9289

Reviewed By: jay-zhuang

Differential Revision: D33079182

Pulled By: riversand963

fbshipit-source-id: 25b1a362d9da247536ab5007bd15900b319f139e
2021-12-22 21:14:48 -08:00
mrambacher
423538a816 Make MemoryAllocator into a Customizable class (#8980)
Summary:
- Make MemoryAllocator and its implementations into a Customizable class.
- Added a "DefaultMemoryAllocator" which uses new and delete
- Added a "CountedMemoryAllocator" that counts the number of allocs and free
- Updated the existing tests to use these new allocators
- Changed the memkind allocator test into a generic test that can test the various allocators.
- Added tests for creating all of the allocators
- Added tests to verify/create the JemallocNodumpAllocator using its options.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8980

Reviewed By: zhichao-cao

Differential Revision: D32990403

Pulled By: mrambacher

fbshipit-source-id: 6fdfe8218c10dd8dfef34344a08201be1fa95c76
2021-12-17 04:20:47 -08:00
Peter Dillinger
0050a73a4f New stable, fixed-length cache keys (#9126)
Summary:
This change standardizes on a new 16-byte cache key format for
block cache (incl compressed and secondary) and persistent cache (but
not table cache and row cache).

The goal is a really fast cache key with practically ideal stability and
uniqueness properties without external dependencies (e.g. from FileSystem).
A fixed key size of 16 bytes should enable future optimizations to the
concurrent hash table for block cache, which is a heavy CPU user /
bottleneck, but there appears to be measurable performance improvement
even with no changes to LRUCache.

This change replaces a lot of disjointed and ugly code handling cache
keys with calls to a simple, clean new internal API (cache_key.h).
(Preserving the old cache key logic under an option would be very ugly
and likely negate the performance gain of the new approach. Complete
replacement carries some inherent risk, but I think that's acceptable
with sufficient analysis and testing.)

The scheme for encoding new cache keys is complicated but explained
in cache_key.cc.

Also: EndianSwapValue is moved to math.h to be next to other bit
operations. (Explains some new include "math.h".) ReverseBits operation
added and unit tests added to hash_test for both.

Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126

Test Plan:
### Basic correctness
Several tests needed updates to work with the new functionality, mostly
because we are no longer relying on filesystem for stable cache keys
so table builders & readers need more context info to agree on cache
keys. This functionality is so core, a huge number of existing tests
exercise the cache key functionality.

### Performance
Create db with
`TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters`
And test performance with
`TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4`
using DEBUG_LEVEL=0 and simultaneous before & after runs.
Before ops/sec, avg over 100 runs: 121924
After ops/sec, avg over 100 runs: 125385 (+2.8%)

### Collision probability
I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity
over many months, by making some pessimistic simplifying assumptions:
* Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys)
* All of every file is cached for its entire lifetime

We use a simple table with skewed address assignment and replacement on address collision
to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output
with `./cache_bench -stress_cache_key -sck_keep_bits=40`:

```
Total cache or DBs size: 32TiB  Writing 925.926 MiB/s or 76.2939TiB/day
Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached)
```

These come from default settings of 2.5M files per day of 32 MB each, and
`-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of
the 128-bit cache key.  With file size of 2\*\*25 contiguous keys (pessimistic), our simulation
is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality.

More default assumptions, relatively pessimistic:
* 100 DBs in same process (doesn't matter much)
* Re-open DB in same process (new session ID related to old session ID) on average
every 100 files generated
* Restart process (all new session IDs unrelated to old) 24 times per day

After enough data, we get a result at the end:

```
(keep 40 bits)  17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected)
```

If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data:

```
(keep 41 bits)  16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected)
(keep 42 bits)  19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected)
```

The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases:

```
197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected)
```

I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data.

Reviewed By: zhichao-cao

Differential Revision: D33171746

Pulled By: pdillinger

fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
2021-12-16 17:15:13 -08:00
Yanqin Jin
08721293ea Fix a bug causing duplicate trailing entries in WritableFile (buffered IO) (#9236)
Summary:
`db_stress` is a user of `FaultInjectionTestFS`. After injecting a write error, `db_stress` probabilistically determins
data drop (https://github.com/facebook/rocksdb/blob/6.27.fb/db_stress_tool/db_stress_test_base.cc#L2615:L2619).

In some of our recent runs of `db_stress`, we found duplicate trailing entries corresponding to file trivial move in
the MANIFEST, causing the recovery to fail, because the file move operation is not idempotent: you cannot delete a
file from a given level twice.

Investigation suggests that data buffering in both `WritableFileWriter` and `FaultInjectionTestFS` may be the root cause.

WritableFileWriter buffers data to write in a memory buffer, `WritableFileWriter::buf_`. After each
`WriteBuffered()`/`WriteBufferedWithChecksum()` succeeds, the `buf_` is cleared.

If the underlying file `WritableFileWriter::writable_file_` is opened in buffered IO mode, then `FaultInjectionTestFS`
buffers data written for each file until next file sync. After an injected error, user of `FaultInjectionFS` can
choose to drop some or none of previously buffered data. If `db_stress` does not drop any unsynced data, then
such data will still exist in the `FaultInjectionTestFS`'s buffer.

Existing implementation of `WritableileWriter::WriteBuffered()` does not clear `buf_` if there is an error. This may lead
to the data being buffered two copies: one in `WritableFileWriter`, and another in `FaultInjectionTestFS`.
We also know that the `WritableFileWriter` of MANIFEST file will close upon an error.  During `Close()`, it will flush the
content in `buf_`. If no write error is injected to `FaultInjectionTestFS` this time, then we end up with two copies of the
data appended to the file.

To fix, we clear the `WritableFileWriter::buf_` upon failure as well. We focus this PR on files opened in non-direct mode.

This PR includes a unit test to reproduce a case when write error injection
to `WritableFile` can cause duplicate trailing entries.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9236

Test Plan: make check

Reviewed By: zhichao-cao

Differential Revision: D33033984

Pulled By: riversand963

fbshipit-source-id: ebfa5a0db8cbf1ed73100528b34fcba543c5db31
2021-12-13 09:00:36 -08:00
Hui Xiao
cd85439632 Make TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAcces less flaky (#9281)
Summary:
Context:
[Rapid thread creation and deletion](https://github.com/facebook/rocksdb/blob/6.27.fb/utilities/transactions/write_prepared_transaction_test.cc#L439-L444) in  `SnapshotConcurrentAccessTest.SnapshotConcurrentAcces` inside a [potentially big loop](https://github.com/facebook/rocksdb/blob/6.27.fb/utilities/transactions/write_prepared_transaction_test.cc#L1238-L1248) can lead to heavy-loading the system with many threads due to delay in actually cleaning up thread's resource in the kernel sometime. We ran into some [flaky failure](https://app.circleci.com/pipelines/github/facebook/rocksdb/10383/workflows/136f1005-80a9-4515-aee9-fe36ac6462a1/jobs/253289) in CI and reproduced it by below:

- Command
```
Added `ROCKSDB_NAMESPACE::port::InstallStackTraceHandler();` like https://github.com/facebook/rocksdb/pull/9276
DEBUG_LEVEL=2 make -j56 write_prepared_transaction_test
GTEST_CATCH_EXCEPTIONS=0 ~/gtest-parallel/gtest-parallel -r 200 -w 200 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
```
- Stack, where `write_prepared_transaction_test.cc:442` in `https://github.com/facebook/rocksdb/issues/9` points to thread creation
```
[ RUN      ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
....terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
Received signal 6 (Aborted)
#0   /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38) [0x7fc114f39438]
...
https://github.com/facebook/rocksdb/issues/7   /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8e73) [0x7fc1158a5e73] ??	??:0
https://github.com/facebook/rocksdb/issues/8   ./write_prepared_transaction_test() [0x4ca86c] std:🧵:thread<rocksdb::WritePreparedTransactionTestBase::SnapshotConcurrentAccessTestInternal(rocksdb::WritePreparedTxnDB*, std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<unsigned long, std::allocator<unsigned long> 	 const&, rocksdb::WritePreparedTxnDB::CommitEntry&, unsigned long&, unsigned long, unsigned long, unsigned long, unsigned long)::{lambda()https://github.com/facebook/rocksdb/issues/1}>(rocksdb::WritePreparedTransactionTestBase::SnapshotConcurrentAccessTestInternal(rocksdb::WritePreparedTxnDB*, s	d::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&, rocksdb::WritePreparedTxnDB::CommitEntry&, unsigned long&, unsigned long, unsigned long, unsigned long, unsigned long)::{l	mbda()https://github.com/facebook/rocksdb/issues/1}&&)	/usr/include/c++/5/thread:137 (discriminator 4)
https://github.com/facebook/rocksdb/issues/9   ./write_prepared_transaction_test() [0x4bb80c] rocksdb::WritePreparedTransactionTestBase::SnapshotConcurrentAccessTestInternal(rocksdb::WritePreparedTxnDB*, std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&, rocksdb::W	itePreparedTxnDB::CommitEntry&, unsigned long&, unsigned long, unsigned long, unsigned long, unsigned long)	/home/circleci/project/utilities/transactions/write_prepared_transaction_test.cc:442
https://github.com/facebook/rocksdb/issues/10  ./write_prepared_transaction_test() [0x4407b6] rocksdb::SnapshotConcurrentAccessTest_SnapshotConcurrentAccess_Test::TestBody()	/home/circleci/project/utilities/transactions/write_prepared_transaction_test.cc:1244
...
[109/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 returned/aborted with exit code -6 (34462 ms)
```

- Move thread 2's work into current thread to avoid half of the thread creation cuz there is no difference in doing so. We expect this can make the thread-creation error less often, even though we can't gurantee it from happening again. Considering this is a trivial change with positive impact, it's still worth landing and monitor if it's enough to solve the problem in reality.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9281

Test Plan:
Before the change, repeating the test 200 times with 200 workers failed
`~/gtest-parallel/gtest-parallel -r 200 -w 200 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1`

```
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from TwoWriteQueues/SnapshotConcurrentAccessTest
[ RUN      ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
..unknown file: Failure
C++ exception with description "Resource temporarily unavailable" thrown in the test body.
[  FAILED  ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1, where GetParam() = (false, true, 1, 0, 1, 20) (11882 ms)
[----------] 1 test from TwoWriteQueues/SnapshotConcurrentAccessTest (11882 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (11882 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1, where GetParam() = (false, true, 1, 0, 1, 20)
```

After the change: repeating the test 200 times with 200 workers didn't fail, even with repeating the "repeating" for 10 times like below
`for i in {1..10}; do ~/gtest-parallel/gtest-parallel -r 200 -w 200 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1; done`

```
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
```

It does failed when repeating the test 400 times with 400 workers
`~/project$ ~/gtest-parallel/gtest-parallel -r 400 -w 400 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1`

```
[1/400] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 (2928 ms)
Note: Google Test filter = TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from TwoWriteQueues/SnapshotConcurrentAccessTest
[ RUN      ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
unknown file: Failure
C++ exception with description "std::bad_alloc" thrown in the test body.
[  FAILED  ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1, where GetParam() = (false, true, 1, 0, 1, 20) (2597 ms)
[----------] 1 test from TwoWriteQueues/SnapshotConcurrentAccessTest (2597 ms total)
```

Reviewed By: ajkr

Differential Revision: D33026776

Pulled By: hx235

fbshipit-source-id: 509f57126392821e835e48396e5bf224f4f5dcac
2021-12-10 12:52:33 -08:00
Yanqin Jin
bd513fd075 Add commit marker with timestamp (#9266)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266

This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing
this tag to WAL, thus it is unavailable to users.
This is an ongoing effort to add user-defined timestamp support to write-committed transactions.
This diff also indicates all column families that may potentially participate in the same
transaction must either disable timestamp or have the same timestamp format, since
`CommitWithTimestamp` tag is followed by a single byte-array denoting the commit
timestamp of the transaction. We will enforce this checking in a future diff. We keep this
diff small.

Reviewed By: ltamasi

Differential Revision: D31721350

fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6
2021-12-10 11:05:35 -08:00
Peter Dillinger
aec95b8c09 Debug "Resource temporarily unavailable" exception in CircleCI (#9276)
Summary:
This changes write_prepared_transaction_test under CircleCI to
print a stack trace on unhandled exception, so that we can debug rare
exceptions seen in CircleCI:

    [ RUN      ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/24
    .......unknown file: Failure
    C++ exception with description "Resource temporarily unavailable" thrown in the test body.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9276

Test Plan:
manual run test with seeded 'throw', with and without
CIRCLECI=true environment variable

Reviewed By: ajkr, hx235

Differential Revision: D32996993

Pulled By: pdillinger

fbshipit-source-id: e790408ce204b676d3d84a290e41be511b203bfa
2021-12-09 12:58:46 -08:00
Peter Dillinger
80ac7412b5 Polish/deflake BackupEngineTest.FileCollision (#9257)
Summary:
Use smaller and more predictable behaviors

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9257

Test Plan:
gtest-parallel --repeat=N ./backupable_db_test --gtest_filter=BackupEngineTest.FileCollision

before (N=50) we see inconsistent sets of SST files

    $ find /dev/shm/rocksdb_blah/ | grep -o '/00.*sst' | grep -o '^[^_]*' | sort | uniq -c
     49 /000009
      3 /000010
      1 /000010.sst
     49 /000012
      3 /000013
      1 /000013.sst
     49 /000015
      2 /000016
      1 /000016.sst
     22 /000018
      2 /000019
      1 /000019.sst
     29 /000020
     11 /000021
      2 /000021.sst
     46 /000022
      2 /000022.sst
      4 /000023
      1 /000023.sst
     27 /000025

And after (N=5000) we see

    $ find /dev/shm/rocksdb_blah/ | grep -o '/00.*sst' | grep -o '^[^_]*' | sort | uniq -c
      10000 /000009
      10000 /000012
       5000 /000015

Reviewed By: ajkr

Differential Revision: D32888393

Pulled By: pdillinger

fbshipit-source-id: 5bfd075b3184bb66c5613758a53f431c406e9808
2021-12-08 21:57:46 -08:00
lgqss
77c7085594 MemTableList::TrimHistory now use allocated bytes (#9020)
Summary:
Fix a bug when both max_write_buffer_size_to_maintain and max_write_buffer_number_to_maintain are 0.
The bug was introduced in 6.5.0 and  https://github.com/facebook/rocksdb/issues/5022.
Fix https://github.com/facebook/rocksdb/issues/8371

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9020

Reviewed By: pdillinger

Differential Revision: D32767084

Pulled By: ajkr

fbshipit-source-id: c401ee6e2557230e892d0fe8abb4966cbd18e85f
2021-12-02 11:45:39 -08:00
Peter Dillinger
735fe61e8f Fix flaky CassandraFunctionalTest...ExpiredColumnsToTombstone (#9226)
Summary:
You could easily reproduce the failure by injecting sleep(11)
before `store.Flush()`. Fixed by setting TTL time to approximately test
timeout time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9226

Test Plan: manual

Reviewed By: akankshamahajan15

Differential Revision: D32698105

Pulled By: pdillinger

fbshipit-source-id: 40529af9d9f2389585988b7c81dffb120e2795a2
2021-11-29 09:53:07 -08:00
Adam Retter
d94932323a Check that newIteratorWithBase regardless of WBWI Overwrite Mode (#8134)
Summary:
The behaviour of WBWI has changed when calling newIteratorWithBase when overwrite is set to true or false. This PR simply adds tests to assert the new correct behaviour.

Closes https://github.com/facebook/rocksdb/issues/7370
Closes https://github.com/facebook/rocksdb/pull/8134

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9107

Reviewed By: pdillinger

Differential Revision: D32099475

Pulled By: mrambacher

fbshipit-source-id: 245f483f73db866cc8a51219a2bff2e09e59faa0
2021-11-18 11:53:09 -08:00
Hui Xiao
cff7819dff Fix BackupEngine's internal callers of GenericRateLimiter::Request() not honoring bytes <= GetSingleBurstBytes() (#9063)
Summary:
**Context:**
Some existing internal calls of `GenericRateLimiter::Request()` in backupable_db.cc and newly added internal calls in https://github.com/facebook/rocksdb/pull/8722/ do not make sure `bytes <= GetSingleBurstBytes()` as required by rate_limiter https://github.com/facebook/rocksdb/blob/master/include/rocksdb/rate_limiter.h#L47.

**Impacts of this bug include:**
(1) In debug build, when `GenericRateLimiter::Request()` requests bytes greater than `GenericRateLimiter:: kMinRefillBytesPerPeriod = 100` byte, process will crash due to assertion failure. See https://github.com/facebook/rocksdb/pull/9063#discussion_r737034133 and for possible scenario
(2) In production build, although there will not be the above crash due to disabled assertion, the bug can lead to a request of small bytes being blocked for a long time by a request of same priority with insanely large bytes from a different thread. See updated https://github.com/facebook/rocksdb/wiki/Rate-Limiter ("Notice that although....the maximum bytes that can be granted in a single request have to be bounded...") for more info.

There is an on-going effort to move rate-limiting to file wrapper level so rate limiting in `BackupEngine` and this PR might be made obsolete in the future.

**Summary:**
- Implemented loop-calling `GenericRateLimiter::Request()` with `bytes <= GetSingleBurstBytes()` as a static private helper function `BackupEngineImpl::LoopRateLimitRequestHelper`
   -- Considering make this a util function in `RateLimiter` later or do something with `RateLimiter::RequestToken()`
- Replaced buggy internal callers with this helper function wherever requested byte is not pre-limited by `GetSingleBurstBytes()`
- Removed the minimum refill bytes per period enforced by `GenericRateLimiter` since it is useless and prevents testing `GenericRateLimiter` for extreme case with small refill bytes per period.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9063

Test Plan:
- Added a new test that failed the assertion before this change and now passes
  - It exposed bugs in [the write during creation in `CopyOrCreateFile()`](df7cc66e17/utilities/backupable/backupable_db.cc (L2034-L2043)), [the read of table properties in `GetFileDbIdentities()`](df7cc66e17/utilities/backupable/backupable_db.cc (L2372-L2378)), [some read of metadata in `BackupMeta::LoadFromFile()`](df7cc66e17/utilities/backupable/backupable_db.cc (L2726))
- Passing Existing tests

Reviewed By: ajkr

Differential Revision: D31824535

Pulled By: hx235

fbshipit-source-id: d2b3dea7a64e2a4b1e6a59fca322f0800a4fcbcc
2021-11-16 09:52:16 -08:00
Yanqin Jin
2035798834 Update TransactionUtil::CheckKeyForConflict to also use timestamps (#9162)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9162

Existing TransactionUtil::CheckKeyForConflict() performs only seq-based
conflict checking. If user-defined timestamp is enabled, it should perform
conflict checking based on timestamps too.

Update TransactionUtil::CheckKey-related methods to verify the timestamp of the
latest version of a key is smaller than the read timestamp. Note that
CheckKeysForConflict() is not updated since it's used only by optimistic
transaction, and we do not plan to update it in this upcoming batch of diffs.

Existing GetLatestSequenceForKey() returns the sequence of the latest
version of a specific user key. Since we support user-defined timestamp, we
need to update this method to also return the timestamp (if enabled) of the
latest version of the key. This will be needed for snapshot validation.

Reviewed By: ltamasi

Differential Revision: D31567960

fbshipit-source-id: 2e4a14aed267435a9aa91bc632d2411c01946d44
2021-11-15 12:52:18 -08:00
jsteemann
a7478070f3 Fix small issues (#5896)
Summary:
The individual commits in this PR should be self-explanatory.
All small and _very_ low-priority changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/5896

Reviewed By: riversand963

Differential Revision: D18065108

Pulled By: mrambacher

fbshipit-source-id: 236b1a1d9d21f982cc08aa67027108dde5eaf280
2021-11-08 12:32:38 -08:00
anand76
78556c14dd Secondary cache error injection (#9002)
Summary:
Implement secondary cache error injection in db_stress.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9002

Reviewed By: zhichao-cao

Differential Revision: D31874896

Pulled By: anand1976

fbshipit-source-id: 8cf04c061a4a44efa0fe88423d05cade67b85f73
2021-11-08 10:27:27 -08:00
Yanqin Jin
5237b39d2e Fix assertion error during compaction with write-prepared txn enabled (#9105)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9105

The user contract of SingleDelete is that: a SingleDelete can only be issued to
a key that exists and has NOT been updated. For example, application can insert
one key `key`, and uses a SingleDelete to delete it in the future. The `key`
cannot be updated or removed using Delete.
In reality, especially when write-prepared transaction is being used, things
can get tricky. For example, a prepared transaction already writes `key` to the
memtable after a successful Prepare(). Afterwards, should the transaction
rollback, it will insert a Delete into the memtable to cancel out the prior
Put. Consider the following sequence of operations.

```
// operation sequence 1
Begin txn
Put(key)
Prepare()
Flush()

Rollback txn
Flush()
```

There will be two SSTs resulting from above. One of the contains a PUT, while
the second one contains a Delete. It is also known that releasing a snapshot
can lead to an L0 containing only a SD for a particular key. Consider the
following operations following the above block.

```
// operation sequence 2
db->Put(key)
db->SingleDelete(key)
Flush()
```

The operation sequence 2 can result in an L0 with only the SD.

Should there be a snapshot for conflict checking created before operation
sequence 1, then an attempt to compact the db may hit the assertion failure
below, because ikey_.type is Delete (from a rollback).

```
else if (clear_and_output_next_key_) {
  assert(ikey_.type == kTypeValue || ikey_.type == kTypeBlobIndex);
}
```

To fix the assertion failure, we can skip the SingleDelete if we detect an
earlier Delete in the same snapshot interval.

Reviewed By: ltamasi

Differential Revision: D32056848

fbshipit-source-id: 23620a91e28562d91c45cf7e95f414b54b729748
2021-11-05 15:29:18 -07:00
Yanqin Jin
9b53f14a35 Fixed a bug in CompactionIterator when write-preared transaction is used (#9060)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9060

RocksDB bottommost level compaction may zero out an internal key's sequence if
the key's sequence is in the earliest_snapshot.
In write-prepared transaction, checking the visibility of a certain sequence in
a specific released snapshot may return a "snapshot released" result.
Therefore, it is possible, after a certain sequence of events, a PUT has its
sequence zeroed out, but a subsequent SingleDelete of the same key will still
be output with its original sequence. This violates the ascending order of
keys and leads to incorrect result.

The solution is to use an extra variable `last_key_seq_zeroed_` to track the
information about visibility in earliest snapshot. With this variable, we can
know for sure that a SingleDelete is in the earliest snapshot even if the said
snapshot is released during compaction before processing the SD.

Reviewed By: ltamasi

Differential Revision: D31813016

fbshipit-source-id: d8cff59d6f34e0bdf282614034aaea99be9174e1
2021-11-03 15:55:00 -07:00
Jay Zhuang
29102641dd Skip directory fsync for filesystem btrfs (#8903)
Summary:
Directory fsync might be expensive on btrfs and it may not be needed.
Here are 4 directory fsync cases:
1. creating a new file: dir-fsync is not needed on btrfs, as long as the
   new file itself is synced.
2. renaming a file: dir-fsync is not needed if the renamed file is
   synced. So an API `FsyncAfterFileRename(filename, ...)` is provided
   to sync the file on btrfs. By default, it just calls dir-fsync.
3. deleting files: dir-fsync is forced by set
   `IOOptions.force_dir_fsync = true`
4. renaming multiple files (like backup and checkpoint): dir-fsync is
   forced, the same as above.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903

Test Plan: run tests on btrfs and non btrfs

Reviewed By: ajkr

Differential Revision: D30885059

Pulled By: jay-zhuang

fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a
2021-11-03 12:21:27 -07:00
mrambacher
f72c834eab Make FileSystem a Customizable Class (#8649)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8649

Reviewed By: zhichao-cao

Differential Revision: D32036059

Pulled By: mrambacher

fbshipit-source-id: 4f1e7557ecac52eb849b83ae02b8d7d232112295
2021-11-02 09:07:11 -07:00
Yanqin Jin
fdf2a0d7eb Fix a compaction bug for write-prepared txn (#9061)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9061

In write-prepared txn, checking a sequence's visibility in a released (old)
snapshot may return "Snapshot released". Suppose we have two snapshots:

```
earliest_snap < earliest_write_conflict_snap
```

If we release `earliest_write_conflict_snap` but keep `earliest_snap` during
bottommost level compaction, then it is possible that certain sequence of
events can lead to a PUT being seq-zeroed followed by a SingleDelete of the
same key. This violates the ascending order of keys, and will cause data
inconsistency.

Reviewed By: ltamasi

Differential Revision: D31813017

fbshipit-source-id: dc68ba2541d1228489b93cf3edda5f37ed06f285
2021-10-29 15:23:17 -07:00
Peter Dillinger
92e2399669 Fix EnvLibrados and add to CI (#9088)
Summary:
This feature was not part of any common or CI build, so no
surprise it broke. Now we can at least ensure compilation. I don't know
how to run the test successfully (missing config file) so it is bypassed
for now.

Fixes https://github.com/facebook/rocksdb/issues/9078

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9088

Test Plan: CI

Reviewed By: mrambacher

Differential Revision: D32009467

Pulled By: pdillinger

fbshipit-source-id: 3e0d1e5fde7f0ece703d48a81479e1cc7392c25c
2021-10-29 08:19:03 -07:00
Jonathan Albrecht
e970248602 Add support for building on s390x platform (#8962)
Summary:
This PR adds support for building on s390x including updating travis CI. It uses the previous work in https://github.com/facebook/rocksdb/pull/6168 and adds some more changes to get all current tests (make check and jni tests) to pass. The tests were run with snappy, lz4, bzip2 and zstd all compiled in.

There are a few pieces still needed to get the travis build working that I don't think I can do. adamretter is this something you could help with?

1. A prebuilt https://rocksdb-deps.s3-us-west-2.amazonaws.com/cmake/cmake-3.14.5-Linux-s390x.deb package
2. A https://hub.docker.com/r/evolvedbinary/rocksjava s390x image

Not sure if there is more required for travis. Happy to help in any way I can.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8962

Reviewed By: mrambacher

Differential Revision: D31802198

Pulled By: pdillinger

fbshipit-source-id: 683511466fa6b505f85ba5a9964a268c6151f0c2
2021-10-22 10:13:15 -07:00
Peter Dillinger
3ffb3baa0b Add (Live)FileStorageInfo API (#8968)
Summary:
New classes FileStorageInfo and LiveFileStorageInfo and
'experimental' function DB::GetLiveFilesStorageInfo, which is intended
to largely replace several fragmented DB functions needed to create
checkpoints and backups.

This function is now used to create checkpoints and backups, because
it fixes many (probably not all) of the prior complexities of checkpoint
not having atomic access to DB metadata. This also ensures strong
functional test coverage of the new API. Specifically, much of the old
CheckpointImpl::CreateCustomCheckpoint has been migrated to and
updated in DBImpl::GetLiveFilesStorageInfo, with the former now
calling the latter.

Also, the class FileStorageInfo in metadata.h compatibly replaces
BackupFileInfo and serves as a new base class for SstFileMetaData.
Some old fields of SstFileMetaData are still provided (for now) but
deprecated.

Although FileStorageInfo::directory is accurate when using db_paths
and/or cf_paths, these have never been supported by Checkpoint
nor BackupEngine and still are not. This change does now detect
these cases and return NotSupported when appropriate. (More work
needed for support.)

Somehow this change broke ProgressCallbackDuringBackup, but
the progress_callback logic was dubious to begin with because it
would call the callback based on copy buffer size, not size actually
copied. Logic and test updated to track size actually copied
per-thread.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8968

Test Plan:
tests updated.
DB::GetLiveFilesStorageInfo mostly tested by use in CheckpointImpl.
DBTest.SnapshotFiles updated to also test GetLiveFilesStorageInfo,
including reading the data after DB close.
Added CheckpointTest.CheckpointWithDbPath (NotSupported).

Reviewed By: siying

Differential Revision: D31242045

Pulled By: pdillinger

fbshipit-source-id: b183d1ce9799e220daaefd6b3b5365d98de676c0
2021-10-16 10:04:32 -07:00
Yanqin Jin
e1139167ae Inline an empty destructor (#9004)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9004

Inline an empty destructor

Reviewed By: ltamasi

Differential Revision: D31525561

fbshipit-source-id: 3b9e37f06b0c70529a5d2d660de21ea335c73611
2021-10-11 18:14:10 -07:00
Yanqin Jin
1a79839c59 Some code cleanup (#9003)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9003

cleanup some code before real work.

Reviewed By: ltamasi

Differential Revision: D31525563

fbshipit-source-id: 44558b3594f2200adc7d8621b08b06c77e358a27
2021-10-11 18:14:10 -07:00
Andrew Kryczka
a282eff3d1 Protect existing files in FaultInjectionTest{Env,FS}::ReopenWritableFile() (#8995)
Summary:
`FaultInjectionTest{Env,FS}::ReopenWritableFile()` functions were accidentally deleting WALs from previous `db_stress` runs causing verification to fail. They were operating under the assumption that `ReopenWritableFile()` would delete any existing file. It was a reasonable assumption considering the `{Env,FileSystem}::ReopenWritableFile()` documentation stated that would happen. The only problem was neither the implementations we offer nor the "real" clients in RocksDB code followed that contract. So, this PR updates the contract as well as fixing the fault injection client usage.

The fault injection change exposed that `ExternalSSTFileBasicTest.SyncFailure` was relying on a fault injection `Env` dropping unsynced data written by a regular `Env`. I changed that test to make its `SstFileWriter` use fault injection `Env`, and also implemented `LinkFile()` in fault injection so the unsynced data is tracked under the new name.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8995

Test Plan:
- Verified it fixes the following failure:

```
$ ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --ops_per_thread=1000 --prefixpercent=0 --readpercent=60 --reopen=0 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
$ ./db_stress --avoid_flush_during_recovery=1 --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=10 --key_len_percent_dist=1,30,69 --max_bytes_for_level_base=4194304 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_write_fault_one_in=16 --ops_per_thread=1000 --prefix_size=-1 --prefixpercent=0 --readpercent=50 --sync=1 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
Verification failed for column family 0 key 000000000000001300000000000000857878787878 (1143): Value not found: NotFound:
Crash-recovery verification failed :(
...
```

- `make check -j48`

Reviewed By: ltamasi

Differential Revision: D31495388

Pulled By: ajkr

fbshipit-source-id: 7886ccb6a07cb8b78ad7b6c1c341ccf40bb68385
2021-10-11 16:23:18 -07:00
Andrew Kryczka
ee239df351 Initialize cache dumper DumpUnit in constructor (#9014)
Summary:
Should fix clang-analyze:

```
utilities/cache_dump_load_impl.cc:296:38: warning: The left operand of '!=' is a garbage value
  while (io_s.ok() && dump_unit.type != CacheDumpUnitType::kFooter) {
                      ~~~~~~~~~~~~~~ ^
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9014

Reviewed By: zhichao-cao

Differential Revision: D31546912

Pulled By: ajkr

fbshipit-source-id: a2e0dc7874e8c1c6abf190862b5d49e6a6ad6d01
2021-10-11 13:05:35 -07:00
Zhichao Cao
699f45049d Introduce a mechanism to dump out blocks from block cache and re-insert to secondary cache (#8912)
Summary:
Background: Cache warming up will cause potential read performance degradation due to reading blocks from storage to the block cache. Since in production, the workload and access pattern to a certain DB is stable, it is a potential solution to dump out the blocks belonging to a certain DB to persist storage (e.g., to a file) and bulk-load the blocks to Secondary cache before the DB is relaunched. For example, when migrating a DB form host A to host B, it will take a short period of time, the access pattern to blocks in the block cache will not change much. It is efficient to dump out the blocks of certain DB, migrate to the destination host and insert them to the Secondary cache before we relaunch the DB.

Design: we introduce the interface of CacheDumpWriter and CacheDumpRead for user to store the blocks dumped out from block cache. RocksDB will encode all the information and send the string to the writer. User can implement their own writer it they want. CacheDumper and CacheLoad are introduced to save the blocks and load the blocks respectively.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8912

Test Plan: add new tests to lru_cache_test and pass make check.

Reviewed By: pdillinger

Differential Revision: D31452871

Pulled By: zhichao-cao

fbshipit-source-id: 11ab4f5d03e383f476947116361d54188d36ec48
2021-10-07 11:42:31 -07:00
mrambacher
13ae16c315 Cleanup includes in dbformat.h (#8930)
Summary:
This header file was including everything and the kitchen sink when it did not need to.  This resulted in many places including this header when they needed other pieces instead.

Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.

Hopefully, this sort of code hygiene cleanup will speed up the builds...

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930

Reviewed By: pdillinger

Differential Revision: D31142788

Pulled By: mrambacher

fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
2021-09-29 04:04:40 -07:00
mrambacher
7fd68b7c39 Make WalFilter, SstPartitionerFactory, FileChecksumGenFactory, and TableProperties Customizable (#8638)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8638

Reviewed By: zhichao-cao

Differential Revision: D31024729

Pulled By: mrambacher

fbshipit-source-id: 954c04ccab0b8dee64050a27aadf78ed119106c0
2021-09-28 05:32:02 -07:00
sdong
7c6a7e8fa8 FaultInjectionTestFS::InjectThreadSpecificReadError() should not corrupt mmaped bytes (#8952)
Summary:
Right now FaultInjectionTestFS::InjectThreadSpecificReadError() might try to corrupt return bytes, but these bytes might be from mmapped files, which would cause segfault. Instead FaultInjectionTestFS::InjectThreadSpecificReadError() should never corrupt data unless it is in caller's buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8952

Test Plan: See db_stress still runs and make sure in a test run failurs are still injected in non-mmap cases.

Reviewed By: ajkr, ltamasi

Differential Revision: D31147318

fbshipit-source-id: 9484a64ff2aaa36685557203f449286e694e65f9
2021-09-23 12:00:47 -07:00
Levi Tamasi
be206db351 Deflake MySQLStyleTransactionTest.TransactionStressTest in "status checked" mode (#8947)
Summary:
There is a corner case when using WriteUnprepared transactions when
`WriteUnpreparedTxn::Get` returns `Status::TryAgain` instead of
propagating the result of `GetFromBatchAndDB`. The patch adds
`PermitUncheckedError` to make the `ASSERT_STATUS_CHECKED` build pass in
this case as well.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8947

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D31125422

Pulled By: ltamasi

fbshipit-source-id: 42de51dcfa9384e032244c2b4d3f40e9a4111194
2021-09-22 16:40:25 -07:00
sdong
9320067703 Improve fault injection to MultiRead (#8937)
Summary:
Several improvements to MultiRead:
1. Fix a bug in stress test which causes false positive when both MultiRead() return and individual read request have failure injected.
2. Add two more types of fault that should be handled: empty read results and checksum mismatch
3. Add a message indicating which type of fault is injected
4. Increase the failure rate

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8937

Reviewed By: anand1976

Differential Revision: D31085930

fbshipit-source-id: 3a04994a3cadebf9a64d25e1fe12b14b7a272fba
2021-09-21 14:48:15 -07:00
Peter Dillinger
5268cdc997 Finish BackupEngine migration to IOStatus (#8940)
Summary:
Updates a few remaining functions that should have been updated
from Status -> IOStatus, and adds to HISTORY for the overall change
including https://github.com/facebook/rocksdb/issues/8820.

This change is for inclusion in version 6.25.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8940

Test Plan: CI

Reviewed By: zhichao-cao

Differential Revision: D31085029

Pulled By: pdillinger

fbshipit-source-id: 91557c6a39ef1d90357d4f4dcd79af0645d87c7b
2021-09-21 11:13:17 -07:00
sdong
4f1dd05cec Implement TestFSRandomAccessFile::MultiRead() (#8925)
Summary:
Right now, the failure injection test for MultiGet() is not sufficient. Improve it with TestFSRandomAccessFile::MultiRead() injecting failures.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8925

Test Plan: Run crash test locally for a while.

Reviewed By: anand1976

Differential Revision: D31000529

fbshipit-source-id: 439c7e02cf7440ac5af82deb609e202abdca3e1f
2021-09-16 16:01:34 -07:00
Zhichao Cao
82e7631de6 Replace Status with IOStatus in the backupable_db (#8820)
Summary:
In order to populate the IOStatus up to the higher level, replace some of the Status to IOStatus.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8820

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D30967215

Pulled By: zhichao-cao

fbshipit-source-id: ccf9d5cfbd9d3de047c464aaa85f9fa43b474903
2021-09-15 15:09:48 -07:00
mrambacher
dafa584fd1 Change the File System File Wrappers to std::unique_ptr (#8618)
Summary:
This allows the wrapper classes to own the wrapped object and eliminates confusion as to ownership.  Previously, many classes implemented their own ownership solutions.  Fixes https://github.com/facebook/rocksdb/issues/8606

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8618

Reviewed By: pdillinger

Differential Revision: D30136064

Pulled By: mrambacher

fbshipit-source-id: d0bf471df8818dbb1770a86335fe98f761cca193
2021-09-13 08:46:19 -07:00
Peter Dillinger
bda8d93ba9 Fix and detect headers with missing dependencies (#8893)
Summary:
It's always annoying to find a header does not include its own
dependencies and only works when included after other includes. This
change adds `make check-headers` which validates that each header can
be included at the top of a file. Some headers are excluded e.g. because
of platform or external dependencies.

rocksdb_namespace.h had to be re-worked slightly to enable checking for
failure to include it. (ROCKSDB_NAMESPACE is a valid namespace name.)

Fixes mostly involve adding and cleaning up #includes, but for
FileTraceWriter, a constructor was out-of-lined to make a forward
declaration sufficient.

This check is not currently run with `make check` but is added to
CircleCI build-linux-unity since that one is already relatively fast.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8893

Test Plan: existing tests and resolving issues detected by new check

Reviewed By: mrambacher

Differential Revision: D30823300

Pulled By: pdillinger

fbshipit-source-id: 9fff223944994c83c105e2e6496d24845dc8e572
2021-09-10 10:00:26 -07:00
mrambacher
0fb938c448 Add support to the ObjectRegistry for ManagedObjects (#8658)
Summary:
ManagedObjects are  shared pointer objects where RocksDB wants to share a single object between multiple configurations.  For example, the Cache may be shared between multiple column families/tables or the Statistics may be shared between multiple databases.

ManagedObjects are stored in the ObjectRegistry by Type (e.g. Cache) and ID.  For a given type/ID name, a single object is stored.

APIs were added to get/set/create these objects.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8658

Reviewed By: pdillinger

Differential Revision: D30806273

Pulled By: mrambacher

fbshipit-source-id: 832ac4423b210c4c4b4a456b35897334775d3160
2021-09-10 05:21:04 -07:00
hx235
45175ca2e1 Charge read to rate limiter in BackupEngine (#8722)
Summary:
Context:
While all the non-trivial write operations in BackupEngine go through the RateLimiter, reads currently do not. In general, this is not a huge issue because (especially since some I/O efficiency fixes) reads in BackupEngine are mostly limited by corresponding writes, for both backup and restore. But in principle we should charge the RateLimiter for reads as well.
- Charged read operations in `BackupEngineImpl::CopyOrCreateFile`, `BackupEngineImpl::ReadFileAndComputeChecksum`, `BackupEngineImpl::BackupMeta::LoadFromFile` and `BackupEngineImpl::GetFileDbIdentities`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8722

Test Plan:
- Passed existing tests
- Passed added unit tests

Reviewed By: pdillinger

Differential Revision: D30610464

Pulled By: hx235

fbshipit-source-id: 9b08c9387159a5385c8d390d6666377a0d0117e5
2021-09-08 16:24:40 -07:00
Andrew Kryczka
dd092c2d11 prevent stranded LATEST_BACKUP in BackupEngineTest.NoDeleteWithReadOnly (#8887)
Summary:
A "LATEST_BACKUP" file was left in the backup directory by
"BackupEngineTest.NoDeleteWithReadOnly" test, affecting future test
runs. In particular, it caused "BackupEngineTest.IOStats" to fail since
it relies on backup directory containing only data written by its
`BackupEngine`.

The fix is to promote "LATEST_BACKUP" to an explicitly managed file so
it is deleted in `BackupEngineTest` constructor if it exists.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8887

Test Plan:
below command used to fail. Now it passes:

```
$ TEST_TMPDIR=/dev/shm ./backupable_db_test --gtest_filter='BackupEngineTest.NoDeleteWithReadOnly:BackupEngineTest.IOStats'
```

Reviewed By: pdillinger

Differential Revision: D30812336

Pulled By: ajkr

fbshipit-source-id: 32dfbe1368ebdab872e610764bfea5daf9a2af09
2021-09-08 13:39:01 -07:00
Andrew Kryczka
9308ff366c Bytes read/written stats for CreateNewBackup*() (#8819)
Summary:
Gets `Statistics` from the options associated with the `DB` undergoing backup, and populates new ticker stats with the thread-local `IOContext` read/write counters for the threads doing backup work.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8819

Reviewed By: pdillinger

Differential Revision: D30779238

Pulled By: ajkr

fbshipit-source-id: 75ccafc355f90906df5cf80367f7245b985772d8
2021-09-07 18:25:16 -07:00
Peter Dillinger
0ef88538c6 Improve support for using regexes (#8740)
Summary:
* Consolidate use of std::regex for testing to testharness.cc, to
minimize Facebook linters constantly flagging uses in non-production
code.
* Improve syntax and error messages for asserting some string matches a
regex in tests.
* Add a public Regex wrapper class to encapsulate existing usage in
ObjectRegistry.
* Remove unnecessary include <regex>
* Put warnings that use of Regex in production code could cause bad
performance or stack overflow.

Intended follow-up work:
* Replace std::regex with another underlying implementation like RE2
* Improve ObjectRegistry interface in terms of possibly confusing literal
string matching vs. regex and in terms of reporting invalid regex.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8740

Test Plan:
tests updated, basic unit test for public Regex, and some manual
testing of temporary changes to see example error messages:

utilities/backupable/backupable_db_test.cc:917: Failure
000010_1162373755_138626.blob (child.name)
does not match regex
[0-9]+_[0-9]+_[0-9]+[.]blobHAHAHA (pattern)

db/db_basic_test.cc:74: Failure
R3SHSBA8C4U0CIMV2ZB0 (sid3)
does not match regex [0-9A-Z]{20}HAHAHA

Reviewed By: mrambacher

Differential Revision: D30706246

Pulled By: pdillinger

fbshipit-source-id: ba845e8f563ccad39bdb58f44f04e9da8f78c3fd
2021-09-07 13:05:23 -07:00
Peter Dillinger
4750421ece Replace most typedef with using= (#8751)
Summary:
Old typedef syntax is confusing

Most but not all changes with

    perl -pi -e 's/typedef (.*) ([a-zA-Z0-9_]+);/using $2 = $1;/g' list_of_files
    make format

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8751

Test Plan: existing

Reviewed By: zhichao-cao

Differential Revision: D30745277

Pulled By: pdillinger

fbshipit-source-id: 6f65f0631c3563382d43347896020413cc2366d9
2021-09-07 11:31:59 -07:00
Peter Dillinger
32752551b9 Fix a buffer size race condition in BackupEngine (#8732)
Summary:
If RateLimiter burst bytes changes during concurrent Restore
operations

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8732

Test Plan: updated unit test fails with TSAN before change, passes after

Reviewed By: ajkr

Differential Revision: D30683879

Pulled By: pdillinger

fbshipit-source-id: d0ddb3587ade91ee2a4d926b475acf7781b03086
2021-09-01 14:28:58 -07:00
Akanksha Mahajan
f9ffeaed3f Update branch name to main in env_librados.md (#8738)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8738

Reviewed By: ltamasi

Differential Revision: D30705691

Pulled By: akankshamahajan15

fbshipit-source-id: 44ac8e1c906b1d1d31e9017a700aab5eefe94253
2021-09-01 14:28:58 -07:00
Peter Dillinger
13ded69484 Built-in support for generating unique IDs, bug fix (#8708)
Summary:
Env::GenerateUniqueId() works fine on Windows and on POSIX
where /proc/sys/kernel/random/uuid exists. Our other implementation is
flawed and easily produces collision in a new multi-threaded test.
As we rely more heavily on DB session ID uniqueness, this becomes a
serious issue.

This change combines several individually suitable entropy sources
for reliable generation of random unique IDs, with goal of uniqueness
and portability, not cryptographic strength nor maximum speed.

Specifically:
* Moves code for getting UUIDs from the OS to port::GenerateRfcUuid
rather than in Env implementation details. Callers are now told whether
the operation fails or succeeds.
* Adds an internal API GenerateRawUniqueId for generating high-quality
128-bit unique identifiers, by combining entropy from three "tracks":
  * Lots of info from default Env like time, process id, and hostname.
  * std::random_device
  * port::GenerateRfcUuid (when working)
* Built-in implementations of Env::GenerateUniqueId() will now always
produce an RFC 4122 UUID string, either from platform-specific API or
by converting the output of GenerateRawUniqueId.

DB session IDs now use GenerateRawUniqueId while DB IDs (not as
critical) try to use port::GenerateRfcUuid but fall back on
GenerateRawUniqueId with conversion to an RFC 4122 UUID.

GenerateRawUniqueId is declared and defined under env/ rather than util/
or even port/ because of the Env dependency.

Likely follow-up: enhance GenerateRawUniqueId to be faster after the
first call and to guarantee uniqueness within the lifetime of a single
process (imparting the same property onto DB session IDs).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8708

Test Plan:
A new mini-stress test in env_test checks the various public
and internal APIs for uniqueness, including each track of
GenerateRawUniqueId individually. We can't hope to verify anywhere close
to 128 bits of entropy, but it can at least detect flaws as bad as the
old code. Serial execution of the new tests takes about 350 ms on
my machine.

Reviewed By: zhichao-cao, mrambacher

Differential Revision: D30563780

Pulled By: pdillinger

fbshipit-source-id: de4c9ff4b2f581cf784fcedb5f39f16e5185c364
2021-08-30 15:20:41 -07:00
Merlin Mao
6c2bd28a61 Update comments, fix typos. (#8721)
Summary:
- Removed the default empty constructors of `TraceWriter` and `TraceReader`.
- Removed unused `ReadFooter()` from `ReplayerImpl`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8721

Test Plan: None

Reviewed By: zhichao-cao

Differential Revision: D30609743

Pulled By: autopear

fbshipit-source-id: 7e2626b015bd57ebb408a2836b4b4217cea10002
2021-08-27 13:16:32 -07:00
Merlin Mao
f6437ea4d7 Refactor TraceAnalyzer to use TraceRecord::Handler to avoid casting. (#8678)
Summary:
`TraceAnalyzer` privately inherits `TraceRecord::Handler` and `WriteBatch::Handler`.

`trace_analyzer_test` can pass with this change.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8678

Reviewed By: zhichao-cao

Differential Revision: D30459814

Pulled By: autopear

fbshipit-source-id: a27f59ac4600f7c3682830c9b1d9dc79e53425be
2021-08-23 17:18:27 -07:00
Merlin Mao
d10801e983 Allow Replayer to report the results of TraceRecords. (#8657)
Summary:
`Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.)
`Replayer::Replay()` reports the results via a callback function.

New interface:
`TraceRecordResult` in "rocksdb/trace_record_result.h".

`DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657

Reviewed By: ajkr

Differential Revision: D30290216

Pulled By: autopear

fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6
2021-08-18 17:06:14 -07:00
Yanqin Jin
2b367fa8cc Fix bug caused by releasing snapshot(s) during compaction (#8608)
Summary:
In debug mode, we are seeing assertion failure as follows

```
db/compaction/compaction_iterator.cc:980: void rocksdb::CompactionIterator::PrepareOutput(): \
Assertion `ikey_.type != kTypeDeletion && ikey_.type != kTypeSingleDeletion' failed.
```

It is caused by releasing earliest snapshot during compaction between the execution of
`NextFromInput()` and `PrepareOutput()`.

In one case, as demonstrated in unit test `WritePreparedTransaction.ReleaseEarliestSnapshotDuringCompaction_WithSD2`,
incorrect result may be returned by a following range scan if we disable assertion, as in opt compilation
level: the SingleDelete marker's sequence number is zeroed out, but the preceding PUT is also
outputted to the SST file after compaction. Due to the logic of DBIter, the PUT will not be
skipped and will be returned by iterator in range scan. https://github.com/facebook/rocksdb/issues/8661 illustrates what happened.

Fix by taking a more conservative approach: make compaction zero out sequence number only
if key is in the earliest snapshot when the compaction starts.

Another assertion failure is
```
Assertion `current_user_key_snapshot_ == last_snapshot' failed.
```

It's caused by releasing the snapshot between the PUT and SingleDelete during compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8608

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D30145645

Pulled By: riversand963

fbshipit-source-id: 699f58e66faf70732ad53810ccef43935d3bbe81
2021-08-17 22:14:20 -07:00
Merlin Mao
74a652a45f Code cleanup for trace replayer (#8652)
Summary:
- Remove extra `;` in trace_record.h
- Remove some unnecessary `assert` in trace_record_handler.cc
- Initialize `env_` after` exec_handler_` in `ReplayerImpl` to let db be asserted in creating the handler before getting `db->GetEnv()`.
- Update history to include the new `TraceReader::Reset()`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8652

Reviewed By: ajkr

Differential Revision: D30276872

Pulled By: autopear

fbshipit-source-id: 476ee162e0f241490c6209307448343a5b326b37
2021-08-12 09:22:43 -07:00
Merlin Mao
f58d276764 Make TraceRecord and Replayer public (#8611)
Summary:
New public interfaces:
`TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h".
`Replayer`, available in `rocksdb/utilities/replayer.h`.

User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file.

Unit tests:
- `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes.
- `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611

Reviewed By: ajkr

Differential Revision: D30266329

Pulled By: autopear

fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8
2021-08-11 19:32:46 -07:00
Peter Dillinger
a7fd1d0881 Make backup restore atomic, with sync option (#8568)
Summary:
Guarantees that if a restore is interrupted, DB::Open will fail. This works by
restoring CURRENT first to CURRENT.tmp then as a final step renaming to CURRENT.

Also makes restore respect BackupEngineOptions::sync (default true). When set,
the restore is guaranteed persisted by the time it returns OK. Also makes the above
atomicity guarantee work in case the interruption is power loss or OS crash (not just
process interruption or crash).

Fixes https://github.com/facebook/rocksdb/issues/8500

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8568

Test Plan:
added to backup mini-stress unit test. Passes with
gtest_repeat=100 (whereas fails 7 times without the CURRENT.tmp)

Reviewed By: akankshamahajan15

Differential Revision: D29812605

Pulled By: pdillinger

fbshipit-source-id: 24e9a993b305b1835ca95558fa7a7152e54cda8e
2021-08-06 09:50:21 -07:00
mrambacher
d057e8326d Make MergeOperator+CompactionFilter/Factory into Customizable Classes (#8481)
Summary:
- Changed MergeOperator, CompactionFilter, and CompactionFilterFactory into Customizable classes.
 - Added Options/Configurable/Object Registration for TTL and Cassandra variants
 - Changed the StringAppend MergeOperators to accept a string delimiter rather than a simple char.  Made the delimiter into a configurable option
 - Added tests for new functionality

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8481

Reviewed By: zhichao-cao

Differential Revision: D30136050

Pulled By: mrambacher

fbshipit-source-id: 271d1772835935b6773abaf018ee71e42f9491af
2021-08-06 08:27:25 -07:00
anand76
c268859aaa Remove corruption error injection in FaultInjectionTestFS (#8616)
Summary:
```FaultInjectionTestFS``` injects various types of read errors in ```FileSystem``` APIs. One type of error is corruption errors, where data is intentionally corrupted or truncated. There is corresponding validation in db_stress to verify that an injected error results in a user visible Get/MultiGet error. However, for corruption errors, its hard to know when a corruption is supposed to be detected by the user request, due to prefetching and, in case of direct IO, padding. This results in false positives. So remove that functionality.

Block checksum validation for Get/MultiGet is confined to ```BlockFetcher```, so we don't lose a lot by disabling this since its a small surface area to test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8616

Reviewed By: zhichao-cao

Differential Revision: D30074422

Pulled By: anand1976

fbshipit-source-id: 6a61fac18f95514c15364b75013799ddf83294df
2021-08-04 15:48:54 -07:00
Merlin Mao
4811115b3e Revert checkpoint fix (#8607)
Summary:
PR https://github.com/facebook/rocksdb/pull/8572 looses custom types in the options file. Need more API changes to fix this issue. Revert this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8607

Reviewed By: ajkr

Differential Revision: D30058289

Pulled By: autopear

fbshipit-source-id: 78f5a154c0bf193e8441bae4a36fa79b95277fd4
2021-08-02 18:29:35 -07:00
Mikhail Golubev
8f52972cf9 Allow to use a string as a delimiter in StringAppendOperator (#8536)
Summary:
An arbitrary string can be used as a delimiter in StringAppend merge operator
flavor. In particular, it allows using an empty string, combining binary values for
the same key byte-to-byte one next to another.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8536

Reviewed By: mrambacher

Differential Revision: D29962120

Pulled By: zhichao-cao

fbshipit-source-id: 4ef5d846a47835cf428a11200409e30e2dbffc4f
2021-08-02 16:50:41 -07:00
mrambacher
ab7f7c9e49 Allow WAL dir to change with db dir (#8582)
Summary:
Prior to this change, the "wal_dir"  DBOption would always be set (defaults to dbname) when the DBOptions were sanitized.  Because of this setitng in the options file, it was not possible to rename/relocate a database directory after it had been created and use the existing options file.

After this change, the "wal_dir" option is only set under specific circumstances.  Methods were added to the ImmutableDBOptions class to see if it is set and if it is set to something other than the dbname.  Additionally, a method was added to retrieve the effective value of the WAL dir (either the option or the dbname/path).

Tests were added to the core and ldb to test that a database could be created and renamed without issue.  Additional tests for various permutations of wal_dir were also added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8582

Reviewed By: pdillinger, autopear

Differential Revision: D29881122

Pulled By: mrambacher

fbshipit-source-id: 67d3d033dc8813d59917b0a3fba2550c0efd6dfb
2021-07-30 12:16:44 -07:00
Yanqin Jin
066b51126d Several simple local code clean-ups (#8565)
Summary:
This PR tries to remove some unnecessary checks as well as unreachable code blocks to
improve readability. An obvious non-public API method naming typo is also corrected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8565

Test Plan: make check

Reviewed By: lth

Differential Revision: D29963984

Pulled By: riversand963

fbshipit-source-id: cc96e8f09890e5cfe9b20eadb63bdca5484c150a
2021-07-30 12:07:49 -07:00
mrambacher
3aee4fbd41 Make EventListener into a Customizable Class (#8473)
Summary:
- Added Type/CreateFromString
- Added ability to load EventListeners to DBOptions
- Since EventListeners did not previously have a Name(), defaulted to "".  If there is no name, the listener cannot be loaded from the ObjectRegistry.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473

Reviewed By: zhichao-cao

Differential Revision: D29901488

Pulled By: mrambacher

fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254
2021-07-27 07:47:02 -07:00
Merlin Mao
55f7ded80d Checkpoint dir options fix (#8572)
Summary:
Originally the 2 options `db_log_dir` and `wal_dir` will be reused in a snapshot db since the options files are just copied. By default, if `wal_dir` was not set when a db was created, it is set to the db's dir. Therefore, the snapshot db will use the same WAL dir. If both the original db and the snapshot db write to or delete from the WAL dir, one may modify or delete files which belong to the other. The same applies to `db_log_dir` as well, but as info log files are not copied or linked, it is simpler for this option.

2 arguments are added to `Checkpoint::CreateCheckpoint()`, allowing to override these 2 options.

`wal_dir`:  If the function argument `wal_dir` is empty, or set to the original db location, or the checkpoint location, the snapshot's `wal_dir` option will be updated to the checkpoint location. Otherwise, the absolute path specified in the argument will be used. During checkpointing, live WAL files will be copied or linked the new location, instead of the current WAL dir specified in the original db.

`db_log_dir`: Same as `wal_dir`, but no files will be copied or linked.

A new unit test was added: `CheckpointTest.CheckpointWithOptionsDirsTest`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8572

Test Plan:
New unit test
```
checkpoint_test --gtest_filter="CheckpointTest.CheckpointWithOptionsDirsTest"
```

Output
```
Note: Google Test filter = CheckpointTest.CheckpointWithOptionsDirsTest
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from CheckpointTest
[ RUN      ] CheckpointTest.CheckpointWithOptionsDirsTest
[       OK ] CheckpointTest.CheckpointWithOptionsDirsTest (11712 ms)
[----------] 1 test from CheckpointTest (11712 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (11713 ms total)
[  PASSED  ] 1 test.
```
This test will fail without this patch. Just modify the code to remove the 2 arguments introduced in this patch in `CreateCheckpoint()`.

Reviewed By: zhichao-cao

Differential Revision: D29832761

Pulled By: autopear

fbshipit-source-id: e6a639b4d674380df82998c0839e79cab695fe29
2021-07-23 11:13:01 -07:00
Drewryz
3b27725245 Fix a minor issue with initializing the test path (#8555)
Summary:
The PerThreadDBPath has already specified a slash. It does not need to be specified when initializing the test path.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8555

Reviewed By: ajkr

Differential Revision: D29758399

Pulled By: jay-zhuang

fbshipit-source-id: 6d2b878523e3e8580536e2829cb25489844d9011
2021-07-23 08:38:45 -07:00
Jay Zhuang
c4a503f3df Fix an race condition during multiple DB opening (#8574)
Summary:
ObjectLibrary is shared between multiple DB instances, the
Register() could have race condition.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8574

Test Plan: pass the failed test

Reviewed By: ajkr

Differential Revision: D29855096

Pulled By: jay-zhuang

fbshipit-source-id: 541eed0bd495d2c963d858d81e7eabf1ba16153c
2021-07-22 13:43:06 -07:00
Zhichao Cao
87e82a41a9 Fix incorrect Status::NoSpace() status check (#8504)
Summary:
If we want to check whether a Status s is NoSpace() or not, we should check the subcode instread of using s==Status::NoSpace(). Fix some of the incorrect check in the ErrorHandler.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8504

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D29601764

Pulled By: zhichao-cao

fbshipit-source-id: cdab56a827891c23746bba9cbb53f169fe35f086
2021-07-20 18:09:51 -07:00
sdong
39a07c9651 DB Stress Reopen write failure to skip WAL (#8548)
Summary:
When DB Stress enables write failure in reopen, WAL files are also created with a wrapper writalbe file which buffers write until fsync. However, crash test currently expects all writes to WAL is persistent. This is at odd with the unsynced bytes dropped. To work it around temporarily, we disable WAL write failure for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8548

Test Plan: Run db_stress. Manual printf to make sure only WAL files are skipped.

Reviewed By: jay-zhuang

Differential Revision: D29745095

fbshipit-source-id: 1879dd2c01abad7879ca243ee94570ec47c347f3
2021-07-16 16:09:33 -07:00
Peter Dillinger
df5dc73bec Don't hold DB mutex for block cache entry stat scans (#8538)
Summary:
I previously didn't notice the DB mutex was being held during
block cache entry stat scans, probably because I primarily checked for
read performance regressions, because they require the block cache and
are traditionally latency-sensitive.

This change does some refactoring to avoid holding DB mutex and to
avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats").
Some tests have to be updated because now the stats collector is
populated in the Cache aggressively on DB startup rather than lazily.
(I hope to clean up some of this added complexity in the future.)

This change also ensures proper treatment of need_out_of_mutex for
non-int DB properties.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538

Test Plan:
Added unit test logic that uses sync points to fail if the DB mutex
is held during a scan, covering the various ways that a scan might be
triggered.

Performance test - the known impact to holding the DB mutex is on
TransactionDB, and the easiest way to see the impact is to hack the
scan code to almost always miss and take an artificially long time
scanning. Here I've injected an unconditional 5s sleep at the call to
ApplyToAllEntries.

Before (hacked):

    $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     433.219 micros/op 2308 ops/sec;    0.1 MB/s ( transactions:78999 aborts:0)
    rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856
    $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     448.802 micros/op 2228 ops/sec;    0.1 MB/s ( transactions:75999 aborts:0)
    rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323

Notice the 5s P100 write time.

After (hacked):

    $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     303.645 micros/op 3293 ops/sec;    0.1 MB/s ( transactions:98999 aborts:0)
    rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407
    $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     310.383 micros/op 3221 ops/sec;    0.1 MB/s ( transactions:96999 aborts:0)
    rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918

P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code:

    $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     311.365 micros/op 3211 ops/sec;    0.1 MB/s ( transactions:96999 aborts:0)
    rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767
    $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     308.395 micros/op 3242 ops/sec;    0.1 MB/s ( transactions:97999 aborts:0)
    rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832

No substantial difference.

Reviewed By: siying

Differential Revision: D29738847

Pulled By: pdillinger

fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
2021-07-16 14:13:08 -07:00
longlijian
4e4ec16957 Replace the namespace "rocksdb" to "ROCKSDB_NAMESPACE" (#8531)
Summary:
For more detail can reference the https://github.com/facebook/rocksdb/issues/6433
(https://github.com/facebook/rocksdb/pull/6433)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8531

Reviewed By: siying

Differential Revision: D29717057

Pulled By: ajkr

fbshipit-source-id: 3ccad9501e5612590e54a7cf8c447118f323c7f4
2021-07-15 17:23:39 -07:00
Myth
bbdc4f2e9a Fix a minor issue in checkpoint test case (#8483)
Summary:
A very simple change :)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8483

Reviewed By: ajkr

Differential Revision: D29558904

Pulled By: mrambacher

fbshipit-source-id: bbe68c20c861103726cb6231ca3fb8fbe1e5a546
2021-07-12 09:09:09 -07:00
sdong
b1a53db327 FaultInjectionTestFS::DeleteFilesCreatedAfterLastDirSync() to recover… (#8501)
Summary:
… small overwritten files.
If a file is overwritten with renamed and the parent path is not synced, FaultInjectionTestFS::DeleteFilesCreatedAfterLastDirSync() will delete the file. However, RocksDB relies on file renaming to be atomic no matter whether the parent directory is synced or not, and the current behavior breaks the assumption and caused some false positive: https://github.com/facebook/rocksdb/pull/8489

Since the atomic renaming is used in CURRENT files, to fix the problem, in FaultInjectionTestFS::DeleteFilesCreatedAfterLastDirSync(), we recover the state of overwritten file if the file is small.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8501

Test Plan: Run stress test for a while and see it doesn't break.

Reviewed By: anand1976

Differential Revision: D29594384

fbshipit-source-id: 589b5c2f0a9d2aca53752d7bdb0231efa5b3ae92
2021-07-07 16:23:23 -07:00
Andrew Kryczka
ed8eb436db Move slow valgrind tests behind -DROCKSDB_FULL_VALGRIND_RUN (#8475)
Summary:
Various tests had disabled valgrind due to it slowing down and timing
out (as is the case right now) the CI runs. Where a test was disabled with no comment,
I assumed slowness was the cause. For these tests that were slow under
valgrind, as well as the ones identified in https://github.com/facebook/rocksdb/issues/8352, this PR moves them
behind the compiler flag `-DROCKSDB_FULL_VALGRIND_RUN`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8475

Test Plan: running `make full_valgrind_test`, `make valgrind_test`, `make check`; will verify they appear working correctly

Reviewed By: jay-zhuang

Differential Revision: D29504843

Pulled By: ajkr

fbshipit-source-id: 2aac90749cfbd30d5ce11cb29a07a1b9314eeea7
2021-07-07 11:14:05 -07:00
anand76
df4197ca6e Bypass buffer in TestFSWritableFile if direct IO is enabled (#8490)
Summary:
```TestFSWritableFile``` buffers data in ```Append``` in order to simulate unsynced data loss on crash. This is only required for buffered IO and should be disabled for direct IO. Otherwise, it causes crash tests to assert on the buffer address alignment - ```db_stress: env/io_posix.cc:1194: virtual rocksdb::IOStatus rocksdb::PosixWritableFile::Append(const rocksdb::Slice&, const rocksdb::IOOptions&, rocksdb::IODebugContext*): Assertion `IsSectorAligned(data.data(), GetRequiredBufferAlignment())' failed.```.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8490

Reviewed By: zhichao-cao

Differential Revision: D29565080

Pulled By: anand1976

fbshipit-source-id: 682831fd66ed3b9597caa74fc453e22dfaf9b973
2021-07-06 16:46:16 -07:00
sdong
f33611d5e9 Stress test to inject read failures in DB reopen (#8476)
Summary:
Inject read failures in DB reopen, just as what we do for metadata writes and writes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8476

Test Plan: Some manual tests and make sure failures are triggered.

Reviewed By: anand1976

Differential Revision: D29507283

fbshipit-source-id: d04da0163973447041038bd87701686a417c4e0c
2021-07-06 11:05:27 -07:00
sdong
ba224b75c7 Stress Test to inject write failures in reopen (#8474)
Summary:
Previously Stress can inject metadata write failures when reopening a DB. We extend it to file append too, in the same way.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8474

Test Plan: manually run crash test with various setting and make sure the failures are triggered as expected.

Reviewed By: zhichao-cao

Differential Revision: D29503116

fbshipit-source-id: e73a446e80ccbd09301a579280e56ff949381fab
2021-06-30 16:46:41 -07:00
Zhichao Cao
a904c62d28 Using existing crc32c checksum in checksum handoff for Manifest and WAL (#8412)
Summary:
In PR https://github.com/facebook/rocksdb/issues/7523 , checksum handoff is introduced in RocksDB for WAL, Manifest, and SST files. When user enable checksum handoff for a certain type of file, before the data is written to the lower layer storage system, we calculate the checksum (crc32c) of each piece of data and pass the checksum down with the data, such that data verification can be down by the lower layer storage system if it has the capability. However, it cannot cover the whole lifetime of the data in the memory and also it potentially introduces extra checksum calculation overhead.

In this PR, we introduce a new interface in WritableFileWriter::Append, which allows the caller be able to pass the data and the checksum (crc32c) together. In this way, WritableFileWriter can directly use the pass-in checksum (crc32c) to generate the checksum of data being passed down to the storage system. It saves the calculation overhead and achieves higher protection coverage. When a new checksum is added with the data, we use Crc32cCombine https://github.com/facebook/rocksdb/issues/8305 to combine the existing checksum and the new checksum. To avoid the segmenting of data by rate-limiter before it is stored, rate-limiter is called enough times to accumulate enough credits for a certain write. This design only support Manifest and WAL which use log_writer in the current stage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8412

Test Plan: make check, add new testing cases.

Reviewed By: anand1976

Differential Revision: D29151545

Pulled By: zhichao-cao

fbshipit-source-id: 75e2278c5126cfd58393c67b1efd18dcc7a30772
2021-06-25 00:47:17 -07:00
Levi Tamasi
3ab0eae860 Deflake BlobDBTest.SnapshotAndGarbageCollection (#8444)
Summary:
This test case has been failing occasionally due to automatic
compactions kicking in, resulting in GC generating additional
blob files that the test did not expect. Disabling automatic
compactions to get rid of this flakiness.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8444

Test Plan: `gtest-parallel --repeat=1000 ./blob_db_test --gtest_filter="BlobDBTest.SnapshotAndGarbageCollection"`

Reviewed By: jay-zhuang

Differential Revision: D29316987

Pulled By: ltamasi

fbshipit-source-id: 9815d189ed7d63890622768675a01990e3680221
2021-06-22 17:34:03 -07:00
Jay Zhuang
f89423a57a Revert "Revert "Snapshot release triggered compaction without multiple tombstones (#8357)" (#8410)" (#8438)
Summary:
This reverts commit 25be1ed66a.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8438

Test Plan: Run the impacted mysql test 40 times

Reviewed By: ajkr

Differential Revision: D29286247

Pulled By: jay-zhuang

fbshipit-source-id: d3bd056971a19a8b012d5d0295fa045c012b3c04
2021-06-22 11:10:03 -07:00
聂佩轩
d53f7ff69a Add DeteleRange support for DBWithTTL (#8384)
Summary:
This commit is for enabling `DBWithTTL` to use `DeteleRange` which it cannot before.
As (int32_t)Timestamp is suffixed to values in `DBWithTTL`, there is no reason that it
cannot use the common used api. I added `DeleteRangeCF` in `DBWithTTLImpl::Write`
so that we can use `DeteleRange` normally. When we run code like
`dbWithTtl->DeleteRange(start, end)`, it executes`WriteBatchInternal::DeleteRange`
internally. Intended to fix https://github.com/facebook/rocksdb/issues/7218

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8384

Test Plan: added corresponded testing logic to existing unit test

Reviewed By: jay-zhuang

Differential Revision: D29176734

fbshipit-source-id: 6874ed979fc08e1d138149d03653e43a75f0e0e6
2021-06-17 16:00:50 -07:00
Andrew Kryczka
25be1ed66a Revert "Snapshot release triggered compaction without multiple tombstones (#8357)" (#8410)
Summary:
This reverts commit 9167ece586.

It was found to reliably trip a compaction picking conflict assertion in a MyRocks unit test. We don't understand why yet so reverting in the meantime.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8410

Test Plan: `make check -j48`

Reviewed By: jay-zhuang

Differential Revision: D29150300

Pulled By: ajkr

fbshipit-source-id: 2de8664f355d6da015e84e5fec2e3f90f49741c8
2021-06-15 18:15:15 -07:00
Zhichao Cao
f44e69c64a Use DbSessionId as cache key prefix when secondary cache is enabled (#8360)
Summary:
Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360

Test Plan: add the test to lru_cache_test

Reviewed By: pdillinger

Differential Revision: D29006215

Pulled By: zhichao-cao

fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814
2021-06-10 11:02:43 -07:00
Andrew Kryczka
9167ece586 Snapshot release triggered compaction without multiple tombstones (#8357)
Summary:
This is a duplicate of https://github.com/facebook/rocksdb/issues/4948 by mzhaom to fix tests after rebase.

This change is a follow-up to https://github.com/facebook/rocksdb/issues/4927, which made this possible by allowing tombstone dropping/seqnum zeroing optimizations on the last key in the compaction. Now the `largest_seqno != 0` condition suffices to prevent snapshot release triggered compaction from entering an infinite loop.

The issues caused by the extraneous condition `level_and_file.second->num_deletions > 1` are:

- files could have `largest_seqno > 0` forever making it impossible to tell they cannot contain any covering keys
- it doesn't trigger compaction when there are many overwritten keys. Some MyRocks use case actually doesn't use Delete but instead calls Put with empty value to "delete" keys, so we'd like to be able to trigger compaction in this case too.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8357

Test Plan: - make check

Reviewed By: jay-zhuang

Differential Revision: D28855340

Pulled By: ajkr

fbshipit-source-id: a261b51eecafec492499e6d01e8e43112f801798
2021-06-04 00:21:40 -07:00
Peter Dillinger
311a544c2a Use deleters to label cache entries and collect stats (#8297)
Summary:
This change gathers and publishes statistics about the
kinds of items in block cache. This is especially important for
profiling relative usage of cache by index vs. filter vs. data blocks.
It works by iterating over the cache during periodic stats dump
(InternalStats, stats_dump_period_sec) or on demand when
DB::Get(Map)Property(kBlockCacheEntryStats), except that for
efficiency and sharing among column families, saved data from
the last scan is used when the data is not considered too old.

The new information can be seen in info LOG, for example:

    Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0
    Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%)

And also through DB::GetProperty and GetMapProperty (here using
ldb just for demonstration):

    $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats
    rocksdb.block-cache-entry-stats.bytes.data-block: 0
    rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0
    rocksdb.block-cache-entry-stats.bytes.filter-block: 0
    rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0
    rocksdb.block-cache-entry-stats.bytes.index-block: 178992
    rocksdb.block-cache-entry-stats.bytes.misc: 0
    rocksdb.block-cache-entry-stats.bytes.other-block: 0
    rocksdb.block-cache-entry-stats.bytes.write-buffer: 0
    rocksdb.block-cache-entry-stats.capacity: 8388608
    rocksdb.block-cache-entry-stats.count.data-block: 0
    rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0
    rocksdb.block-cache-entry-stats.count.filter-block: 0
    rocksdb.block-cache-entry-stats.count.filter-meta-block: 0
    rocksdb.block-cache-entry-stats.count.index-block: 215
    rocksdb.block-cache-entry-stats.count.misc: 1
    rocksdb.block-cache-entry-stats.count.other-block: 0
    rocksdb.block-cache-entry-stats.count.write-buffer: 0
    rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290
    rocksdb.block-cache-entry-stats.percent.data-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.index-block: 2.133751
    rocksdb.block-cache-entry-stats.percent.misc: 0.000000
    rocksdb.block-cache-entry-stats.percent.other-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000
    rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052
    rocksdb.block-cache-entry-stats.secs_since_last_collection: 0

Solution detail - We need some way to flag what kind of blocks each
entry belongs to, preferably without changing the Cache API.
One of the complications is that Cache is a general interface that could
have other users that don't adhere to whichever convention we decide
on for keys and values. Or we would pay for an extra field in the Handle
that would only be used for this purpose.

This change uses a back-door approach, the deleter, to indicate the
"role" of a Cache entry (in addition to the value type, implicitly).
This has the added benefit of ensuring proper code origin whenever we
recognize a particular role for a cache entry; if the entry came from
some other part of the code, it will use an unrecognized deleter, which
we simply attribute to the "Misc" role.

An internal API makes for simple instantiation and automatic
registration of Cache deleters for a given value type and "role".

Another internal API, CacheEntryStatsCollector, solves the problem of
caching the results of a scan and sharing them, to ensure scans are
neither excessive nor redundant so as not to harm Cache performance.

Because code is added to BlocklikeTraits, it is pulled out of
block_based_table_reader.cc into its own file.

This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option
(could still be added), and with actual stat gathering.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297

Test Plan: manual testing with db_bench, and a couple of basic unit tests

Reviewed By: ltamasi

Differential Revision: D28488721

Pulled By: pdillinger

fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 16:51:13 -07:00
mrambacher
6b0a22a4b0 Fix MultiGet with PinnableSlices and Merge for WBWI (#8299)
Summary:
The MultiGetFromBatchAndDB would fail if the PinnableSlice value being returned was pinned.  This could happen if the value was retrieved from the DB (not memtable) or potentially if the values were reused (and a previous iteration returned a slice that was pinned).

This change resets the pinnable value to clear it prior to attempting to use it, thereby eliminating the problem with the value already being pinned.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8299

Reviewed By: jay-zhuang

Differential Revision: D28455426

Pulled By: mrambacher

fbshipit-source-id: a34d7d983ec9b6bb4c8a2b4892f72858d43e6972
2021-05-18 14:35:47 -07:00
anand76
feb06e83b2 Initial support for secondary cache in LRUCache (#8271)
Summary:
Defined the abstract interface for a secondary cache in include/rocksdb/secondary_cache.h, and updated LRUCacheOptions to take a std::shared_ptr<SecondaryCache>. An item is initially inserted into the LRU (primary) cache. When it ages out and evicted from memory, its inserted into the secondary cache. On a LRU cache miss and successful lookup in the secondary cache, the item is promoted to the LRU cache. Only support synchronous lookup currently. The secondary cache would be used to implement a persistent (flash cache) or compressed cache.

Tests:
Results from cache_bench and db_bench don't show any regression due to these changes.

cache_bench results before and after this change -
Command
```./cache_bench -ops_per_thread=10000000 -threads=1```
Before
```Complete in 40.688 s; QPS = 245774```
```Complete in 40.486 s; QPS = 246996```
```Complete in 42.019 s; QPS = 237989```
After
```Complete in 40.672 s; QPS = 245869```
```Complete in 44.622 s; QPS = 224107```
```Complete in 42.445 s; QPS = 235599```

db_bench results before this change, and with this change + https://github.com/facebook/rocksdb/issues/8213 and https://github.com/facebook/rocksdb/issues/8191 -
Commands
```./db_bench  --benchmarks="fillseq,compact" -num=30000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/home/anand76/nvm_cache/db -partition_index_and_filters=true```

```./db_bench -db=/home/anand76/nvm_cache/db -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=6 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -threads=16 -duration=300```
Before
```
DB path: [/home/anand76/nvm_cache/db]
readrandom   :      80.702 micros/op 198104 ops/sec;   54.4 MB/s (3708999 of 3708999 found)
```
```
DB path: [/home/anand76/nvm_cache/db]
readrandom   :      87.124 micros/op 183625 ops/sec;   50.4 MB/s (3439999 of 3439999 found)
```
After
```
DB path: [/home/anand76/nvm_cache/db]
readrandom   :      77.653 micros/op 206025 ops/sec;   56.6 MB/s (3866999 of 3866999 found)
```
```
DB path: [/home/anand76/nvm_cache/db]
readrandom   :      84.962 micros/op 188299 ops/sec;   51.7 MB/s (3535999 of 3535999 found)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8271

Reviewed By: zhichao-cao

Differential Revision: D28357511

Pulled By: anand1976

fbshipit-source-id: d1cfa236f00e649a18c53328be10a8062a4b6da2
2021-05-13 22:58:40 -07:00
Andrew Kryczka
d76c46e6a0 Deflake TransactionStressTest.ExpiredTransactionDataRace1 (#8258)
Summary:
We saw the `Commit()` fail with "Operation expired" so apparently the
expiration time is too short. Increased the magnitude of the times in
this test to make flakiness less likely.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8258

Reviewed By: jay-zhuang

Differential Revision: D28177033

Pulled By: ajkr

fbshipit-source-id: 0357acee6cc14c104b6ccd39231a683a606ab130
2021-05-12 15:49:05 -07:00
Peter Dillinger
78a309bf86 New Cache API for gathering statistics (#8225)
Summary:
Adds a new Cache::ApplyToAllEntries API that we expect to use
(in follow-up PRs) for efficiently gathering block cache statistics.
Notable features vs. old ApplyToAllCacheEntries:

* Includes key and deleter (in addition to value and charge). We could
have passed in a Handle but then more virtual function calls would be
needed to get the "fields" of each entry. We expect to use the 'deleter'
to identify the origin of entries, perhaps even more.
* Heavily tuned to minimize latency impact on operating cache. It
does this by iterating over small sections of each cache shard while
cycling through the shards.
* Supports tuning roughly how many entries to operate on for each
lock acquire and release, to control the impact on the latency of other
operations without excessive lock acquire & release. The right balance
can depend on the cost of the callback. Good default seems to be
around 256.
* There should be no need to disable thread safety. (I would expect
uncontended locks to be sufficiently fast.)

I have enhanced cache_bench to validate this approach:

* Reports a histogram of ns per operation, so we can look at the
ditribution of times, not just throughput (average).
* Can add a thread for simulated "gather stats" which calls
ApplyToAllEntries at a specified interval. We also generate a histogram
of time to run ApplyToAllEntries.

To make the iteration over some entries of each shard work as cleanly as
possible, even with resize between next set of entries, I have
re-arranged which hash bits are used for sharding and which for indexing
within a shard.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225

Test Plan:
A couple of unit tests are added, but primary validation is manual, as
the primary risk is to performance.

The primary validation is using cache_bench to ensure that neither
the minor hashing changes nor the simulated stats gathering
significantly impact QPS or latency distribution. Note that adding op
latency histogram seriously impacts the benchmark QPS, so for a
fair baseline, we need the cache_bench changes (except remove simulated
stat gathering to make it compile). In short, we don't see any
reproducible difference in ops/sec or op latency unless we are gathering
stats nearly continuously. Test uses 10GB block cache with
8KB values to be somewhat realistic in the number of items to iterate
over.

Baseline typical output:

```
Complete in 92.017 s; Rough parallel ops/sec = 869401
Thread ops/sec = 54662

Operation latency (ns):
Count: 80000000 Average: 11223.9494  StdDev: 29.61
Min: 0  Median: 7759.3973  Max: 9620500
Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
------------------------------------------------------
[       0,       1 ]       68   0.000%   0.000%
(    2900,    4400 ]       89   0.000%   0.000%
(    4400,    6600 ] 33630240  42.038%  42.038% ########
(    6600,    9900 ] 18129842  22.662%  64.700% #####
(    9900,   14000 ]  7877533   9.847%  74.547% ##
(   14000,   22000 ] 15193238  18.992%  93.539% ####
(   22000,   33000 ]  3037061   3.796%  97.335% #
(   33000,   50000 ]  1626316   2.033%  99.368%
(   50000,   75000 ]   421532   0.527%  99.895%
(   75000,  110000 ]    56910   0.071%  99.966%
(  110000,  170000 ]    16134   0.020%  99.986%
(  170000,  250000 ]     5166   0.006%  99.993%
(  250000,  380000 ]     3017   0.004%  99.996%
(  380000,  570000 ]     1337   0.002%  99.998%
(  570000,  860000 ]      805   0.001%  99.999%
(  860000, 1200000 ]      319   0.000% 100.000%
( 1200000, 1900000 ]      231   0.000% 100.000%
( 1900000, 2900000 ]      100   0.000% 100.000%
( 2900000, 4300000 ]       39   0.000% 100.000%
( 4300000, 6500000 ]       16   0.000% 100.000%
( 6500000, 9800000 ]        7   0.000% 100.000%
```

New, gather_stats=false. Median thread ops/sec of 5 runs:

```
Complete in 92.030 s; Rough parallel ops/sec = 869285
Thread ops/sec = 54458

Operation latency (ns):
Count: 80000000 Average: 11298.1027  StdDev: 42.18
Min: 0  Median: 7722.0822  Max: 6398720
Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
------------------------------------------------------
[       0,       1 ]      109   0.000%   0.000%
(    2900,    4400 ]      793   0.001%   0.001%
(    4400,    6600 ] 34054563  42.568%  42.569% #########
(    6600,    9900 ] 17482646  21.853%  64.423% ####
(    9900,   14000 ]  7908180   9.885%  74.308% ##
(   14000,   22000 ] 15032072  18.790%  93.098% ####
(   22000,   33000 ]  3237834   4.047%  97.145% #
(   33000,   50000 ]  1736882   2.171%  99.316%
(   50000,   75000 ]   446851   0.559%  99.875%
(   75000,  110000 ]    68251   0.085%  99.960%
(  110000,  170000 ]    18592   0.023%  99.983%
(  170000,  250000 ]     7200   0.009%  99.992%
(  250000,  380000 ]     3334   0.004%  99.997%
(  380000,  570000 ]     1393   0.002%  99.998%
(  570000,  860000 ]      700   0.001%  99.999%
(  860000, 1200000 ]      293   0.000% 100.000%
( 1200000, 1900000 ]      196   0.000% 100.000%
( 1900000, 2900000 ]       69   0.000% 100.000%
( 2900000, 4300000 ]       32   0.000% 100.000%
( 4300000, 6500000 ]       10   0.000% 100.000%
```

New, gather_stats=true, 1 second delay between scans. Scans take about
1 second here so it's spending about 50% time scanning. Still the effect on
ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:

```
Complete in 91.890 s; Rough parallel ops/sec = 870608
Thread ops/sec = 54551

Operation latency (ns):
Count: 80000000 Average: 11311.2629  StdDev: 45.28
Min: 0  Median: 7686.5458  Max: 10018340
Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
------------------------------------------------------
[       0,       1 ]       71   0.000%   0.000%
(    2900,    4400 ]      291   0.000%   0.000%
(    4400,    6600 ] 34492060  43.115%  43.116% #########
(    6600,    9900 ] 16727328  20.909%  64.025% ####
(    9900,   14000 ]  7845828   9.807%  73.832% ##
(   14000,   22000 ] 15510654  19.388%  93.220% ####
(   22000,   33000 ]  3216533   4.021%  97.241% #
(   33000,   50000 ]  1680859   2.101%  99.342%
(   50000,   75000 ]   439059   0.549%  99.891%
(   75000,  110000 ]    60540   0.076%  99.967%
(  110000,  170000 ]    14649   0.018%  99.985%
(  170000,  250000 ]     5242   0.007%  99.991%
(  250000,  380000 ]     3260   0.004%  99.995%
(  380000,  570000 ]     1599   0.002%  99.997%
(  570000,  860000 ]     1043   0.001%  99.999%
(  860000, 1200000 ]      471   0.001%  99.999%
( 1200000, 1900000 ]      275   0.000% 100.000%
( 1900000, 2900000 ]      143   0.000% 100.000%
( 2900000, 4300000 ]       60   0.000% 100.000%
( 4300000, 6500000 ]       27   0.000% 100.000%
( 6500000, 9800000 ]        7   0.000% 100.000%
( 9800000, 14000000 ]        1   0.000% 100.000%

Gather stats latency (us):
Count: 46 Average: 980387.5870  StdDev: 60911.18
Min: 879155  Median: 1033777.7778  Max: 1261431
Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
------------------------------------------------------
(  860000, 1200000 ]       45  97.826%  97.826% ####################
( 1200000, 1900000 ]        1   2.174% 100.000%

Most recent cache entry stats:
Number of entries: 1295133
Total charge: 9.88 GB
Average key size: 23.4982
Average charge: 8.00 KB
Unique deleters: 3
```

Reviewed By: mrambacher

Differential Revision: D28295742

Pulled By: pdillinger

fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
2021-05-11 16:17:10 -07:00
mrambacher
9f2d255aed Add ObjectRegistry to ConfigOptions (#8166)
Summary:
This change enables a couple of things:
- Different ConfigOptions can have different registry/factory associated with it, thereby allowing things like a "Test" ConfigOptions versus a "Production"
- The ObjectRegistry is created fewer times and can be re-used

The ConfigOptions can also be initialized/constructed from a DBOptions, in which case it will grab some of its settings (Env, Logger) from the DBOptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8166

Reviewed By: zhichao-cao

Differential Revision: D27657952

Pulled By: mrambacher

fbshipit-source-id: ae1d6200bb7ab127405cdeefaba43c7fe694dfdd
2021-05-11 06:47:22 -07:00
mrambacher
ff463742b5 Add Merge Operator support to WriteBatchWithIndex (#8135)
Summary:
The WBWI has two differing modes of operation dependent on the value
of the constructor parameter `overwrite_key`.
Currently, regardless of the parameter, neither mode performs as
expected when using Merge. This PR remedies this by correctly invoking
the appropriate Merge Operator before returning results from the WBWI.

Examples of issues that exist which are solved by this PR:

## Example 1 with `overwrite_key=false`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
Get('k1')
```
Incorrectly yields `v2`, that is to say that the Merge behaves like a Put.

## Example 2 with o`verwrite_key=true`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
Get('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.

## Example 3 with `overwrite_key=false`
Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
```
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `v1,v2`

## Example 4 with `overwrite_key=true`
Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
```
Merge('k1', 'v1')
GetFromBatchAndDB('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.

## Example 5 with `overwrite_key=false`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `v1,v2`

## Example 6 with `overwrite_key=true`
Currently, from an empty database, `('k1' -> 'v1')`, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8135

Reviewed By: pdillinger

Differential Revision: D27657938

Pulled By: mrambacher

fbshipit-source-id: 0fbda6bbc66bedeba96a84786d90141d776297df
2021-05-10 12:50:25 -07:00
Peter Dillinger
c26b75baa5 Deprecate obsolete "backupable db" from public APIs (#8274)
Summary:
An early design of BackupEngine used stackable DB, so I guess a
DB had to opt-in to being backupable. Unfortunately the naming of that
obsolete design still infects our public API and implementation.

This change fixes the public API, with a deprecated
backward-compatibility header. `BackupableDBOptions` is renamed to
`BackupEngineOptions` (copy-replace in the public header) and
backup_engine.h replaces backupable_db.h (present for backward
compatibility). The only other change in backupable_db.h ->
backup_engine.h is cleaning up headers.

Later changes will fix the internal implementation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8274

Test Plan:
The internal implementation of BackupEngine uses the name
BackupEngineOptions, while the unit tests use the old name
BackupableDBOptions. This gives me confidence that both still work.

Reviewed By: mrambacher

Differential Revision: D28259471

Pulled By: pdillinger

fbshipit-source-id: a25dbe327b9772143488e7bb0ec7139ee42d0613
2021-05-07 13:53:15 -07:00
mrambacher
8948dc8524 Make ImmutableOptions struct that inherits from ImmutableCFOptions and ImmutableDBOptions (#8262)
Summary:
The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions.  This change cleans that up by introducing an ImmutableOptions struct.  Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form).

Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR.  All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes.

Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262

Reviewed By: pdillinger

Differential Revision: D28226540

Pulled By: mrambacher

fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf
2021-05-05 14:00:17 -07:00
Andrew Kryczka
0f42e50fec Fix GetLiveFiles() returning OPTIONS-000000 (#8268)
Summary:
See release note in HISTORY.md.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8268

Test Plan: unit test repro

Reviewed By: siying

Differential Revision: D28227901

Pulled By: ajkr

fbshipit-source-id: faf61d13b9e43a761e3d5dcf8203923126b51339
2021-05-05 12:54:46 -07:00
Peter Dillinger
85becd94c1 Refactor: use TableBuilderOptions to reduce parameter lists (#8240)
Summary:
Greatly reduced the not-quite-copy-paste giant parameter lists
of rocksdb::NewTableBuilder, rocksdb::BuildTable,
BlockBasedTableBuilder::Rep ctor, and BlockBasedTableBuilder ctor.

Moved weird separate parameter `uint32_t column_family_id` of
TableFactory::NewTableBuilder into TableBuilderOptions.

Re-ordered parameters to TableBuilderOptions ctor, so that `uint64_t
target_file_size` is not randomly placed between uint64_t timestamps
(was easy to mix up).

Replaced a couple of fields of BlockBasedTableBuilder::Rep with a
FilterBuildingContext. The motivation for this change is making it
easier to pass along more data into new fields in FilterBuildingContext
(follow-up PR).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8240

Test Plan: ASAN make check

Reviewed By: mrambacher

Differential Revision: D28075891

Pulled By: pdillinger

fbshipit-source-id: fddb3dbb8260a0e8bdcbb51b877ebabf9a690d4f
2021-04-29 07:00:50 -07:00
sdong
cde69a7cfd db_stress to add --open_metadata_write_fault_one_in (#8235)
Summary:
DB Stress to add --open_metadata_write_fault_one_in which would randomly fail in some file metadata modification operations during DB Open, including file creation, close, renaming and directory sync. Some operations can fail before and after the operations take place.
If DB open fails, db_stress would retry without the failure ingestion, and DB is expected to open successfully.
This option is enabled in crash test in half of the time.
Some follow up changes would allow write failures in open time, and ingesting those failures in non-DB open cases.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8235

Test Plan: Run stress tests for a while and see failures got triggered. This can reproduce the bug fixed by https://github.com/facebook/rocksdb/pull/8192 and a similar one that fails when fsyncing parent directory.

Reviewed By: anand1976

Differential Revision: D28010944

fbshipit-source-id: 36a96da4dc3633e5f7680cef3ea0a900fcdb5558
2021-04-28 10:58:05 -07:00
Adam Retter
2760c2aef8 WBWI Internal Move implementation from .h into .cpp (#8229)
Summary:
Moves some of the structural refactoring from https://github.com/facebook/rocksdb/pull/8135 into this PR.
This just cleans up the code by moving implementation out of the .h file and into the .cc file.

Should be considered for merge before both https://github.com/facebook/rocksdb/pull/7214 and https://github.com/facebook/rocksdb/pull/8135

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8229

Reviewed By: jay-zhuang

Differential Revision: D27999669

Pulled By: mrambacher

fbshipit-source-id: 6eccecbf1f11bb9f5a173e86d1e7bc448bc96071
2021-04-26 09:48:22 -07:00
mrambacher
01e460d538 Make types of Immutable/Mutable Options fields match that of the underlying Option (#8176)
Summary:
This PR is a first step at attempting to clean up some of the Mutable/Immutable Options code.  With this change, a DBOption and a ColumnFamilyOption can be reconstructed from their Mutable and Immutable equivalents, respectively.

readrandom tests do not show any performance degradation versus master (though both are slightly slower than the current 6.19 release).

There are still fields in the ImmutableCFOptions that are not CF options but DB options.  Eventually, I would like to move those into an ImmutableOptions (= ImmutableDBOptions+ImmutableCFOptions).  But that will be part of a future PR to minimize changes and disruptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8176

Reviewed By: pdillinger

Differential Revision: D27954339

Pulled By: mrambacher

fbshipit-source-id: ec6b805ba9afe6e094bffdbd76246c2d99aa9fad
2021-04-22 20:43:54 -07:00
Andrew Gallagher
2e5de5a2c3 Cleanup include (#8208)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8208

Make include of "file_system.h" use the same include path as everywhere
else.

Reviewed By: riversand963, akankshamahajan15

Differential Revision: D27881606

fbshipit-source-id: fc1e076229fde21041a813c655ce017b5070c8b3
2021-04-20 14:57:27 -07:00
Yanqin Jin
a376c22066 Handle rename() failure in non-local FS (#8192)
Summary:
In a distributed environment, a file `rename()` operation can succeed on server (remote)
side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which
can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a
new MANIFEST. We currently always delete the new MANIFEST if an error occurs.

This is problematic in distributed world. If the server-side successfully updates the CURRENT
file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail.

As a fix, we can track the execution result of IO operations on the new MANIFEST.
- If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original
  MANIFEST. Therefore, it is safe to remove the new MANIFEST.
- If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up
  code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
  POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
  new MANIFEST.) Therefore, we keep the new MANIFEST.
    - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT.
    - If process reopens the db immediately after the failure, then the CURRENT file can point
      to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
      succeed and ignore the other.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192

Test Plan: make check

Reviewed By: zhichao-cao

Differential Revision: D27804648

Pulled By: riversand963

fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
2021-04-19 18:11:13 -07:00
Levi Tamasi
0c6e4674a6 Fix a data race related to DB properties (#8206)
Summary:
Historically, the DB properties `rocksdb.cur-size-active-mem-table`,
`rocksdb.cur-size-all-mem-tables`, and `rocksdb.size-all-mem-tables` called
the method `MemTable::ApproximateMemoryUsage` for mutable memtables,
which is not safe without synchronization. This resulted in data races with
memtable inserts. The patch changes the code handling these properties
to use `MemTable::ApproximateMemoryUsageFast` instead, which returns a
cached value backed by an atomic variable. Two test cases had to be updated
for this change. `MemoryTest.MemTableAndTableReadersTotal` was fixed by
increasing the value size used so each value ends up in its own memtable,
which was the original intention (note: the test has been broken in the sense
that the test code didn't consider that memtable sizes below 64 KB get
increased to 64 KB by `SanitizeOptions`, and has been passing only by
accident). `DBTest.MemoryUsageWithMaxWriteBufferSizeToMaintain` relies on
completely up-to-date values and thus was changed to use `ApproximateMemoryUsage`
directly instead of going through the DB properties. Note: this should be safe in this case
since there's only a single thread involved.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8206

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D27866811

Pulled By: ltamasi

fbshipit-source-id: 7bd754d0565e0a65f1f7f0e78ffc093beef79394
2021-04-19 16:38:02 -07:00
Akanksha Mahajan
c377c2ba15 Fix flaky test BackupableDBTest.FileSizeForIncremental (#8197)
Summary:
Test was flaky because for kUseDbSessionId naming, blob files use
naming scheme kLegacyCrc32cAndFileSize. So expected number of files
because of collision can vary. So disabling blobdb for this test case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8197

Reviewed By: pdillinger

Differential Revision: D27836997

Pulled By: akankshamahajan15

fbshipit-source-id: 5eb21a5f4acae3d6b730a9e1b207264fbc18cb80
2021-04-18 16:18:35 -07:00