Commit Graph

1075 Commits

Author SHA1 Message Date
Yanqin Jin
d93812c9ae Iterator with timestamp (#6255)
Summary:
Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests.

Create an iterator with timestamp.
```
...
read_opts.timestamp = &ts;
auto* iter = db->NewIterator(read_opts);
// target is key without timestamp.
for (iter->Seek(target); iter->Valid(); iter->Next()) {}
for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {}
delete iter;
read_opts.timestamp = &ts1;
// lower_bound and upper_bound are without timestamp.
read_opts.iterate_lower_bound = &lower_bound;
read_opts.iterate_upper_bound = &upper_bound;
auto* iter1 = db->NewIterator(read_opts);
// Do Seek or SeekToFirst()
delete iter1;
```

Test plan (dev server)
```
$make check
```

Simple benchmarking (dev server)
1. The overhead introduced by this PR even when timestamp is disabled.
key size: 16 bytes
value size: 100 bytes
Entries: 1000000
Data reside in main memory, and try to stress iterator.
Repeated three times on master and this PR.
- Seek without next
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3
```
master: 159047.0 ops/sec
this PR: 158922.3 ops/sec (2% drop in throughput)
- Seek and next 10 times
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10
```
master: 109539.3 ops/sec
this PR: 107519.7 ops/sec (2% drop in throughput)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255

Differential Revision: D19438227

Pulled By: riversand963

fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746
2020-03-06 16:24:27 -08:00
Cheng Chang
0a0151fb99 Remove memcpy from RandomAccessFileReader::Read in direct IO mode (#6455)
Summary:
In direct IO mode, RandomAccessFileReader::Read allocates an internal aligned buffer, and then copies the result into the scratch buffer. If the result is only temporarily used inside a function, there is no need to do the memcpy and just let the result Slice refer to the internally allocated buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6455

Test Plan: make check

Differential Revision: D20106753

Pulled By: cheng-chang

fbshipit-source-id: 44f505843837bba47a56e3fa2c4dd3bd76486b58
2020-03-06 14:05:12 -08:00
Otto Kekäläinen
f6c2777d95 Fix spelling: commited -> committed (#6481)
Summary:
In most places in the code the variable names are spelled correctly as
COMMITTED but in a couple places not. This fixes them and ensures the
variable is always called COMMITTED everywhere.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6481

Differential Revision: D20306776

Pulled By: pdillinger

fbshipit-source-id: b6c1bfe41db559b4bc6955c530934460c07f7022
2020-03-06 12:45:20 -08:00
Huisheng Liu
904a60ff63 return timestamp from get (#6409)
Summary:
Added new Get() methods that return timestamp. Dummy implementation is given so that classes derived from DB don't need to be touched to provide their implementation. MultiGet is not included.

ReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
    base line (commit 72ee067b9):
        101.712 micros/op 314602 ops/sec;   36.0 MB/s (5658999 of 5658999 found)
    This PR:
        100.288 micros/op 319071 ops/sec;   36.5 MB/s (5674999 of 5674999 found)

./db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=readrandom --use_existing_db=1 --num=25000000 --threads=32
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6409

Differential Revision: D20200086

Pulled By: riversand963

fbshipit-source-id: 490edd74d924f62bd8ae9c29c2a6bbbb8410ca50
2020-03-02 16:01:00 -08:00
Michael R. Crusoe
051696bf98 fix some spelling typos (#6464)
Summary:
Found from Debian's "Lintian" program
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6464

Differential Revision: D20162862

Pulled By: zhichao-cao

fbshipit-source-id: 06941ee2437b038b2b8045becbe9d2c6fbff3e12
2020-02-28 14:14:03 -08:00
Manuel Ung
41535d0218 WriteUnPrepared: Pass in correct subbatch count during rollback (#6463)
Summary:
Today `WriteUnpreparedTxn::RollbackInternal` will write the rollback batch assuming that there is only a single subbatch. However, because untracked_keys_ are currently not deduplicated, it's possible for duplicate keys to exist, and thus split the batch. Also, tracked_keys_ also does not support compators outside of the bytewise comparators, so it's possible for duplicates to occur there as well.

To solve this, just pass in the correct subbatch count.

Also, removed `WriteUnpreparedRollbackPreReleaseCallback` to unify the Commit/Rollback codepaths some more.

Also, fixed a bug in `CommitInternal` where if 1. two_write_queue is true and 2. include_data is true, then `WriteUnpreparedCommitEntryPreReleaseCallback` ends up calling `AddCommitted` on the commit time write batch a second time on the second write. To fix, `WriteUnpreparedCommitEntryPreReleaseCallback` is re-initialized.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6463

Differential Revision: D20150153

Pulled By: lth

fbshipit-source-id: df0b42d39406c75af73df995aa1138f0db539cd1
2020-02-28 11:19:32 -08:00
sdong
fdf882ded2 Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433

Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.

Differential Revision: D19977691

fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
2020-02-20 12:09:57 -08:00
Andrew Kryczka
0f9dcb88b2 Return NotSupported from WriteBatchWithIndex::DeleteRange (#5393)
Summary:
As discovered in https://github.com/facebook/rocksdb/issues/5260 and https://github.com/facebook/rocksdb/issues/5392, reads on the indexed batch do not account for range tombstones. So, return `Status::NotSupported` from `WriteBatchWithIndex::DeleteRange` until we properly support it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5393

Test Plan: added unit test

Differential Revision: D19912360

Pulled By: ajkr

fbshipit-source-id: 0bbfc978ea015d64516ca708fce2429abba524cb
2020-02-18 11:18:25 -08:00
Manuel Ung
dc23c125c3 WriteUnPrepared: Untracked keys (#6404)
Summary:
For write unprepared, some applications may bypass the transaction api, and write keys directly into the write batch. However, since they are not tracked, rollbacks (both for savepoint and transaction) are not aware that these keys have to be rolled back.

The fix is to track them in `WriteUnpreparedTxn::untracked_keys_`. This is populated whenever we flush unprepared batches into the DB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6404

Differential Revision: D19842023

Pulled By: lth

fbshipit-source-id: a9edfc643d5c905fc89da9a9a9094d30c9b70108
2020-02-14 11:31:39 -08:00
wolfkdy
29e24434fe refine code (#6420)
Summary:
I create a new branch from the branch new upsteram/master and "git merge --squash".
Maybe it will fix everything.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6420

Differential Revision: D19897152

Pulled By: zhichao-cao

fbshipit-source-id: 6575d9e3b23e360f42ee1480b43028b5fcc20136
2020-02-13 18:55:02 -08:00
Manuel Ung
fb571509a7 WriteUnPrepared: Enable WAL during crash recovery (#6418)
Summary:
Unfortunately, it seems like mysqld reuses xids across machine restarts. When that happens, we could have something like the following happening:

```
BEGIN_PREPARE(unprepared) Put(a) END_PREPARE(xid = 1)
-- crash and recover with Put(a) rolled back as it was not prepared
BEGIN_PREPARE(prepared) Put(b) END_PREPARE(xid = 1)
COMMIT(xid = 1)
-- crash and recover with both a, b
```

To solve this, we will have to log the rollback batch into the WAL during recovery.

WritePrepared already logs the rollback batch into the WAL, if a rollback happens after prepare, so there is no problem there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6418

Differential Revision: D19896151

Pulled By: lth

fbshipit-source-id: 2ff65ddc5fe75efd57736fed4b7cd7a109d26609
2020-02-13 18:44:39 -08:00
sdong
ac8e89a443 Should flush and sync WAL when writing it in DB::Open() (#6417)
Summary:
A recent fix related to 2pc https://github.com/facebook/rocksdb/pull/6313/ writes something to WAL, but does not flush or sync. This causes assertion failure "impl->TEST_WALBufferIsEmpty()" if manual_wal_flush = true. We should fsync the entry to make sure a second power reset can recover.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6417

Test Plan: Add manual_wal_flush=true case in TransactionTest.DoubleCrashInRecovery and fix a bug in the test so that the bug can be reproduced. It passes with the fix.

Differential Revision: D19894537

fbshipit-source-id: f1e84e49e2269f583c6019743118292cd8b6598e
2020-02-13 18:41:04 -08:00
Kefu Chai
debc4ef18b utilities/env_librados: copy use bufferlist::iterator (#6395)
Summary:
to adapt the change in ceph upstream where the bufferlist::copy() method
was removed in
c724369010

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6395

Differential Revision: D19816815

Pulled By: zhichao-cao

fbshipit-source-id: 9210767b91af0ecdcf5dfaa3e70edcaeea55135f
2020-02-10 11:31:16 -08:00
sdong
876c2dbff4 Allow readahead when reading option files. (#6372)
Summary:
Right, when reading from option files, no readahead is used and 8KB buffer is used. It might introduce high latency if the file system provide high latency and doesn't do readahead. Instead, introduce a readahead to the file. When calling inside DB, infer the value from options.log_readahead. Otherwise, a default 512KB readahead size is used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6372

Test Plan: Add --log_readahead_size in db_bench. Run it with several options and observe read size from option files using strace.

Differential Revision: D19727739

fbshipit-source-id: e6d8053b0a64259abc087f1f388b9cd66fa8a583
2020-02-07 15:18:26 -08:00
Levi Tamasi
1b4be4cac9 BlobDB: ignore trivially moved files when updating the SST<->blob file mapping (#6381)
Summary:
BlobDB keeps track of the mapping between SSTs and blob files using
the `OnFlushCompleted` and `OnCompactionCompleted` callbacks of
the `EventListener` interface: upon receiving a flush notification, a link
is added between the newly flushed SST and the corresponding blob file;
for compactions, links are removed for the inputs and added for the outputs.
The earlier code performed this link deletion and addition even for
trivially moved files; the new code walks through the two lists together
(in a fashion that's similar to merge sort) and skips such files.
This should mitigate https://github.com/facebook/rocksdb/issues/6338,
wherein an assertion is triggered with the earlier code when a compaction
notification for a trivial move precedes the flush notification for the
moved SST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6381

Test Plan: make check

Differential Revision: D19773729

Pulled By: ltamasi

fbshipit-source-id: ae0f273ded061110dd9334e8fb99b0d7786650b0
2020-02-07 12:50:57 -08:00
Yanqin Jin
f2fbc5d668 Shorten certain test names to avoid infra failure (#6352)
Summary:
Unit test names, together with other components,  are used to create log files
during some internal testing. Overly long names cause infra failure due to file
names being too long.

Look for internal tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6352

Differential Revision: D19649307

Pulled By: riversand963

fbshipit-source-id: 6f29de096e33c0eaa87d9c8702f810eda50059e7
2020-01-30 23:10:24 -08:00
Levi Tamasi
9e3ace42a4 Add statistics for BlobDB GC (#6296)
Summary:
The patch adds statistics support to the new BlobDB garbage collection implementation;
namely, it adds support for the following (pre-existing) tickers:

`BLOB_DB_GC_NUM_FILES`: the number of blob files obsoleted by the GC logic.
`BLOB_DB_GC_NUM_NEW_FILES`: the number of new blob files generated by the GC logic.
`BLOB_DB_GC_FAILURES`: the number of failed GC passes (where a GC pass is
equivalent to a (sub)compaction).
`BLOB_DB_GC_NUM_KEYS_RELOCATED`: the number of blobs relocated to new blob
files by the GC logic.
`BLOB_DB_GC_BYTES_RELOCATED`: the total size of blobs relocated to new blob files.

The tickers `BLOB_DB_GC_NUM_KEYS_OVERWRITTEN`, `BLOB_DB_GC_NUM_KEYS_EXPIRED`,
`BLOB_DB_GC_BYTES_OVERWRITTEN`, `BLOB_DB_GC_BYTES_EXPIRED`, and
`BLOB_DB_GC_MICROS` are not relevant for the new GC logic, and are thus marked
deprecated.

The patch also adds a couple of log messages that log the number and total size of
blobs encountered and relocated during a GC pass, as well as the number of blob
files created and obsoleted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6296

Test Plan: Extended unit tests and used the BlobDB mode of `db_bench`.

Differential Revision: D19402513

Pulled By: ltamasi

fbshipit-source-id: d53d2bfbf4928a1db1e9346c67ebb9007b8932ec
2020-01-29 16:46:16 -08:00
Maysam Yabandeh
2f973ca96e Double Crash in kPointInTimeRecovery with TransactionDB (#6313)
Summary:
In WritePrepared there could be gap in sequence numbers. This breaks the trick we use in kPointInTimeRecovery which assume the first seq in the log right after the corrupted log is one larger than the last seq we read from the logs. To let this trick keep working, we add a dummy entry with the expected sequence to the first log right after recovery.
Also in WriteCommitted, if the log right after the corrupted log is empty, since it has no sequence number to let the sequential trick work, it is assumed as unexpected behavior. This is however expected to happen if we close the db after recovering from a corruption and before writing anything new to it. To remedy that, we apply the same technique by writing a dummy entry to the log that is created after the corrupted log.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6313

Differential Revision: D19458291

Pulled By: maysamyabandeh

fbshipit-source-id: 09bc49e574690085df45b034ca863ff315937e2d
2020-01-29 11:40:55 -08:00
matthewvon
e6e8b9e871 Correct pragma once problem with Bazel on Windows (#6321)
Summary:
This is a simple edit to have two #include file paths be consistent within range_del_aggregator.{h,cc} with everywhere else.

The impact of this inconsistency is that it actual breaks a Bazel based build on the Windows platform. The same pragma once failure occurs with both Windows Visual C++ 2019 and clang for Windows 9.0. Bazel's "sandboxing" of the builds causes both compilers to not properly recognize "rocksdb/types.h" and "include/rocksdb/types.h" to be the same file (also comparator.h). My guess is that the backslash versus forward slash mixing within path names is the underlying issue.

But, everything builds fine once the include paths in these two source files are consistent with the rest of the repository.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6321

Differential Revision: D19506585

Pulled By: ltamasi

fbshipit-source-id: 294c346607edc433ab99eaabc9c880ee7426817a
2020-01-21 16:12:43 -08:00
Levi Tamasi
1dd7873e08 Remove earlier partial BlobDB GC implementation (#6278)
Summary:
In addition to removing the earlier partially implemented garbage collection
logic from the BlobDB codebase, the patch also removes the test cases (as well as
the related sync points, as appropriate) that were only relevant for the old
implementation, and reworks the remaining ones so they use the new GC logic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6278

Test Plan: `make check`

Differential Revision: D19335226

Pulled By: ltamasi

fbshipit-source-id: 0cc1794bc9892feda1426ed5522a318f3cb1b692
2020-01-14 15:08:44 -08:00
Maysam Yabandeh
eff5e076f5 unordered_write incompatible with max_successive_merges (#6284)
Summary:
unordered_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions.
The patch fixes that and also reverts the changes performed by https://github.com/facebook/rocksdb/pull/6254, in which max_successive_merges was mistakenly declared incompatible with unordered_write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6284

Differential Revision: D19356115

Pulled By: maysamyabandeh

fbshipit-source-id: f06dadec777622bd75f267361c022735cf8cecb6
2020-01-10 16:53:19 -08:00
sdong
39410bcb3d Fix some shadow warning (#6242)
Summary:
Some shadow warning shows up when using gcc 4.8. An example:

./utilities/blob_db/blob_compaction_filter.h: In constructor ‘rocksdb::blob_db::BlobIndexCompactionFilterFactoryBase::BlobIndexCompactionFilterFactoryBase(rocksdb::blob_db::lobDBImpl*, rocksdb::Env*, rocksdb::Statistics*)’:
./utilities/blob_db/blob_compaction_filter.h:121:7: error: declaration of ‘blob_db_impl’ shadows a member of 'this' [-Werror=shadow]
       : blob_db_impl_(blob_db_impl), env_(_env), statistics_(_statistics) {}
       ^

Fix them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6242

Test Plan: Build and see the warnings go away.

Differential Revision: D19217789

fbshipit-source-id: 8ef631941f23dab47a388e060adec24b72efd65e
2020-01-08 18:20:13 -08:00
Maysam Yabandeh
5709e97a74 Skip CancelAllBackgroundWork if DBImpl is already closed (#6268)
Summary:
WritePreparedTxnDB calls CancelAllBackgroundWork in its destructor to avoid dangling references to it from background job's SnapshotChecker callback. However, if the DBImpl is already closed, the info log might be closed with it, which causes memory leak when CancelAllBackgroundWork tries to print to the info log. The patch fixes that by calling CancelAllBackgroundWork only if the db is not closed already.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6268

Differential Revision: D19303439

Pulled By: maysamyabandeh

fbshipit-source-id: 4228a6be7e78d43c90630347baa89b008200bd15
2020-01-07 15:34:27 -08:00
wolfkdy
1ab1231acf parallel occ (#6240)
Summary:
This is a continuation of https://github.com/facebook/rocksdb/pull/5320/files
I open a new mr for these purposes, half a year has past since the old mr is posted so it's almost impossible to fulfill some points below on the old mr, especially 5)
1) add validation modes for optimistic txns
2) modify unittests to test both modes
3) make format
4) refine hash functor
5) push to master
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6240

Differential Revision: D19301296

fbshipit-source-id: 5b5b3cbd39558f43947f7d2dec6cd31a06386edb
2020-01-07 14:20:38 -08:00
Maysam Yabandeh
48a678b7c9 Prevent an incompatible combination of options (#6254)
Summary:
allow_concurrent_memtable_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions. The patch fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6254

Differential Revision: D19265819

Pulled By: maysamyabandeh

fbshipit-source-id: 47f2e2dc26fe0972c7152f4da15dadb9703f1179
2020-01-02 16:15:06 -08:00
Levi Tamasi
1ebaa762e6 Log garbage_collection_cutoff alongside the other BlobDB options
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6229

Differential Revision: D19191195

Pulled By: ltamasi

fbshipit-source-id: 2a3c4785299641a46e022fc012460b759a689fce
2019-12-20 11:00:53 -08:00
Levi Tamasi
7a7ca8eb5b BlobDB: only compare CF IDs when checking whether an API call is for the default CF (#6226)
Summary:
BlobDB currently only supports using the default column family. The earlier
code enforces this by comparing the `ColumnFamilyHandle` passed to the
`Get`/`Put`/etc. call with the handle returned by `DefaultColumnFamily`
(which, at the end of the day, comes from `DBImpl::default_cf_handle_`).
Since other `ColumnFamilyHandle`s can also point to the default column
family, this can reject legitimate requests as well. (As an example,
with the earlier code, the handle returned by `BlobDB::Open` cannot
actually be used in API calls.) The patch fixes this by comparing only
the IDs of the column family handles instead of the pointers themselves.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6226

Test Plan: `make check`

Differential Revision: D19187461

Pulled By: ltamasi

fbshipit-source-id: 54ce2e12ebb1f07e6d1e70e3b1e0213dfa94bda2
2019-12-19 18:05:49 -08:00
anand1976
1be48cb895 Fix crash in Transaction::MultiGet() when num_keys > 32
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6192

Test Plan:
Add a unit test that fails without the fix and passes now
make check

Differential Revision: D19124781

Pulled By: anand1976

fbshipit-source-id: 8c8cb6fa16c3fc23ec011e168561a13f76bbd783
2019-12-16 20:39:35 -08:00
Levi Tamasi
0d2172f128 Make it possible to enable periodic compactions for BlobDB (#6172)
Summary:
Periodic compactions ensure that even SSTs that do not get picked up
otherwise eventually go through compaction; used in conjunction with
BlobDB's garbage collection, they enable BlobDB to reclaim space when
old blob files are used by such straggling SSTs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6172

Test Plan: Ran `make check` and used the BlobDB mode of `db_bench`.

Differential Revision: D19045045

Pulled By: ltamasi

fbshipit-source-id: 04636ecc4b6cfe8d495bf656faa65d54a5eb1a93
2019-12-13 16:13:25 -08:00
anand76
afa2420c2b Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 14:48:41 -08:00
Levi Tamasi
583c6953d8 Move out valid blobs from the oldest blob files during compaction (#6121)
Summary:
The patch adds logic that relocates live blobs from the oldest N non-TTL
blob files as they are encountered during compaction (assuming the BlobDB
configuration option `enable_garbage_collection` is `true`), where N is defined
as the number of immutable non-TTL blob files multiplied by the value of
a new BlobDB configuration option called `garbage_collection_cutoff`.
(The default value of this parameter is 0.25, that is, by default the valid blobs
residing in the oldest 25% of immutable non-TTL blob files are relocated.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6121

Test Plan: Added unit test and tested using the BlobDB mode of `db_bench`.

Differential Revision: D18785357

Pulled By: ltamasi

fbshipit-source-id: 8c21c512a18fba777ec28765c88682bb1a5e694e
2019-12-13 10:13:05 -08:00
Levi Tamasi
3b607610df Do not update SST <-> blob file mapping if compaction failed
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6156

Test Plan: Extended unit tests.

Differential Revision: D18943867

Pulled By: ltamasi

fbshipit-source-id: b3669d2dd6af08e987ad1a59d6712ae2514da0b1
2019-12-12 11:30:45 -08:00
Adam Retter
a61ec9ae3b Fix BlobDB compilation on older GCC versions
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6094

Differential Revision: D18731951

Pulled By: ltamasi

fbshipit-source-id: 5b73c6009c748f6a2a48d4d880b1259980d801d4
2019-11-27 13:09:09 -08:00
Adam Retter
6d58ea901d Fix compilation under MSVC VS2015 (#6081)
Summary:
**NOTE**: this also needs to be back-ported to 6.4.6 and possibly older branches if further releases from them is envisaged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6081

Differential Revision: D18710107

Pulled By: zhichao-cao

fbshipit-source-id: 03260f9316566e2bfc12c7d702d6338bb7941e01
2019-11-26 18:24:09 -08:00
Levi Tamasi
d9314a9214 Refactor and clean up the code that reads a blob from a file (#6093)
Summary:
This patch factors out the logic that reads a (potentially compressed) blob
from a file into a separate helper method `GetRawBlobFromFile`, and cleans
up the code a bit. Also, errors during decompression are now logged/propagated
to the user by returning a `Status` code of `Corruption`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6093

Test Plan: `make check`

Differential Revision: D18716673

Pulled By: ltamasi

fbshipit-source-id: 44144bc064cab616862d5643f34384f2bae6eb78
2019-11-26 16:49:39 -08:00
Levi Tamasi
72daa92d3a Refactor blob file creation logic (#6066)
Summary:
The patch refactors and cleans up the logic around creating new blob files
by moving the common code of `SelectBlobFile` and `SelectBlobFileTTL`
to a new helper method `CreateBlobFileAndWriter`, bringing the implementation
of `SelectBlobFile` and `SelectBlobFileTTL` into sync, and increasing encapsulation
by adding new constructors for `BlobFile` and `BlobLogHeader`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6066

Test Plan:
Ran `make check` and used the BlobDB mode of `db_bench` to sanity test both
the TTL and the non-TTL code paths.

Differential Revision: D18646921

Pulled By: ltamasi

fbshipit-source-id: e5705a84807932e31dccab4f49b3e64369cea26d
2019-11-26 13:28:32 -08:00
Sebastiano Peluso
fcd7e03832 Ignore value of BackupableDBOptions::max_valid_backups_to_open when B… (#6072)
Summary:
This change ignores the value of BackupableDBOptions::max_valid_backups_to_open when a BackupEngine is not read-only.

Issue: https://github.com/facebook/rocksdb/issues/4997

Note on tests: I had to remove test case WriteOnlyEngine of BackupableDBTest because it was not consistent with the new semantic of BackupableDBOptions::max_valid_backups_to_open. Maybe, we should think about adding a new interface for append-only BackupEngines. On the other hand, I changed LimitBackupsOpened test case to use a read-only BackupEngine, and I added a new specific test case for the change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6072

Reviewed By: pdillinger

Differential Revision: D18687364

Pulled By: sebastianopeluso

fbshipit-source-id: 77bc1f927d623964d59137a93de123bbd719da4e
2019-11-26 10:00:31 -08:00
Levi Tamasi
279c488395 Mark blob files not needed by any memtables/SSTs obsolete (#6032)
Summary:
The patch adds logic to mark no longer needed blob files obsolete upon database open
and whenever a flush or compaction completes. Unneeded blob files are detected by
iterating through live immutable non-TTL blob files starting from the lowest-numbered one,
and stopping when a blob file used by any SSTs or potentially used by memtables is found.
(The latter is determined by comparing the sequence number at which the blob file
became immutable with the largest sequence number received in flush notifications.)

In addition, the patch cleans up the logic around closing and obsoleting blob files and
enforces invariants around this area (blob files are now guaranteed to go through the
stages mutable-non-obsolete, immutable-non-obsolete, and immutable-obsolete in this
order).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6032

Test Plan: Extended unit tests and tested using the BlobDB mode of `db_bench`.

Differential Revision: D18495610

Pulled By: ltamasi

fbshipit-source-id: 11825b84af74f3f4abfd9bcae04e80870ae58961
2019-11-18 16:30:06 -08:00
Maysam Yabandeh
0058daef7b Disable SmallestUnCommittedSeq in Valgrind run (#6035)
Summary:
SmallestUnCommittedSeq sometimes takes too long when run under Valgrind. The patch disables it when the tests are run under Valgrind.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6035

Differential Revision: D18509198

Pulled By: maysamyabandeh

fbshipit-source-id: 1191443b9fedb6b9c50d6b76f5c92371f5030230
2019-11-14 14:41:52 -08:00
Peter Dillinger
e8e7fb1dcf More fixes to auto-GarbageCollect in BackupEngine (#6023)
Summary:
Production:
* Fixes GarbageCollect (and auto-GC triggered by PurgeOldBackups, DeleteBackup, or CreateNewBackup) to clean up backup directory independent of current settings (except max_valid_backups_to_open; see issue https://github.com/facebook/rocksdb/issues/4997) and prior settings used with same backup directory.
* Fixes GarbageCollect (and auto-GC) not to attempt to remove "." and ".." entries from directories.
* Clarifies contract with users in modifying BackupEngine operations. In short, leftovers from any incomplete operation are cleaned up by any subsequent call to that same kind of operation (PurgeOldBackups and DeleteBackup considered the same kind of operation). GarbageCollect is available to clean up after all kinds. (NB: right now PurgeOldBackups and DeleteBackup will clean up after incomplete CreateNewBackup, but we aren't promising to continue that behavior.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6023

Test Plan:
* Refactors open parameters to use an option enum, for readability, etc. (Also fixes an unused parameter bug in the redundant OpenDBAndBackupEngineShareWithChecksum.)
* Fixes an apparent bug in ShareTableFilesWithChecksumsTransition in which old backup data was destroyed in the transition to be tested. That test is now augmented to ensure GarbageCollect (or auto-GC) does not remove shared files when BackupEngine is opened with share_table_files=false.
* Augments DeleteTmpFiles test to ensure that CreateNewBackup does auto-GC when an incompletely created backup is detected.

Differential Revision: D18453559

Pulled By: pdillinger

fbshipit-source-id: 5e54e7b08d711b161bc9c656181012b69a8feac4
2019-11-14 06:20:18 -08:00
anand76
6c7b1a0cc7 Batched MultiGet API for multiple column families (#5816)
Summary:
Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice.

As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816

Test Plan:
make check
make asan_check
asan_crash_test

Differential Revision: D18408676

Pulled By: anand1976

fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d
2019-11-12 13:52:55 -08:00
Levi Tamasi
8e7aa62813 BlobDB: Maintain mapping between blob files and SSTs (#6020)
Summary:
The patch adds logic to BlobDB to maintain the mapping between blob files
and SSTs for which the blob file in question is the oldest blob file referenced
by the SST file. The mapping is initialized during database open based on the
information retrieved using `GetLiveFilesMetaData`, and updated after
flushes/compactions based on the information received through the `EventListener`
interface (or, in the case of manual compactions issued through the `CompactFiles`
API, the `CompactionJobInfo` object).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6020

Test Plan: Added a unit test; also tested using the BlobDB mode of `db_bench`.

Differential Revision: D18410508

Pulled By: ltamasi

fbshipit-source-id: dd9e778af781cfdb0d7056298c54ba9cebdd54a5
2019-11-11 14:01:34 -08:00
Peter Dillinger
aa63abf698 Auto-GarbageCollect on PurgeOldBackups and DeleteBackup (#6015)
Summary:
Only if there is a crash, power failure, or I/O error in
DeleteBackup, shared or private files from the backup might be left
behind that are not cleaned up by PurgeOldBackups or DeleteBackup-- only
by GarbageCollect. This makes the BackupEngine API "leaky by default."
Even if it means a modest performance hit, I think we should make
Delete and Purge do as they say, with ongoing best effort: i.e. future
calls will attempt to finish any incomplete work from earlier calls.

This change does that by having DeleteBackup and PurgeOldBackups do a
GarbageCollect, unless (to minimize performance hit) this BackupEngine
has already done a GarbageCollect and there have been no
deletion-related I/O errors in that GarbageCollect or since then.

Rejected alternative 1: remove meta file last instead of first. This would in theory turn partially deleted backups into corrupted backups, but code changes would be needed to allow the missing files and consider it acceptably corrupt, rather than failing to open the BackupEngine. This might be a reasonable choice, but I mostly rejected it because it doesn't solve the legacy problem of cleaning up existing lingering files.

Rejected alternative 2: use a deletion marker file. If deletion started with creating a file that marks a backup as flagged for deletion, then we could reliably detect partially deleted backups and efficiently finish removing them. In addition to not solving the legacy problem, this could be precarious if there's a disk full situation, and we try to create a new file in order to delete some files. Ugh.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6015

Test Plan: Updated unit tests

Differential Revision: D18401333

Pulled By: pdillinger

fbshipit-source-id: 12944e372ce6809f3f5a4c416c3b321a8927d925
2019-11-08 19:15:35 -08:00
Levi Tamasi
f80050fa8f Add file number/oldest referenced blob file number to {Sst,Live}FileMetaData (#6011)
Summary:
The patch exposes the file numbers of the SSTs as well as the oldest blob
files they contain a reference to through the GetColumnFamilyMetaData/
GetLiveFilesMetaData interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6011

Test Plan:
Fixed and extended the existing unit tests. (The earlier ColumnFamilyMetaDataTest
wasn't really testing anything because the generated memtables were never
flushed, so the metadata structure was essentially empty.)

Differential Revision: D18361697

Pulled By: ltamasi

fbshipit-source-id: d5ed1d94ac70858b84393c48711441ddfe1251e9
2019-11-07 14:04:16 -08:00
Yanqin Jin
67e735dbf9 Rename BlockBasedTable::ReadMetaBlock (#6009)
Summary:
According to
https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format,
the block read by BlockBasedTable::ReadMetaBlock is actually the meta index
block. Therefore, it is better to rename the function to ReadMetaIndexBlock.

This PR also applies some format change to existing code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6009

Test Plan: make check

Differential Revision: D18333238

Pulled By: riversand963

fbshipit-source-id: 2c4340a29b3edba53d19c132cbfd04caf6242aed
2019-11-05 17:19:11 -08:00
Sergei Petrunia
230bcae7b6 Add a limited support for iteration bounds into BaseDeltaIterator (#5403)
Summary:
For MDEV-19670: MyRocks: key lookups into deleted data are very slow

BaseDeltaIterator remembers iterate_upper_bound and will not let delta_iterator_
walk above the iterate_upper_bound if base_iterator_ is not valid
anymore.

== Rationale ==
The most straightforward way would be to make the delta_iterator
(which is a rocksdb::WBWIIterator) to support iterator bounds. But
checking for bounds has an extra CPU overhead.

So we put the check into BaseDeltaIterator, and only make it when
base_iterator_ is not valid.

(note: We could take it even further, and move the check a few lines
down, and only check iterator bounds ourselves if base_iterator_ is
not valid AND delta_iterator_ hit a tombstone).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5403

Differential Revision: D15863092

Pulled By: maysamyabandeh

fbshipit-source-id: 8da458e7b9af95ff49356666f69664b4a6ccf49b
2019-11-05 11:39:36 -08:00
Maysam Yabandeh
52733b4498 WritePrepared: Fix flaky test MaxCatchupWithNewSnapshot (#5850)
Summary:
MaxCatchupWithNewSnapshot tests that the snapshot sequence number will be larger than the max sequence number when the snapshot was taken. However since the test does not have access to the max sequence number when the snapshot was taken, it uses max sequence number after that, which could have advanced the snapshot by then, thus making the test flaky.
The fix is to compare with max sequence number before the snapshot was taken, which is a lower bound for the value when the snapshot was taken.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5850

Test Plan: ~/gtest-parallel/gtest-parallel --repeat=12800 ./write_prepared_transaction_test --gtest_filter="*MaxCatchupWithNewSnapshot*"

Differential Revision: D17608926

Pulled By: maysamyabandeh

fbshipit-source-id: b122ae5a27f982b290bd60da852e28d3c5eb0136
2019-11-04 16:23:57 -08:00
Sagar Vemuri
4c9aa30a62 Auto enable Periodic Compactions if a Compaction Filter is used (#5865)
Summary:
- Periodic compactions are auto-enabled if a compaction filter or a compaction filter factory is set, in Level Compaction.
- The default value of `periodic_compaction_seconds` is changed to UINT64_MAX, which lets RocksDB auto-tune periodic compactions as needed. An explicit value of 0 will still work as before ie. to disable periodic compactions completely. For now, on seeing a compaction filter along with a UINT64_MAX value for `periodic_compaction_seconds`, RocksDB will make SST files older than 30 days to go through periodic copmactions.

Some RocksDB users make use of compaction filters to control when their data can be deleted, usually with a custom TTL logic. But it is occasionally possible that the compactions get delayed by considerable time due to factors like low writes to a key range, data reaching bottom level, etc before the TTL expiry. Periodic Compactions feature was originally built to help such cases. Now periodic compactions are auto enabled by default when compaction filters or compaction filter factories are used, as it is generally helpful to all cases to collect garbage.

`periodic_compaction_seconds` is set to a large value, 30 days, in `SanitizeOptions` when RocksDB sees that a `compaction_filter` or `compaction_filter_factory` is used.

This is done only for Level Compaction style.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5865

Test Plan:
- Added a new test `DBCompactionTest.LevelPeriodicCompactionWithCompactionFilters` to make sure that `periodic_compaction_seconds` is set if either `compaction_filter` or `compaction_filter_factory` options are set.
- `COMPILE_WITH_ASAN=1 make check`

Differential Revision: D17659180

Pulled By: sagar0

fbshipit-source-id: 4887b9cf2e53cf2dc93a7b658c6b15e1181217ee
2019-10-29 15:05:51 -07:00
Peter Dillinger
ca7ccbe2ea Misc hashing updates / upgrades (#5909)
Summary:
- Updated our included xxhash implementation to version 0.7.2 (== the latest dev version as of 2019-10-09).
- Using XXH_NAMESPACE (like other fb projects) to avoid potential name collisions.
- Added fastrange64, and unit tests for it and fastrange32. These are faster alternatives to hash % range.
- Use preview version of XXH3 instead of MurmurHash64A for NPHash64
-- Had to update cache_test to increase probability of passing for any given hash function.
- Use fastrange64 instead of % with uses of NPHash64
-- Had to fix WritePreparedTransactionTest.CommitOfDelayedPrepared to avoid deadlock apparently caused by new hash collision.
- Set default seed for NPHash64 because specifying a seed rarely makes sense for it.
- Removed unnecessary include xxhash.h in a popular .h file
- Rename preview version of XXH3 to XXH3p for clarity and to ease backward compatibility in case final version of XXH3 is integrated.

Relying on existing unit tests for NPHash64-related changes. Each new implementation of fastrange64 passed unit tests when manipulating my local build to select it. I haven't done any integration performance tests, but I consider the improved performance of the pieces being swapped in to be well established.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5909

Differential Revision: D18125196

Pulled By: pdillinger

fbshipit-source-id: f6bf83d49d20cbb2549926adf454fd035f0ecc0d
2019-10-24 17:16:46 -07:00
Levi Tamasi
a59dc843a4 Move blob_index.h to db/ (#5919)
Summary:
Extracted from PR https://github.com/facebook/rocksdb/issues/5903 for technical reasons.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5919

Test Plan: make check

Differential Revision: D17910132

Pulled By: ltamasi

fbshipit-source-id: 6ecbb8d6e84b2a1d1f28575ad48ac3cc65833eb5
2019-10-14 12:54:05 -07:00