Commit Graph

687 Commits

Author SHA1 Message Date
sdong
a50ea71c00 Improve ldb consistency checks (#6802)
Summary:
When using ldb, users cannot turn on force consistency check in most commands, while they cannot use checksonsistnecy with --try_load_options. The change fixes both by:
1. checkconsistency now calls OpenDB() so that it gets all the options loading and sanitized options logic
2. use options.check_consistency_checks = true by default, and add a --disable_consistency_checks to turn it off.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6802

Test Plan: Add a new unit test. Some manual tests with corrupted DBs.

Reviewed By: pdillinger

Differential Revision: D21388051

fbshipit-source-id: 8d122732d391b426e3982a1c3232a8e3763ffad0
2020-05-08 14:17:47 -07:00
anand76
94265234de Fix race due to delete triggered compaction in Universal compaction mode (#6799)
Summary:
Delete triggered compaction in universal compaction mode was causing a corruption when scheduled in parallel with other compactions.
1. When num_levels = 1, a file marked for compaction may be picked along with all older files in L0, without checking if any of them are already being compaction. This can cause unpredictable results like resurrection of older versions of keys or deleted keys.
2. When num_levels > 1, a delete triggered compaction would not get scheduled if it overlaps with a running regular compaction. However, the reverse is not true. This is due to the fact that in ```UniversalCompactionBuilder::CalculateSortedRuns```, it assumes that entire sorted runs are picked for compaction and only checks the first file in a sorted run to determine conflicts. This is violated by a delete triggered compaction as it works on a subset of a sorted run.

Fix the bug for num_levels > 1, and disable the feature for now when num_levels = 1. After disabling this feature, files would still get marked for compaction, but no compaction would get scheduled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6799

Reviewed By: pdillinger

Differential Revision: D21431286

Pulled By: anand1976

fbshipit-source-id: ae9f0bdb1d6ae2f10284847db731c23f43af164a
2020-05-07 17:32:17 -07:00
Yanqin Jin
5584595f80 Do not swallow error returned from SaveTo() (#6801)
Summary:
With consistency check enabled, VersionBuilder::SaveTo() may return error once
corruption is detected while building versions. We should handle these errors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6801

Test Plan: make check

Reviewed By: siying

Differential Revision: D21385045

Pulled By: riversand963

fbshipit-source-id: 98f6424e2a4699b62befa21e9fe00e70a771118e
2020-05-05 10:46:20 -07:00
Levi Tamasi
a00ddf1574 Expose the set of live blob files from Version/VersionSet (#6785)
Summary:
The patch adds logic that returns the set of live blob files from
`Version::AddLiveFiles` and `VersionSet::AddLiveFiles` (in addition to
live table files), and also cleans up the code a bit, for example, by
exposing only the numbers of table files as opposed to the earlier
`FileDescriptor`s that no clients used. Moreover, the patch extends
the `GetLiveFiles` API so that it also exposes blob files in the current version.
Similarly to https://github.com/facebook/rocksdb/pull/6755,
this is a building block for identifying and purging obsolete blob files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6785

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D21336210

Pulled By: ltamasi

fbshipit-source-id: fc1aede8a49eacd03caafbc5f6f9ce43b6270821
2020-05-04 15:08:13 -07:00
sdong
680c416348 Avoid Swallowing Some File Consistency Checking Bugs (#6793)
Summary:
We are swallowing some file consistency checking failures. This is not expected. We are fixing two cases: DB reopen and manifest dump.
More places are not fixed and need follow-up.

Error from CheckConsistencyForDeletes() is also swallowed, which is not fixed in this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6793

Test Plan: Add a unit test to cover the reopen case.

Reviewed By: riversand963

Differential Revision: D21366525

fbshipit-source-id: eb438a322237814e8d5125f916a3c6de97f39ded
2020-05-04 14:18:11 -07:00
Levi Tamasi
fe238e5438 Keep track of obsolete blob files in VersionSet (#6755)
Summary:
The patch adds logic to keep track of obsolete blob files. A blob file becomes
obsolete when the last `shared_ptr` that points to the corresponding
`SharedBlobFileMetaData` object goes away, which, in turn, happens when the
last `Version` that contains the blob file is destroyed. No longer needed blob
files are added to the obsolete list in `VersionSet` using a custom deleter to
avoid unnecessary coupling between `SharedBlobFileMetaData` and `VersionSet`.
Obsolete blob files are returned by `VersionSet::GetObsoleteFiles` and stored
in `JobContext`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6755

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D21233155

Pulled By: ltamasi

fbshipit-source-id: 47757e06fdc0127f27ed57f51abd27893d9a7b7a
2020-04-30 11:25:51 -07:00
Mike Kolupaev
e45673dece Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.

Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.

It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.

Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621

Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.

Reviewed By: siying

Differential Revision: D20786930

Pulled By: al13n321

fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
2020-04-15 17:40:44 -07:00
Yanqin Jin
0c05624d50 Compaction with timestamp: input boundaries (#6645)
Summary:
Towards making compaction logic compatible with user timestamp.
When computing boundaries and overlapping ranges for inputs of compaction, We need to compare SSTs by user key without timestamp.

Test plan (devserver):
```
make check
```
Several individual tests:
```
./version_set_test --gtest_filter=VersionStorageInfoTimestampTest.GetOverlappingInputs
./db_with_timestamp_compaction_test
./db_with_timestamp_basic_test
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6645

Reviewed By: ltamasi

Differential Revision: D20960012

Pulled By: riversand963

fbshipit-source-id: ad377fa9eb481bf7a8a3e1824aaade48cdc653a4
2020-04-10 16:05:49 -07:00
Levi Tamasi
2165c3bacc Re-persist blob file metadata when a new manifest file is created (#6630)
Summary:
Does what it says on the can. Similarly to table files, we need to re-persist
the metadata of live blob files whenever a new manifest file is opened.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6630

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D20802126

Pulled By: ltamasi

fbshipit-source-id: 5738692d898790293bf09d66e9997369bbf89566
2020-04-02 11:53:05 -07:00
Zhichao Cao
4246888101 Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.

The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487

Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check

Reviewed By: anand1976

Differential Revision: D20685017

Pulled By: zhichao-cao

fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 16:04:43 -07:00
Levi Tamasi
6f62322fe4 Add blob files to VersionStorageInfo/VersionBuilder (#6597)
Summary:
The patch adds a couple of classes to represent metadata about
blob files: `SharedBlobFileMetaData` contains the information elements
that are immutable (once the blob file is closed), e.g. blob file number,
total number and size of blob files, checksum method/value, while
`BlobFileMetaData` contains attributes that can vary across versions like
the amount of garbage in the file. There is a single `SharedBlobFileMetaData`
for each blob file, which is jointly owned by the `BlobFileMetaData` objects
that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s
and can also be shared if the (immutable _and_ mutable) state of the blob file
is the same in two versions.

In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends
`VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those
containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata
to a new `VersionStorageInfo`. Consistency checks are also extended to ensure
that table files point to blob files that are part of the `Version`, and that all blob files
that are part of any given `Version` have at least some _non_-garbage data in them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D20656803

Pulled By: ltamasi

fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
2020-03-26 18:51:53 -07:00
Huisheng Liu
a6ce5c823b multiget support for timestamps (#6483)
Summary:
Add timestamp support for MultiGet().
timestamp from readoptions is honored, and timestamps can be returned along with values.

MultiReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
base line (commit 17bef7d3a):
  multireadrandom :     104.173 micros/op 307167 ops/sec; (5462999 of 5462999 found)
This PR:
  multireadrandom :     104.199 micros/op 307095 ops/sec; (5307999 of 5307999 found)

.\db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=multireadrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6483

Reviewed By: anand1976

Differential Revision: D20498373

Pulled By: riversand963

fbshipit-source-id: 8505f22bc40fd791bc7dd05e48d7e67c91edb627
2020-03-24 11:24:09 -07:00
sdong
921cdd37e2 Fix bug that number of table loading threads is set as a boolean (#6576)
Summary:
When applying a new version in non DB open case, optimize_filters_for_hits is used for max_threads, which is clearly a bug. It is not clear what the indented value in the first place, but it value 1 makes sense here, which would create no extra threads. This bug is not expected to cause user visible problems, assuming C++ implicitly cast bool to 0 or 1.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6576

Test Plan: Run all exsiting test.

Reviewed By: ajkr

Differential Revision: D20602467

fbshipit-source-id: 40b2cd8619aba09ae9242b36c415464db3c9b737
2020-03-24 10:17:40 -07:00
anand76
a9d168cfd7 Simplify migration to FileSystem API (#6552)
Summary:
The current Env/FileSystem API separation has a couple of issues -
1. It requires the user to specify 2 options - ```Options::env``` and ```Options::file_system``` - which means they have to make code changes to benefit from the new APIs. Furthermore, there is a risk of accessing the same APIs in two different ways, through Env in the old way and through FileSystem in the new way. The two may not always match, for example, if env is ```PosixEnv``` and FileSystem is a custom implementation. Any stray RocksDB calls to env will use the ```PosixEnv``` implementation rather than the file_system implementation.
2. There needs to be a simple way for the FileSystem developer to instantiate an Env for backward compatibility purposes.

This PR solves the above issues and simplifies the migration in the following ways -
1. Embed a shared_ptr to the ```FileSystem``` in the ```Env```, and remove ```Options::file_system``` as a configurable option. This way, no code changes will be required in application code to benefit from the new API. The default Env constructor uses a ```LegacyFileSystemWrapper``` as the embedded ```FileSystem```.
1a. - This also makes it more robust by ensuring that even if RocksDB
  has some stray calls to Env APIs rather than FileSystem, they will go
  through the same object and thus there is no risk of getting out of
  sync.
2. Provide a ```NewCompositeEnv()``` API that can be used to construct a
PosixEnv with a custom FileSystem implementation. This eliminates an
indirection to call Env APIs, and relieves the FileSystem developer of
the burden of having to implement wrappers for the Env APIs.
3. Add a couple of missing FileSystem APIs - ```SanitizeEnvOptions()``` and
```NewLogger()```

Tests:
1. New unit tests
2. make check and make asan_check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6552

Reviewed By: riversand963

Differential Revision: D20592038

Pulled By: anand1976

fbshipit-source-id: c3801ad4153f96d21d5a3ae26c92ba454d1bf1f7
2020-03-23 21:54:21 -07:00
Zhichao Cao
5c6346c420 Revert "Added the safe-to-ignore tag to version_edit (#6530)" (#6569)
Summary:
This reverts commit e10553f2a6.

Pass make asan_check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6569

Reviewed By: riversand963

Differential Revision: D20574319

Pulled By: zhichao-cao

fbshipit-source-id: ce36981a21596f5f2e14da6a59a2bb3619509a8b
2020-03-23 10:27:47 -07:00
Yanqin Jin
fb09ef05dc Attempt to recover from db with missing table files (#6334)
Summary:
There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status.
This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version.
`DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will *not* be performed.
To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush.

Test plan (on devserver):
```
$make check
$COMPILE_WITH_ASAN=1 make all && make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334

Reviewed By: anand1976

Differential Revision: D19778960

Pulled By: riversand963

fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8
2020-03-20 19:30:48 -07:00
Zhichao Cao
e10553f2a6 Added the safe-to-ignore tag to version_edit (#6530)
Summary:
Each time RocksDB switches to a new MANIFEST file from old one, it calls WriteCurrentStateToManifest() which writes a 'snapshot' of the current in-memory state of versions to the beginning of the new manifest as a bunch of version edits. We can distinguish these version edits from other version edits written during normal operations with a custom, safe-to-ignore tag.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6530

Test Plan: added test to version_edit_test, pass make asan_check

Reviewed By: riversand963

Differential Revision: D20524516

Pulled By: zhichao-cao

fbshipit-source-id: f1de102f5499bfa88dae3caa2f32c7f42cf904db
2020-03-19 11:30:26 -07:00
Yanqin Jin
d93812c9ae Iterator with timestamp (#6255)
Summary:
Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests.

Create an iterator with timestamp.
```
...
read_opts.timestamp = &ts;
auto* iter = db->NewIterator(read_opts);
// target is key without timestamp.
for (iter->Seek(target); iter->Valid(); iter->Next()) {}
for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {}
delete iter;
read_opts.timestamp = &ts1;
// lower_bound and upper_bound are without timestamp.
read_opts.iterate_lower_bound = &lower_bound;
read_opts.iterate_upper_bound = &upper_bound;
auto* iter1 = db->NewIterator(read_opts);
// Do Seek or SeekToFirst()
delete iter1;
```

Test plan (dev server)
```
$make check
```

Simple benchmarking (dev server)
1. The overhead introduced by this PR even when timestamp is disabled.
key size: 16 bytes
value size: 100 bytes
Entries: 1000000
Data reside in main memory, and try to stress iterator.
Repeated three times on master and this PR.
- Seek without next
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3
```
master: 159047.0 ops/sec
this PR: 158922.3 ops/sec (2% drop in throughput)
- Seek and next 10 times
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10
```
master: 109539.3 ops/sec
this PR: 107519.7 ops/sec (2% drop in throughput)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255

Differential Revision: D19438227

Pulled By: riversand963

fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746
2020-03-06 16:24:27 -08:00
Zhichao Cao
8d73137ae8 Replace Directory with FSDirectory in DB (#6468)
Summary:
In the current code base, we can use Directory from Env to manage directory (e.g, Fsync()). The PR https://github.com/facebook/rocksdb/issues/5761  introduce the File System as a new Env API. So we further replace the Directory class in DB with FSDirectory such that we can have more IO information from IOStatus returned by FSDirectory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6468

Test Plan: pass make asan_check

Differential Revision: D20195261

Pulled By: zhichao-cao

fbshipit-source-id: 93962cb9436852bfcfb76e086d9e7babd461cbe1
2020-03-02 16:16:26 -08:00
Huisheng Liu
904a60ff63 return timestamp from get (#6409)
Summary:
Added new Get() methods that return timestamp. Dummy implementation is given so that classes derived from DB don't need to be touched to provide their implementation. MultiGet is not included.

ReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
    base line (commit 72ee067b9):
        101.712 micros/op 314602 ops/sec;   36.0 MB/s (5658999 of 5658999 found)
    This PR:
        100.288 micros/op 319071 ops/sec;   36.5 MB/s (5674999 of 5674999 found)

./db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=readrandom --use_existing_db=1 --num=25000000 --threads=32
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6409

Differential Revision: D20200086

Pulled By: riversand963

fbshipit-source-id: 490edd74d924f62bd8ae9c29c2a6bbbb8410ca50
2020-03-02 16:01:00 -08:00
Yanqin Jin
362b8d4393 Fix MANIFEST name assignment (#6426)
Summary:
Currently, a new MANIFEST file is assigned a new file number when 1) no
MANIFEST is open, or 2) current MANIFEST file size exceeds a threshold. This is
not sufficient. There are cases when the caller explicitly specifies that a new
MANIFEST be created. For example, if user sets options.write_dbid_to_manifest = true,
and there are WAL files, then RocksDB will run into an issue during recovery.
`DBImpl::Recover()` will call `LogAndApply()` to write dbid. At this point, the db being
recovered creates a new MANIFEST, say, MANIFEST-000003. Since there are WALs,
`DBImpl::RecoverLogFiles` will be called. Towards the end of this function, we call
`LogAndApply(new_descriptor_log=true)`, which explicitly creates a new MANIFEST.
However, the manifest_file_number is wrong before this fix. Consequently, RocksDB
opens an existing, non-empty file for append, effectively truncating the file to zero.
If a crash occurs, then there will be data loss.

Test Plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6426

Test Plan: make check

Differential Revision: D19951866

Pulled By: riversand963

fbshipit-source-id: 4b1b9fc28d4fe2ac12764b388ef9e61f05e766da
2020-02-20 14:30:58 -08:00
sdong
fdf882ded2 Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433

Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.

Differential Revision: D19977691

fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
2020-02-20 12:09:57 -08:00
Cheng Chang
4034e289ad Fail fast in paranoid mode when LoadTableHandlers fail during recovering (#6368)
Summary:
Previously, when recovering version set, LoadTableHandlers failures are ignored.
If paranoid_checks is true, this failure should not be ignored, otherwise, the opened db might be in an inconsistent state.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6368

Test Plan: make check

Differential Revision: D19713459

Pulled By: cheng-chang

fbshipit-source-id: 68cb94f4f2cc43f8b024b14755193cd45cfcad55
2020-02-14 08:17:10 -08:00
Cheng Chang
a676001f95 Revert usage of Defer. (#6410)
Summary:
Seems like this caused the following test failure on AppVeyor:
DBTest2.CrashInRecoveryMultipleCF
c:\projects\rocksdb\db\db_test_util.cc(107): error: DestroyDB(dbname_, options)
IO error: Failed to delete: C:\projects\rocksdb\db_tests\\testrocksdb-3112//db_test2_10791409581227174103/000013.sst: Access is denied.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6410

Test Plan: Wait to see whether the AppVeyor test passes.

Differential Revision: D19879872

Pulled By: cheng-chang

fbshipit-source-id: 59a9c55ca88566e9210c0b715ecc45a4fd9afe26
2020-02-13 10:31:32 -08:00
anand76
3e49249d30 Ensure all MultiGet IO errors are propagated to user (#6403)
Summary:
Unrevert the previous fix to propagate error status, and an additional fix to not treat a memtable lookup MergeInProgress status as an error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6403

Test Plan:
Unit tests
Tried running stress tests but couldn't repro the stress failure

Differential Revision: D19846721

Pulled By: anand1976

fbshipit-source-id: 7db10cccbdc863d9b559497f0a46b608d2488ca4
2020-02-11 17:27:22 -08:00
anand76
35ed530d2c Revert "Check KeyContext status in MultiGet (#6387)" (#6401)
Summary:
This reverts commit d70011bccc. The commit is causing some stress test failure due to unexpected Status::MergeInProgress() return for some keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6401

Differential Revision: D19826623

Pulled By: anand1976

fbshipit-source-id: edd634cede9cb7bdd2cb8f46e662ea709b16d2f1
2020-02-10 22:23:36 -08:00
Cheng Chang
dafb568052 Add utility class Defer (#6382)
Summary:
Add a utility class `Defer` to defer the execution of a function until the Defer object goes out of scope.
Used in VersionSet:: ProcessManifestWrites as an example.
The inline comments for class `Defer` have more details.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6382

Test Plan: `make defer_test version_set_test && ./defer_test && ./version_set_test`

Differential Revision: D19797538

Pulled By: cheng-chang

fbshipit-source-id: b1a9b7306e4fd4f48ec2ab55783caa561a315f0f
2020-02-10 17:59:47 -08:00
Levi Tamasi
cbf5f3be43 Do not move VersionEdit into AtomicGroupReadBuffer (#6400)
Summary:
https://github.com/facebook/rocksdb/pull/6383 surfaced an issue with
`VersionSet`/`ReactiveVersionSet` and `AtomicGroupReadBuffer::AddEdit`
(which was added in https://github.com/facebook/rocksdb/pull/5411):
`AddEdit` moves the `VersionEdit` passed to it into `replay_buffer_`,
however, the client `VersionSet` classes keep using it afterwards. This
*seemed to* work before the refactoring but it really did not: since
`VersionEdit` used to have a user-declared destructor, no move
constructor/move assignment operator was generated, and the `move` in
`AddEdit` was really a copy. The patch makes the copy explicit. Note: it
should be possible to rework this logic so that we can get away
with the move but for now, this should fix the issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6400

Test Plan:
`make check`
`make analyze`

Differential Revision: D19824466

Pulled By: ltamasi

fbshipit-source-id: f38033967daf2a39c78dcd6e12978bafe37632b4
2020-02-10 17:15:42 -08:00
Zhichao Cao
4369f2c7bb Checksum for each SST file and stores in MANIFEST (#6216)
Summary:
In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata.

Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string).  A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216

Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check.

Differential Revision: D19171461

Pulled By: zhichao-cao

fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87
2020-02-10 15:52:52 -08:00
anand76
d70011bccc Check KeyContext status in MultiGet (#6387)
Summary:
Currently, any IO errors and checksum mismatches while reading data
blocks, are being ignored by the batched MultiGet. Its only looking at
the GetContext state. Fix that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6387

Test Plan: Add unit tests

Differential Revision: D19799819

Pulled By: anand1976

fbshipit-source-id: 46133dccbb04e64067b9fe6cda73e282203db969
2020-02-07 16:48:16 -08:00
Levi Tamasi
752c87af78 Clean up VersionEdit a bit (#6383)
Summary:
This is a bunch of small improvements to `VersionEdit`. Namely, the patch

* Makes the names and order of variables, methods, and code chunks related
  to the various information elements more consistent, and adds missing
  getters for the sake of completeness.
* Initializes previously uninitialized stack variables.
* Marks all getters const to improve const correctness.
* Adds in-class initializers and removes the default ctor that would
  create an object with uninitialized built-in fields and call `Clear`
  afterwards.
* Adds a new type alias for new files and changes the existing `typedef`
  for deleted files into a type alias as well.
* Makes the helper method `DecodeNewFile4From` private.
* Switches from long-winded iterator syntax to range based loops in a
  couple of places.
* Fixes a couple of assignments where an integer 0 was assigned to
  boolean members.
* Fixes a getter which used to return a `const std::string` instead of
the intended `const std::string&`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6383

Test Plan: make check

Differential Revision: D19780537

Pulled By: ltamasi

fbshipit-source-id: b0b4f09fee0ec0e7c7b7a6d76bfe5346e91824d0
2020-02-07 13:27:06 -08:00
sdong
69c8614815 Avoid to get manifest file size when recovering from it. (#6369)
Summary:
Right now RocksDB gets manifest file size before recovering from it. The information is available in LogReader. Use it instead to prevent one file system call.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6369

Test Plan: Run all existing tests

Differential Revision: D19714872

fbshipit-source-id: 0144be324d403c99e3da875ea2feccc8f64e883d
2020-02-04 11:39:23 -08:00
anand76
fb05b5a652 Force a new manifest file if append to current one fails (#6331)
Summary:
Fix for issue https://github.com/facebook/rocksdb/issues/6316

When an append/sync of the manifest file fails due to an IO error such
as NoSpace, we don't always put the DB in read-only mode. This is true
for flush and compactions, as well as foreground operatons such as column family
add/drop, CompactFiles etc. Subsequent changes to the DB will be
recorded in the same manifest file, which would have a corrupted record
in the middle due to the previous failure. On next DB::Open(), it will
fail to process the full manifest and data will be lost.

To fix this, we reset VersionSet::descriptor_log_ on append/sync
failure, which will force a new manifest file to be written on the next
append.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6331

Test Plan: Add new unit tests in error_handler_test.cc

Differential Revision: D19632951

Pulled By: anand1976

fbshipit-source-id: 68d527cb6e59a94cbbbf9f5a17a7f464381d51e3
2020-01-30 10:56:29 -08:00
sdong
8f2bee6747 Add ReadOptions.auto_prefix_mode (#6314)
Summary:
Add a new option ReadOptions.auto_prefix_mode. When set to true, iterator should return the same result as total order seek, but may choose to do prefix seek internally, based on iterator upper bounds. Also fix two previous bugs when handling prefix extrator changes: (1) reverse iterator should not rely on upper bound to determine prefix. Fix it with skipping prefix check. (2) block-based filter is not handled properly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6314

Test Plan: (1) add a unit test; (2) add the check to stress test and run see whether it can pass at least one run.

Differential Revision: D19458717

fbshipit-source-id: 51c1bcc5cdd826c2469af201979a39600e779bce
2020-01-28 14:44:05 -08:00
sdong
f10f135938 Fix regression bug of hash index with iterator total order seek (#6328)
Summary:
https://github.com/facebook/rocksdb/pull/6028 introduces a bug for hash index in SST files. If a table reader is created when total order seek is used, prefix_extractor might be passed into table reader as null. While later when prefix seek is used, the same table reader used, hash index is checked but prefix extractor is null and the program would crash.
Fix the issue by fixing http://github.com/facebook/rocksdb/pull/6028 in the way that prefix_extractor is preserved but ReadOptions.total_order_seek is checked

Also, a null pointer check is added so that a bug like this won't cause segfault in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6328

Test Plan: Add a unit test that would fail without the fix. Stress test that reproduces the crash would pass.

Differential Revision: D19586751

fbshipit-source-id: 8de77690167ddf5a77a01e167cf89430b1bfba42
2020-01-27 15:44:54 -08:00
Levi Tamasi
d305f13e21 Make DBCompactionTest.SkipStatsUpdateTest more robust (#6306)
Summary:
Currently, this test case tries to infer whether
`VersionStorageInfo::UpdateAccumulatedStats` was called during open by
checking the number of files opened against an arbitrary threshold (10).
This makes the test brittle and results in sporadic failures. The patch
changes the test case to use sync points to directly test whether
`UpdateAccumulatedStats` was called.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6306

Test Plan: `make check`

Differential Revision: D19439544

Pulled By: ltamasi

fbshipit-source-id: ceb7adf578222636a0f51740872d0278cd1a914f
2020-01-21 12:55:55 -08:00
Yanqin Jin
1aaa145877 Fix a data race for cfd->log_number_ (#6249)
Summary:
A thread calling LogAndApply may release db mutex when calling
WriteCurrentStateToManifest() which reads cfd->log_number_. Another thread can
call SwitchMemtable() and writes to cfd->log_number_.
Solution is to cache the cfd->log_number_ before releasing mutex in
LogAndApply.

Test Plan (on devserver):
```
$COMPILE_WITH_TSAN=1 make db_stress
$./db_stress --acquire_snapshot_one_in=10000 --avoid_unnecessary_blocking_io=1 --block_size=16384 --bloom_bits=16 --bottommost_compression_type=zstd --cache_index_and_filter_blocks=1 --cache_size=1048576 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_blackbox --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --enable_pipelined_write=0  --flush_one_in=1000000 --format_version=5 --get_live_files_and_wal_files_one_in=1000000 --index_block_restart_interval=5 --index_type=0 --log2_keys_per_lock=22 --long_running_snapshots=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=1000000 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --memtablerep=skip_list --mmap_read=0 --nooverwritepercent=1 --open_files=500000 --ops_per_thread=100000000 --partition_filters=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefixpercent=5 --progress_reports=0 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --set_options_one_in=10000 --snapshot_hold_ops=100000 --subcompactions=2 --sync=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=4194304 --write_dbid_to_manifest=1 --writepercent=35
```
Then repeat the following multiple times, e.g. 100 after compiling with tsan.
```
$./db_test2 --gtest_filter=DBTest2.SwitchMemtableRaceWithNewManifest
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6249

Differential Revision: D19235077

Pulled By: riversand963

fbshipit-source-id: 79467b52f48739ce7c27e440caa2447a40653173
2020-01-06 20:09:51 -08:00
anand76
afa2420c2b Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 14:48:41 -08:00
Jermy Li
c2029f9716 Support concurrent CF iteration and drop (#6147)
Summary:
It's easy to cause coredump when closing ColumnFamilyHandle with unreleased iterators, especially iterators release is controlled by java GC when using JNI.

This patch fixed concurrent CF iteration and drop, we let iterators(actually SuperVersion) hold a ColumnFamilyData reference to prevent the CF from being released too early.

fixed https://github.com/facebook/rocksdb/issues/5982
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6147

Differential Revision: D18926378

fbshipit-source-id: 1dff6d068c603d012b81446812368bfee95a5e15
2019-12-12 19:04:48 -08:00
sdong
aa1857e2df Support options.max_open_files = -1 with periodic_compaction_seconds (#6090)
Summary:
options.periodic_compaction_seconds isn't supported when options.max_open_files != -1. It's because that the information of file creation time is stored in table properties and are not guaranteed to be loaded unless options.max_open_files = -1. Relax this constraint by storing the information in manifest.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6090

Test Plan: Pass all existing tests; Modify an existing test to force the manifest value to take 0 to simulate backward compatibility case; manually open the DB generated with the change by release 4.2.

Differential Revision: D18702268

fbshipit-source-id: 13e0bd94f546498a04f3dc5fc0d9dff5125ec9eb
2019-11-26 21:39:56 -08:00
sdong
77eab5c85a Make default value of options.ttl to be 30 days when it is supported. (#6073)
Summary:
By default options.ttl is disabled. We believe a better default will be 30 days, which means deleted data the database will be removed from SST files slightly after 30 days, for most of the cases.

Make the default UINT64_MAX - 1 to indicate that it is not overridden by users.

Change periodic_compaction_seconds to be UINT64_MAX - 1 to UINT64_MAX  too to be consistent. Also fix a small bug in the previous periodic_compaction_seconds default code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6073

Test Plan: Add unit tests for it.

Differential Revision: D18669626

fbshipit-source-id: 957cd4374cafc1557d45a0ba002010552a378cc8
2019-11-26 10:00:32 -08:00
sdong
d8c28e692a Support options.ttl with options.max_open_files = -1 (#6060)
Summary:
Previously, options.ttl cannot be set with options.max_open_files = -1, because it makes use of creation_time field in table properties, which is not available unless max_open_files = -1. With this commit, the information will be stored in manifest and when it is available, will be used instead.

Note that, this change will break forward compatibility for release 5.1 and older.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6060

Test Plan: Extend existing test case to options.max_open_files != -1, and simulate backward compatility in one test case by forcing the value to be 0.

Differential Revision: D18631623

fbshipit-source-id: 30c232a8672de5432ce9608bb2488ecc19138830
2019-11-22 21:23:00 -08:00
sdong
bb23bfe63c Fix a regression bug on total order seek with prefix enabled and range delete (#6028)
Summary:
Recent change https://github.com/facebook/rocksdb/pull/5861 mistakely use "prefix_extractor_ != nullptr" as the condition to determine whehter prefix bloom filter isused. It fails to consider read_options.total_order_seek, so it is wrong. The result is that an optimization for non-total-order seek is mistakely applied to total order seek, and introduces a bug in following corner case:
Because of RangeDelete(), a file's largest key is extended. Seek key falls into the range deleted file, so level iterator seeks into the previous file without getting any key. The correct behavior is to place the iterator to the first key of the next file. However, an optimization is triggered and invalidates the iterator because it is out of the prefix range, causing wrong results. This behavior is reproduced in the unit test added.
Fix the bug by setting prefix_extractor to be null if total order seek is used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6028

Test Plan: Add a unit test which fails without the fix.

Differential Revision: D18479063

fbshipit-source-id: ac075f013029fcf69eb3a598f14c98cce3e810b3
2019-11-13 10:11:34 -08:00
Levi Tamasi
f80050fa8f Add file number/oldest referenced blob file number to {Sst,Live}FileMetaData (#6011)
Summary:
The patch exposes the file numbers of the SSTs as well as the oldest blob
files they contain a reference to through the GetColumnFamilyMetaData/
GetLiveFilesMetaData interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6011

Test Plan:
Fixed and extended the existing unit tests. (The earlier ColumnFamilyMetaDataTest
wasn't really testing anything because the generated memtables were never
flushed, so the metadata structure was essentially empty.)

Differential Revision: D18361697

Pulled By: ltamasi

fbshipit-source-id: d5ed1d94ac70858b84393c48711441ddfe1251e9
2019-11-07 14:04:16 -08:00
Sagar Vemuri
4c9aa30a62 Auto enable Periodic Compactions if a Compaction Filter is used (#5865)
Summary:
- Periodic compactions are auto-enabled if a compaction filter or a compaction filter factory is set, in Level Compaction.
- The default value of `periodic_compaction_seconds` is changed to UINT64_MAX, which lets RocksDB auto-tune periodic compactions as needed. An explicit value of 0 will still work as before ie. to disable periodic compactions completely. For now, on seeing a compaction filter along with a UINT64_MAX value for `periodic_compaction_seconds`, RocksDB will make SST files older than 30 days to go through periodic copmactions.

Some RocksDB users make use of compaction filters to control when their data can be deleted, usually with a custom TTL logic. But it is occasionally possible that the compactions get delayed by considerable time due to factors like low writes to a key range, data reaching bottom level, etc before the TTL expiry. Periodic Compactions feature was originally built to help such cases. Now periodic compactions are auto enabled by default when compaction filters or compaction filter factories are used, as it is generally helpful to all cases to collect garbage.

`periodic_compaction_seconds` is set to a large value, 30 days, in `SanitizeOptions` when RocksDB sees that a `compaction_filter` or `compaction_filter_factory` is used.

This is done only for Level Compaction style.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5865

Test Plan:
- Added a new test `DBCompactionTest.LevelPeriodicCompactionWithCompactionFilters` to make sure that `periodic_compaction_seconds` is set if either `compaction_filter` or `compaction_filter_factory` options are set.
- `COMPILE_WITH_ASAN=1 make check`

Differential Revision: D17659180

Pulled By: sagar0

fbshipit-source-id: 4887b9cf2e53cf2dc93a7b658c6b15e1181217ee
2019-10-29 15:05:51 -07:00
Vijay Nadimpalli
ec880436c1 API to get file_creation_time of the oldest file in the DB (#5948)
Summary:
Adding a new API to db.h that allows users to get file_creation_time of the oldest file in the DB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5948

Test Plan: Added unit test.

Differential Revision: D18056151

Pulled By: vjnadimpalli

fbshipit-source-id: 448ec9d34cb6772e1e5a62db399ace00dcbfbb5d
2019-10-25 11:53:57 -07:00
Yanqin Jin
2309fd63bf Update column families' log number altogether after flushing during recovery (#5856)
Summary:
A bug occasionally shows up in crash test, and https://github.com/facebook/rocksdb/issues/5851 reproduces it.
The bug can surface in the following way.
1. Database has multiple column families.
2. Between one DB restart, the last log file is corrupted in the middle (not the tail)
3. During restart, DB crashes between flushing between two column families.

Then DB will fail to be opened again with error "SST file is ahead of WALs".
Solution is to update the log number associated with each column family altogether after flushing all column families' memtables. The version edits should be written to a new MANIFEST. Only after writing to all these version edits succeed does RocksDB (atomically) points the CURRENT file to the new MANIFEST.

Test plan (on devserver):
```
$make all && make check
```
Specifically
```
$make db_test2
$./db_test2 --gtest_filter=DBTest2.CrashInRecoveryMultipleCF
```
Also checked for compatibility as follows.
Use this branch, run DBTest2.CrashInRecoveryMultipleCF and preserve the db directory.
Then checkout 5.4, build ldb, and dump the MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5856

Differential Revision: D17620818

Pulled By: riversand963

fbshipit-source-id: b52ce5969c9a8052cacec2bd805fcfb373589039
2019-10-24 18:29:30 -07:00
sdong
a0cd920026 LevelIterator to avoid gap after prefix bloom filters out a file (#5861)
Summary:
Right now, when LevelIterator::Seek() is called, when a file is filtered out by prefix bloom filter, the position is put to the beginning of the next file. This is a confusing internal interface because many keys in the levels are skipped. Avoid this behavior by checking the key of the next file against the seek key, and invalidate the whole iterator if the prefix doesn't match.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5861

Test Plan: Add a new unit test to validate the behavior; run all exsiting tests; run crash_test

Differential Revision: D17918213

fbshipit-source-id: f06b47d937c7cc8919001f18dcc3af5b28c9cdac
2019-10-21 11:40:57 -07:00
Levi Tamasi
5f025ea832 BlobDB GC: add SST <-> oldest blob file referenced mapping (#5903)
Summary:
This is groundwork for adding garbage collection support to BlobDB. The
patch adds logic that keeps track of the oldest blob file referred to by
each SST file. The oldest blob file is identified during flush/
compaction (similarly to how the range of keys covered by the SST is
identified), and persisted in the manifest as a custom field of the new
file edit record. Blob indexes with TTL are ignored for the purposes of
identifying the oldest blob file (since such blob files are cleaned up by the
TTL logic in BlobDB).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5903

Test Plan:
Added new unit tests; also ran db_bench in BlobDB mode, inspected the
manifest using ldb, and confirmed (by scanning the SST files using
sst_dump) that the value of the oldest blob file number field matches
the contents of the file for each SST.

Differential Revision: D17859997

Pulled By: ltamasi

fbshipit-source-id: 21662c137c6259a6af70446faaf3a9912c550e90
2019-10-14 15:21:01 -07:00
sdong
846e05005d Revert "Merging iterator to avoid child iterator reseek for some cases (#5286)" (#5871)
Summary:
This reverts commit 9fad3e21eb.

Iterator verification in stress tests sometimes fail for assertion
table/block_based/block_based_table_reader.cc:2973: void rocksdb::BlockBasedTableIterator<TBlockIter, TValue>::FindBlockForward() [with TBlockIter = rocksdb::DataBlockIter; TValue = rocksdb::Slice]: Assertion `!next_block_is_out_of_bound || user_comparator_.Compare(*read_options_.iterate_upper_bound, index_iter_->user_key()) <= 0' failed.

It is likely to be linked to https://github.com/facebook/rocksdb/pull/5286 together with https://github.com/facebook/rocksdb/pull/5468 as the former PR makes some child iterator's seek being avoided, so that upper bound condition fails to be updated there. Strictly speaking, the former PR was merged before the latter one, but the latter one feels a more important improvement so I choose to revert the former one for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5871

Differential Revision: D17689196

fbshipit-source-id: 4ded5be68f67bee2782d31a29cb72ea68f59dd8c
2019-10-01 11:22:41 -07:00