Commit Graph

3251 Commits

Author SHA1 Message Date
Yi Wu
4cb7068c1e BlobDB: Fix VisibleToActiveSnapshot() (#4236)
Summary:
There are two issues with `VisibleToActiveSnapshot`:
1. If there are no snapshots, `oldest_snapshot` will be 0 and `VisibleToActiveSnapshot` will always return true. Since the method is used to decide whether it is safe to delete obsolete files, obsolete file won't be able to delete in this case.
2. The `auto` keyword of `auto snapshots = db_impl_->snapshots()` translate to a copy of `const SnapshotList` instead of a reference. Since copy constructor of `SnapshotList` is not defined, using the copy may yield unexpected result.

Issue 2 actually hide issue 1 from being catch by tests. During test `snapshots.empty()` can return false while it should actually be empty, and `snapshots.oldest()` return an invalid address, making `oldest_snapshot` being some random large number.

The issue was originally reported by BlobDB early adopter at Kuaishou.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4236

Differential Revision: D9188706

Pulled By: yiwu-arbug

fbshipit-source-id: a0f2624b927cf9bf28c1bb534784fee5d106f5ea
2018-08-06 16:57:42 -07:00
Jingguo Yao
ceb5fea1e3 Improve FullFilterBitsReader::HashMayMatch's doc (#4202)
Summary:
HashMayMatch is related to AddKey() instead of CreateFilter().
Also applies some minor Fixes #4191 #4200 #3910
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4202

Differential Revision: D9180945

Pulled By: maysamyabandeh

fbshipit-source-id: 6f07b81c5bb9bda5c0273475b486ba8a030471e6
2018-08-06 11:13:18 -07:00
Yanqin Jin
1f802773bc Update JobContext. (#3949)
Summary:
In the past, we assume that a job modifies a single column family. Therefore, a job can create at most one superversion since each superversion corresponds to one column family. This assumption leads to the fact that a `JobContext` has only one member variable called `superversion_context`.
Now we want to support group flush of column families, indicating that each job can create multiple superversions. Therefore, we need to make the following change to accommodate this new feature.

Add a vector of `SuperVersionContext` to `JobContext` to support installing
superversions for multiple column families in one job context.

This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3949

Differential Revision: D8864895

Pulled By: riversand963

fbshipit-source-id: 5937a48817276370d3c8172db9c8aafc826d97ca
2018-08-03 17:42:34 -07:00
Yanqin Jin
22368965a0 Modify verification logic of ObsoleteOptionsFileTest (#4218)
Summary:
The current verification logic does not consider the case in which multiple
threads (foreground and background) may execute `PurgeObsoleteFiles` function
simultaneously. Each invocation will trigger the callback adding elements to
a vector. Then we verify the elements in the vector, which can fail sometimes.

The solution is to give up checking the elements. Instead, we check the number
of OPTIONS file in the database dir.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4218

Differential Revision: D9128727

Pulled By: riversand963

fbshipit-source-id: 2b13b705fb21bc0ddd41940c4ec9b6b0c8d88224
2018-08-03 13:57:40 -07:00
DorianZheng
f9373e2d5c Make sure to call ReleaseFileNumberFromPendingOutputs
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4219

Differential Revision: D9144294

Pulled By: riversand963

fbshipit-source-id: e46b72e5f8a149dc7a0512e38edcd0ddb0150f30
2018-08-02 18:57:34 -07:00
Andrew Kryczka
f8f6983f89 Skip range deletions at seqno zero when collapsing (#4216)
Summary:
`CollapsedRangeDelMap` internally uses seqno zero as a sentinel value to
denote a gap between range tombstones or the end of range tombstones. It
therefore expects to never have consecutive sentinel tombstones.

However, since `DeleteRange` is now supported in `SstFileWriter`, an
ingested file may contain range tombstones, and that ingested file may
be assigned global seqno zero. When such tombstones are added to the
collapsed map, they resemble sentinel tombstones due to having seqno
zero. Then, the invariant mentioned above about never having consecutive
sentinel tombstones can be violated.

The symptom of this violation was dereferencing the `end()` iterator
(#4204). The fix in this PR is to not add range tombstones with seqno
zero to the collapsed map. They're not needed anyways since they can't
possibly cover anything (in case of a key and a range tombstone with the
same seqno, the key is visible).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4216

Differential Revision: D9121716

Pulled By: ajkr

fbshipit-source-id: f5b78a70bea9527354603ea7ac8542a7e2b6a210
2018-08-01 12:12:02 -07:00
Sagar Vemuri
12b6cdeed3 Trace and Replay for RocksDB (#3837)
Summary:
A framework for tracing and replaying RocksDB operations.

A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench.

- Column-families are supported
- Multi-threaded tracing is supported.
- TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it.
- This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations.

Currently supported DB operations:
- Writes:
-- Put
-- Merge
-- Delete
-- SingleDelete
-- DeleteRange
-- Write
- Reads:
-- Get (point lookups)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837

Differential Revision: D7974837

Pulled By: sagar0

fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72
2018-08-01 00:27:08 -07:00
Yanqin Jin
54de56844d Remove random writes from SST file ingestion (#4172)
Summary:
RocksDB used to store global_seqno in external SST files written by
SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the
`global_seqno`. Since random write is not supported in some non-POSIX compliant
file systems, external SST file ingestion is not supported on these file
systems. To address this limitation, we no longer update `global_seqno` during
file ingestion. Later RocksDB uses the MANIFEST and other information in table
properties to deduce global seqno for externally-ingested SST files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172

Differential Revision: D8961465

Pulled By: riversand963

fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071
2018-07-27 16:12:23 -07:00
DorianZheng
f5e46354d2 Protect external file when ingesting (#4099)
Summary:
If crash happen after a hard link established, Recover function may reuse the file number that has already assigned to the internal file, and this will overwrite the external file. To protect the external file, we have to make sure the file number will never being reused.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/4099

Differential Revision: D9034092

Pulled By: riversand963

fbshipit-source-id: 3f1a737440b86aa2ef01673e5013aacbb7c33e28
2018-07-27 14:13:12 -07:00
Siying Dong
fd45495cf5 DBImpl::IngestExternalFile() should grab mutex when releasing file number in failure case (#4189)
Summary:
995fcf7573 has a bug: ReleaseFileNumberFromPendingOutputs() added is not protected by the DB mutex. Fix it by grabbing the lock for this operation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4189

Differential Revision: D9015447

Pulled By: siying

fbshipit-source-id: b8506e09a96c3f95a6fe32b5ca5fcdb9bee88937
2018-07-26 11:12:29 -07:00
Siying Dong
2a81633da2 Fix bug when seeking backward against an out-of-bound iterator (#4187)
Summary:
92ee3350e0 introduces an out-of-bound check in BlockBasedTableIterator::Valid(). However, this flag is not reset when re-seeking in backward direction. This caused the iterator to be invalide by mistake. Fix it by always resetting the out-of-bound flag in every seek.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4187

Differential Revision: D8996600

Pulled By: siying

fbshipit-source-id: b6235ea614f71381e50e7904c4fb036300604ac1
2018-07-25 17:14:01 -07:00
Manuel Ung
ea212e5316 WriteUnPrepared: Implement unprepared batches for transactions (#4104)
Summary:
This adds support for writing unprepared batches based on size defined in `TransactionOptions::max_write_batch_size`. This is done by overriding methods that modify data (Put/Delete/SingleDelete/Merge) and checking first if write batch size has exceeded threshold. If so, the write batch is written to DB as an unprepared batch.

Support for Commit/Rollback for unprepared batch is added as well. This has been done by simply extending the WritePrepared Commit/Rollback logic to take care of all unprep_seq numbers either when updating prepare heap, or adding to commit map. For updating the commit map, this logic exists inside `WriteUnpreparedCommitEntryPreReleaseCallback`.

A test change was also made to have transactions unregister themselves when committing without prepare. This is because with write unprepared, there may be unprepared entries (which act similarly to prepared entries) already when a commit is done without prepare.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4104

Differential Revision: D8785717

Pulled By: lth

fbshipit-source-id: c02006e281ec1ce00f628e2a7beec0ee73096a91
2018-07-24 00:13:18 -07:00
Zhongyi Xie
f95a5b2464 Avoid unnecessary big for-loop when reporting ticker stats stored in GetContext (#3490)
Summary:
Currently in `Version::Get` when reporting ticker stats stored in `GetContext`, there is a big for-loop through all `Ticker` which adds unnecessary cost to overall CPU usage. We can optimize by storing only ticker values that are used in `Get()` calls in a new struct `GetContextStats` since only a small fraction of all tickers are used in `Get()` calls. For comparison, with the new approach we only need to visit 17 values while old approach will require visiting 100+ `Ticker`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3490

Differential Revision: D6969154

Pulled By: miasantreble

fbshipit-source-id: fc27072965a3a94125a3e6883d20dafcf5b84029
2018-07-20 16:58:13 -07:00
Siying Dong
a5e851e113 Reformatting some recent changes (#4161)
Summary:
Lint is not happy with some new code recently committed. Format them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4161

Differential Revision: D8940582

Pulled By: siying

fbshipit-source-id: c9b43b1ef8c88b5e923911058b44eb77234b36b7
2018-07-20 14:43:38 -07:00
Siying Dong
8425c8bd4d BlockBasedTableReader: automatically adjust tail prefetch size (#4156)
Summary:
Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others.
Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10%
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4156

Differential Revision: D8916847

Pulled By: siying

fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758
2018-07-20 14:43:37 -07:00
Yanqin Jin
2736752b33 Fix a bug in MANIFEST group commit (#4157)
Summary:
PR #3944 introduces group commit of `VersionEdit` in MANIFEST. The
implementation has a bug. When updating the log file number of each column
family, we must consider only `VersionEdit`s that operate on the same column
family. Otherwise, a column family may accidentally set its log file number
higher than actual value, indicating that log files with smaller file number
will be ignored, thus causing some updates to be lost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4157

Differential Revision: D8916650

Pulled By: riversand963

fbshipit-source-id: 8f456cf688f17bf35ad87b38e30e899aa162f201
2018-07-19 17:27:56 -07:00
Dmitri Smirnov
78ab11cd71 Return new operator for Status allocations for Windows (#4128)
Summary: Windows requires new/delete for memory allocations to be overriden. Refactor to be less intrusive.

Differential Revision: D8878047

Pulled By: siying

fbshipit-source-id: 35f2b5fec2f88ea48c9be926539c6469060aab36
2018-07-19 15:09:06 -07:00
Sagar Vemuri
f3801528c1 Disable DBFlushTest.SyncFail and DBTest.GroupCommitTest on Travis (#4154)
Summary:
I am temporarily disabling DBFlushTest.SyncFail and DBTest.GroupCommitTest tests on Travis until we figure out the root-cause. These tests will still continue to run locally though.
I haven't been able to reproduce these failures locally so far (even on a [local Travis environment](https://docs.travis-ci.com/user/common-build-problems/#Troubleshooting-Locally-in-a-Docker-Image) ).

These tests  are failing way too frequently causing everyone to wonder why their PR failed on travis, and waste time in debugging.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4154

Differential Revision: D8907258

Pulled By: sagar0

fbshipit-source-id: f40068b16e9245fb3791b6a4796435d1ce1ed205
2018-07-18 18:43:11 -07:00
Siying Dong
37e0fdc824 DBSSTTest.DeleteSchedulerMultipleDBPaths data race (#4146)
Summary:
Fix a minor data race in DBSSTTest.DeleteSchedulerMultipleDBPaths reported by TSAN
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4146

Differential Revision: D8880945

Pulled By: siying

fbshipit-source-id: 25c632f685757735c59ad4ff26b2f346a443a446
2018-07-17 17:57:46 -07:00
Yi Wu
d538ebdff0 Fix write get stuck when pipelined write is enabled (#4143)
Summary:
Fix the issue when pipelined write is enabled, writers can get stuck indefinitely and not able to finish the write. It can show with the following example: Assume there are 4 writers W1, W2, W3, W4 (W1 is the first, W4 is the last).

T1: all writers pending in WAL writer queue:
WAL writer queue: W1, W2, W3, W4
memtable writer queue: empty

T2. W1 finish WAL writer and move to memtable writer queue:
WAL writer queue: W2, W3, W4,
memtable writer queue: W1

T3. W2 and W3 finish WAL write as a batch group. W2 enter ExitAsBatchGroupLeader and move the group to memtable writer queue, but before wake up next leader.
WAL writer queue: W4
memtable writer queue: W1, W2, W3

T4. W1, W2, W3 finish memtable write as a batch group. Note that W2 still in the previous ExitAsBatchGroupLeader, although W1 have done memtable write for W2.
WAL writer queue: W4
memtable writer queue: empty

T5. The thread corresponding to W3 create another writer W3' with the same address as W3.
WAL writer queue: W4, W3'
memtable writer queue: empty

T6. W2 continue with ExitAsBatchGroupLeader. Because the address of W3' is the same as W3, the last writer in its group, it thinks there are no pending writers, so it reset newest_writer_ to null, emptying the queue. W4 and W3' are deleted from the queue and will never be wake up.

The issue exists since pipelined write was introduced in 5.5.0.

Closes #3704
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4143

Differential Revision: D8871599

Pulled By: yiwu-arbug

fbshipit-source-id: 3502674e51066a954a0660257e24ac588f815e2a
2018-07-17 17:27:51 -07:00
Siying Dong
ddc07b40fc Remove managed iterator
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4124

Differential Revision: D8829910

Pulled By: siying

fbshipit-source-id: f3e952ccf3a631071a5d77c48e327046f8abb560
2018-07-17 14:43:18 -07:00
Siying Dong
995fcf7573 Pending output file number should be released after bulkload failure (#4145)
Summary:
If bulkload fails for an input error, the pending output file number wasn't released. This bug can cause all future files with larger number than the current number won't be deleted, even they are compacted. This commit fixes the bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4145

Differential Revision: D8877900

Pulled By: siying

fbshipit-source-id: 080be92a23d43305ca1e13fe1c06eb4cd0b01466
2018-07-17 14:13:16 -07:00
Maysam Yabandeh
b55da012f6 Refactor IndexBlockIter (#4141)
Summary:
Refactor IndexBlockIter to reduce conditional branches on key_includes_seq_. IndexBlockIter::Prev is also separated from DataBlockIter::Prev, not to cache the prev entries as they are of less importance when iterating over the index block.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4141

Differential Revision: D8866437

Pulled By: maysamyabandeh

fbshipit-source-id: fdac76880426fc2be7d3c6354c09ab98f6657d4b
2018-07-16 17:13:10 -07:00
Sagar Vemuri
991120fa10 Allow ttl to be changed dynamically (#4133)
Summary:
Allow ttl to be changed dynamically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4133

Differential Revision: D8845440

Pulled By: sagar0

fbshipit-source-id: c8c87ae643b3a8c4123e4c037c4645efc094a2d3
2018-07-16 14:27:53 -07:00
Nathan VanBenschoten
ef7815b803 Support range deletion tombstones in IngestExternalFile SSTs (#3778)
Summary:
Fixes #3391.

This change adds a `DeleteRange` method to `SstFileWriter` and adds
support for ingesting SSTs with range deletion tombstones. This is
important for applications that need to atomically ingest SSTs while
clearing out any existing keys in a given key range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3778

Differential Revision: D8821836

Pulled By: anand1976

fbshipit-source-id: ca7786c1947ff129afa703dab011d524c7883844
2018-07-13 22:43:09 -07:00
Peter Mattis
90fc40690a Relax VersionStorageInfo::GetOverlappingInputs check (#4050)
Summary:
Do not consider the range tombstone sentinel key as causing 2 adjacent
sstables in a level to overlap. When a range tombstone's end key is the
largest key in an sstable, the sstable's end key is so to a "sentinel"
value that is the smallest key in the next sstable with a sequence
number of kMaxSequenceNumber. This "sentinel" is guaranteed to not
overlap in internal-key space with the next sstable. Unfortunately,
GetOverlappingFiles uses user-keys to determine overlap and was thus
considering 2 adjacent sstables in a level to overlap if they were
separated by this sentinel key. This in turn would cause compactions to
be larger than necessary.

Note that this conflicts with
https://github.com/facebook/rocksdb/pull/2769 and cases
`DBRangeDelTest.CompactionTreatsSplitInputLevelDeletionAtomically` to
fail.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4050

Differential Revision: D8844423

Pulled By: ajkr

fbshipit-source-id: df3f9f1db8f4cff2bff77376b98b83c2ae1d155b
2018-07-13 17:42:38 -07:00
Yanqin Jin
21171615c1 Reduce execution time of IngestFileWithGlobalSeqnoRandomized (#4131)
Summary:
Make `ExternalSSTFileTest.IngestFileWithGlobalSeqnoRandomized` run faster.

`make format`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4131

Differential Revision: D8839952

Pulled By: riversand963

fbshipit-source-id: 4a7e842fde1cde4dc902e928a1cf511322578521
2018-07-13 17:27:39 -07:00
Maysam Yabandeh
8581a93a6b Per-thread unique test db names (#4135)
Summary:
The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel.
Example: ``` ~/gtest-parallel/gtest-parallel ./table_test```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135

Differential Revision: D8846653

Pulled By: maysamyabandeh

fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84
2018-07-13 17:27:39 -07:00
Fosco Marotto
8527012bb6 Converted db/merge_test.cc to use gtest (#4114)
Summary:
Picked up a task to convert this to use the gtest framework.  It can't be this simple, can it?

It works, but should all the std::cout be removed?

```
[$] ~/git/rocksdb [gft !]: ./merge_test
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from MergeTest
[ RUN      ] MergeTest.MergeDbTest
Test read-modify-write counters...
a: 3
1
2
a: 3
b: 1225
3
Compaction started ...
Compaction ended
a: 3
b: 1225
Test merge-based counters...
a: 3
1
2
a: 3
b: 1225
3
Test merge in memtable...
a: 3
1
2
a: 3
b: 1225
3
Test Partial-Merge
Test merge-operator not set after reopen
[       OK ] MergeTest.MergeDbTest (93 ms)
[ RUN      ] MergeTest.MergeDbTtlTest
Opening database with TTL
Test read-modify-write counters...
a: 3
1
2
a: 3
b: 1225
3
Compaction started ...
Compaction ended
a: 3
b: 1225
Test merge-based counters...
a: 3
1
2
a: 3
b: 1225
3
Test merge in memtable...
Opening database with TTL
a: 3
1
2
a: 3
b: 1225
3
Test Partial-Merge
Opening database with TTL
Opening database with TTL
Opening database with TTL
Opening database with TTL
Test merge-operator not set after reopen
[       OK ] MergeTest.MergeDbTtlTest (97 ms)
[----------] 2 tests from MergeTest (190 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (190 ms total)
[  PASSED  ] 2 tests.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4114

Differential Revision: D8822886

Pulled By: gfosco

fbshipit-source-id: c299d008e883c3bb911d2b357a2e9e4423f8e91a
2018-07-13 14:13:07 -07:00
Anand Ananthabhotla
e3eba52a5d Re-enable kUniversalSubcompactions option_config (#4125)
Summary:
1. Move kUniversalSubcompactions up before kEnd in db_test_util.h, so
tests that cycle through all the option_configs include this
2. Skip kUniversalSubcompactions wherever kUniversalCompaction and
kUniversalCompactionMultilevel are skipped

Related to #3935
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4125

Differential Revision: D8828637

Pulled By: anand1976

fbshipit-source-id: 650dee15fd27d85281cf9bb4ca8ab460e04cac6f
2018-07-13 11:13:01 -07:00
Tamir Duberstein
7bee48bdbd Add GCC 8 to Travis (#3433)
Summary:
- Avoid `strdup` to use jemalloc on Windows
- Use `size_t` for consistency
- Add GCC 8 to Travis
- Add CMAKE_BUILD_TYPE=Release to Travis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3433

Differential Revision: D6837948

Pulled By: sagar0

fbshipit-source-id: b8543c3a4da9cd07ee9a33f9f4623188e233261f
2018-07-13 10:58:06 -07:00
Yanqin Jin
90ebf1a257 Reduce execution time of a test. (#4127)
Summary:
Reduce the number of key ranges in `ExternalSSTFileTest.OverlappingRanges` so
that the test completes in shorter time to avoid timeouts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4127

Differential Revision: D8827851

Pulled By: riversand963

fbshipit-source-id: a16387b0cc92a7c872b1c50f0cfbadc463afc9db
2018-07-12 17:42:03 -07:00
Yanqin Jin
dbeaa0d397 Reduce #iterations to shorten execution time. (#4123)
Summary:
Reduce #iterations from 5000 to 1000 so that
`ExternalSSTFileTest.CompactDuringAddFileRandom` can finish faster.
On the one hand, 5000 iterations does not seem to improve the quality of unit
test in comparison with 1000. On the other hand, long running tests should belong to stress tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4123

Differential Revision: D8822514

Pulled By: riversand963

fbshipit-source-id: 0f439b8d5ccd9a4aed84638f8bac16382de17245
2018-07-12 14:42:39 -07:00
Nikhil Benesch
5f3088d565 Range deletion performance improvements + cleanup (#4014)
Summary:
This fixes the same performance issue that #3992 fixes but with much more invasive cleanup.

I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z.

With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level.

ajkr please take a look when you have a minute!
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4014

Differential Revision: D8773253

Pulled By: ajkr

fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
2018-07-12 14:42:39 -07:00
Nikhil Benesch
5cd8240b86 Test range deletions with more configurations (#4021)
Summary:
Run the basic range deletion tests against the standard set of
configurations. This testing exposed that files with hash indexes and
partitioned indexes were not handling the case where the file contained
only range deletions--i.e., where the index was empty.

Additionally file a TODO about the fact that range deletions are broken
when allow_mmap_reads = true is set.

/cc ajkr nvanbenschoten

Best viewed with ?w=1: https://github.com/facebook/rocksdb/pull/4021/files?w=1
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4021

Differential Revision: D8811860

Pulled By: ajkr

fbshipit-source-id: 3cc07e6d6210a2a00b932866481b3d5c59775343
2018-07-11 15:57:49 -07:00
Yanqin Jin
331cb63641 SetOptions Backup Race Condition (#4108)
Summary:
Prior to this PR, there was a race condition between `DBImpl::SetOptions` and `BackupEngine::CreateNewBackup`, as illustrated below.
```
Time                  thread 1                           thread 2
  |   CreateNewBackup -> GetLiveFiles
  |                                         SetOptions -> RenameTempFileToOptionsFile
  |                                         SetOptions -> RenameTempFileToOptionsFile
  |                                         SetOptions -> RenameTempFileToOptionsFile // unlink oldest OPTIONS file
  |   copy the oldest OPTIONS // IO error!
  V
```
Proposed fix is to check the value of `DBImpl::disable_obsolete_files_deletion_` before calling `DeleteObsoleteOptionsFiles`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4108

Differential Revision: D8796360

Pulled By: riversand963

fbshipit-source-id: 02045317f793ea4c7d4400a5bf333b8502fa3e82
2018-07-11 14:57:46 -07:00
Manuel Ung
b9846370e9 WriteUnPrepared: Add support for recovering WriteUnprepared transactions (#4078)
Summary:
This adds support for recovering WriteUnprepared transactions through the following changes:
- The information in `RecoveredTransaction` is extended so that it can reference multiple batches.
- `MarkBeginPrepare` is extended with a bool indicating whether it is an unprepared begin, and this is passed down to `InsertRecoveredTransaction` to indicate whether the current transaction is prepared or not.
- `WriteUnpreparedTxnDB::Initialize` is overridden so that it will rollback unprepared transactions from the recovered transactions. This can be done without updating the prepare heap/commit map, because this is before the DB has finished initializing, and after writing the rollback batch, those data structures should not contain information about the rolled back transaction anyway.

Commit/Rollback of live transactions is still unimplemented and will come later.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4078

Differential Revision: D8703382

Pulled By: lth

fbshipit-source-id: 7e0aada6c23bd39299f1f20d6c060492e0e6b60a
2018-07-06 17:59:13 -07:00
Yanqin Jin
db7ae0a485 Fix a map lookup that may throw exception. (#4098)
Summary:
`std::map::at(key)` throws std::out_of_range if key does not exist. Current
code does not handle this. Although this case is unlikely, I feel it's safe to
use `std::map::find`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4098

Differential Revision: D8753865

Pulled By: riversand963

fbshipit-source-id: 9a9ba43badb0fb5e0d24cd87903931fd12f3f8ec
2018-07-06 16:12:49 -07:00
Huachao Huang
35b83327a7 compaction: fix max_subcompactions option for CompactRange (#4082)
Summary:
The max_subcompactions option was introduced in https://github.com/facebook/rocksdb/pull/3775.
Closes https://github.com/facebook/rocksdb/pull/4082

Differential Revision: D8743258

Pulled By: ajkr

fbshipit-source-id: d60ee75769dfc19ab6f8754e4ff3a267848f1ed9
2018-07-05 20:12:56 -07:00
Zhongyi Xie
b3efb1cbe0 fix clang analyzer warnings (#4072)
Summary:
clang analyze is giving the following warnings:
> db/compaction_job.cc:1178:16: warning: Called C++ object pointer is null
    } else if (meta->smallest.size() > 0) {
               ^~~~~~~~~~~~~~~~~~~~~
db/compaction_job.cc:1201:33: warning: Access to field 'marked_for_compaction' results in a dereference of a null pointer (loaded from variable 'meta')
    meta->marked_for_compaction = sub_compact->builder->NeedCompact();
    ~~~~
db/version_set.cc:2770:26: warning: Called C++ object pointer is null
        uint32_t cf_id = last_writer->cfd->GetID();
                         ^~~~~~~~~~~~~~~~~~~~~~~~~
Closes https://github.com/facebook/rocksdb/pull/4072

Differential Revision: D8685852

Pulled By: miasantreble

fbshipit-source-id: b0e2fd9dfc1cbba2317723e09886384b9b1c9085
2018-06-28 19:12:35 -07:00
Manuel Ung
8ad63a4b86 WriteUnPrepared: Add new WAL marker kTypeBeginUnprepareXID (#4069)
Summary:
This adds a new WAL marker of type kTypeBeginUnprepareXID.

Also, DBImpl now contains a field called batch_per_txn (meaning one WriteBatch per transaction, or possibly multiple WriteBatches). This would also indicate that this DB is using WriteUnprepared policy.

Recovery code would be able to make use of this extra field on DBImpl in a separate diff. For now, it is just used to determine whether the WAL is compatible or not.
Closes https://github.com/facebook/rocksdb/pull/4069

Differential Revision: D8675099

Pulled By: lth

fbshipit-source-id: ca27cae1738e46d65f2bb92860fc759deb874749
2018-06-28 18:58:29 -07:00
Anand Ananthabhotla
52d4c9b7f6 Allow DB resume after background errors (#3997)
Summary:
Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts -
1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not
2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place
3. Provide an API for the user to clear the error and resume the DB instance

This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors.
Closes https://github.com/facebook/rocksdb/pull/3997

Differential Revision: D8653831

Pulled By: anand1976

fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd
2018-06-28 12:34:40 -07:00
Yanqin Jin
26d67e357e Support group commits of version edits (#3944)
Summary:
This PR supports the group commit of multiple version edit entries corresponding to different column families. Column family drop/creation still cannot be grouped. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
Closes https://github.com/facebook/rocksdb/pull/3944

Differential Revision: D8432536

Pulled By: riversand963

fbshipit-source-id: 8f11bd05193b6c0d9272d82e44b676abfac113cb
2018-06-28 12:34:39 -07:00
Maysam Yabandeh
0a5b5d88b2 Remove ReadOnly part of PinnableSliceAndMmapReads from Lite (#4070)
Summary:
Lite does not support readonly DBs.
Closes https://github.com/facebook/rocksdb/pull/4070

Differential Revision: D8677858

Pulled By: maysamyabandeh

fbshipit-source-id: 536887d2363ee2f5d8e1ea9f1a511e643a1707fa
2018-06-28 08:42:17 -07:00
Zhongyi Xie
14f409c0f1 PrefixMayMatch: remove unnecessary check for prefix_extractor_ (#4067)
Summary:
with https://github.com/facebook/rocksdb/pull/3601 and https://github.com/facebook/rocksdb/pull/3899, `prefix_extractor_` is not really being used in block based filter and full filter's version of `PrefixMayMatch` because now `prefix_extractor` is passed as an argument. Also it is now possible that prefix_extractor_ may be initialized to nullptr when a non-standard prefix_extractor is used and also for ROCKSDB_LITE. Removing these checks should not break any existing tests.
Closes https://github.com/facebook/rocksdb/pull/4067

Differential Revision: D8669002

Pulled By: miasantreble

fbshipit-source-id: 0e701ba912b8a26734fadb72d15bb1b266b6176a
2018-06-27 20:42:43 -07:00
Zhichao Cao
1f6efabe23 Add bottommost_compression_opts to for bottommost_compression (#3985)
Summary:
…ression

 For `CompressionType` we have options `compression` and `bottommost_compression`. Thus, to make the compression options consitent with the compression type when bottommost_compression is enabled, we add the bottommost_compression_opts
Closes https://github.com/facebook/rocksdb/pull/3985

Reviewed By: riversand963

Differential Revision: D8385911

Pulled By: zhichao-cao

fbshipit-source-id: 07bc533dd61bcf1cef5927d8d62901c13d38d5fc
2018-06-27 17:42:38 -07:00
Maysam Yabandeh
235ab9dd32 Pin mmap files in ReadOnlyDB (#4053)
Summary:
https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case.
Closes https://github.com/facebook/rocksdb/pull/4053

Differential Revision: D8662546

Pulled By: maysamyabandeh

fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd
2018-06-27 17:13:34 -07:00
Manuel Ung
a16e00b7b9 WriteUnPrepared Txn: Disable seek to snapshot optimization (#3955)
Summary:
This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data.

There are the places where reads had to be modified.
- Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible.
- Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key.
- Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case.
Closes https://github.com/facebook/rocksdb/pull/3955

Differential Revision: D8286688

Pulled By: lth

fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe
2018-06-27 12:23:07 -07:00
Nikhil Benesch
17339dc2f3 Add table property tracking number of range deletions (#4016)
Summary:
Add a new table property, rocksdb.num.range-deletions, which tracks the
number of range deletions in a block-based table. Range deletions are no
longer counted in rocksdb.num.entries; as discovered in PR #3778, there
are various code paths that implicitly assume that rocksdb.num.entries
counts only true keys, not range deletions.

/cc ajkr nvanbenschoten
Closes https://github.com/facebook/rocksdb/pull/4016

Differential Revision: D8527575

Pulled By: ajkr

fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7
2018-06-26 20:27:35 -07:00
Zhongyi Xie
408205a36b use user_key and iterate_upper_bound to determine compatibility of bloom filters (#3899)
Summary:
Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file.
This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at.
Closes https://github.com/facebook/rocksdb/pull/3899

Differential Revision: D8157459

Pulled By: miasantreble

fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41
2018-06-26 15:57:26 -07:00