Commit Graph

3548 Commits

Author SHA1 Message Date
Andrew Kryczka
8ec3e72551 Cache dictionary used for decompressing data blocks (#4881)
Summary:
- If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`.
- If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle.
- New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881

Differential Revision: D13663801

Pulled By: ajkr

fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f
2019-01-23 18:15:47 -08:00
PeifengSi
43defe9872 Correct the code comment in Compaction::KeyNotExistsBeyondOutputLevel (#4902)
Summary:
Even one key falls in a file's range, we can not infer it definitely exists in this file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4902

Differential Revision: D13795018

Pulled By: siying

fbshipit-source-id: 590956f727e9440fcdee55ad9541ace934c64914
2019-01-23 18:00:56 -08:00
Siying Dong
d94aa2f7db Make compaction_pri = kMinOverlappingRatio to be default (#4911)
Summary:
compaction_pri = kMinOverlappingRatio usually provides much better write amplification than the default.
https://github.com/facebook/rocksdb/pull/4907 fixes one shortcome of this option. Make it default.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4911

Differential Revision: D13789262

Pulled By: siying

fbshipit-source-id: d90acf8c4dede44f00d183ca4c7a210259378269
2019-01-23 16:47:38 -08:00
Siying Dong
5bf941966b CompactionPri = kMinOverlappingRatio also uses compensated file size (#4907)
Summary:
Right now, CompactionPri = kMinOverlappingRatio provides best write amplification, but it doesn't
prioritize files with more tombstones. We combine the two good features: make kMinOverlappingRatio
to boost files with lots of tombstones too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4907

Differential Revision: D13788774

Pulled By: siying

fbshipit-source-id: 1991cbb495fb76c8b529de69896e38d81ed9d9b3
2019-01-23 13:21:01 -08:00
Andrew Kryczka
01013ae766 Digest ZSTD compression dictionary once when writing SST file (#4849)
Summary:
This is essentially a re-submission of #4251 with a few improvements:

- Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict`
- Eliminated `Init` functions. Instead do all initialization work in constructors.
- Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849

Differential Revision: D13606039

Pulled By: ajkr

fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447
2019-01-18 19:12:57 -08:00
Yi Wu
b1ad6ebba8 WritePrepared: fix two versions in compaction see different status for released snapshots (#4890)
Summary:
Fix how CompactionIterator::findEarliestVisibleSnapshots handles released snapshot. It fixing the two scenarios:

Scenario 1:
key1 has two values v1 and v2. There're two snapshots s1 and s2 taken after v1 and v2 are committed. Right after compaction output v2, s1 is released. Now findEarliestVisibleSnapshot may see s1 being released, and return the next snapshot, which is s2. That's larger than v2's earliest visible snapshot, which was s1.
The fix: the only place we check against last snapshot and current key snapshot is when we decide whether to compact out a value if it is hidden by a later value. In the check if we see current snapshot is even larger than last snapshot, we know last snapshot is released, and we are safe to compact out current key.

Scenario 2:
key1 has two values v1 and v2. there are two snapshots s1 and s2 taken after v1 and v2 are committed. During compaction before we process the key, s1 is released. When compaction process v2, snapshot checker may return kSnapshotReleased, and the earliest visible snapshot for v2 become s2. When compaction process v1, snapshot checker may return kIsInSnapshot (for WritePrepared transaction, it could be because v1 is still in commit cache). The result will become inconsistent here.
The fix: remember the set of released snapshots ever reported by snapshot checker, and ignore them when finding result for findEarliestVisibleSnapshot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4890

Differential Revision: D13705538

Pulled By: maysamyabandeh

fbshipit-source-id: e577f0d9ee1ff5a6035f26859e56902ecc85a5a4
2019-01-18 17:24:06 -08:00
Yi Wu
128f532858 WritePrepared: fix issue with snapshot released during compaction (#4858)
Summary:
Compaction iterator keep a copy of list of live snapshots at the beginning of compaction, and then query snapshot checker to verify if values of a sequence number is visible to these snapshots. However when the snapshot is released in the middle of compaction, the snapshot checker implementation (i.e. WritePreparedSnapshotChecker) may remove info with the snapshot and may report incorrect result, which lead to values being compacted out when it shouldn't. This patch conservatively keep the values if snapshot checker determines that the snapshots is released.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4858

Differential Revision: D13617146

Pulled By: maysamyabandeh

fbshipit-source-id: cf18a94f6f61a94bcff73c280f117b224af5fbc3
2019-01-16 09:55:32 -08:00
Yanqin Jin
e79df377c5 Use chrono::time_point instead of time_t (#4868)
Summary:
By convention, time_t almost always stores the integral number of seconds since
00:00 hours, Jan 1, 1970 UTC, according to http://www.cplusplus.com/reference/ctime/time_t/.
We surely want more precision than seconds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4868

Differential Revision: D13633046

Pulled By: riversand963

fbshipit-source-id: 4e01e23a22e8838023c51a91247a286dbf3a5396
2019-01-16 09:51:05 -08:00
Yi Wu
5d4fddfa52 WritePrepared: Fix visible key compacted out by compaction (#4883)
Summary:
With WritePrepared transaction, flush/compaction can contain uncommitted keys, and those keys can get committed during compaction. If a snapshot is taken before the key is committed, it should not see the key. On the other hand, compaction grab the list of snapshots at its beginning, and only consider those snapshots to dedup keys. Consider the case:
```
seq = 1: put "foo" = "bar"
seq = 2: transaction T: delete "foo", prepare
seq = 3: compaction start
seq = 4: take snapshot S
seq = 5: transaction T: commit.
...
seq = N: compaction iterator reached key "foo".
```
When compaction start, the list of snapshot is empty. Compaction doesn't take snapshot S into account. When it reached "foo", transaction T is committed. Compaction may think the value "foo=bar" is not visible by any snapshot (which is wrong), and compact the value out.

The fix is to explicitly take a snapshot before compaction grabbing the list of snapshots. Compaction will then has to keep keys visible to this snapshot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4883

Differential Revision: D13668775

Pulled By: maysamyabandeh

fbshipit-source-id: 1cab9615f94b7d3e8522cc3d44c3a14c7d4720e4
2019-01-15 21:34:38 -08:00
Maysam Yabandeh
cad99a6031 WritePrepared: snapshot should be larger than max_evicted_seq_ (#4886)
Summary:
The AdvanceMaxEvictedSeq algorithm assumes that new snapshots always have sequence number larger than the last max_evicted_seq_. To enforce this assumption we make two changes:
i) max is not advanced beyond the last published seq, with the exception that the evicted commit entry itself is not published yet, which is quite rare.
ii) When obtaining the snapshot if the max_evicted_seq_ is not published yet, commit a dummy entry so that it waits for it to be published and also increased the latest published seq by one above the max.
To test these non-realistic corner cases we create a commit cache with size 1 so that every single commit results into eviction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4886

Differential Revision: D13685270

Pulled By: maysamyabandeh

fbshipit-source-id: 5461bc09c2a9b75798bfcb9853a256c81cdac0b0
2019-01-15 18:11:52 -08:00
Siying Dong
7d13f307ff Improve Error Message When wal_dir doesn't exist (#4874)
Summary:
Right now the error mesage when options.wal_dir doesn't exist is not helpful to users. Be more specific
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4874

Differential Revision: D13642425

Pulled By: siying

fbshipit-source-id: 9a3172ed0f799af233b0f3b2e5e35bc7ce04c7b5
2019-01-15 16:46:04 -08:00
Yanqin Jin
301da345ae Make a copy of MutableCFOptions to avoid race condition (#4876)
Summary:
If we do not do this, then reading MutableCFOptions may have a race condition
with SetOptions which modifies MutableCFOptions.

Also reserve space in advance for vectors to avoid reallocation changing the
address of its elements.

Test plan
```
$make clean && make -j32 all check
$make clean && COMPILE_WITH_TSAN=1 make -j32 all check
$make clean && COMPILE_WITH_ASAN=1 make -j32 all check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4876

Differential Revision: D13644500

Pulled By: riversand963

fbshipit-source-id: 4b8112c5c819d5a2922bb61ad1521b3d2fb2fd47
2019-01-11 17:43:37 -08:00
Maysam Yabandeh
d56ac22b44 Remove duplicates from SnapshotList::GetAll (#4860)
Summary:
The vector returned by SnapshotList::GetAll could have duplicate entries if two separate snapshots have the same sequence number. However, when this vector is used in compaction the duplicate entires are of no use and could be safely ignored. Moreover not having duplicate entires simplifies reasoning in the compaction_iterator.cc code. For example when searching for the previous_snap we currently use the snapshot before the current one but the way the code uses that it expects it to be also less than the current snapshot, which would be simpler to read if there is no duplicate entry in the snapshot list.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4860

Differential Revision: D13615502

Pulled By: maysamyabandeh

fbshipit-source-id: d45bf01213ead5f39db811f951802da6fcc3332b
2019-01-09 16:25:42 -08:00
Siying Dong
8641e9adf7 Non-initial file preloading should always prefetch index and filter (#4852)
Summary:
https://github.com/facebook/rocksdb/pull/3340 introduces preloading when max_open_files != -1.
It doesn't preload index and filter in non-initial file loading case. This is a little bit too
complicated to understand. We observed in one MyRocks use case where the filter is expected to be
preloaded but is not. To simplify the use case, we simply always prefetch the index and filter.
They anyway is expected to be loaded in the file verification phase anyway.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4852

Differential Revision: D13595402

Pulled By: siying

fbshipit-source-id: d4d8624eb3e849e20aeb990df2100502d85aff31
2019-01-08 12:47:34 -08:00
tom wang
42135523a0 modify comments about flush_queue_
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4850

Differential Revision: D13591940

Pulled By: sagar0

fbshipit-source-id: 617794e0a41d0f4554d40871180b061e84189fc5
2019-01-07 13:52:59 -08:00
Yi Wu
cf852fdf55 Minor fix: single delete a blob value is not a mismatch (#4848)
Summary:
In compaction iterator, if the next value of single delete is a blob value, it should not treated as mismatch. This is only a minor fix and doesn't affect correctness.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4848

Differential Revision: D13585812

Pulled By: yiwu-arbug

fbshipit-source-id: 0ff6223fa03a644ac9fd8a2d77f9d6711d0a62b0
2019-01-04 16:31:02 -08:00
Andrew Kryczka
9e2c804fe6 Fix point lookup on range tombstone sentinel endpoint (#4829)
Summary:
Previously for point lookup we decided which file to look into based on user key overlap only. We also did not truncate range tombstones in the point lookup code path. These two ideas did not interact well in cases like this:

- L1 has range tombstone [a, c)#1 and point key b#2. The data is split between file1 with range [a#1,1, b#72057594037927935,15], and file2 with range [b#2, c#1].
- L1's file2 gets compacted to L2.
- User issues `Get()` for b#3.
- L1's file1 is opened and the range tombstone [a, c)#1 is found for b, while no point-key for b is found in L1.
- `Get()` assumes that the range tombstone must cover all data in that range in lower levels, so short circuits and returns `NotFound`.

The solution to this problem is to not look into files that only overlap with the point lookup at a range tombstone sentinel endpoint. In the above example, this would mean not opening L1's file1 or its tombstones during the `Get()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4829

Differential Revision: D13561355

Pulled By: ajkr

fbshipit-source-id: a13c21c816870a2f5d32a48af6dbd719a7d9d19f
2019-01-04 11:24:08 -08:00
Yanqin Jin
a07175af65 Refactor atomic flush result installation to MANIFEST (#4791)
Summary:
as titled.
Since different bg flush threads can flush different sets of column families
(due to column family creation and drop), we decide not to let one thread
perform atomic flush result installation for other threads. Bg flush threads
will install their atomic flush results sequentially to MANIFEST, using
a conditional variable, i.e. atomic_flush_install_cv_ to coordinate.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4791

Differential Revision: D13498930

Pulled By: riversand963

fbshipit-source-id: dd7482fc41f4bd22dad1e1ef7d4764ef424688d7
2019-01-03 20:56:24 -08:00
Yi Wu
77a8d4d476 Detect if Jemalloc is linked with the binary (#4844)
Summary:
Declare Jemalloc non-standard APIs as weak symbols, so that if Jemalloc is linked with the binary, these symbols will be replaced by Jemalloc's, otherwise they will be nullptr. This is similar to how folly detect jemalloc, but we assume the main program use jemalloc as long as jemalloc is linked: https://github.com/facebook/folly/blob/master/folly/memory/Malloc.h#L147
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4844

Differential Revision: D13574934

Pulled By: yiwu-arbug

fbshipit-source-id: 7ea871beb1be7d5a1259cc38f9b78078793db2db
2019-01-03 16:30:12 -08:00
DorianZheng
8c79f79208 Fix skip WAL for whole write_group when leader's callback fail (#4838)
Summary:
The original implementation has two problems:

1. f0dda35d7d/db/db_impl_write.cc (L478)
f0dda35d7d/db/write_thread.h (L231)

If the callback status of leader of the write_group fails, then the whole write_group will not write to WAL, this may cause data loss.

2. f0dda35d7d/db/write_thread.h (L130)
The annotation says that Writer.status is the status of memtable inserter, but the original implementation use it for another case which is not consistent with the original design. Looks like we can still reuse Writer.status, but we should modify the annotation, so Writer.status is not only the status of memtable inserter.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4838

Differential Revision: D13574070

Pulled By: yiwu-arbug

fbshipit-source-id: a2a2aefcfd329c4c6a91652bf090aaf1ce119c4b
2019-01-03 12:40:42 -08:00
Siying Dong
e4feb78606 Try to fix DBSSTTest.RateLimitedDelete flakiness (#4840)
Summary:
DBSSTTest.RateLimitedDelete is flakey. The root cause is not completely identified, but
the compaction waiting in the test doesn't strictly wait for compaction cleaning to finish, which
may cause test flakiness. Fix it first and see whether the failures still happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4840

Differential Revision: D13567273

Pulled By: siying

fbshipit-source-id: 6fce38b912aff92a925231e7aa9bb0fef892761a
2019-01-03 11:05:19 -08:00
Andrew Kryczka
ace543a815 fix accounting for range tombstones in TableProperties (#4841)
Summary:
- To be consistent with the accounting of other optypes in `TableProperties`, we should count range tombstones in `TableProperties::num_entries` and `TableProperties::num_deletions`.
- Updated assertions in stress test's `OnTableFileCreated` handler to accept files with range tombstones only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4841

Differential Revision: D13568424

Pulled By: ajkr

fbshipit-source-id: 0139d7806494eda20ece67ec460d2458dbbf6026
2019-01-02 15:08:53 -08:00
Anand Ananthabhotla
b9d6eccac1 Lock free MultiGet (#4754)
Summary:
Avoid locking the DB mutex in order to reference SuperVersions. Instead, we get the thread local cached SuperVersion for each column family in the list. It depends on finding a sequence number that overlaps with all the open memtables. We start with the latest published sequence number, and if any of the memtables is sealed before we can get all the SuperVersions, the process is repeated. After a few times, give up and lock the DB mutex.

Tests:
1. Unit tests
2. make check
3. db_bench -

TEST_TMPDIR=/dev/shm ./db_bench -use_existing_db=true -benchmarks=readrandom -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=5000000 -reads=1000000 -threads=32 -compression_type=none -cache_size=1048576000 -batch_size=1 -bloom_bits=1
readrandom   :       0.167 micros/op 5983920 ops/sec;  426.2 MB/s (1000000 of 1000000 found)

Multireadrandom with batch size 1:
multireadrandom :       0.176 micros/op 5684033 ops/sec; (1000000 of 1000000 found)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4754

Differential Revision: D13363550

Pulled By: anand1976

fbshipit-source-id: 6243e8de7dbd9c8bb490a8eca385da0c855b1dd4
2019-01-02 11:42:54 -08:00
Faustin Lammler
7d65bd5ce4 Fix spelling errors (#4827)
Summary:
Hi, Lintian, the Debian package checker complains about spelling error (spelling-error-in-binary).

See https://salsa.debian.org/mariadb-team/mariadb-10.3/-/jobs/98380
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4827

Differential Revision: D13566362

Pulled By: riversand963

fbshipit-source-id: cd4e9212133c73b0591030de6cdedaa47575968d
2019-01-02 11:17:57 -08:00
Yanqin Jin
ec68091d19 Remove an unused parameter (#4816)
Summary:
The `flush_reason` parameter in `DBImpl::InstallSuperVersionAndScheduleWork` is
not used. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4816

Differential Revision: D13543218

Pulled By: riversand963

fbshipit-source-id: 8fc75d49462ce092e85aef0fe0c50936140db153
2019-01-02 09:59:13 -08:00
Siying Dong
f0dda35d7d Preload some files even if options.max_open_files (#3340)
Summary:
Choose to preload some files if options.max_open_files != -1. This can slightly narrow the gap of performance between options.max_open_files is -1 and a large number. To avoid a significant regression to DB reopen speed if options.max_open_files != -1. Limit the files to preload in DB open time to 16.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3340

Differential Revision: D6686945

Pulled By: siying

fbshipit-source-id: 8ec11bbdb46e3d0cdee7b6ad5897a09c5a07869f
2018-12-28 18:02:28 -08:00
Burton Li
46e3209e0d Compaction limiter miscs (#4795)
Summary:
1. Remove unused API SubtractCompactionTask().
2. Assert outstanding tasks drop to zero in ConcurrentTaskLimiterImpl destructor.
3. Remove GetOutstandingTask() check from manual compaction test, as TEST_WaitForCompact() doesn't synced with 'delete prepicked_compaction' in DBImpl::BGWorkCompaction(), which may make the test flaky.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4795

Differential Revision: D13542183

Pulled By: siying

fbshipit-source-id: 5eb2a47e62efe4126937149aa0df6e243ebefc33
2018-12-26 13:59:35 -08:00
Alexander Zinoviev
80bf8975fd Add a new per level counter for block cache hit (#4796)
Summary:
Add a new per level counter for block cache hits, increase it by one on every successful attempt to get an entry from cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4796

Differential Revision: D13513688

Pulled By: zinoale

fbshipit-source-id: 104df038f1232e3356e162eb2d8ca138e34a8281
2018-12-21 13:20:05 -08:00
Andrew Kryczka
e0be1bc4f1 fix DeleteRange memory leak for mmap and block cache (#4810)
Summary:
Previously we were cleaning up range tombstone meta-block by calling `ReleaseCachedEntry`, which wouldn't work if `value != nullptr && cache_handle == nullptr`. This happened at least in the case with mmap reads and block cache both enabled. I noticed `NewDataBlockIterator` intends to handle all these cases, so migrated to that instead of `NewUnfragmentedRangeTombstoneIterator`.

Also changed the table-opening logic to fail on `ReadRangeDelBlock` failure, since that can cause data corruption. Added a test case to verify this behavior. Note the test case does not fail on `TryReopen` because failure to preload table handlers is not considered critical. However, it does fail on any read involving that file since it cannot return correct data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4810

Differential Revision: D13534296

Pulled By: ajkr

fbshipit-source-id: 55dde1111717cea6ec4bf38418daab81ccef3599
2018-12-20 21:59:49 -08:00
Siying Dong
da1c64b6e7 Introduce a CPU time counter in perf_context (#4741)
Summary:
Introduce the first CPU timing counter, perf_context.get_cpu_nanos. This opens a door to more CPU counters in the future.
Only Posix Env has it implemented using clock_gettime() with CLOCK_THREAD_CPUTIME_ID. How accurate the counter is depends on the platform.
Make PerfStepTimer to take an Env as an argument, and sometimes pass it in. The direct reason is to make the unit tests to use SpecialEnv where we can ingest logic there. But in long term, this is a good change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4741

Differential Revision: D13287798

Pulled By: siying

fbshipit-source-id: 090361049d9d5095d1d1a369fe1338d2e2e1c73f
2018-12-20 12:03:44 -08:00
Abhishek Madan
02bfc5831e Change is_range_del_table_empty_ flag to atomic (#4801)
Summary:
To avoid a race on the flag, make it an atomic_bool. This
doesn't seem to significantly affect benchmarks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4801

Differential Revision: D13523845

Pulled By: abhimadan

fbshipit-source-id: 3bc29f53c50a4e06cd9f8c6232a4bb221868e055
2018-12-19 17:21:14 -08:00
Abhishek Madan
8bf73208a4 Remove stale TODO (#4800)
Summary:
This TODO was already addressed, but I forgot to remove it
before landing the PR it came from.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4800

Differential Revision: D13522284

Pulled By: abhimadan

fbshipit-source-id: 7766bc4f5b54e47d355cf26137ef5e86c604472a
2018-12-19 15:45:37 -08:00
Jakub Tomanik
71a69d9b68 Fix building RocksDB for iOS (#4687)
Summary:
This PR contains the following fixes:

1. Fixing Makefile to support non-default locations of developer tools

2. Fixing compile error using a patch from https://github.com/facebook/rocksdb/pull/4007
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4687

Differential Revision: D13287263

Pulled By: riversand963

fbshipit-source-id: 4525eb42ba7b6f82af5f9bfb8e52fa4024e27ccc
2018-12-19 14:13:55 -08:00
Adam Retter
1b0c9ce396 Fix Windows broken build error due to non-const override (#4798)
Summary:
1) `transaction_base.h` overrides from `transaction.h` with a `const boolean do_validate`.
The non-const base declaration, which I cannot see the need for, causes a compilation error on Microsoft Windows.

2) Implicit cast from `double` to `uint64_t` causes a compilation error on Microsoft Windows.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4798

Differential Revision: D13519734

Pulled By: sagar0

fbshipit-source-id: 6e8cb80e9a589b1122e1500c21b8e3a3a472b459
2018-12-19 13:29:51 -08:00
Yanqin Jin
671a7eb36f Avoid switching empty memtable in certain cases (#4792)
Summary:
in certain cases, we do not perform memtable switching if the active
memtable of the column family is empty. Two exceptions:
1. In manual flush, if cached_recoverable_state_empty_ is false, then we need
   to switch memtable due to requirement of transaction.
2. In switch WAL, we need to switch memtable anyway because we have to seal the
   memtable if the WAL on which it depends will be closed.

This change can potentially delay the occurence of write stalls because number
of memtables increase more slowly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4792

Differential Revision: D13499501

Pulled By: riversand963

fbshipit-source-id: 91c9b17ae753578578039f3851667d93610005e1
2018-12-18 16:47:23 -08:00
Abhishek Madan
c15df15f07 Fix unused member compile error
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4793

Differential Revision: D13509363

Pulled By: abhimadan

fbshipit-source-id: 530b4765e3335d6ecd016bfaa89645f8aa98c61f
2018-12-18 14:28:42 -08:00
Abhishek Madan
81b6b09f6b Remove v1 RangeDelAggregator (#4778)
Summary:
Now that v2 is fully functional, the v1 aggregator is removed.
The v2 aggregator has been renamed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4778

Differential Revision: D13495930

Pulled By: abhimadan

fbshipit-source-id: 9d69500a60a283e79b6c4fa938fc68a8aa4d40d6
2018-12-17 17:33:46 -08:00
Roman Zeyde
a62c6626e0 Support setting options on column families via C bindings (#4785)
Summary:
Currently, it supports setting options only on the default column family.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4785

Differential Revision: D13491819

Pulled By: ajkr

fbshipit-source-id: 75c78bd86222bb05568e538562af84fb53eb4d8d
2018-12-17 13:52:12 -08:00
Abhishek Madan
abf931afa6 Add compaction logic to RangeDelAggregatorV2 (#4758)
Summary:
RangeDelAggregatorV2 now supports ShouldDelete calls on
snapshot stripes and creation of range tombstone compaction iterators.
RangeDelAggregator is no longer used on any non-test code path, and will
be removed in a future commit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4758

Differential Revision: D13439254

Pulled By: abhimadan

fbshipit-source-id: fe105bcf8e3d4a2df37a622d5510843cd71b0401
2018-12-17 13:20:51 -08:00
Maysam Yabandeh
4ed3c1eb88 Fix flaky test DeleteFileRange (#4784)
Summary:
The test fails sporadically expecting the DB to be empty after DeleteFilesInRange(..., nullptr, nullptr) call which is not. Debugging shows cases where the files are skipped since they are being compacted. The patch fixes the test by waiting for the last CompactRange to finish before calling DeleteFilesInRange.
Verified by
```
~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.DeleteFileRange --repeat=10000
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4784

Differential Revision: D13469402

Pulled By: maysamyabandeh

fbshipit-source-id: 3d8f44abe205b82c69f01e7edf27e1f8098248e1
2018-12-14 13:47:36 -08:00
Maysam Yabandeh
349542332a Fix race condition on options_file_number_ (#4780)
Summary:
options_file_number_ must be written under db::mutex_ sine its read is protected by mutex_ in ::GetLiveFiles(). However currently it is written in ::RenameTempFileToOptionsFile() which according to its contract must be called without holding db::mutex_. The patch fixes the race condition by also acquitting the mutex_ before writing options_file_number_. Also it does that only if the rename of option file is successful.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4780

Differential Revision: D13461411

Pulled By: maysamyabandeh

fbshipit-source-id: 2d5bae96a1f3e969ef2505b737cf2d7ae749787b
2018-12-13 19:27:38 -08:00
Yanqin Jin
4fce44fc8b Improve flushing multiple column families (#4708)
Summary:
If one column family is dropped, we should simply skip it and continue to flush
other active ones.
Currently we use Status::ShutdownInProgress to notify caller of column families
being dropped. In the future, we should consider using a different Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4708

Differential Revision: D13378954

Pulled By: riversand963

fbshipit-source-id: 42f248cdf2d32d4c0f677cd39012694b8f1328ca
2018-12-13 15:12:40 -08:00
DorianZheng
2670fe8c73 Get CompactionJobInfo from CompactFiles
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4716

Differential Revision: D13207677

Pulled By: ajkr

fbshipit-source-id: d0ccf5a66df6cbb07288b0c5ebad81fd9df3926b
2018-12-13 14:21:24 -08:00
Burton Li
a8b9891f95 Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918

We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.

With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.

ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.

The usage is straight forward:
e.g.:

//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...

//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0);  // full throttle (0 task)

//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);

// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;

// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;

...

//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332

Differential Revision: D13226590

Pulled By: siying

fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 13:18:28 -08:00
Maysam Yabandeh
0aa17c1002 Fix flaky test DBCompactionTest::DeleteFileRange (#4776)
Summary:
The test has been failing sporadically probably because the configured compaction options were actually unused. Verified that by the following:
```
~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.DeleteFileRange --repeat=1000
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4776

Differential Revision: D13441052

Pulled By: maysamyabandeh

fbshipit-source-id: d35075b9e6cef9b9c9d0d571f9cd72ade8eda55d
2018-12-12 16:32:14 -08:00
DorianZheng
4862720e08 Expose column family id to FlushJobInfo
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4772

Differential Revision: D13428923

Pulled By: ajkr

fbshipit-source-id: e351e9c5eea97816db25429e129357a8af90712a
2018-12-11 20:33:42 -08:00
Abhishek Madan
cad248f5c6 Prepare FragmentedRangeTombstoneIterator for use in compaction (#4740)
Summary:
To support the flush/compaction use cases of RangeDelAggregator
in v2, FragmentedRangeTombstoneIterator now supports dropping tombstones
that cannot be read in the compaction output file. Furthermore,
FragmentedRangeTombstoneIterator supports the "snapshot striping" use
case by allowing an iterator to be split by a list of snapshots.
RangeDelAggregatorV2 will use these changes in a follow-up change.

In the process of making these changes, other miscellaneous cleanups
were also done in these files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4740

Differential Revision: D13287382

Pulled By: abhimadan

fbshipit-source-id: f5aeb03e1b3058049b80c02a558ee48f723fa48c
2018-12-11 12:10:48 -08:00
Sagar Vemuri
dde3ef1116 Change directory where ExternalSSTFileBasicTest runs (#4766)
Summary:
Change the directory where ExternalSSTFileBasicTest* tests run.

**Problem:**
Without this change, I spent considerable time chasing around a non-existent issue as ExternalSSTFileTest.* and ExternalSSTFileBasicTest.* create similar directories.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4766

Differential Revision: D13409384

Pulled By: sagar0

fbshipit-source-id: c33e1f4d505dfa6efbc788d6c57cdb680053ded3
2018-12-11 10:21:37 -08:00
Abhishek Madan
64aabc9183 Properly set smallest key of subcompaction output (#4723)
Summary:
It is possible to see a situation like the following when
subcompactions are enabled:
1. A subcompaction boundary is set to `[b, e)`.
2. The first output file in a subcompaction has `c@20` as its smallest key
3. The range tombstone `[a, d)30` is encountered.
4. The tombstone is written to the range-del meta block and the new
   smallest key is set to `b@0` (since no keys in this subcompaction's
   output can be smaller than `b`).
5. A key `b@10` in a lower level will now reappear, since it is not
   covered by the truncated start key `b@0`.

In general, unless the smallest data key in a file has a seqnum of 0, it
is not safe to truncate a tombstone at the start key to have a seqnum of
0, since it can expose keys with a seqnum greater than 0 but less than
the tombstone's actual seqnum.

To fix this, when the lower bound of a file is from the subcompaction
boundaries, we now set the seqnum of an artificially extended smallest
key to the tombstone's seqnum. This is safe because subcompactions
operate over disjoint sets of keys, and the subcompactions that can
experience this problem are not the first subcompaction (which is
unbounded on the left).

Furthermore, there is now an assertion to detect the described anomalous
case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4723

Differential Revision: D13236188

Pulled By: abhimadan

fbshipit-source-id: a6da6a113f2de1e2ff307ca72e055300c8fe5692
2018-12-10 12:38:31 -08:00
Yanqin Jin
f307479ba6 Enable checkpoint of read-only db (#4681)
Summary:
1. DBImplReadOnly::GetLiveFiles should not return NotSupported. Instead, it
   should call DBImpl::GetLiveFiles(flush_memtable=false).
2. In DBImp::Recover, we should also recover the OPTIONS file name and/or
   number so that an immediate subsequent GetLiveFiles will get the correct
   OPTIONS name.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4681

Differential Revision: D13069205

Pulled By: riversand963

fbshipit-source-id: 3e6a0174307d06db5a01feb099b306cea1f7f88a
2018-12-07 17:06:02 -08:00
Yanqin Jin
9be3e6b488 Allow file-ingest-triggered flush to skip waiting for write-stall clear (#4751)
Summary:
When write stall has already been triggered due to number of L0 files reaching
threshold, file ingestion must proceed with its flush without waiting for the
write stall condition to cleared by the compaction because compaction can wait
for ingestion to finish (circular wait).

In order to avoid this wait, we can set `FlushOptions.allow_write_stall` to be
true (default is false). Setting it to false can cause deadlock.

This can happen when the number of compaction threads is low.

Considere the following
```
Time  compaction_thread                        ingestion_thread
 |                                             num_running_ingest_file_++
 |    while(num_running_ingest_file_>0){wait}
 |                                             flush
 V
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4751

Differential Revision: D13343037

Pulled By: riversand963

fbshipit-source-id: d3b95938814af46ec4c463feff0b50c70bd8b23f
2018-12-05 14:59:29 -08:00
Yanqin Jin
b96fccb1e6 Move a function to critical section (#4752)
Summary:
Test plan
```
$make clean && make -j32 all check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4752

Differential Revision: D13344705

Pulled By: riversand963

fbshipit-source-id: fc3a43174d09d70ccc2b09decd78e1da1b6ba9d1
2018-12-05 13:12:09 -08:00
Zhongyi Xie
2f1ca4e838 Revert "BaseDeltaIterator: always check valid() before accessing key(… (#4744)
Summary:
…) (#4702)"

This reverts commit 3a18bb3e15.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4744

Differential Revision: D13311869

Pulled By: miasantreble

fbshipit-source-id: 6300b12cc34828d8b9274e907a3aef1506d5d553
2018-12-03 23:38:27 -08:00
Zhongyi Xie
3a18bb3e15 BaseDeltaIterator: always check valid() before accessing key() (#4702)
Summary:
Current implementation of `current_over_upper_bound_` fails to take into consideration that keys might be invalid in either base iterator or delta iterator. Calling key() in such scenario will lead to assertion failure and runtime errors.
This PR addresses the bug by adding check for valid keys before calling `IsOverUpperBound()`, also added test coverage for iterate_upper_bound usage in BaseDeltaIterator
Also recommit https://github.com/facebook/rocksdb/pull/4656 (It was reverted earlier due to bugs)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4702

Differential Revision: D13146643

Pulled By: miasantreble

fbshipit-source-id: 6d136929da12d0f2e2a5cea474a8038ec5cdf1d0
2018-11-30 15:35:13 -08:00
Siying Dong
6e938c904f Make NewBloomFilterPolicy() use full filter by default (#4735)
Summary:
Full block (use_block_based_builder=false) Bloom filter has clear CPU saving benefits but with limitation of using temp memory when building an SST file proportional to the SST file size. We reduced the chance of having large SST files with multi-level universal compaction. Now we change to a default with better performance.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4735

Differential Revision: D13266674

Pulled By: siying

fbshipit-source-id: 7594a4c3e32568a5a2adce22bb0e46553e55c602
2018-11-30 13:13:27 -08:00
Sagar Vemuri
70645355ad Move FIFOCompactionPicker to a separate file (#4724)
Summary:
**Summary:**
Simplified the code layout by moving FIFOCompactionPicker to a separate file.
**Why?:**
While trying to add ttl functionality to universal compaction, I found that `FIFOCompactionPicker` class and its impl methods to be interspersed between `LevelCompactionPicker` methods which kind-of made the code a little hard to traverse. So I moved `FIFOCompactionPicker` to a separate compaction_picker_fifo.h/cc file, similar to `UniversalCompactionPicker`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4724

Differential Revision: D13227914

Pulled By: sagar0

fbshipit-source-id: 89471766ea67fa4d87664a41c057dd7df4b3d4e3
2018-11-29 16:04:52 -08:00
Yanqin Jin
8d7bc76f36 Fix a flaky test DBFlushTest.SyncFail (#4633)
Summary:
There is a race condition in DBFlushTest.SyncFail, as illustrated below.
```
time         thread1                             bg_flush_thread
  |     Flush(wait=false, cfd)
  |     refs_before=cfd->current()->TEST_refs()   PickMemtable calls cfd->current()->Ref()
  V
```
The race condition between thread1 getting the ref count of cfd's current
version and bg_flush_thread incrementing the cfd's current version makes it
possible for later assertion on refs_before to fail. Therefore, we add test
sync points to enforce the order and assert on the ref count before and after
PickMemtable is called in bg_flush_thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4633

Differential Revision: D12967131

Pulled By: riversand963

fbshipit-source-id: a99d2bacb7869ec5d8d03b24ef2babc0e6ae1a3b
2018-11-29 13:39:56 -08:00
Kefu Chai
7dbee38716 db/repair: reset Repair::db_lock_ in ctor (#4683)
Summary:
there is chance that

* the caller tries to repair the db when holding the db_lock, in
  that case the env implementation might not set the `lock`
  parameter of Repairer::Run().
* the caller somehow never calls Repairer::Run().

either way, the desctructor of Repair will compare the uninitialized
db_lock_ with nullptr, and tries to unlock it. there is good chance
that the db_lock_ is not nullptr, then boom.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4683

Differential Revision: D13260287

Pulled By: riversand963

fbshipit-source-id: 878a119d2e9f10a0fa17ee62cf3fb24b33d49fa5
2018-11-29 11:26:41 -08:00
Abhishek Madan
8fe1e06ca0 Clean up FragmentedRangeTombstoneList (#4692)
Summary:
Removed `one_time_use` flag, which removed the need for some
tests, and changed all `NewRangeTombstoneIterator` methods to return
`FragmentedRangeTombstoneIterators`.

These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones`
and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692

Differential Revision: D13106570

Pulled By: abhimadan

fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845
2018-11-28 15:29:02 -08:00
Zhichao Cao
7125e24619 Add the max trace file size limitation option to Tracing (#4610)
Summary:
If user do not end the trace manually, the tracing will continue which can potential use up all the storage space and cause problem. In this PR, the max trace file size is added to the TraceOptions and user can set the value if they need or the default is 64GB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4610

Differential Revision: D12893400

Pulled By: zhichao-cao

fbshipit-source-id: acf4b5a6076bb691778bdfbac4864e1006758953
2018-11-27 14:27:05 -08:00
Abhishek Madan
85394a96ca Speed up range scans with range tombstones (#4677)
Summary:
Previously, every range tombstone iterator was seeked on every
ShouldDelete call, which quickly degraded performance for long range
scans. This PR improves performance by tracking iterator positions and
only advancing iterators when necessary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4677

Differential Revision: D13205373

Pulled By: abhimadan

fbshipit-source-id: 80c199dace1e19362a4c61c686bf01913eae87cb
2018-11-26 16:33:41 -08:00
Zhongyi Xie
a21cb22ee3 Revert "apply ReadOptions.iterate_upper_bound to transaction iterator… (#4705)
Summary:
… (#4656)"

This reverts commit b76398a82b.

Will add test coverage for iterate_upper_bound before re-commit b76398
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4705

Differential Revision: D13148592

Pulled By: miasantreble

fbshipit-source-id: 4d1ce0bfd9f7a5359a7688bd780eb06a66f45b1f
2018-11-24 10:46:28 -08:00
Andrew Kryczka
07cf0ee589 Fix ticker stat for number files closed (#4703)
Summary:
We haven't been populating `NO_FILE_CLOSES` since v1.5.8 even though it was never marked as deprecated. Start populating it again. Conveniently `DeleteTableReader` has an unused `void*` argument that we can use...

Blame: 63f216ee0a

Closes #4700.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4703

Differential Revision: D13146769

Pulled By: ajkr

fbshipit-source-id: ad8d6fb0493e701f60a165a3bca1787d255be008
2018-11-21 18:31:34 -08:00
Yi Wu
05d9d82181 Revert "Move MemoryAllocator option from Cache to BlockBasedTableOpti… (#4697)
Summary:
…ons (#4676)"

This reverts commit b32d087dbb.

`MemoryAllocator` needs to be with `Cache`, since cache entry can
outlive DB and block based table. The cache needs to hold reference to
memory allocator when deleting cache entry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4697

Differential Revision: D13133490

Pulled By: yiwu-arbug

fbshipit-source-id: 8ef7e8a51263bfd929f892fd062665ff4ce9ce5a
2018-11-21 11:29:57 -08:00
Abhishek Madan
457f77b9ff Introduce RangeDelAggregatorV2 (#4649)
Summary:
The old RangeDelAggregator did expensive pre-processing work
to create a collapsed, binary-searchable representation of range
tombstones. With FragmentedRangeTombstoneIterator, much of this work is
now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
in each iterator to find a covering tombstone in ShouldDelete, while
doing minimal work in AddTombstones. The old RangeDelAggregator is still
used during flush/compaction for now, though RangeDelAggregatorV2 will
support those uses in a future PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649

Differential Revision: D13146964

Pulled By: abhimadan

fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3
2018-11-21 10:56:45 -08:00
Abhishek Madan
ed5aec5ba3 Fix range tombstone covering short-circuit logic (#4698)
Summary:
Since a range tombstone seen at one level will cover all keys
in the range at lower levels, there was a short-circuiting check in Get
that reported a key was not found at most one file after the range
tombstone was discovered. However, this was incorrect for merge
operands, since a deletion might only cover some merge operands,
which implies that the key should be found. This PR fixes this logic in
the Version portion of Get, and removes the logic from the MemTable
portion of Get, since the perforamnce benefit provided there is minimal.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4698

Differential Revision: D13142484

Pulled By: abhimadan

fbshipit-source-id: cbd74537c806032f2bfa564724d01a80df7c8f10
2018-11-20 13:29:22 -08:00
Siying Dong
13579e8c5a WriteBufferManger doens't cost to cache if no limit is set (#4695)
Summary:
WriteBufferManger is not invoked when allocating memory for memtable if the limit is not set even if a cache is passed. It is inconsistent from the comment syas. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4695

Differential Revision: D13112722

Pulled By: siying

fbshipit-source-id: 0b27eef63867f679cd06033ea56907c0569597f4
2018-11-18 16:55:43 -08:00
Andrew Kryczka
9d6d4867ab Fix uninitialized fields in file metadata (#4693)
Summary:
This is a quick fix for the uninitialized bugs in `LiveFileMetaData` and `SstFileMetaData` that were uncovered in #4686.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4693

Differential Revision: D13113189

Pulled By: ajkr

fbshipit-source-id: 18e798d031d2a59d0b55fc010c135e0126f4042d
2018-11-16 20:49:17 -08:00
Yanqin Jin
147697420a Rollback memtable flush upon atomic flush fail (#4641)
Summary:
This fixes an assertion.

An atomic flush can have multiple flush jobs. Some of them may fail. If any of
them fails, we need to rollback all of them.
For the flush jobs that do fail, we already call `RollbackMemTableFlush` in
`FlushJob::Run`. The tricky part is for flush jobs that have completed
successfully. We need to call `RollbackMemTableFlush` for them as well.

The newly added DBAtomicFlushTest.AtomicFlushRollbackSomeJobs will SigAbort
without the corresponding change in AtomicFlushMemTablesToOutputFiles.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4641

Differential Revision: D12943649

Pulled By: riversand963

fbshipit-source-id: c66a4a664a1e0938e938fd41edc5a70c34cdd868
2018-11-14 20:54:17 -08:00
Abhishek Madan
6bee36a786 Modify FragmentedRangeTombstoneList member layout (#4632)
Summary:
Rather than storing a `vector<RangeTombstone>`, we now store a
`vector<RangeTombstoneStack>` and a `vector<SequenceNumber>`. A
`RangeTombstoneStack` contains the start and end keys of a range tombstone
fragment, and indices into the seqnum vector to indicate which sequence
numbers the fragment is located at. The diagram below illustrates an
example:

```
tombstones_:     [a, b) [c, e) [h, k)
                   | \   /  \   /  |
                   |  \ /    \ /   |
                   v   v      v    v
tombstone_seqs_: [ 5 3 10 7 2 8 6  ]
```

This format allows binary searching the tombstone list to use less key
comparisons, which helps in cases where there are many overlapping
tombstones. Also, this format makes it easier to add DBIter-like
semantics to `FragmentedRangeTombstoneIterator` in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4632

Differential Revision: D13053103

Pulled By: abhimadan

fbshipit-source-id: e8220cc712fcf5be4d602913bb23ace8ea5f8ef0
2018-11-14 17:52:17 -08:00
Siying Dong
f5c8cf5fed Increase wait time in DBTest.SanitizeNumThreads (#4659)
Summary:
DBTest.SanitizeNumThreads Sometimes fails. The test waited for 10ms timeout and expect all threads scheduled to be executed. This can be a source of flakiness. Make a check every 1ms and up to 10s.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4659

Differential Revision: D13074174

Pulled By: siying

fbshipit-source-id: b1d5ff87a326a4fc9eab8d1cc307bbb940dfe70c
2018-11-14 16:19:36 -08:00
Zhongyi Xie
d8df169b84 release db mutex when calling ApproximateSize (#4630)
Summary:
`GenSubcompactionBoundaries` calls `VersionSet::ApproximateSize` which gets BlockBasedTableReader for every file and seeks in its index block to find `key`'s offset. If the table or index block aren't in memory already, this involves I/O. This can be improved by releasing DB mutex when calling ApproximateSize.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4630

Differential Revision: D13052653

Pulled By: miasantreble

fbshipit-source-id: cae31d46d10d0860fa8a26b8d5154b2d17d1685f
2018-11-13 17:08:34 -08:00
Zhongyi Xie
b76398a82b apply ReadOptions.iterate_upper_bound to transaction iterator (#4656)
Summary:
Currently transaction iterator does not apply `ReadOptions.iterate_upper_bound` when iterating. This PR attempts to fix the problem by having `BaseDeltaIterator` enforcing the upper bound check when iterator state is changed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4656

Differential Revision: D13039257

Pulled By: miasantreble

fbshipit-source-id: 909eb9f6b4597a4d80418fb139f32ec82c6ec1d1
2018-11-13 15:44:15 -08:00
Yi Wu
b32d087dbb Move MemoryAllocator option from Cache to BlockBasedTableOptions (#4676)
Summary:
Per offline discussion with siying, `MemoryAllocator` and `Cache` should be decouple. The idea is that memory allocator handles memory allocation, while cache handle cache policy.

It is normal that external cache libraries pack couple the two components for better optimization. If we want to integrate with such library in the future, we can make a wrapper of the library implementing both `Cache` and `MemoryAllocator` interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4676

Differential Revision: D13047662

Pulled By: yiwu-arbug

fbshipit-source-id: cd42e246d80ab600b4de47d073f7d2db308ce6dd
2018-11-13 13:48:38 -08:00
Siying Dong
abb1a8fc23 Add a unit test to assert number of preads (#4657)
Summary:
We used to have a bug, which caused every block to be read twice, and none of our tests caught it. Add a very simply unit test to make sure that when reading a data block, we only issue one pread against the SST file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4657

Differential Revision: D13005260

Pulled By: siying

fbshipit-source-id: 03167b554ad2451192b1707415536d7d05e9026c
2018-11-13 12:52:19 -08:00
QingpingWang
4f0fcb78ae Expose num entries and deletions of sst files (#4623)
Summary:
he ratio of num_deletions to num_entries of a level can be useful to determine if a manual compaction needs to be triggered on a level.
Also refer #3980
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4623

Differential Revision: D13045744

Pulled By: sagar0

fbshipit-source-id: 71f3c8e363a8ffd194ec3bb0ed0b69612231f0b3
2018-11-13 11:52:19 -08:00
Soli Como
5945e16dfc Divide NO_ITERATORS into two counters NO_ITERATOR_CREATED and NO_ITERATOR_DELETE (#4498)
Summary:
Currently, `Statistics` can record tick by `recordTick()` whose second parameter is an `uint64_t`.
That means tick can only increase.
If we want to reduce tick, we have to work around like `RecordTick(statistics_, NO_ITERATORS, uint64_t(-1));`.
That's kind of a hack.

So, this PR divide `NO_ITERATORS` into two counters `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETE`, making the counters increase only.

Fixes #3013 .
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4498

Differential Revision: D10395010

Pulled By: sagar0

fbshipit-source-id: cfb523b22a37411c794b4e9da090f1ae30293db2
2018-11-13 11:46:32 -08:00
Soli
a478682260 Fix #3840: only SyncClosedLogs for multiple CFs (#4460)
Summary:
Call `SyncClosedLogs()` only if there are more than one column families.

Update several unit tests (in `fault_injection_test` and `db_flush_test`) correspondingly.

See #3840 for more info.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4460

Differential Revision: D12896377

Pulled By: riversand963

fbshipit-source-id: f49afdaec32568f12f001219a3aec1dfde3b32bf
2018-11-13 11:32:16 -08:00
Andrew Kryczka
ea9454700a Backup engine support for direct I/O reads (#4640)
Summary:
Use the `DBOptions` that the backup engine already holds to figure out the right `EnvOptions` to use when reading the DB files. This means that, if a user opened a DB instance with `use_direct_reads=true`, then using `BackupEngine` to back up that DB instance will use direct I/O to read files when calculating checksums and copying. Currently the WALs and manifests would still be read using buffered I/O to prevent mixing direct I/O reads with concurrent buffered I/O writes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4640

Differential Revision: D13015268

Pulled By: ajkr

fbshipit-source-id: 77006ad6f3e00ce58374ca4793b785eea0db6269
2018-11-13 11:17:25 -08:00
Zhongyi Xie
b313019326 use per-level perfcontext for DB::Get calls (#4617)
Summary:
this PR adds two more per-level perf context counters to track
* number of keys returned in Get call, break down by levels
* total processing time at each level during Get call
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4617

Differential Revision: D12898024

Pulled By: miasantreble

fbshipit-source-id: 6b84ef1c8097c0d9e97bee1a774958f56ab4a6c4
2018-11-13 10:40:49 -08:00
Sagar Vemuri
2993cd2002 Fix RocksDB Lite build (#4675)
Summary:
Our internal CI test caught RocksDB Lite build failures. The failures are due to a new test introduced in #4665 using `SSTFileWriter` and `IngestExternalFile`, but these is not exposed under lite mode. Fixed by #ifdef'ing out the test.

```
db/db_test2.cc: In member function ‘virtual void rocksdb::DBTest2_TestCompactFiles_Test::TestBody()’:
db/db_test2.cc:2907:3: error: ‘SstFileWriter’ is not a member of ‘rocksdb’
   rocksdb::SstFileWriter sst_file_writer{rocksdb::EnvOptions(), options};
   ^
In file included from ./util/testharness.h:15:0,
                 from ./table/mock_table.h:23,
                 from ./db/db_test_util.h:44,
                 from db/db_test2.cc:13:
db/db_test2.cc:2912:13: error: ‘sst_file_writer’ was not declared in this scope
   ASSERT_OK(sst_file_writer.Open(external_file1));
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4675

Differential Revision: D13035984

Pulled By: sagar0

fbshipit-source-id: c1ceac550dfac1a85eeea436693dc7dd467519a6
2018-11-12 19:01:37 -08:00
Abhishek Madan
7d04ef4655 Fix flaky DBDynamicLevelTest.DynamicLevelMaxBytesBase2 (#4668)
Summary:
Part of the test required that a compaction start before a
manual flush, but this was not enforced by the test. In some cases,
particularly when writing to tmpfs, this could lead to the compaction
starting after the flush, which caused the base level to be higher than
it was expected to be. Add a sync point in the test to ensure that the
flush and compaction happen simultaneously.

The test also had some stale comments, so those have been removed or
modified, and the test has been simplified so that it no longer uses sleeps
and writes uncompressed SSTs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4668

Differential Revision: D13032440

Pulled By: abhimadan

fbshipit-source-id: 3f23b583a096454dafb8d8ea75678605dec80209
2018-11-12 16:42:16 -08:00
DorianZheng
0f88160f67 Fix CompactFiles bug (#4665)
Summary:
`CompactFiles` gets `SuperVersion` before `WaitForIngestFile`, while `IngestExternalFile` may add files that overlap with `input_file_names`

The timeline of execution flow is as follow:

Let's say that level N has two file [1,2] and [5,6]
```
timeline              user_thread1                             user_thread2
t0   |      CompactFiles([1, 2], [5, 6]) begin
t1   |         GetReferencedSuperVersion()
t2   |                                              IngestExternalFile([3,4]) to level N begin
t3   |             CompactFiles resume
     V
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4665

Differential Revision: D13030674

Pulled By: ajkr

fbshipit-source-id: 8be19477fd6e505032267a979d32f3097cc3be51
2018-11-12 14:32:18 -08:00
Yanqin Jin
05dec0c7c7 Remove redundant member var and set options (#4631)
Summary:
In the past, both `DBImpl::atomic_flush_` and
`DBImpl::immutable_db_options_.atomic_flush` exist. However, we fail to set
`immutable_db_options_.atomic_flush`, but use `DBImpl::atomic_flush_` which is
set correctly. This does not lead to incorrect behavior, but is a duplicate of
information.

Since `immutable_db_options_` is always there and has `atomic_flush`, we should
use it as source of truth and remove `DBImpl::atomic_flush_`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4631

Differential Revision: D12928371

Pulled By: riversand963

fbshipit-source-id: f85a811959d3828aad4a3a1b05f71facf19c636d
2018-11-12 12:24:26 -08:00
DorianZheng
09426ae1c7 Fix DBImpl::GetColumnFamilyHandleUnlocked data race (#4666)
Summary:
Hi, yiwu-arbug, I found that `DBImpl::GetColumnFamilyHandleUnlocked` still have data race condition, because `column_family_memtables_` has a stateful cache `current_` and `column_family_memtables_::Seek` maybe call without the protection of `mutex_` by a write thread

check 859dbda6e3/db/write_batch.cc (L1188)  and   859dbda6e3/db/write_batch.cc (L1756)  and  859dbda6e3/db/db_impl_write.cc (L318)

So it's better to use `versions_->GetColumnFamilySet()->GetColumnFamily` instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4666

Differential Revision: D13027117

Pulled By: yiwu-arbug

fbshipit-source-id: 4e3778eaf8e7f7c8577bbd78129b6a5fd7ce79fb
2018-11-12 11:52:34 -08:00
Yi Wu
859dbda6e3 Fix DBTest.SoftLimit flakyness (#4658)
Summary:
The flakyness can be reproduced with the following patch:
```
 --- a/db/db_impl_compaction_flush.cc
+++ b/db/db_impl_compaction_flush.cc
@@ -2013,6 +2013,9 @@ void DBImpl::BackgroundCallFlush() {
       if (job_context.HaveSomethingToDelete()) {
         PurgeObsoleteFiles(job_context);
       }
+      static int f_count = 0;
+      printf("clean flush job context %d\n", ++f_count);
+      env_->SleepForMicroseconds(1000000);
       job_context.Clean();
       mutex_.Lock();
     }
```
The issue is that FlushMemtable with opt.wait=true does not wait for `OnStallConditionsChanged` being called. The event listener is triggered on `JobContext::Clean`, which happens after flush result is installed. At the time we check for stall condition after flushing memtable, the job context cleanup may not be finished.

To fix the flaykyness, we use sync point to create a custom WaitForFlush that waits for context cleanup.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4658

Differential Revision: D13007301

Pulled By: yiwu-arbug

fbshipit-source-id: d98395ee7b0ad4c62e83e8d0e9b6028058c61712
2018-11-09 16:45:19 -08:00
Sagar Vemuri
dc3528077a Update all unique/shared_ptr instances to be qualified with namespace std (#4638)
Summary:
Ran the following commands to recursively change all the files under RocksDB:
```
find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
```
Running `make format` updated some formatting on the files touched.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638

Differential Revision: D12934992

Pulled By: sagar0

fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
2018-11-09 11:19:58 -08:00
Zhongyi Xie
fce5994603 Add more sync point to fix flaky test GroupCommitTest
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4637

Differential Revision: D12963727

Pulled By: miasantreble

fbshipit-source-id: 76053501afbecc6ef388ddc56542fa0185243e3f
2018-11-07 14:07:53 -08:00
Siying Dong
566fc8b994 Black list some valgrind tests (#4642)
Summary:
valgrind tests with 1 thread run too long. To make it shorter, black list some long tests. These are already blacklisted in parallel valgrind tests, but they are not in non-parallel mode
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4642

Differential Revision: D12945237

Pulled By: siying

fbshipit-source-id: 04cf977d435996480fe87aa09f14b17975b74f7d
2018-11-06 14:22:36 -08:00
Andrew Kryczka
fffac43cfb Add DB property for SST files kept from deletion (#4618)
Summary:
This property can help debug why SST files aren't being deleted. Previously we only had the property "rocksdb.is-file-deletions-enabled". However, even when that returned true, obsolete SSTs may still not be deleted due to the coarse-grained mechanism we use to prevent newly created SSTs from being accidentally deleted. That coarse-grained mechanism uses a lower bound file number for SSTs that should not be deleted, and this property exposes that lower bound.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4618

Differential Revision: D12898179

Pulled By: ajkr

fbshipit-source-id: fe68acc041ddbcc9276bbd48976524d95aafc776
2018-11-05 20:24:40 -08:00
Siying Dong
c3105aa50d Try to fix ExternalSSTFileTest.IngestNonExistingFile flakines (#4625)
Summary:
ExternalSSTFileTest.IngestNonExistingFile occasionally fail for number of SST files after manual compaction doesn't go down as expected. Although I don't find a reason how this can happen, adding an extra waiting to make sure obsolete file purging has finished before we check the files doesn't hurt.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4625

Differential Revision: D12910586

Pulled By: siying

fbshipit-source-id: 2a5ddec6908c99cf3bcc78431c6f93151c2cab59
2018-11-02 17:26:35 -07:00
Zhongyi Xie
61311157ff exclude get db property calls from rocksdb_lite (#4619)
Summary:
fix current failing lite test:
> In file included from ./util/testharness.h:15:0,
                 from ./table/mock_table.h:23,
                 from ./db/db_test_util.h:44,
                 from db/db_flush_test.cc:10:
db/db_flush_test.cc: In member function ‘virtual void rocksdb::DBFlushTest_ManualFlushFailsInReadOnlyMode_Test::TestBody()’:
db/db_flush_test.cc:250:35: error: ‘Properties’ is not a member of ‘rocksdb::DB’
   ASSERT_TRUE(db_->GetIntProperty(DB::Properties::kBackgroundErrors,
                                   ^
make: *** [db/db_flush_test.o] Error 1
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4619

Differential Revision: D12898319

Pulled By: miasantreble

fbshipit-source-id: 72de603b1f2e972fc8caa88611798c4e98e348c6
2018-11-02 11:28:59 -07:00
Yanqin Jin
de18a2d82e Update test to cover a new case in file ingestion (#4614)
Summary:
The new case is directIO = true, write_global_seqno = false in which we no longer write global_seqno to the external SST file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4614

Differential Revision: D12885001

Pulled By: riversand963

fbshipit-source-id: 7541bdc608b3a0c93d3c3c435da1b162b36673d4
2018-11-01 16:23:49 -07:00
Bo Hou
cd9404bb77 xxhash 64 support
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4607

Reviewed By: siying

Differential Revision: D12836696

Pulled By: jsjhoubo

fbshipit-source-id: 7122ccb712d0b0f1cd998aa4477e0da1401bd870
2018-11-01 15:44:06 -07:00
Andrew Kryczka
5c794d94c4 Prevent manual flush hanging in read-only mode (#4615)
Summary:
The logic to wait for stall conditions to clear before beginning a manual flush didn't take into account whether the DB was in read-only mode. In read-only mode the stall conditions would never clear since no background work is happening, so the wait would be never-ending. It's probably better to return an error to the user.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4615

Differential Revision: D12888008

Pulled By: ajkr

fbshipit-source-id: 1c474b42a7ac38d9fd0d0e2340ff1d53e684d83c
2018-11-01 15:27:06 -07:00
Andrew Kryczka
b8f68bac38 Prevent manual compaction hanging in read-only mode (#4611)
Summary:
A background compaction with pre-picked files (i.e., either a manual compaction or a bottom-pri compaction) fails when the DB is in read-only mode. In the failure handling, we forgot to unregister the compaction and the files it covered. Then subsequent manual compactions could conflict with this zombie compaction (possibly Halloween related) and wait forever for it to finish.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4611

Differential Revision: D12871217

Pulled By: ajkr

fbshipit-source-id: 9d24e921d5bbd2ee8c2c9536a30abfa42a220c6e
2018-10-31 17:24:36 -07:00
Yanqin Jin
d1118f6f19 Add test to check if DB can handle atomic group (#4433)
Summary:
Add unit tests to demonstrate that `VersionSet::Recover` is able to detect and handle cases in which the MANIFEST has valid atomic group, incomplete trailing atomic group, atomic group mixed with normal version edits and atomic group with incorrect size.
With this capability, RocksDB identifies non-valid groups of version edits and do not apply them, thus guaranteeing that the db is restored to a state consistent with the most recent successful atomic flush before applying WAL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4433

Differential Revision: D10079202

Pulled By: riversand963

fbshipit-source-id: a0e0b8bf4da1cf68e044d397588c121b66c68876
2018-10-30 16:37:47 -07:00
Abhishek Madan
eaaf1a6f05 Promote rocksdb.{deleted.keys,merge.operands} to main table properties (#4594)
Summary:
Since the number of range deletions are reported in
TableProperties, it is confusing to not report the number of merge
operands and point deletions as top-level properties; they are
accessible through the public API, but since they are not the "main"
properties, they do not appear in aggregated table properties, or the
string representation of table properties.

This change promotes those two property keys to
`rocksdb/table_properties.h`, adds corresponding uint64 members for
them, deprecates the old access methods `GetDeletedKeys()` and
`GetMergeOperands()` (though they are still usable for now), and removes
`InternalKeyPropertiesCollector`. The property key strings are the same
as before this change, so this should be able to read DBs written from older
versions (though I haven't tested this yet).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4594

Differential Revision: D12826893

Pulled By: abhimadan

fbshipit-source-id: 9e4e4fbdc5b0da161c89582566d184101ba8eb68
2018-10-30 15:34:27 -07:00
Siying Dong
9da88a8321 Remove info logging in db mutex inside EnableFileDeletions() (#4604)
Summary:
EnableFileDeletions() does info logging inside db mutex. This is not recommended in the code base, since there could be I/O involved. Move this outside the DB mutex.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4604

Differential Revision: D12834432

Pulled By: siying

fbshipit-source-id: ffe5c2626fcfdb4c54a661a3c3b0bc95054816cf
2018-10-30 10:33:59 -07:00
Andrew Kryczka
cae540ebef Fix range tombstones written to more files than necessary (#4592)
Summary:
When there's a gap between files, we do not need to output tombstones starting at the next output file's begin key to the current output file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4592

Differential Revision: D12808627

Pulled By: ajkr

fbshipit-source-id: 77c8b2e7523a95b1cd6611194144092c06acb505
2018-10-29 19:23:27 -07:00
Yanqin Jin
806ff34b61 Disable DBIOFailureTest.NoSpaceCompactRange in LITE (#4596)
Summary:
Since ErrorHandler::RecoverFromNoSpace is no-op in LITE mode, then we should
not have this test in LITE mode. If we do keep it, it will cause the test
thread to wait on bg_cv_ that will not be signalled.

How to reproduce
```
$make clean && git checkout a27fce408e
$OPT="-DROCKSDB_LITE -g" make -j20
$./db_io_failure_test --gtest_filter=DBIOFailureTest.NoSpaceCompactRange
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4596

Differential Revision: D12818516

Pulled By: riversand963

fbshipit-source-id: bc83524f40fff1e29506979017f7f4c2b70322f3
2018-10-29 14:36:31 -07:00
Yanqin Jin
92b4401566 Avoid memtable cut when active memtable is empty (#4595)
Summary:
For flush triggered by RocksDB due to memory usage approaching certain
threshold (WriteBufferManager or Memtable full), we should cut the memtable
only when the current active memtable is not empty, i.e. contains data. This is
what we do for non-atomic flush. If we always cut memtable even when the active
memtable is empty, we will generate extra, empty immutable memtable.
This is not ideal since it may cause write stall. It also causes some
DBAtomicFlushTest to fail because cfd->imm()->NumNotFlushed() is different from
expectation.

Test plan
```
$make clean && make J=1 -j32 all check
$make clean && OPT="-DROCKSDB_LITE -g" make J=1 -j32 all check
$make clean && TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make J=1 -j32 valgrind_test
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4595

Differential Revision: D12818520

Pulled By: riversand963

fbshipit-source-id: d867bdbeacf4199fdd642debb085f94703c41a18
2018-10-29 09:45:32 -07:00
Yanqin Jin
5b4c709fad Enable atomic flush (#4023)
Summary:
Adds a DB option `atomic_flush` to control whether to enable this feature. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4023

Differential Revision: D8518381

Pulled By: riversand963

fbshipit-source-id: 1e3bb33e99bb102876a31b378d93b0138ff6634f
2018-10-26 15:08:43 -07:00
Yi Wu
f560c8f5c8 s/CacheAllocator/MemoryAllocator/g (#4590)
Summary:
Rename the interface, as it is mean to be a generic interface for memory allocation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4590

Differential Revision: D10866340

Pulled By: yiwu-arbug

fbshipit-source-id: 85cb753351a40cb856c046aeaa3f3b369eef3d16
2018-10-26 14:30:30 -07:00
Abhishek Madan
7528130e38 Cache fragmented range tombstones in BlockBasedTableReader (#4493)
Summary:
This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses.

On the same DB used in #4449, running `readrandom` results in the following:
```
readrandom   :       0.983 micros/op 1017076 ops/sec;   78.3 MB/s (63103 of 100000 found)
```

Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results):
```
   Tombstones?    | avg micros/op | stddev micros/op |  avg ops/s   | stddev ops/s
----------------- | ------------- | ---------------- | ------------ | ------------
None              |        0.6186 |          0.04637 | 1,625,252.90 | 124,679.41
500 Expanded      |        0.6019 |          0.03628 | 1,666,670.40 | 101,142.65
500 Unexpanded    |        0.6435 |          0.03994 | 1,559,979.40 | 104,090.52
1k Expanded       |        0.6034 |          0.04349 | 1,665,128.10 | 125,144.57
1k Unexpanded     |        0.6261 |          0.03093 | 1,600,457.50 |  79,024.94
5k Expanded       |        0.6163 |          0.05926 | 1,636,668.80 | 154,888.85
5k Unexpanded     |        0.6402 |          0.04002 | 1,567,804.70 | 100,965.55
10k Expanded      |        0.6036 |          0.05105 | 1,667,237.70 | 142,830.36
10k Unexpanded    |        0.6128 |          0.02598 | 1,634,633.40 |  72,161.82
25k Expanded      |        0.6198 |          0.04542 | 1,620,980.50 | 116,662.93
25k Unexpanded    |        0.5478 |          0.0362  | 1,833,059.10 | 121,233.81
50k Expanded      |        0.5104 |          0.04347 | 1,973,107.90 | 184,073.49
50k Unexpanded    |        0.4528 |          0.03387 | 2,219,034.50 | 170,984.32
```

After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493

Differential Revision: D10842844

Pulled By: abhimadan

fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509
2018-10-25 19:26:44 -07:00
Zhongyi Xie
fe0d23059d Fix two contrun job failures (#4587)
Summary:
Currently there are two contrun test failures:
* rocksdb-contrun-lite:
> tools/db_bench_tool.cc: In function ‘int rocksdb::db_bench_tool(int, char**)’:
tools/db_bench_tool.cc:5814:5: error: ‘DumpMallocStats’ is not a member of ‘rocksdb’
     rocksdb::DumpMallocStats(&stats_string);
     ^
make: *** [tools/db_bench_tool.o] Error 1
* rocksdb-contrun-unity:
> In file included from unity.cc:44:0:
db/range_tombstone_fragmenter.cc: In member function ‘void rocksdb::FragmentedRangeTombstoneIterator::FragmentTombstones(std::unique_ptr<rocksdb::InternalIteratorBase<rocksdb::Slice> >, rocksdb::SequenceNumber)’:
db/range_tombstone_fragmenter.cc:90:14: error: reference to ‘ParsedInternalKeyComparator’ is ambiguous
   auto cmp = ParsedInternalKeyComparator(icmp_);

This PR will fix them
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4587

Differential Revision: D10846554

Pulled By: miasantreble

fbshipit-source-id: 8d3358879e105060197b1379c84aecf51b352b93
2018-10-24 20:16:45 -07:00
Yanqin Jin
eb8c9918f7 Remove unused variable
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4585

Differential Revision: D10841983

Pulled By: riversand963

fbshipit-source-id: 6a7e0b40065bcfbb10a2cac0cec1e8da0750a617
2018-10-24 15:51:45 -07:00
Abhishek Madan
8c78348c77 Use only "local" range tombstones during Get (#4449)
Summary:
Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key from higher levels, and compute the highest covering
sequence number at the current level. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.

In the benchmark results, the following command was used to initialize the database:
```
./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
```

...and the following command was used to measure read throughput:
```
./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
```

The filluniquerandom command was only run once, and the resulting database was used
to measure read performance before and after the PR. Both binaries were compiled with
`DEBUG_LEVEL=0`.

Readrandom results before PR:
```
readrandom   :       4.544 micros/op 220090 ops/sec;   16.9 MB/s (63103 of 100000 found)
```

Readrandom results after PR:
```
readrandom   :      11.147 micros/op 89707 ops/sec;    6.9 MB/s (63103 of 100000 found)
```

So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).

----
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449

Differential Revision: D10370575

Pulled By: abhimadan

fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
2018-10-24 12:31:12 -07:00
Zhongyi Xie
21bf7421ca use per-level perf context for bloom filter related counters (#4581)
Summary:
PR https://github.com/facebook/rocksdb/pull/4226 introduced per-level perf context which allows breaking down perf context by levels.
This PR takes advantage of the feature to populate a few counters related to bloom filters
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4581

Differential Revision: D10518010

Pulled By: miasantreble

fbshipit-source-id: 011244561783ec860d32d5b0fa6bce6e78d70ef8
2018-10-24 12:21:38 -07:00
Neil Mayhew
43dbd4411e Adapt three unit tests with newer compiler/libraries (#4562)
Summary:
This fixes three tests that fail with relatively recent tools and libraries:

The tests are:

* `spatial_db_test`
* `table_test`
* `db_universal_compaction_test`

I'm using:

* `gcc` 7.3.0
* `glibc` 2.27
* `snappy` 1.1.7
* `gflags` 2.2.1
* `zlib` 1.2.11
* `bzip2` 1.0.6.0.1
* `lz4` 1.8.2
* `jemalloc` 5.0.1

The versions used in the Travis environment (which is two Ubuntu LTS versions behind the current one and doesn't use `lz4` or `jemalloc`) don't seem to have a problem. However, to be safe, I verified that these tests pass with and without my changes in a trusty Docker container without `lz4` and `jemalloc`.

However, I do get an unrelated set of other failures when using a trusty Docker container that uses `lz4` and `jemalloc`:

```
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/0, where GetParam() = (1, false) (1189 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1, where GetParam() = (1, true) (1246 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2, where GetParam() = (3, false) (1237 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3, where GetParam() = (3, true) (1195 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4, where GetParam() = (5, false) (1161 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5, where GetParam() = (5, true) (1229 ms)
```

I haven't attempted to fix these since I'm not using trusty and Travis doesn't use `lz4` and `jemalloc`. However, the final commit in this PR does at least fix the compilation errors that occur when using trusty's version of `lz4`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4562

Differential Revision: D10510917

Pulled By: maysamyabandeh

fbshipit-source-id: 59534042015ec339270e5fc2f6ac4d859370d189
2018-10-24 08:17:56 -07:00
Zhongyi Xie
f6b151f16d fix clang analyzer error (#4583)
Summary:
clang analyzer currently fails with the following warnings:
> db/log_reader.cc:323:9: warning: Undefined or garbage value returned to caller
        return r;
        ^~~~~~~~
db/log_reader.cc:344:11: warning: Undefined or garbage value returned to caller
          return r;
          ^~~~~~~~
db/log_reader.cc:369:11: warning: Undefined or garbage value returned to caller
          return r;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4583

Differential Revision: D10523517

Pulled By: miasantreble

fbshipit-source-id: 0cc8b8f27657b202bead148bbe7c4aa84fed095b
2018-10-23 22:14:54 -07:00
Maysam Yabandeh
c34cc40424 Fix user comparator receiving internal key (#4575)
Summary:
There was a bug that the user comparator would receive the internal key instead of the user key. The bug was due to RangeMightExistAfterSortedRun expecting user key but receiving internal key when called in GenerateBottommostFiles. The patch augment an existing unit test to reproduce the bug and fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4575

Differential Revision: D10500434

Pulled By: maysamyabandeh

fbshipit-source-id: 858346d2fd102cce9e20516d77338c112bdfe366
2018-10-23 08:14:46 -07:00
Siying Dong
7024263682 Dynamic level to adjust level multiplier when write is too heavy (#4338)
Summary:
Level compaction usually performs poorly when the writes so heavy that the level targets can't be guaranteed. With this improvement, we improve level_compaction_dynamic_level_bytes = true so that in the write heavy cases, the level multiplier can be slightly adjusted based on the size of L0.

We keep the behavior the same if number of L0 files is under 2X compaction trigger and the total size is less than options.max_bytes_for_level_base, so that unless write is so heavy that compaction cannot keep up, the behavior doesn't change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4338

Differential Revision: D9636782

Pulled By: siying

fbshipit-source-id: e27fc17a7c29c84b00064cc17536a01dacef7595
2018-10-22 10:21:47 -07:00
Yi Wu
933250e355 Fix RepeatableThreadTest::MockEnvTest hang (#4560)
Summary:
When `MockTimeEnv` is used in test to mock time methods, we cannot use `CondVar::TimedWait` because it is using real time, not the mocked time for wait timeout. On Mac the method can return immediately without awaking other waiting threads, if the real time is larger than `wait_until` (which is a mocked time). When that happen, the `wait()` method will fall into an infinite loop.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4560

Differential Revision: D10472851

Pulled By: yiwu-arbug

fbshipit-source-id: 898902546ace7db7ac509337dd8677a527209d19
2018-10-21 20:17:18 -07:00
Yanqin Jin
da4aa59b4c Add read retry support to log reader (#4394)
Summary:
Current `log::Reader` does not perform retry after encountering `EOF`. In the future, we need the log reader to be able to retry tailing the log even after `EOF`.

Current implementation is simple. It does not provide more advanced retry policies. Will address this in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4394

Differential Revision: D9926508

Pulled By: riversand963

fbshipit-source-id: d86d145792a41bd64a72f642a2a08c7b7b5201e1
2018-10-19 11:53:00 -07:00
Maysam Yabandeh
0afa5b53d7 Disable GroupCommitTest in Appveyor (#4536)
Summary:
We have already disabled it on Travis since it has been too flaky. The same problem arises in Appveyor as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4536

Differential Revision: D10452240

Pulled By: maysamyabandeh

fbshipit-source-id: 728f4ecddf780097159dc0a0737d460eb5ce4f09
2018-10-18 14:21:09 -07:00
Abhishek Madan
45f213b558 Lazily initialize RangeDelAggregator stripe map entries (#4497)
Summary:
When there are no range deletions, flush and compaction perform a binary search
on an effectively empty map every time they call ShouldDelete. This PR lazily
initializes each stripe map entry so that the binary search can be elided in
these cases.

After this PR, the total amount of time spent in compactions is 52.541331s, and the total amount of time spent in flush is 5.532608s, the former of which is a significant improvement from the results after #4495.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4497

Differential Revision: D10428610

Pulled By: abhimadan

fbshipit-source-id: 6f7e1ce3698fac3ef86d1197955e6b72e0931a0f
2018-10-17 11:47:34 -07:00
Zhongyi Xie
d6ec288703 Add PerfContextByLevel to provide per level perf context information (#4226)
Summary:
Current implementation of perf context is level agnostic. Making it hard to do performance evaluation for the LSM tree. This PR adds `PerfContextByLevel` to decompose the counters by level.
This will be helpful when analyzing point and range query performance as well as tuning bloom filter
Also replaced __thread with thread_local keyword for perf_context
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4226

Differential Revision: D10369509

Pulled By: miasantreble

fbshipit-source-id: f1ced4e0de5fcebdb7f9cff36164516bc6382d82
2018-10-17 11:19:40 -07:00
anand1976
1e3845805d Properly determine a truncated CompactRange stop key (#4496)
Summary:
When a CompactRange() call for a level is truncated before the end key
is reached, because it exceeds max_compaction_bytes, we need to properly
set the compaction_end parameter to indicate the stop key. The next
CompactRange will use that as the begin key. We set it to the smallest
key of the next file in the level after expanding inputs to get a clean
cut.

Previously, we were setting it before expanding inputs. So we could end
up recompacting some files. In a pathological case, where a single key
has many entries spanning all the files in the level (possibly due to
merge operands without a partial merge operator, thus resulting in
compaction output identical to the input), this would result in
an endless loop over the same set of files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4496

Differential Revision: D10395026

Pulled By: anand1976

fbshipit-source-id: f0c2f89fee29b4b3be53b6467b53abba8e9146a9
2018-10-15 23:22:51 -07:00
Yanqin Jin
e633983cf1 Add support to flush multiple CFs atomically (#4262)
Summary:
Leverage existing `FlushJob` to implement atomic flush of multiple column families.

This PR depends on other PRs and is a subset of #3752 . This PR itself is not sufficient in fulfilling atomic flush.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4262

Differential Revision: D9283109

Pulled By: riversand963

fbshipit-source-id: 65401f913e4160b0a61c0be6cd02adc15dad28ed
2018-10-15 20:01:17 -07:00
Andrew Kryczka
32b4d4ad47 Avoid per-key linear scan over snapshots in compaction (#4495)
Summary:
`CompactionIterator::snapshots_` is ordered by ascending seqnum, just like `DBImpl`'s linked list of snapshots from which it was copied. This PR exploits this ordering to make `findEarliestVisibleSnapshot` do binary search rather than linear scan. This can make flush/compaction significantly faster when many snapshots exist since that function is called on every single key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4495

Differential Revision: D10386470

Pulled By: ajkr

fbshipit-source-id: 29734991631227b6b7b677e156ac567690118a8b
2018-10-15 16:21:22 -07:00
Yanqin Jin
729a617b5b Add listener to sample file io (#3933)
Summary:
We would like to collect file-system-level statistics including file name, offset, length, return code, latency, etc., which requires to add callbacks to intercept file IO function calls when RocksDB is running.
To collect file-system-level statistics, users can inherit the class `EventListener`, as in `TestFileOperationListener `. Note that `TestFileOperationListener::ShouldBeNotifiedOnFileIO()` returns true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3933

Differential Revision: D10219571

Pulled By: riversand963

fbshipit-source-id: 7acc577a2d31097766a27adb6f78eaf8b1e8ff15
2018-10-12 18:36:11 -07:00
Yi Wu
6f8d4bdff1 Fix compile error with jemalloc (#4488)
Summary:
The "je_" prefix of jemalloc APIs presents only when the macro `JEMALLOC_NO_RENAME` from jemalloc.h presents.

With the patch I'm also adding -DROCKSDB_JEMALLOC flag in buck TARGETS.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4488

Differential Revision: D10355971

Pulled By: yiwu-arbug

fbshipit-source-id: 03a2d69790a44ac89219c7525763fa937a63d95a
2018-10-12 11:50:50 -07:00
Chinmay Kamat
6422356a27 Acquire lock on DB LOCK file before starting repair. (#4435)
Summary:
This commit adds code to acquire lock on the DB LOCK file
before starting the repair process. This will prevent
multiple processes from performing repair on the same DB
simultaneously. Fixes repair_test to work with this change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4435

Differential Revision: D10361499

Pulled By: riversand963

fbshipit-source-id: 3c512c48b7193d383b2279ccecabdb660ac1cf22
2018-10-12 10:41:54 -07:00
Abhishek Madan
7dd1641048 Use vector in UncollapsedRangeDelMap (#4487)
Summary:
Using `./range_del_aggregator_bench --use_collapsed=false
--num_range_tombstones=5000 --num_runs=1000`, here are the results before and
after this change:

Before:
```
=========================
Results:
=========================
AddTombstones:           1822.61 us
ShouldDelete (first):    94.5286 us
```

After:
```
=========================
Results:
=========================
AddTombstones:           199.26 us
ShouldDelete (first):    38.9344 us
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4487

Differential Revision: D10347288

Pulled By: abhimadan

fbshipit-source-id: d44efe3a166d583acfdc3ec1199e0892f34dbfb7
2018-10-11 15:29:14 -07:00
UncP
531786ebf7 DBWriteImpl: remove redundant code (#4450)
Summary:
in `WriteThread::LaunchParallelMemTableWriters`, there is `  write_group->running.store(write_group->size);
`
https://github.com/facebook/rocksdb/blob/master/db/write_thread.cc#L510
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4450

Differential Revision: D10201900

Pulled By: yiwu-arbug

fbshipit-source-id: 96c8fbbba5aff7ba8a6ceb3117a2bd7cc9b2f34b
2018-10-10 21:00:32 -07:00
Simon Grätzer
ceded4535d WriteBatch::Iterate wrongly returns Status::Corruption (#4478)
Summary:
Wrong I overwrite `WriteBatch::Handler::Continue` to return _false_ at some point, I always get the `Status::Corruption` error.
I don't think this check is used correctly here: The counter in `found` cannot reflect all entries in the WriteBatch when we exit the loop early.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4478

Differential Revision: D10317416

Pulled By: yiwu-arbug

fbshipit-source-id: cccae3382805035f9b3239b66682b5fcbba6bb61
2018-10-10 20:57:27 -07:00
Andrew Kryczka
7e56072290 Fix merge operand reappearing when covered by DeleteRange (#4481)
Summary:
Even during `DBIter::Prev()`, there is a case where we need to use `RangeDelPositioningMode::kForwardTraversal`. In particular, when we hit too many internal keys for a single user key, we use seek to find the newest internal key. If it's a merge operand, we then scan forwards, collecting the merge operands. This forward scan should be using `RangeDelPositioningMode::kForwardTraversal`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4481

Differential Revision: D10319507

Pulled By: ajkr

fbshipit-source-id: b5ce7352461f3a7696b28a5136ae0076f2bde51f
2018-10-10 18:16:12 -07:00
Peter Pei
09814f2cfc support OnCompactionBegin (#4431)
Summary:
fix #4288

Add `OnCompactionBegin` support to `rocksdb::EventListener`.

Currently, we only have these three callbacks:

- OnFlushBegin
- OnFlushCompleted
- OnCompactionCompleted

As paolococchi requested in #4288 , and ajkr agreed, we should also support `OnCompactionBegin`.

This PR is a try to implement the support of `OnCompactionBegin`.

Hope it is useful to you.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4431

Differential Revision: D10055515

Pulled By: yiwu-arbug

fbshipit-source-id: 39c0f95f8e9ff1c7ca3a10787502a17f258d2334
2018-10-10 17:32:27 -07:00
Andrew Kryczka
faa70fc575 DeleteRange regression tests using public API (#4476)
Summary:
I wrote a couple tests using the public API to expose/prevent the bugs we talked. In particular,

- When files have overlapping endpoints and a range tombstone spans them, ensure the largest key does not reappear to readers. This was happening due to a bug that skipped writing range tombstones to an output file when their begin key exactly matched the file's largest key.
- When a tombstone spans multiple atomic compaction units, ensure newer keys do not disappear by being compacted beneath it. This happened due to a range tombstone appearing untruncated to readers when it spanned files with overlapping endpoints, even if it extended into files without overlapping endpoints (i.e., different atomic compaction units).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4476

Differential Revision: D10286001

Pulled By: ajkr

fbshipit-source-id: bb5ca51d0f90812fb37bfe1d01aec93f7eda55aa
2018-10-10 12:30:11 -07:00
Abhishek Madan
9c6fea7fe1 Update HISTORY.md, fix unity_test failure (#4479)
Summary:
Follow-up to https://github.com/facebook/rocksdb/pull/4432.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4479

Differential Revision: D10304151

Pulled By: abhimadan

fbshipit-source-id: 3608b95c324702ca26791f95cb26dae1d49efbe7
2018-10-10 12:09:56 -07:00
Anand Ananthabhotla
854a4be03f Handle mixed slowdown/no_slowdown writer properly (#4475)
Summary:
There is a bug when the write queue leader is blocked on a write
delay/stop, and the queue has writers with WriteOptions::no_slowdown set
to true. They are not woken up until the write stall is cleared.

The fix introduces a dummy writer inserted at the tail to indicate a
write stall and prevent further inserts into the queue, and a condition
variable that writers who can tolerate slowdown wait on before adding
themselves to the queue. The leader calls WriteThread::BeginWriteStall()
to add the dummy writer and then walk the queue to fail any writers with
no_slowdown set. Once the stall clears, the leader calls
WriteThread::EndWriteStall() to remove the dummy writer and signal the
condition variable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4475

Differential Revision: D10285827

Pulled By: anand1976

fbshipit-source-id: 747465e5e7f07a829b1fb0bc1afcd7b93f4ab1a9
2018-10-09 22:52:40 -07:00
jsteemann
141ef7f8d3 avoid copying when iterating using range-based for (#4459)
Summary:
this avoids a few copies of std::string and other structs
in the context of range-based for loops. instead of copying
the values for each iteration, use a const reference to avoid
copying.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4459

Differential Revision: D10282045

Pulled By: sagar0

fbshipit-source-id: 5012e910dca279abd2be847e1fb432d96274edfb
2018-10-09 17:15:51 -07:00
jsteemann
517d3b8b77 fix typo in error message, twice (#4457)
Summary:
Fixes a typo in error messages returned by Iterator::GetProperty(...)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4457

Differential Revision: D10281965

Pulled By: sagar0

fbshipit-source-id: 1cd3c665f467ef06cdfd9f482692e6f8568f3d22
2018-10-09 17:07:27 -07:00
Abhishek Madan
3a4bd36fed Truncate range tombstones by leveraging InternalKeys (#4432)
Summary:
To more accurately truncate range tombstones at SST boundaries,
we now represent them in RangeDelAggregator using InternalKeys, which
are end-key-exclusive as they were before this change.

During compaction, "atomic compaction unit boundaries" (the range of
keys contained in neighbouring and overlaping SSTs) are propagated down
to RangeDelAggregator to truncate range tombstones at those boundariies
instead. See https://github.com/facebook/rocksdb/pull/4432#discussion_r221072219 and https://github.com/facebook/rocksdb/pull/4432#discussion_r221138683
for motivating examples.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4432

Differential Revision: D10263952

Pulled By: abhimadan

fbshipit-source-id: 2fe85ff8a02b3a6a2de2edfe708012797a7bd579
2018-10-09 15:19:38 -07:00
Zhongyi Xie
283a700f5d add locking around calls to RecalculateWriteStallConditions in column_family_test (#4474)
Summary:
this should fix the current failing TSAN jobs:
The callstack for TSAN:
> WARNING: ThreadSanitizer: data race (pid=87440)
  Read of size 8 at 0x7d580000fce0 by thread T22 (mutexes: write M548703):
    #0 rocksdb::InternalStats::DumpCFStatsNoFileHistogram(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) db/internal_stats.cc:1204 (column_family_test+0x00000080eca7)
    #1 rocksdb::InternalStats::DumpCFStats(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) db/internal_stats.cc:1169 (column_family_test+0x0000008106d0)
    #2 rocksdb::InternalStats::HandleCFStats(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, rocksdb::Slice) db/internal_stats.cc:578 (column_family_test+0x000000810720)
    #3 rocksdb::InternalStats::GetStringProperty(rocksdb::DBPropertyInfo const&, rocksdb::Slice const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) db/internal_stats.cc:488 (column_family_test+0x00000080670c)
    #4 rocksdb::DBImpl::DumpStats() db/db_impl.cc:625 (column_family_test+0x00000070ce9a)

>  Previous write of size 8 at 0x7d580000fce0 by main thread:
    #0 rocksdb::InternalStats::AddCFStats(rocksdb::InternalStats::InternalCFStatsType, unsigned long) db/internal_stats.h:324 (column_family_test+0x000000693bbf)
    #1 rocksdb::ColumnFamilyData::RecalculateWriteStallConditions(rocksdb::MutableCFOptions const&) db/column_family.cc:818 (column_family_test+0x000000693bbf)
    #2 rocksdb::ColumnFamilyTest_WriteStallSingleColumnFamily_Test::TestBody() db/column_family_test.cc:2563 (column_family_test+0x0000005e5a49)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4474

Differential Revision: D10262099

Pulled By: miasantreble

fbshipit-source-id: 1247973a3ca32e399b4575d3401dd5439c39efc5
2018-10-09 14:10:13 -07:00
Zhongyi Xie
cac87fcf57 move dump stats to a separate thread (#4382)
Summary:
Currently statistics are supposed to be dumped to info log at intervals of `options.stats_dump_period_sec`. However the implementation choice was to bind it with compaction thread, meaning if the database has been serving very light traffic, the stats may not get dumped at all.
We decided to separate stats dumping into a new timed thread using `TimerQueue`, which is already used in blob_db. This will allow us schedule new timed tasks with more deterministic behavior.

Tested with db_bench using `--stats_dump_period_sec=20` in command line:
> LOG:2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
LOG:2018/09/17-14:08:05.643286 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
LOG:2018/09/17-14:08:25.691325 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
LOG:2018/09/17-14:08:45.740989 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------

LOG content:
> 2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
2018/09/17-14:07:45.575080 7fe99fbfe700 [WARN] [db/db_impl.cc:606]
** DB Stats **
Uptime(secs): 20.0 total, 20.0 interval
Cumulative writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5.57 GB, 285.01 MB/s
Cumulative WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 GB, 285.01 MB/s
Cumulative stall: 00:00:0.012 H:M:S, 0.1 percent
Interval writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5700.71 MB, 285.01 MB/s
Interval WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 MB, 285.01 MB/s
Interval stall: 00:00:0.012 H:M:S, 0.1 percent
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4382

Differential Revision: D9933051

Pulled By: miasantreble

fbshipit-source-id: 6d12bb1e4977674eea4bf2d2ac6d486b814bb2fa
2018-10-08 22:54:43 -07:00
DorianZheng
27090ae8f6 Fix DBImpl::GetColumnFamilyHandleUnlocked race condition (#4391)
Summary:
- Fix DBImpl API race condition

The timeline of execution flow is as follow:
```
timeline              user_thread1                      user_thread2
t1   |     cfh = GetColumnFamilyHandleUnlocked(0)
t2   |     id1 = cfh->GetID()
t3   |                                                GetColumnFamilyHandleUnlocked(1)
t4   |     id2 = cfh->GetID()
     V
```
The original implementation return a pointer to a stateful variable, so that the return `ColumnFamilyHandle` will be changed when another thread calls `GetColumnFamilyHandleUnlocked` with different `column family id`

- Expose ColumnFamily ID to compaction event listener

- Fix the return status of `DBImpl::GetLatestSequenceForKey`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4391

Differential Revision: D10221243

Pulled By: yiwu-arbug

fbshipit-source-id: dec60ee9ff0c8261a2f2413a8506ec1063991993
2018-10-08 14:24:16 -07:00
DorianZheng
e0f05754ba Expose column family id to OnCompactionCompleted (#4466)
Summary:
The controller you requested could not be found. PTAL
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4466

Differential Revision: D10241358

Pulled By: yiwu-arbug

fbshipit-source-id: 99664eb286860a6c8844d50efeb0ef6f0e10dd1e
2018-10-08 14:24:16 -07:00
DorianZheng
7487a7628c Fix return status of DBImpl::GetLatestSequenceForKey
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4467

Differential Revision: D10241418

Pulled By: yiwu-arbug

fbshipit-source-id: f6adbe7292b2c934e14971c7432b3eb115c35026
2018-10-08 14:22:05 -07:00
Maysam Yabandeh
21b51dfec4 Add inline comments to flush job (#4464)
Summary:
It also renames InstallMemtableFlushResults to MaybeInstallMemtableFlushResults to clarify its contract.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4464

Differential Revision: D10224918

Pulled By: maysamyabandeh

fbshipit-source-id: 04e3f2d8542002cb9f8010cb436f5152751b3cbe
2018-10-05 15:41:17 -07:00
Maysam Yabandeh
1fb6805527 Fix snprintf buffer overflow bug (#4465)
Summary:
The contract of snprintf says that it returns "The number of characters that would have been written if n had been sufficiently large" http://www.cplusplus.com/reference/cstdio/snprintf/
The existing code however was assuming that the return value is the actual number of written bytes and uses that to reposition the starting point on the next call to snprintf. This leads to buffer overflow when the last call to snprintf has filled up the buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4465

Differential Revision: D10224080

Pulled By: maysamyabandeh

fbshipit-source-id: 40f44e122d15b0db439812a0a361167cf012de3e
2018-10-05 14:50:51 -07:00
Dmitry Alimov
e13d8dcbbb Fix typos in comments (#4456)
Summary:
Fix some typos in the comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4456

Differential Revision: D10209214

Pulled By: miasantreble

fbshipit-source-id: dff857ba60396bc95126e635db96d7dc8330d2cb
2018-10-04 20:46:50 -07:00
Zhongyi Xie
ce1fc5af09 fix unused param allocator in compression.h (#4453)
Summary:
this should fix currently failing contrun test: rocksdb-contrun-no_compression, rocksdb-contrun-tsan, rocksdb-contrun-tsan_crash
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4453

Differential Revision: D10202626

Pulled By: miasantreble

fbshipit-source-id: 850b07f14f671b5998c22d8239e2a55b2fc1e355
2018-10-04 13:24:22 -07:00
JiYou
a1f6142f38 VersionSet: GetOverlappingInputs() fix overflow and optimize. (#4385)
Summary:
This fix is for `level == 0` in `GetOverlappingInputs()`:
- In `GetOverlappingInputs()`, if `level == 0`, it has potential
risk of overflow if `i == 0`.
- Optmize process when `expand = true`, the expected complexity
can be reduced to O(n).

Signed-off-by: JiYou <jiyou09@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4385

Differential Revision: D10181001

Pulled By: riversand963

fbshipit-source-id: 46eef8a1d1605c9329c164e6471cd5c5b6de16b5
2018-10-03 18:40:59 -07:00
Yanqin Jin
4e58b2ea3d Check for compression lib support before test exec (#4443)
Summary:
Before running CompactFilesTest.SentinelCompressionType, we should check
whether zlib and snappy are supported.

CompactFilesTest.SentinelCompressionType is a newly added test. Compilation and
linking with different options, e.g. COMPILE_WITH_TSAN, COMPILE_WITH_ASAN, etc.
lead to generation of different binaries. On the one hand, it's not clear why
zlib or snappy is present under ASAN, but not under TSAN. On the other hand,
changing the compilation flags for TSAN or ASAN seems a bigger change worth much
more attention. To unblock the cont-runs, I suggest that we simply add these
two checks at the beginning of the test, as we did for
GeneralTableTest.ApproximateOffsetOfCompressed in table/table_test.cc.

Future actions include invesigating the absence of zlib and snappy when
compiling with TSAN, i.e. COMPILE_WITH_TSAN=1, if necessary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4443

Differential Revision: D10140935

Pulled By: riversand963

fbshipit-source-id: 62f96d1e685386accd2ef0b98f6f754d3fd67b3e
2018-10-02 10:42:01 -07:00
Yanqin Jin
be5cc4c7b8 Remove a race condition between lsdir and rm (#4440)
Summary:
In DBCompactionTestWithParam::ManualLevelCompactionOutputPathId, there is
a race condition between `DBTestBase::GetSstFileCount` and
`DBImpl::PurgeObsoleteFiles`. The following graph explains why.

```
Timeline  db_compact_test_t              bg_flush_t         bg_compact_t
    |  [initiate bg flush and
    |      start waiting]
    |                                     flush
    |                                     DeleteObsoleteFiles
    |  [waken up by bg_flush_t which
    |   signaled in DeleteObsoleteFiles]
    |
    |  [initiate compaction and
    |   start waiting]
    |
    |                                                         [compact,
    |                                                          set manual.done to true]
    |                                   [signal at the end of
    |                                    BackgroundCallFlush]
    |
    |  [waken up by bg_flush_t
    |   which signaled before
    |   returning from
    |   BackgroundCallFlush]
    |
    |  Check manual.done is true
    |
    |  GetSstFileCount    <-- race condition -->           PurgeObsoleteFiles
    V
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4440

Differential Revision: D10122628

Pulled By: riversand963

fbshipit-source-id: 3ede73c39fee6ad804dc6ac1ed84759c7e63977f
2018-10-01 11:57:55 -07:00
Andrew Kryczka
ac6f435a9a Fix CompactFiles support for kDisableCompressionOption (#4438)
Summary:
Previously `CompactFiles` with `CompressionType::kDisableCompressionOption` caused program to crash on assertion failure. This PR fixes the crash by adding support for that setting. Now, that setting will cause RocksDB to choose compression according to the column family's options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4438

Differential Revision: D10115761

Pulled By: ajkr

fbshipit-source-id: a553c6fa76fa5b6f73b0d165d95640da6f454122
2018-10-01 01:18:10 -07:00
JiYou
75ca13875c FindFile: use std::lower_bound reduce the repeated code. (#4372)
Summary:
`FindFile()` and  `FindFileInRange()` actually works as the same
of `std::lower_bound()`. Use `std::lower_bound()` to reduce the
repeated code.

- change `FindFile()` and `FindFileInRange()` to use `std::lower_bound()`

Signed-off-by: JiYou <jiyou09@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4372

Differential Revision: D9919677

Pulled By: ajkr

fbshipit-source-id: f74aaa30e2f80e410e299c5a5bca4eaf2a7a26de
2018-09-27 10:35:00 -07:00
Yi Wu
dc813e4b85 Improve log handling when recover without flush (#4405)
Summary:
Improve log handling when avoid_flush_during_recovery=true.
1. restore total_log_size_ after recovery, by summing up existing log sizes. Fixes #4253.
2. truncate the last existing log, since this log can contain preallocated space and it will be a waste to keep the space. It avoids a crash loop of user application cause a lot of log with non-trivial size being created and ultimately take up all disk space.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4405

Differential Revision: D9953933

Pulled By: yiwu-arbug

fbshipit-source-id: 967780fee8acec7f358b6eb65190fb4684f82e56
2018-09-26 10:37:48 -07:00