Commit Graph

3475 Commits

Author SHA1 Message Date
Siying Dong
49ddd7ec4f Stats should be logged in INFO level (#4977)
Summary:
Previously, stats were logged in warning level. This was done in that way because
people reported that it wasn't logged in MyRocks. However, later we learned that it turns
out to be due to a bug in MyRocks, which is fixed in
79bb705e74

Now we revert the stats logging to INFO level, so that it doesn't pollute the warning
level logging.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4977

Differential Revision: D14058485

Pulled By: siying

fbshipit-source-id: 19fab323c19d9bc88184287f209551f9a77ca0e6
2019-02-12 16:54:55 -08:00
Yanqin Jin
c5a64cffd2 Avoid fsync on the same directory in atomic flush (#4817)
Summary:
In `DBImpl::AtomicFlushMemTablesToOutputFiles`, we need to call fsync only once
on the same data directory. If two column families share a common directory for
their data, we call fsync only once.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4817

Differential Revision: D13543689

Pulled By: riversand963

fbshipit-source-id: 4701d77c96a47802fbf6cb9f3337ee65d46b95f5
2019-02-12 12:28:36 -08:00
Andrew Kryczka
62f70f6d14 Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.

So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:

- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952

Differential Revision: D13967980

Pulled By: ajkr

fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-11 19:47:32 -08:00
Maysam Yabandeh
576d2d6c60 WritePrepared: relax assert in compaction iterator (#4969)
Summary:
If IsInSnapshot(seq2, snapshot) determines that the snapshot is released, the future queries IsInSnapshot(seq1, snapshot) could still return a definitive answer of true if for example seq1 is too old that is determined visible in all snapshots. This violates a recently added assert statement to compaction iterator. The patch relaxes the assert.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4969

Differential Revision: D14030998

Pulled By: maysamyabandeh

fbshipit-source-id: 6db53db0e37d0a20e8997ef2c1004b8627614ab9
2019-02-11 15:01:46 -08:00
Yanqin Jin
2d049ab7e8 Checksum properties block for block-based table (#4956)
Summary:
Always enable properties block checksum verification for block-based table. For external SST file ingested with 'write_global_seqno==true', we use 'DecodeEntrySlow' to parse its blocks' contents so that the process will not die upon failing the assertion possibly caused by corruption.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4956

Differential Revision: D14012741

Pulled By: riversand963

fbshipit-source-id: 8b766e6f54b36f8f9e074c0e19e0926ec3cce186
2019-02-11 11:50:01 -08:00
Siying Dong
5d9a623e2c Add a unit test to Ignorable manfiest record (#4964)
Summary:
https://github.com/facebook/rocksdb/pull/4960 introduced ignorable manfiest
record. Adding a test to it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4964

Differential Revision: D14012667

Pulled By: siying

fbshipit-source-id: e5f10ecc68dec2716e178d44f0fe2b76c3d857ef
2019-02-11 11:20:24 -08:00
tang-jianfeng
08809f5e6c Implement trace sampling (#4963)
Summary:
Implement trace sampling to allow user to specify the sampling frequency, i.e. save one per how many requests, so that a user does not need to log all if he/she is interested in only a sampled set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4963

Differential Revision: D14011190

Pulled By: tang-jianfeng

fbshipit-source-id: 078b631d9319b67cb089dd2c30e21d0df8dc406a
2019-02-08 18:08:18 -08:00
Siying Dong
1a761e6a6c Add a placeholder in manifest indicating ignorable record (#4960)
Summary:
We want to reserve some right that some extra information added manifest
in the future can be forward compatible by previous versions. Now we create a
place holder for that. A bit in tag is added to indicate that a field can be
safely ignored.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4960

Differential Revision: D14000484

Pulled By: siying

fbshipit-source-id: cbf5bad3f9d5ec798f789806f244d1c20d3b66d6
2019-02-08 11:33:11 -08:00
Siying Dong
f48758e939 Deprecate CompactionFilter::IgnoreSnapshots() = false (#4954)
Summary:
We found that the behavior of CompactionFilter::IgnoreSnapshots() = false isn't
what we have expected. We thought that snapshot will always be preserved.
However, we just realized that, if no snapshot is created while compaction
starts, and a snapshot is created after that, the data seen from the snapshot
can successfully be dropped by the compaction. This creates a strange behavior
to the feature, which is hard to explain. Like what is documented in code
comment, this feature is not very useful with snapshot anyway. The decision
is to deprecate the feature.

We keep the function to avoid to break users code. However, we will fail
compactions if false is returned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4954

Differential Revision: D13981900

Pulled By: siying

fbshipit-source-id: 2db8c2c3865acd86a28dca625945d1481b1d1e36
2019-02-07 16:57:33 -08:00
Siying Dong
cf3a671733 Remove cuckoo hash memtable (#4953)
Summary:
Cuckoo Hash is less useful than we initially expected. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4953

Differential Revision: D13979264

Pulled By: siying

fbshipit-source-id: 2a60afdaa989f045357398b43a1cc5d46f4492ed
2019-02-07 16:15:27 -08:00
Zhongyi Xie
71cae59a99 exclude test CompactFilesShouldTriggerAutoCompaction from ROCKSDB_LITE (#4950)
Summary:
This will fix the following build error:

> db/db_test.cc: In member function ‘virtual void rocksdb::DBTest_CompactFilesShouldTriggerAutoCompaction_Test::TestBody()’:
> db/db_test.cc:5462:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’
>    db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data);
> db/db_test.cc:5490:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’
>    db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data);
> db/db_test.cc:5499:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’
>    db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data);
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4950

Differential Revision: D13965378

Pulled By: miasantreble

fbshipit-source-id: a975435476fe555b1cd9d5da263ee3da3acdea56
2019-02-05 17:01:11 -08:00
Zhongyi Xie
00ed41daee Allow copy for PerfContext objects (#4919)
Summary:
Existing implementation of PerfContext does not define copy constructor or assignment operator, which could potentially cause problems when user create copies and resets the builtin one. This PR address the issue by providing these two constructors with deep copy semantics.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4919

Differential Revision: D13960406

Pulled By: miasantreble

fbshipit-source-id: 36aab5aaee65d4480f537e4e22148faa45e8e334
2019-02-05 14:29:08 -08:00
Jay Zhuang
c9a52cbdc8 Fix potential DB hang while using CompactFiles (#4940)
Summary:
CompactFiles() may block auto compaction which could cuase DB hang when it
reachs level0_stop_writes_trigger.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4940

Differential Revision: D13929648

Pulled By: cooldoger

fbshipit-source-id: 10842df38df3bebf862cd1a120a88ce961fdd381
2019-02-05 11:23:38 -08:00
Siying Dong
8fe073324f BYTES_READ stats miscount for NotFound cases (#4938)
Summary:
In NotFound cases, stats BYTES_READ and perf_context.get_read_bytes is still be increased. The amount increased will be
whatever size of the string or PinnableSlice that users passed in as the output data structure. This is wrong. Fix this by not
increasing these two counters.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4938

Differential Revision: D13908963

Pulled By: siying

fbshipit-source-id: 60bce42e4fbb9862bba3da36dbc27b2963ea6162
2019-02-05 10:53:35 -08:00
yangzhijia
31221bb7e8 Properly set upper bound of subcompaction output (#4879) (#4898)
Summary:
Fix the ouput overlap bug when using subcompactions, the upper bound of output
file was extended incorrectly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4898

Differential Revision: D13736107

Pulled By: ajkr

fbshipit-source-id: 21dca09f81d5f07bf2766bf566f9b50dcab7d8e3
2019-02-05 10:20:16 -08:00
Maysam Yabandeh
30468d8eb4 Fix analyze error on possible un-initialized value (#4937)
Summary:
The patch fixes the following analyze error by checking the return status of ParseInternalKey.
```
db/merge_helper.cc:306:23: warning: The right operand of '==' is a garbage value
    assert(kTypeMerge == orig_ikey.type);
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4937

Differential Revision: D13908506

Pulled By: maysamyabandeh

fbshipit-source-id: 68d7771e75519da3d4bd807fd231675ec12093f6
2019-02-01 09:41:27 -08:00
Ming Zhao
59244447e3 Zero seqnum of final key / drop final tombstone when compacting to bottommost level
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4927

Differential Revision: D13889458

Pulled By: mzhaom

fbshipit-source-id: d6b66db85901a9eb90748fba6a9dc4e7457b9c5e
2019-02-01 09:21:57 -08:00
Yanqin Jin
842cdc11dd Use correct FileMeta for atomic flush result install (#4932)
Summary:
1. this commit fixes our handling of a combination of two separate edge
cases. If a flush job does not pick any memtable to flush (because another
flush job has already picked the same memtables), and the column family
assigned to the flush job is dropped right before RocksDB calls
rocksdb::InstallMemtableAtomicFlushResults, our original code passes
a FileMetaData object whose file number is 0, failing the assertion in
rocksdb::InstallMemtableAtomicFlushResults (assert(m->GetFileNumber() > 0)).
2. Also piggyback a small change: since we already create a local copy of column family's mutable CF options to eliminate potential race condition with `SetOptions` call, we might as well use the local copy in other function calls in the same scope.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4932

Differential Revision: D13901322

Pulled By: riversand963

fbshipit-source-id: b936580af7c127ea0c6c19ea10cd5fcede9fb0f9
2019-01-31 14:49:51 -08:00
Maysam Yabandeh
35e5689e11 Take snapshots once for all cf flushes (#4934)
Summary:
FlushMemTablesToOutputFiles calls FlushMemTableToOutputFile for each column family. The patch moves the take-snapshot logic to outside FlushMemTableToOutputFile so that it does it once for all the flushes. This also addresses a deadlock issue for resetting the managed snapshot of job_snapshot in the 2nd call to FlushMemTableToOutputFile.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4934

Differential Revision: D13900747

Pulled By: maysamyabandeh

fbshipit-source-id: f3cd650c5fff24cf95c1aaf8a10c149d42bf042c
2019-01-31 12:21:59 -08:00
Alexander Zinoviev
32a6dd9a41 Add a new CPU time counter to compaction report (#4889)
Summary:
Measure CPU time consumed for a compaction and report it in the stats report
Enable NowCPUNanos() to work for MacOS
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889

Differential Revision: D13701276

Pulled By: zinoale

fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055
2019-01-29 17:24:00 -08:00
Yanqin Jin
158da7a6ee Verify checksum before ingestion (#4916)
Summary:
before file ingestion (in preparation phase), verify the checksums of
the blocks of the external SST file, including properties block with global
seqno.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4916

Differential Revision: D13863501

Pulled By: riversand963

fbshipit-source-id: dc54697f970e3807832e2460f7228fcc7efe81ee
2019-01-29 17:17:29 -08:00
Sagar Vemuri
4978caaa6f Remove a redundant call to TableFileName in CompactionJob::FinishCompactionOutputFile (#4925)
Summary:
While stepping through the code I noticed that there is a redundant call to TableFileName.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4925

Differential Revision: D13845749

Pulled By: sagar0

fbshipit-source-id: 31db45716b4d720e0e0350dd457b49d6f1848e7d
2019-01-28 13:33:23 -08:00
Siying Dong
ee1818081f Remove PlainTable's feature store_index_in_file (#4914)
Summary:
Store_index_in_file is a less useful feature. To simplify the code to maintain, we are dropping the feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4914

Differential Revision: D13791883

Pulled By: siying

fbshipit-source-id: d187c5d662584866103e4b77d09dfb925509ae2e
2019-01-28 12:50:22 -08:00
Siying Dong
bc7d1661a8 Fix test name typo in PlainTableDBTest
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4926

Differential Revision: D13830196

Pulled By: siying

fbshipit-source-id: e06bf2a6cd273b5eb18dfd82bdd35ffce197d021
2019-01-25 18:14:26 -08:00
Siying Dong
f184bee77b PlainTable should avoid copying Get() results from immortal source. (#4924)
Summary:
https://github.com/facebook/rocksdb/pull/4053 avoids memcopy for Get() results if files are immortable
(read-only DB, max_open_files=-1) and the file is ammaped. The same optimization is being applied to PlainTable
here.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4924

Differential Revision: D13827749

Pulled By: siying

fbshipit-source-id: 1f2cbfc530b40ce08ccd53f95f6e78de4d1c2f96
2019-01-25 17:12:19 -08:00
Siying Dong
fc53839bfa Disallow customized hash function in DynamicBloom (#4915)
Summary:
I didn't find where customized hash function is used in DynamicBloom. This can only reduce performance. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4915

Differential Revision: D13794452

Pulled By: siying

fbshipit-source-id: e38669b11e01444d2d782da11c7decabbd851819
2019-01-24 10:34:30 -08:00
Dmitry Fink
e07aa8669d Allow full merge when root of history for a key is reached (#4909)
Summary:
Previously compaction was not collapsing operands for a first
key on a layer, even in cases when it was its root of history. Some
tests (CompactionJobTest.NonAssocMerge) was actually accounting
for that bug,
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4909

Differential Revision: D13781169

Pulled By: finik

fbshipit-source-id: d2de353ecf05bec39b942cd8d5b97a8dc445f336
2019-01-23 21:46:10 -08:00
Andrew Kryczka
8ec3e72551 Cache dictionary used for decompressing data blocks (#4881)
Summary:
- If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`.
- If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle.
- New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881

Differential Revision: D13663801

Pulled By: ajkr

fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f
2019-01-23 18:15:47 -08:00
PeifengSi
43defe9872 Correct the code comment in Compaction::KeyNotExistsBeyondOutputLevel (#4902)
Summary:
Even one key falls in a file's range, we can not infer it definitely exists in this file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4902

Differential Revision: D13795018

Pulled By: siying

fbshipit-source-id: 590956f727e9440fcdee55ad9541ace934c64914
2019-01-23 18:00:56 -08:00
Siying Dong
d94aa2f7db Make compaction_pri = kMinOverlappingRatio to be default (#4911)
Summary:
compaction_pri = kMinOverlappingRatio usually provides much better write amplification than the default.
https://github.com/facebook/rocksdb/pull/4907 fixes one shortcome of this option. Make it default.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4911

Differential Revision: D13789262

Pulled By: siying

fbshipit-source-id: d90acf8c4dede44f00d183ca4c7a210259378269
2019-01-23 16:47:38 -08:00
Siying Dong
5bf941966b CompactionPri = kMinOverlappingRatio also uses compensated file size (#4907)
Summary:
Right now, CompactionPri = kMinOverlappingRatio provides best write amplification, but it doesn't
prioritize files with more tombstones. We combine the two good features: make kMinOverlappingRatio
to boost files with lots of tombstones too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4907

Differential Revision: D13788774

Pulled By: siying

fbshipit-source-id: 1991cbb495fb76c8b529de69896e38d81ed9d9b3
2019-01-23 13:21:01 -08:00
Andrew Kryczka
01013ae766 Digest ZSTD compression dictionary once when writing SST file (#4849)
Summary:
This is essentially a re-submission of #4251 with a few improvements:

- Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict`
- Eliminated `Init` functions. Instead do all initialization work in constructors.
- Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849

Differential Revision: D13606039

Pulled By: ajkr

fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447
2019-01-18 19:12:57 -08:00
Yi Wu
b1ad6ebba8 WritePrepared: fix two versions in compaction see different status for released snapshots (#4890)
Summary:
Fix how CompactionIterator::findEarliestVisibleSnapshots handles released snapshot. It fixing the two scenarios:

Scenario 1:
key1 has two values v1 and v2. There're two snapshots s1 and s2 taken after v1 and v2 are committed. Right after compaction output v2, s1 is released. Now findEarliestVisibleSnapshot may see s1 being released, and return the next snapshot, which is s2. That's larger than v2's earliest visible snapshot, which was s1.
The fix: the only place we check against last snapshot and current key snapshot is when we decide whether to compact out a value if it is hidden by a later value. In the check if we see current snapshot is even larger than last snapshot, we know last snapshot is released, and we are safe to compact out current key.

Scenario 2:
key1 has two values v1 and v2. there are two snapshots s1 and s2 taken after v1 and v2 are committed. During compaction before we process the key, s1 is released. When compaction process v2, snapshot checker may return kSnapshotReleased, and the earliest visible snapshot for v2 become s2. When compaction process v1, snapshot checker may return kIsInSnapshot (for WritePrepared transaction, it could be because v1 is still in commit cache). The result will become inconsistent here.
The fix: remember the set of released snapshots ever reported by snapshot checker, and ignore them when finding result for findEarliestVisibleSnapshot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4890

Differential Revision: D13705538

Pulled By: maysamyabandeh

fbshipit-source-id: e577f0d9ee1ff5a6035f26859e56902ecc85a5a4
2019-01-18 17:24:06 -08:00
Yi Wu
128f532858 WritePrepared: fix issue with snapshot released during compaction (#4858)
Summary:
Compaction iterator keep a copy of list of live snapshots at the beginning of compaction, and then query snapshot checker to verify if values of a sequence number is visible to these snapshots. However when the snapshot is released in the middle of compaction, the snapshot checker implementation (i.e. WritePreparedSnapshotChecker) may remove info with the snapshot and may report incorrect result, which lead to values being compacted out when it shouldn't. This patch conservatively keep the values if snapshot checker determines that the snapshots is released.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4858

Differential Revision: D13617146

Pulled By: maysamyabandeh

fbshipit-source-id: cf18a94f6f61a94bcff73c280f117b224af5fbc3
2019-01-16 09:55:32 -08:00
Yanqin Jin
e79df377c5 Use chrono::time_point instead of time_t (#4868)
Summary:
By convention, time_t almost always stores the integral number of seconds since
00:00 hours, Jan 1, 1970 UTC, according to http://www.cplusplus.com/reference/ctime/time_t/.
We surely want more precision than seconds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4868

Differential Revision: D13633046

Pulled By: riversand963

fbshipit-source-id: 4e01e23a22e8838023c51a91247a286dbf3a5396
2019-01-16 09:51:05 -08:00
Yi Wu
5d4fddfa52 WritePrepared: Fix visible key compacted out by compaction (#4883)
Summary:
With WritePrepared transaction, flush/compaction can contain uncommitted keys, and those keys can get committed during compaction. If a snapshot is taken before the key is committed, it should not see the key. On the other hand, compaction grab the list of snapshots at its beginning, and only consider those snapshots to dedup keys. Consider the case:
```
seq = 1: put "foo" = "bar"
seq = 2: transaction T: delete "foo", prepare
seq = 3: compaction start
seq = 4: take snapshot S
seq = 5: transaction T: commit.
...
seq = N: compaction iterator reached key "foo".
```
When compaction start, the list of snapshot is empty. Compaction doesn't take snapshot S into account. When it reached "foo", transaction T is committed. Compaction may think the value "foo=bar" is not visible by any snapshot (which is wrong), and compact the value out.

The fix is to explicitly take a snapshot before compaction grabbing the list of snapshots. Compaction will then has to keep keys visible to this snapshot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4883

Differential Revision: D13668775

Pulled By: maysamyabandeh

fbshipit-source-id: 1cab9615f94b7d3e8522cc3d44c3a14c7d4720e4
2019-01-15 21:34:38 -08:00
Maysam Yabandeh
cad99a6031 WritePrepared: snapshot should be larger than max_evicted_seq_ (#4886)
Summary:
The AdvanceMaxEvictedSeq algorithm assumes that new snapshots always have sequence number larger than the last max_evicted_seq_. To enforce this assumption we make two changes:
i) max is not advanced beyond the last published seq, with the exception that the evicted commit entry itself is not published yet, which is quite rare.
ii) When obtaining the snapshot if the max_evicted_seq_ is not published yet, commit a dummy entry so that it waits for it to be published and also increased the latest published seq by one above the max.
To test these non-realistic corner cases we create a commit cache with size 1 so that every single commit results into eviction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4886

Differential Revision: D13685270

Pulled By: maysamyabandeh

fbshipit-source-id: 5461bc09c2a9b75798bfcb9853a256c81cdac0b0
2019-01-15 18:11:52 -08:00
Siying Dong
7d13f307ff Improve Error Message When wal_dir doesn't exist (#4874)
Summary:
Right now the error mesage when options.wal_dir doesn't exist is not helpful to users. Be more specific
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4874

Differential Revision: D13642425

Pulled By: siying

fbshipit-source-id: 9a3172ed0f799af233b0f3b2e5e35bc7ce04c7b5
2019-01-15 16:46:04 -08:00
Yanqin Jin
301da345ae Make a copy of MutableCFOptions to avoid race condition (#4876)
Summary:
If we do not do this, then reading MutableCFOptions may have a race condition
with SetOptions which modifies MutableCFOptions.

Also reserve space in advance for vectors to avoid reallocation changing the
address of its elements.

Test plan
```
$make clean && make -j32 all check
$make clean && COMPILE_WITH_TSAN=1 make -j32 all check
$make clean && COMPILE_WITH_ASAN=1 make -j32 all check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4876

Differential Revision: D13644500

Pulled By: riversand963

fbshipit-source-id: 4b8112c5c819d5a2922bb61ad1521b3d2fb2fd47
2019-01-11 17:43:37 -08:00
Maysam Yabandeh
d56ac22b44 Remove duplicates from SnapshotList::GetAll (#4860)
Summary:
The vector returned by SnapshotList::GetAll could have duplicate entries if two separate snapshots have the same sequence number. However, when this vector is used in compaction the duplicate entires are of no use and could be safely ignored. Moreover not having duplicate entires simplifies reasoning in the compaction_iterator.cc code. For example when searching for the previous_snap we currently use the snapshot before the current one but the way the code uses that it expects it to be also less than the current snapshot, which would be simpler to read if there is no duplicate entry in the snapshot list.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4860

Differential Revision: D13615502

Pulled By: maysamyabandeh

fbshipit-source-id: d45bf01213ead5f39db811f951802da6fcc3332b
2019-01-09 16:25:42 -08:00
Siying Dong
8641e9adf7 Non-initial file preloading should always prefetch index and filter (#4852)
Summary:
https://github.com/facebook/rocksdb/pull/3340 introduces preloading when max_open_files != -1.
It doesn't preload index and filter in non-initial file loading case. This is a little bit too
complicated to understand. We observed in one MyRocks use case where the filter is expected to be
preloaded but is not. To simplify the use case, we simply always prefetch the index and filter.
They anyway is expected to be loaded in the file verification phase anyway.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4852

Differential Revision: D13595402

Pulled By: siying

fbshipit-source-id: d4d8624eb3e849e20aeb990df2100502d85aff31
2019-01-08 12:47:34 -08:00
tom wang
42135523a0 modify comments about flush_queue_
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4850

Differential Revision: D13591940

Pulled By: sagar0

fbshipit-source-id: 617794e0a41d0f4554d40871180b061e84189fc5
2019-01-07 13:52:59 -08:00
Yi Wu
cf852fdf55 Minor fix: single delete a blob value is not a mismatch (#4848)
Summary:
In compaction iterator, if the next value of single delete is a blob value, it should not treated as mismatch. This is only a minor fix and doesn't affect correctness.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4848

Differential Revision: D13585812

Pulled By: yiwu-arbug

fbshipit-source-id: 0ff6223fa03a644ac9fd8a2d77f9d6711d0a62b0
2019-01-04 16:31:02 -08:00
Andrew Kryczka
9e2c804fe6 Fix point lookup on range tombstone sentinel endpoint (#4829)
Summary:
Previously for point lookup we decided which file to look into based on user key overlap only. We also did not truncate range tombstones in the point lookup code path. These two ideas did not interact well in cases like this:

- L1 has range tombstone [a, c)#1 and point key b#2. The data is split between file1 with range [a#1,1, b#72057594037927935,15], and file2 with range [b#2, c#1].
- L1's file2 gets compacted to L2.
- User issues `Get()` for b#3.
- L1's file1 is opened and the range tombstone [a, c)#1 is found for b, while no point-key for b is found in L1.
- `Get()` assumes that the range tombstone must cover all data in that range in lower levels, so short circuits and returns `NotFound`.

The solution to this problem is to not look into files that only overlap with the point lookup at a range tombstone sentinel endpoint. In the above example, this would mean not opening L1's file1 or its tombstones during the `Get()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4829

Differential Revision: D13561355

Pulled By: ajkr

fbshipit-source-id: a13c21c816870a2f5d32a48af6dbd719a7d9d19f
2019-01-04 11:24:08 -08:00
Yanqin Jin
a07175af65 Refactor atomic flush result installation to MANIFEST (#4791)
Summary:
as titled.
Since different bg flush threads can flush different sets of column families
(due to column family creation and drop), we decide not to let one thread
perform atomic flush result installation for other threads. Bg flush threads
will install their atomic flush results sequentially to MANIFEST, using
a conditional variable, i.e. atomic_flush_install_cv_ to coordinate.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4791

Differential Revision: D13498930

Pulled By: riversand963

fbshipit-source-id: dd7482fc41f4bd22dad1e1ef7d4764ef424688d7
2019-01-03 20:56:24 -08:00
Yi Wu
77a8d4d476 Detect if Jemalloc is linked with the binary (#4844)
Summary:
Declare Jemalloc non-standard APIs as weak symbols, so that if Jemalloc is linked with the binary, these symbols will be replaced by Jemalloc's, otherwise they will be nullptr. This is similar to how folly detect jemalloc, but we assume the main program use jemalloc as long as jemalloc is linked: https://github.com/facebook/folly/blob/master/folly/memory/Malloc.h#L147
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4844

Differential Revision: D13574934

Pulled By: yiwu-arbug

fbshipit-source-id: 7ea871beb1be7d5a1259cc38f9b78078793db2db
2019-01-03 16:30:12 -08:00
DorianZheng
8c79f79208 Fix skip WAL for whole write_group when leader's callback fail (#4838)
Summary:
The original implementation has two problems:

1. f0dda35d7d/db/db_impl_write.cc (L478)
f0dda35d7d/db/write_thread.h (L231)

If the callback status of leader of the write_group fails, then the whole write_group will not write to WAL, this may cause data loss.

2. f0dda35d7d/db/write_thread.h (L130)
The annotation says that Writer.status is the status of memtable inserter, but the original implementation use it for another case which is not consistent with the original design. Looks like we can still reuse Writer.status, but we should modify the annotation, so Writer.status is not only the status of memtable inserter.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4838

Differential Revision: D13574070

Pulled By: yiwu-arbug

fbshipit-source-id: a2a2aefcfd329c4c6a91652bf090aaf1ce119c4b
2019-01-03 12:40:42 -08:00
Siying Dong
e4feb78606 Try to fix DBSSTTest.RateLimitedDelete flakiness (#4840)
Summary:
DBSSTTest.RateLimitedDelete is flakey. The root cause is not completely identified, but
the compaction waiting in the test doesn't strictly wait for compaction cleaning to finish, which
may cause test flakiness. Fix it first and see whether the failures still happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4840

Differential Revision: D13567273

Pulled By: siying

fbshipit-source-id: 6fce38b912aff92a925231e7aa9bb0fef892761a
2019-01-03 11:05:19 -08:00
Andrew Kryczka
ace543a815 fix accounting for range tombstones in TableProperties (#4841)
Summary:
- To be consistent with the accounting of other optypes in `TableProperties`, we should count range tombstones in `TableProperties::num_entries` and `TableProperties::num_deletions`.
- Updated assertions in stress test's `OnTableFileCreated` handler to accept files with range tombstones only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4841

Differential Revision: D13568424

Pulled By: ajkr

fbshipit-source-id: 0139d7806494eda20ece67ec460d2458dbbf6026
2019-01-02 15:08:53 -08:00
Anand Ananthabhotla
b9d6eccac1 Lock free MultiGet (#4754)
Summary:
Avoid locking the DB mutex in order to reference SuperVersions. Instead, we get the thread local cached SuperVersion for each column family in the list. It depends on finding a sequence number that overlaps with all the open memtables. We start with the latest published sequence number, and if any of the memtables is sealed before we can get all the SuperVersions, the process is repeated. After a few times, give up and lock the DB mutex.

Tests:
1. Unit tests
2. make check
3. db_bench -

TEST_TMPDIR=/dev/shm ./db_bench -use_existing_db=true -benchmarks=readrandom -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=5000000 -reads=1000000 -threads=32 -compression_type=none -cache_size=1048576000 -batch_size=1 -bloom_bits=1
readrandom   :       0.167 micros/op 5983920 ops/sec;  426.2 MB/s (1000000 of 1000000 found)

Multireadrandom with batch size 1:
multireadrandom :       0.176 micros/op 5684033 ops/sec; (1000000 of 1000000 found)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4754

Differential Revision: D13363550

Pulled By: anand1976

fbshipit-source-id: 6243e8de7dbd9c8bb490a8eca385da0c855b1dd4
2019-01-02 11:42:54 -08:00