Commit Graph

3586 Commits

Author SHA1 Message Date
Yanqin Jin
392f6d49e5 Fix a bug in GetOverlappingInputsRangeBinarySearch (#5211)
Summary:
As title.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5211

Differential Revision: D14992018

Pulled By: riversand963

fbshipit-source-id: b5720ea4742029e2fb47ff6d9f8d9de006db4ed4
2019-04-18 09:22:16 -07:00
JiYou
5b7e09bd6f VersionSet: optmize GetOverlappingInputsRangeBinarySearch (#4987)
Summary:
`GetOverlappingInputsRangeBinarySearch` firstly use binary search
to find a index in the given range `[begin, end]`. But after find
the index, then use linear search to find the `start_index` and
`end_index`. So the search process degraded to linear time.

Here optmize the search process with below changes:

- use `std::lower_bound` and `std::upper_bound` to get
  `lg(n)` search complexity.
- use uniformed lambda for search process.
- simplify process for `within_interval` true or false.
- remove function `ExtendFileRangeWithinInterval`
  and `ExtendFileRangeOverlappingInterval`.

Signed-off-by: JiYou <jiyou09@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4987

Differential Revision: D14984192

Pulled By: riversand963

fbshipit-source-id: fae4b8e59a21b7e350718d60cdc94dd55ac81e89
2019-04-17 18:15:20 -07:00
Zhongyi Xie
248b6b551e rename variable to avoid shadowing (#5204)
Summary:
this PR fixes the following compile warning:
```
db/memtable.cc: In member function ‘virtual void rocksdb::MemTableIterator::Seek(const rocksdb::Slice&)’:
db/memtable.cc:321:22: error: declaration of ‘user_key’ shadows a member of 'this' [-Werror=shadow]
       Slice user_key(ExtractUserKey(k));
                      ^
db/memtable.cc: In member function ‘virtual void rocksdb::MemTableIterator::SeekForPrev(const rocksdb::Slice&)’:
db/memtable.cc:338:22: error: declaration of ‘user_key’ shadows a member of 'this' [-Werror=shadow]
       Slice user_key(ExtractUserKey(k));
                      ^
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5204

Differential Revision: D14970160

Pulled By: miasantreble

fbshipit-source-id: 388eb089f90c4528cc6d615dd4607fb53ceac705
2019-04-17 10:15:05 -07:00
Zhongyi Xie
baa5302447 Avoid double-compacting data in bottom level in manual compactions (#5138)
Summary:
Depending on the config, manual compaction (leveled compaction style) does following compactions:
L0->L1
L1->L2
...
Ln-1 -> Ln
Ln -> Ln
The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only.
Resolves issue https://github.com/facebook/rocksdb/issues/4995
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138

Differential Revision: D14940106

Pulled By: miasantreble

fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5
2019-04-16 23:32:20 -07:00
Siying Dong
beb44ec3eb WriteBufferManager's dummy entry size to block cache 1MB -> 256KB (#5175)
Summary:
Dummy cache size of 1MB is too large for small block sizes. Our GetDefaultCacheShardBits() use min_shard_size = 512L * 1024L to determine number of shards, so 1MB will excceeds the size of the whole shard and make the cache excceeds the budget.
Change it to 256KB accordingly.
There shouldn't be obvious performance impact, since inserting a cache entry every 256KB of memtable inserts is still infrequently enough.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5175

Differential Revision: D14954289

Pulled By: siying

fbshipit-source-id: 2c275255c1ac3992174e06529e44c55538325c94
2019-04-16 12:03:07 -07:00
yiwu-arbug
f1239d5f10 Avoid per-key upper bound check in BlockBasedTableIterator (#5142)
Summary:
This is second attempt for #5101. Original commit message:
`BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons.

This patch come with two fixes:

Fix 1: To optimize checking for bounds, we need comparing the bounds with index key as well. However BlockBasedTableIterator doesn't know whether its index iterator is internally using user keys or internal keys. The patch fixes that by extending InternalIterator with a user_key() function that is overridden by In IndexBlockIter.

Fix 2: In #5101 we return `IsOutOfBound()=true` when block index key is out of bound. But the index key can be larger than smallest key of the next file on the level. That file can be within upper bound and should not be filtered out.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5142

Differential Revision: D14907113

Pulled By: siying

fbshipit-source-id: ac95775c5b4e7b700f76ab43e39f45402c98fbfb
2019-04-16 11:37:47 -07:00
Vijay Nadimpalli
71a82a0abe Consolidating WAL creation which currently has duplicate logic in db_impl_write.cc and db_impl_open.cc (#5188)
Summary:
Right now, two separate pieces of code are used to create WAL files in DBImpl::Open function of db_impl_open.cc and DBImpl::SwitchMemtable function of db_impl_write.cc. This code change simply creates 1 function called DBImpl::CreateWAL in db_impl_open.cc which is used to replace existing WAL creation logic in DBImpl::Open and DBImpl::SwitchMemtable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5188

Differential Revision: D14942832

Pulled By: vjnadimpalli

fbshipit-source-id: d49230e04c36176015c8c1b422575872f92157fb
2019-04-15 18:51:04 -07:00
Yi Zhang
3e63e553b4 Fix MultiGet ASSERT bug when passing unsorted result (#5195)
Summary:
Found this when test driving the new MultiGet. If you pass unsorted result with sorted_result = false you'll trigger the ASSERT incorrect even though we'll sort down below.

I've also added simple test cover sorted_result=true/false scenario copied from MultiGetSimple.

anand1976
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5195

Differential Revision: D14935475

Pulled By: yizhang82

fbshipit-source-id: 1d2af5e3a003847d965066a16e3b19da68acf170
2019-04-15 11:35:21 -07:00
anand76
29111e92b4 Add bounds check in FilePickerMultiGet::PrepareNextLevel() (#5189)
Summary:
Add bounds check when looping through empty levels in FilePickerMultiGet
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5189

Differential Revision: D14925334

Pulled By: anand1976

fbshipit-source-id: 65d53247cf443153e28ce2b8b753fa51c6ae4566
2019-04-12 18:05:09 -07:00
yiwu-arbug
cca141ecf8 Fix crash with memtable prefix bloom and key out of prefix extractor domain (#5190)
Summary:
Before using prefix extractor `InDomain()` should be check. All uses in memtable.cc didn't check `InDomain()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5190

Differential Revision: D14923773

Pulled By: miasantreble

fbshipit-source-id: b3ad60bcca5f3a1a2b929a6eb34b0b7ba6326f04
2019-04-12 17:07:49 -07:00
Maysam Yabandeh
fe642cbee6 WritePrepared: fix race condition in reading batch with duplicate keys (#5147)
Summary:
When ReadOption doesn't specify a snapshot, WritePrepared::Get used kMaxSequenceNumber to avoid the cost of creating a new snapshot object (that requires sync over db_mutex). This creates a race condition if it is reading from the writes of a transaction that had duplicate keys: each instance of duplicate key is inserted with a different sequence number and depending on the ordering the ::Get might skip the newer one and read the older one that is obsolete.
The patch fixes that by using last published seq as the snapshot sequence number. It also adds a check after the read is done to ensure that the max_evicted_seq has not advanced the aforementioned seq, which is a very unlikely event. If it did, then the read is not valid since the seq is not backed by an actually snapshot to let IsInSnapshot handle that properly when an overlapping commit is evicted from commit cache.
A unit  test is added to reproduce the race condition with duplicate keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5147

Differential Revision: D14758815

Pulled By: maysamyabandeh

fbshipit-source-id: a56915657132cf6ba5e3f5ea1b5d78c803407719
2019-04-12 14:40:41 -07:00
Siying Dong
85b2bde3dd Still implement StatisticsImpl::measureTime() (#5181)
Summary:
Since Statistics::measureTime() is deprecated, StatisticsImpl::measureTime() is not implemented. We realized that users might have a wrapped Statistics implementation in which measureTime() is implemented as forwarded to StatisticsImpl, and causes assert failure. In order to make the change less intrusive, we implement StatisticsImpl::measureTime(). We will revisit whether we need to remove it after several releases.

Also, add a test to make sure that a Statistics implementation using the old interface still works.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5181

Differential Revision: D14907089

Pulled By: siying

fbshipit-source-id: 29b6202fd04e30ed6f6adcaeb1000e87f10d1e1a
2019-04-12 11:00:35 -07:00
Yanqin Jin
3189398c00 Fix bugs detected by clang analyzer (#5185)
Summary:
as titled. False positive included, fixed anyway to make the check
pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5185

Differential Revision: D14909384

Pulled By: riversand963

fbshipit-source-id: dc5177e72b1929ccfd6175a60e2cd7bdb9bd80f3
2019-04-12 10:45:56 -07:00
vijaynadimpalli
f49e12b892 Added missing table properties in log (#5168)
Summary:
When a new SST file is created via flush or compaction, we dump out the table properties, however only a few table properties are logged. The change here is to log all the table properties
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5168

Differential Revision: D14876928

Pulled By: vjnadimpalli

fbshipit-source-id: 1aca42ad00f9f650761d39e187f8beeb8700149b
2019-04-11 14:33:49 -07:00
anand76
fefd4b98c5 Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.

Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency

The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.

Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).

Batch   Sizes

1        | 2        | 4         | 8      | 16  | 32

Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074        - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14        - MultiGet (w/ batching)

Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135

Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62

Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891

dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011

Differential Revision: D14348703

Pulled By: anand1976

fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
2019-04-11 14:28:26 -07:00
Siying Dong
ed9f5e21aa Change OptimizeForPointLookup() and OptimizeForSmallDb() (#5165)
Summary:
Change the behavior of OptimizeForSmallDb() so that it is less likely to go out of memory.
Change the behavior of OptimizeForPointLookup() to take advantage of the new memtable whole key filter, and move away from prefix extractor as well as hash-based indexing, as they are prone to misuse.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5165

Differential Revision: D14880709

Pulled By: siying

fbshipit-source-id: 9af30e3c9e151eceea6d6b38701a58f1f9fb692d
2019-04-11 10:45:36 -07:00
Sagar Vemuri
d3d20dcdca Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.

This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted.  And also, of course, it helps to cleanup data older than certain threshold.

- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).

This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166

Differential Revision: D14884441

Pulled By: sagar0

fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
2019-04-10 19:31:18 -07:00
Siying Dong
0bb555630f Consolidate hash function used for non-persistent data in a new function (#5155)
Summary:
Create new function NPHash64() and GetSliceNPHash64(), which are currently
implemented using murmurhash.
Replace the current direct call of murmurhash() to use the new functions
if the hash results are not used in on-disk format.
This will make it easier to try out or switch to alternative functions
in the uses where data format compatibility doesn't need to be considered.
This part shouldn't have any performance impact.

Also, the sharded cache hash function is changed to the new format, because
it falls into this categoery. It doesn't show visible performance impact
in db_bench results. CPU showed by perf is increased from about 0.2% to 0.4%
in an extreme benchmark setting (4KB blocks, no-compression, everything
cached in block cache). We've known that the current hash function used,
our own Hash() has serious hash quality problem. It can generate a lots of
conflicts with similar input. In this use case, it means extra lock contention
for reads from the same file. This slight CPU regression is worthy to me
to counter the potential bad performance with hot keys. And hopefully this
will get further improved in the future with a better hash function.

cache_test's condition is relaxed a little bit to. The new hash is slightly
more skewed in this use case, but I manually checked the data and see
the hash results are still in a reasonable range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5155

Differential Revision: D14834821

Pulled By: siying

fbshipit-source-id: ec9a2c0a2f8ae4b54d08b13a5c2e9cc97aa80cb5
2019-04-08 13:32:06 -07:00
Yanqin Jin
de00f28132 Refactor ExternalSSTFileTest (#5129)
Summary:
remove an unnecessary function `GenerateAndAddFileIngestBehind`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5129

Differential Revision: D14686710

Pulled By: riversand963

fbshipit-source-id: 5698ae63e10f8ef76c2da753bbb07a36024ac065
2019-04-08 11:16:34 -07:00
Sergei Glushchenko
39c6c5fc1b Expose DB methods to lock and unlock the WAL (#5146)
Summary:
Expose DB methods to lock and unlock the WAL.

These methods are intended to use by MyRocks in order to obtain WAL
coordinates in consistent way.

Usage scenario is following:

MySQL has performance_schema.log_status which provides information that
enables a backup tool to copy the required log files without locking for
the duration of copy. To populate this table MySQL does following:

1. Lock the binary log. Transactions are not allowed to commit now
2. Save the binary log coordinates
3. Walk through the storage engines and lock writes on each engine. For
   InnoDB, redo log is locked. For MyRocks, WAL should be locked.
4. Ask storage engines for their coordinates. InnoDB reports its current
   LSN and checkpoint LSN. MyRocks should report active WAL files names
   and sizes.
5. Release storage engine's locks
6. Unlock binary log

Backup tool will then use this information to copy InnoDB, RocksDB and
MySQL binary logs up to specified positions to end up with consistent DB
state after restore.

Currently, RocksDB allows to obtain the list of WAL files. Only missing
bit is the method to lock the writes to WAL files.

LockWAL method must flush the WAL in order for the reported size to be
accurate (GetSortedWALFiles is using file system stat call to return the
file size), also, since backup tool is going to copy the WAL, it is
better to be flushed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5146

Differential Revision: D14815447

Pulled By: maysamyabandeh

fbshipit-source-id: eec9535a6025229ed471119f19fe7b3d8ae888a3
2019-04-06 06:40:36 -07:00
Adam Simpkins
c06c4c01c5 Fix many bugs in log statement arguments (#5089)
Summary:
Annotate all of the logging functions to inform the compiler that these
use printf-style formatting arguments.  This allows the compiler to emit
warnings if the format arguments are incorrect.

This also fixes many problems reported now that format string checking
is enabled.  Many of these are simply mix-ups in the argument type (e.g,
int vs uint64_t), but in several cases the wrong number of arguments
were being passed in which can cause the code to crash.

The primary motivation for this was to fix the log message in
`DBImpl::SwitchMemtable()` which caused a segfault due to an extra %s
format parameter with no argument supplied.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5089

Differential Revision: D14574795

Pulled By: simpkins

fbshipit-source-id: 0921b03f0743652bf4ae21e414ff54b3bb65422a
2019-04-04 12:12:11 -07:00
Maysam Yabandeh
75e8b6dfcf Fix race condition in IteratorWithLocalStatistics (#5149)
Summary:
The ReadCallback was shared between all threads in IteratorWithLocalStatistics. A race condition was
 hence introduced with recent changes that changes the content of ReadCallback. The patch fixes that by using a separate callback per thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5149

Differential Revision: D14761612

Pulled By: maysamyabandeh

fbshipit-source-id: 814a316aed046c318cb90e22379a6e32ac528949
2019-04-03 16:04:38 -07:00
Zhichao Cao
ebb9b2ed16 Fix the potential DB crash caused by call EndTrace before StartTrace (#5130)
Summary:
Although user should first call StartTrace to begin the RocksDB tracing function and call EndTrace to stop the tracing process, user can accidentally call EndTrace first. It will cause segment fault and crash the DB instance. The issue is fixed by checking the pointer first.

Test case added in db_test2.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5130

Differential Revision: D14691420

Pulled By: zhichao-cao

fbshipit-source-id: 3be13d2f944bc453728ef8eef67b68d7ad0939c8
2019-04-03 13:26:34 -07:00
Zhongyi Xie
e8480d4d9d add assert to silence clang analyzer and fix variable shadowing (#5140)
Summary:
This PR address two open issues:

1.  clang analyzer is paranoid about db_ being nullptr after DB::Open calls in the test.
See https://github.com/facebook/rocksdb/pull/5043#discussion_r271394579
Add an assert to keep clang happy
2. PR https://github.com/facebook/rocksdb/pull/5049 introduced a  variable shadowing:
```
db/db_iterator_test.cc: In constructor ‘rocksdb::DBIteratorWithReadCallbackTest_ReadCallback_Test::TestBody()::TestReadCallback::TestReadCallback(rocksdb::SequenceNumber)’:
db/db_iterator_test.cc:2484:9: error: declaration of ‘max_visible_seq’ shadows a member of 'this' [-Werror=shadow]
         : ReadCallback(max_visible_seq) {}
         ^
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5140

Differential Revision: D14735497

Pulled By: miasantreble

fbshipit-source-id: 3219ea75cf4ae04f64d889323f6779e84be98144
2019-04-02 21:15:44 -07:00
Maysam Yabandeh
5234fc1b70 Mark logs with prepare in PreReleaseCallback (#5121)
Summary:
In prepare phase of 2PC, the db promises to remember the prepared data, for possible future commits. To fulfill the promise the prepared data must be persisted in the WAL so that they could be recovered after a crash. The log that contains a prepare batch that is not committed yet, is marked so that it is not garbage collected before the transaction commits/rollbacks. The bug was that the write to the log file and the mark of the file was not atomic, and WAL gc could have happened before the WAL log is actually marked. This patch moves the marking logic to PreReleaseCallback so that the WAL gc logic that joins both write threads would see the WAL write and WAL mark atomically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5121

Differential Revision: D14665210

Pulled By: maysamyabandeh

fbshipit-source-id: 1d66aeb1c66a296cb4899a5a20c4d40c59e4b534
2019-04-02 15:17:47 -07:00
Maysam Yabandeh
14b3f683a1 WriteUnPrepared: less virtual in iterator callback (#5049)
Summary:
WriteUnPrepared adds a virtual function, MaxUnpreparedSequenceNumber, to ReadCallback, which returns 0 unless WriteUnPrepared is enabled and the transaction has uncommitted data written to the DB. Together with snapshot sequence number, this determines the last sequence that is visible to reads.
The patch clarifies the guarantees of the GetIterator API in WriteUnPrepared transactions and make use of that to statically initialize the read callback and thus avoid the virtual call.
Furthermore it increases the minimum value for min_uncommitted from 0 to 1 as seq 0 is used only for last level keys that are committed in all snapshots.

The following benchmark shows +0.26% higher throughput in seekrandom benchmark.

Benchmark:
./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench

./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG    10 runs] : 20355 ops/sec;  225.2 MB/sec
seekrandom [MEDIAN 10 runs] : 20425 ops/sec;  225.9 MB/sec

./db_bench_lessvirtual3 --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG    10 runs] : 20409 ops/sec;  225.8 MB/sec
seekrandom [MEDIAN 10 runs] : 20487 ops/sec;  226.6 MB/sec
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5049

Differential Revision: D14366459

Pulled By: maysamyabandeh

fbshipit-source-id: ebaff8908332a5ae9af7defeadabcb624be660ef
2019-04-02 14:47:16 -07:00
Siying Dong
ebcc8ae1d3 Revert "Avoid per-key upper bound check in BlockBasedTableIterator (#5101)" (#5132)
Summary:
This reverts commit f29dc1b906.

In BlockBasedTableIterator, index_iter_->key() is sometimes a user key, so it is wrong to call ExtractUserKey() against it. This is a bug introduced by #5101.
Temporarily revert the diff to keep the branch clean.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5132

Differential Revision: D14718584

Pulled By: siying

fbshipit-source-id: 0ac55dc9b5dbc18c7809092146bdf7eb9364b9ad
2019-04-02 10:00:38 -07:00
Mike Kolupaev
120bc4715b Add DBOptions. avoid_unnecessary_blocking_io to defer file deletions (#5043)
Summary:
Just like ReadOptions::background_purge_on_iterator_cleanup but for ColumnFamilyHandle instead of Iterator.

In our use case we sometimes call ColumnFamilyHandle's destructor from low-latency threads, and sometimes it blocks the thread for a few seconds deleting the files. To avoid that, we can either offload ColumnFamilyHandle's destruction to a background thread on our side, or add this option on rocksdb side. This PR does the latter, to be consistent with how we solve exactly the same problem for iterators using background_purge_on_iterator_cleanup option.

(EDIT: It's avoid_unnecessary_blocking_io now, and affects both CF drops and iterator destructors.)
I'm not quite comfortable with having two separate options (background_purge_on_iterator_cleanup and background_purge_on_cf_cleanup) for such a rarely used thing. Maybe we should merge them? Rename background_purge_on_cf_cleanup to something like delete_files_on_background_threads_only or avoid_blocking_io_in_unexpected_places, and make iterators use it instead of the one in ReadOptions? I can do that here if you guys think it's better.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5043

Differential Revision: D14339233

Pulled By: al13n321

fbshipit-source-id: ccf7efa11c85c9a5b91d969bb55627d0fb01e7b8
2019-04-01 17:10:40 -07:00
Yi Wu
f29dc1b906 Avoid per-key upper bound check in BlockBasedTableIterator (#5101)
Summary:
`BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5101

Differential Revision: D14678707

Pulled By: siying

fbshipit-source-id: 2372446116753c7892ea4cec7b4b49ef87ba463e
2019-03-29 13:11:46 -07:00
Yanqin Jin
09957ded1d Update RepeatableThreadTest with MockTimeEnv (#5107)
Summary:
**This PR updates RepeatableThread::wait, breaking some tests on OS X. The rest of the PR fixes the tests on OS X.**
`RepeatableThreadTest.MockEnvTest` uses `MockTimeEnv` and `RepeatableThread`. If `RepeatableThread::wait` calls `TimedWait` with a time smaller than or equal to the current (real) time, `TimedWait` returns immediately on certain platforms, e.g. OS X. #4560 addresses this issue by replacing `TimedWait` with `Wait` in test. This fixes the test but makes test/production code diverge, which is not optimal for test coverage. This PR proposes an alternative fix which unifies test and production code path for `RepeatableThread::wait`. We obtain the current (real) time in seconds and add 10 extra seconds to ensure that `RepeatableThread::wait` invokes `TimedWait` with a time greater than (real) current time. This is to prevent the `TimedWait` function from returning immediately without sleeping and releasing the mutex. If `TimedWait` returns immediately, the mutex will not be released, and `RepeatableThread::TEST_WaitForRun` never has a chance to execute the callback which, in this case, updates the result returned by `mock_env->NowMicros()`. Consequently, `RepeatableThread::wait` cannot break out of the loop, causing test to hang. The extra 10 seconds is a best-effort approach because there seems no reliable and deterministic way to provide the aforementioned guarantee. By the time `RepeatableThread::wait` is called, there is no guarantee that the `delay + mock_env->NowMicros()` will be greater than the current real time. However, 10 seconds should be sufficient in most cases. We will keep an eye for possible flakiness of this test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5107

Differential Revision: D14680885

Pulled By: riversand963

fbshipit-source-id: d1ecbe10e1dacd110bd464cd01e188bfee72b89e
2019-03-29 10:08:50 -07:00
anand76
dae3b5545c Smooth the deletion of WAL files (#5116)
Summary:
WAL files are currently not subject to deletion rate limiting by DeleteScheduler. If the size of the WAL files is significant, this can cause a high delete rate on SSDs that may affect other operations. To fix it, force WAL file deletions to go through the SstFileManager. Original PR for this is #2768
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5116

Differential Revision: D14669437

Pulled By: anand1976

fbshipit-source-id: c5f62d0640cebaa1574de841a1d01e4ce2faadf0
2019-03-28 15:17:13 -07:00
Siying Dong
106a94af15 Improve obsolete_files_test (#5125)
Summary:
We see a failure of obsolete_files_test but aren't able to identify
the issue. Improve the test in following way and hope we can debug
better next time:
1. Place sync point before automatic compaction runs so race condition
   will always trigger.
2. Disable sync point before test finishes.
3. ASSERT_OK() instead of ASSERT_TRUE(status.ok())
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5125

Differential Revision: D14669456

Pulled By: siying

fbshipit-source-id: dccb7648e334501ad651eb212880096eef1f4ab2
2019-03-28 13:16:02 -07:00
Siying Dong
89ab1381f8 Apply automatic formatting to some files (#5114)
Summary:
Following files were run through automatic formatter:
db/db_impl.cc
db/db_impl.h
db/db_impl_compaction_flush.cc
db/db_impl_debug.cc
db/db_impl_files.cc
db/db_impl_readonly.h
db/db_impl_write.cc
db/dbformat.cc
db/dbformat.h
table/block.cc
table/block.h
table/block_based_filter_block.cc
table/block_based_filter_block.h
table/block_based_filter_block_test.cc
table/block_based_table_builder.cc
table/block_based_table_reader.cc
table/block_based_table_reader.h
table/block_builder.cc
table/block_builder.h
table/block_fetcher.cc
table/block_prefix_index.cc
table/block_prefix_index.h
table/block_test.cc
table/format.cc
table/format.h

I could easily run all the files, but I don't want people to feel that
I'm doing it for lines of code changes :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5114

Differential Revision: D14633040

Pulled By: siying

fbshipit-source-id: 3f346cb53bf21e8c10704400da548dfce1e89a52
2019-03-27 16:24:45 -07:00
Siying Dong
5f6adf3f6a Fix some variable naming in db/transaction_log_impl.* (#5112)
Summary:
We follow Google C++ Style which indicates variable names should be
all underscore: https://google.github.io/styleguide/cppguide.html#Variable_Names
Fix some variable names under db/transaction_log_impl.*
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5112

Differential Revision: D14631157

Pulled By: siying

fbshipit-source-id: 9525c9b0976b843bca377b03897700d87cc60af8
2019-03-27 12:27:54 -07:00
Yi Wu
d69241586e Fix perf_context.user_key_comparison_count for range scan (#5098)
Summary:
Currently `perf_context.user_key_comparison_count` is bump only in `InternalKeyComparator`. For places user comparator is used directly the counter is not bump. Fixing the majority of it.

Index iterator and filter code also use user comparator directly and don't bump the counter. It is not fixed in this patch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5098

Differential Revision: D14603753

Pulled By: siying

fbshipit-source-id: 1cd41035644ca9e49b97a51030a5d1e15f5f3cae
2019-03-27 10:34:27 -07:00
Siying Dong
2b4d5ceb47 Remove some "using std::..." from header files. (#5113)
Summary:
The code convention we are following, Google C++ Style, discourage
alias in header files, especially public headers:
https://google.github.io/styleguide/cppguide.html#Aliases
Remove some of them. Might removed some from .cc files as well to be consistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5113

Differential Revision: D14633030

Pulled By: siying

fbshipit-source-id: b990edc919d5de60295992284f980195e501d424
2019-03-27 10:28:21 -07:00
Yanqin Jin
9358178edc Support for single-primary, multi-secondary instances (#4899)
Summary:
This PR allows RocksDB to run in single-primary, multi-secondary process mode.
The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary.
Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`.

This PR has several components:
1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary.

2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue.

3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`.
3.1 Tail the primary's MANIFEST during recovery.
3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`.
3.3 Tailing WAL will be in a future PR.

4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899

Differential Revision: D14510945

Pulled By: riversand963

fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
2019-03-26 16:45:31 -07:00
Shi Feng
01e6badbb6 Introduce CPU timers for iterator seek and next (#5076)
Summary:
Introduce CPU timers for iterator seek and next operations. Seek
counter includes SeekToFirst, SeekToLast and SeekForPrev, w/ the
caveat that SeekToLast timer doesn't include some post processing
time if upper bound is defined.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5076

Differential Revision: D14525218

Pulled By: fredfsh

fbshipit-source-id: 03ba25df3b22b06c072621e4de0eacfa1445f0d9
2019-03-26 16:32:13 -07:00
Siying Dong
48e7effa79 Avoid to go through every CF for every ReleaseSnapshot() (#5090)
Summary:
With https://github.com/facebook/rocksdb/pull/3009 we go through every CF
to check whether a bottommost compaction is needed to be triggered. This is done
within DB mutex. What we do within DB mutex may heavily influece the write throughput
we can achieve, so we always want to minimize work there.

Here we try to avoid this for-loop by first check a global threshold. In most of
the time, the CF loop can be avoided.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5090

Differential Revision: D14582684

Pulled By: siying

fbshipit-source-id: 968f6d9bb6affe1a5ebc4910b418300b076f166f
2019-03-25 19:18:04 -07:00
Maysam Yabandeh
c84fad7a19 Reorder DBIter fields to reduce memory usage (#5078)
Summary:
The patch reorders DBIter fields to put 1-byte fields together and let the compiler optimize the memory usage by using less 64-bit allocations for bools and enums.

This might have a negative side effect of putting the variables that are accessed together into different cache lines and hence increasing the cache misses. Not sure what benchmark would verify that thought. I ran simple, single-threaded seekrandom benchmarks but the variance in the results is too much to be conclusive.

./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench
./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5078

Differential Revision: D14562676

Pulled By: maysamyabandeh

fbshipit-source-id: 2284655d46e079b6e9a860e94be5defb6f482167
2019-03-21 09:55:09 -07:00
Zhongyi Xie
a291f3a1e5 Collect compaction stats by priority and dump to info LOG (#5050)
Summary:
In order to better understand compaction done by different priority thread pool, we now collect compaction stats by priority and also print them to info LOG through stats dump.

```
** Compaction Stats [default] **
Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Low      0/0    0.00 KB   0.0     16.8    11.3      5.5       5.6      0.1       0.0   0.0    406.4    136.1     42.24             34.96        45    0.939     13M  8865K
High      0/0    0.00 KB   0.0      0.0     0.0      0.0      11.4     11.4       0.0   0.0      0.0     76.2    153.00             35.74     12185    0.013       0      0
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5050

Differential Revision: D14408583

Pulled By: miasantreble

fbshipit-source-id: e53746586ea27cb8abc9fec35805bd80ed30f608
2019-03-19 17:28:19 -07:00
Wenjie Yang
36c2a7cfb1 Add an option to filter traces (#5082)
Summary:
Add an option to filter out READ or WRITE operations while tracing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5082

Differential Revision: D14515083

Pulled By: mrmiywj

fbshipit-source-id: 2504c89a9abf1dd629cad44b4104092702d77610
2019-03-19 14:36:51 -07:00
Hiroaki Nakamura
f2f6acbef3 Add missing C API for transaction (#5077)
Summary:
Partly addresses https://github.com/facebook/rocksdb/issues/4999
I verified `make static_lib` runs fine.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5077

Differential Revision: D14521101

Pulled By: maysamyabandeh

fbshipit-source-id: ba88e74a51d2d793cac7260d505b1a54254b53af
2019-03-19 09:43:22 -07:00
Shobhit Dayal
b45b1cde3e Feature for sampling and reporting compressibility (#4842)
Summary:
This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms:
1. lz4 or snappy for fast compression
2. zstd or zlib for slow but higher compression.

The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType.

The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842

Differential Revision: D13629011

Pulled By: shobhitdayal

fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa
2019-03-18 12:15:34 -07:00
anand76
b4fa51dfaf Update bg_error when log flush fails in SwitchMemtable() (#5072)
Summary:
There is a potential failure case in DBImpl::SwitchMemtable() that is not handled properly. The call to cur_log_writer->WriteBuffer() can fail due to an IO error. In that case, we need to call SetBGError() in order set the background error since the WriteBuffer() failure may result in data loss.

Also, the asserts for !new_mem and !new_log are incorrect, as those would have been allocated by the time this failure is detected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5072

Differential Revision: D14461384

Pulled By: anand1976

fbshipit-source-id: fb59bce9d61378f37d2dfcd28c0b704b0f43c3cf
2019-03-15 15:19:25 -07:00
Siying Dong
0920bf4e68 Revert "Remove PlainTable's feature store_index_in_file (#4914)" (#5034)
Summary:
This reverts commit ee1818081f.

We are not ready to deprecate this feature. revert it for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5034

Differential Revision: D14287246

Pulled By: siying

fbshipit-source-id: e4beafdeaee1c94364fdaa6ba198218d158339f7
2019-03-01 15:45:45 -08:00
Siying Dong
aef763b6d6 Make statistics's stats_level change thread-safe (#5030)
Summary:
Right now, users can change statistics.stats_level while DB is running, but TSAN may report
data race. We make stats_level_ to be atomic, and access them using accessors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5030

Differential Revision: D14267519

Pulled By: siying

fbshipit-source-id: 37d7ebeff7a43a406230143422a16af899163f73
2019-03-01 10:42:09 -08:00
Maysam Yabandeh
77ebc82b92 Call PreReleaseCallback between WAL and memtable write (#5015)
Summary:
PreReleaseCallback meant to be called before the writes are visible to the readers. Since the sequence number is known after the WAL write, there is no reason to delay calling PreReleaseCallback to after the memtable write, which would complicates the reader's logic in presence of our memtable writes that are made visible by the other write thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5015

Differential Revision: D14221670

Pulled By: maysamyabandeh

fbshipit-source-id: a504dd665cf923226d7af09cc8e9c7739a25edc6
2019-02-28 15:49:11 -08:00
Siying Dong
5e298f865b Add two more StatsLevel (#5027)
Summary:
Statistics cost too much CPU for some use cases. Add two stats levels
so that people can choose to skip two types of expensive stats, timers and
histograms.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027

Differential Revision: D14252765

Pulled By: siying

fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592
2019-02-28 10:27:59 -08:00
Maysam Yabandeh
a661c0d208 WritePrepared: optimize read path by avoiding virtual (#5018)
Summary:
The read path includes a callback function, ReadCallback, which would eventually calls IsInSnapshot to figure if a particular seq is in the reading snapshot or not. This callback is virtual, which adds the cost of multiple virtual function call to each read. The first few checks in IsInSnapshot, however, are quite trivial and take care of majority of the cases. The patch moves those to a non-virtual function in the the parent class, ReadCallback, to lower the virtual callback cost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5018

Differential Revision: D14226562

Pulled By: maysamyabandeh

fbshipit-source-id: 6feed5b34f3b082e52092c5ef143e29b49c46b44
2019-02-26 16:56:19 -08:00
Zhongyi Xie
c4f5d0aa15 add GetStatsHistory to retrieve stats snapshots (#4748)
Summary:
This PR adds public `GetStatsHistory` API to retrieve stats history in the form of an std map. The key of the map is the timestamp in microseconds when the stats snapshot is taken, the value is another std map from stats name to stats value (stored in std string). Two DBOptions are introduced: `stats_persist_period_sec` (default 10 minutes) controls the intervals between two snapshots are taken; `max_stats_history_count` (default 10) controls the max number of history snapshots to keep in memory. RocksDB will stop collecting stats snapshots if `stats_persist_period_sec` is set to 0.

(This PR is the in-memory part of https://github.com/facebook/rocksdb/pull/4535)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4748

Differential Revision: D13961471

Pulled By: miasantreble

fbshipit-source-id: ac836d401ecb84ea92216bf9966f969dedf4ad04
2019-02-20 15:52:54 -08:00
Siying Dong
93f7e7a450 Temporarily Disable DBTest2.PresetCompressionDict (#5003)
Summary:
DBTest2.PresetCompressionDict is flaky. Temparily disable it for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5003

Differential Revision: D14139505

Pulled By: siying

fbshipit-source-id: ebf1872d364b76b2cb021b489ea2f17ee997116a
2019-02-19 14:44:12 -08:00
Michael Liu
3c5d1b16b1 Apply modernize-use-override (3)
Summary:
Use C++11’s override and remove virtual where applicable.
Change are automatically generated.

bypass-lint
drop-conflicts

Reviewed By: igorsugak

Differential Revision: D14131816

fbshipit-source-id: f20e7f7cecf2e699d70f5fa036f72c0e3f59b50e
2019-02-19 13:39:49 -08:00
Zhongyi Xie
ed995c6a69 add whole key bloom filter support in memtables (#4985)
Summary:
MyRocks calls `GetForUpdate` on `INSERT`, for unique key check, and in almost all cases GetForUpdate returns empty result. For such cases, whole key bloom filter is helpful.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4985

Differential Revision: D14118257

Pulled By: miasantreble

fbshipit-source-id: d35cb7109c62fd5ad541a26968e3a3e16d3e85ea
2019-02-19 12:15:39 -08:00
Siying Dong
c2affccc18 Header logger should call LogHeader() (#4980)
Summary:
The info log header feature never worked well, because log level Header was not
translated to Logger::LogHeader() call. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4980

Differential Revision: D14087283

Pulled By: siying

fbshipit-source-id: 7e7d03ce35fa8d13d4ee549f46f7326f7bc0006d
2019-02-15 16:59:36 -08:00
Siying Dong
26a33ee5bd flush_job logs data size too (#4979)
Summary:
Right now when a flush is triggered, the memory consumption is logged but data size is not.
It's useful to log both when we debug unexpected small flushed file size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4979

Differential Revision: D14071979

Pulled By: siying

fbshipit-source-id: 0cd60449c5205eb00e0fbc299084418f609904ed
2019-02-15 16:33:19 -08:00
Aubin Sanyal
3231a2e581 Deprecate ttl option from CompactionOptionsFIFO (#4965)
Summary:
We introduced ttl option in CompactionOptionsFIFO when ttl-based file
deletion (compaction) was supported only as part of FIFO Compaction. But
with the extension of ttl semantics even to Level compaction,
CompactionOptionsFIFO.ttl can now be deprecated. Instead we will start
using ColumnFamilyOptions.ttl for FIFO compaction as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4965

Differential Revision: D14072960

Pulled By: sagar0

fbshipit-source-id: c98cc2ae695a28136295787cd88d36a220fc219e
2019-02-15 09:51:41 -08:00
Michael Liu
ca89ac2ba9 Apply modernize-use-override (2nd iteration)
Summary:
Use C++11’s override and remove virtual where applicable.
Change are automatically generated.

Reviewed By: Orvid

Differential Revision: D14090024

fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a
2019-02-14 14:41:36 -08:00
Andrew Kryczka
c8c8104d7e Dictionary compression for files written by SstFileWriter (#4978)
Summary:
If `CompressionOptions::max_dict_bytes` and/or `CompressionOptions::zstd_max_train_bytes` are set, `SstFileWriter` will now generate files respecting those options.

I refactored the logic a bit for deciding when to use dictionary compression. Previously we plumbed `is_bottommost_level` down to the table builder and used that. However it was kind of confusing in `SstFileWriter`'s context since we don't know what level the file will be ingested to. Instead, now the higher-level callers (e.g., flush, compaction, file writer) are responsible for building the right `CompressionOptions` to give the table builder.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4978

Differential Revision: D14060763

Pulled By: ajkr

fbshipit-source-id: dc802c327896df2b319dc162d6acc82b9cdb452a
2019-02-14 11:23:55 -08:00
Yanqin Jin
4fc442029a Avoid using kInAtomicGroup tag for single-cf op (#4981)
Summary:
if an operation just involves a single column family, then we do
not have to set the kInAtomicGroup tag when writing to MANIFEST. This change
can fix a compatibility test failure, i.e. 5.15 and earlier cannot recognize
kInAtomicGroup tag.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4981

Differential Revision: D14072687

Pulled By: riversand963

fbshipit-source-id: 46b0c61e399f16c6b7169de0b33430d0ed90d6d4
2019-02-13 18:33:42 -08:00
Yanqin Jin
a69d4deefb Atomic ingest (#4895)
Summary:
Make file ingestion atomic.

 as title.
Ingesting external SST files into multiple column families should be atomic. If
a crash occurs and db reopens, either all column families have successfully
ingested the files before the crash, or non of the ingestions have any effect
on the state of the db.

Also add unit tests for atomic ingestion.

Note that the unit test here does not cover the case of incomplete atomic group
in the MANIFEST, which is covered in VersionSetTest already.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4895

Differential Revision: D13718245

Pulled By: riversand963

fbshipit-source-id: 7df97cc483af73ad44dd6993008f99b083852198
2019-02-12 19:16:17 -08:00
Siying Dong
49ddd7ec4f Stats should be logged in INFO level (#4977)
Summary:
Previously, stats were logged in warning level. This was done in that way because
people reported that it wasn't logged in MyRocks. However, later we learned that it turns
out to be due to a bug in MyRocks, which is fixed in
79bb705e74

Now we revert the stats logging to INFO level, so that it doesn't pollute the warning
level logging.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4977

Differential Revision: D14058485

Pulled By: siying

fbshipit-source-id: 19fab323c19d9bc88184287f209551f9a77ca0e6
2019-02-12 16:54:55 -08:00
Yanqin Jin
c5a64cffd2 Avoid fsync on the same directory in atomic flush (#4817)
Summary:
In `DBImpl::AtomicFlushMemTablesToOutputFiles`, we need to call fsync only once
on the same data directory. If two column families share a common directory for
their data, we call fsync only once.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4817

Differential Revision: D13543689

Pulled By: riversand963

fbshipit-source-id: 4701d77c96a47802fbf6cb9f3337ee65d46b95f5
2019-02-12 12:28:36 -08:00
Andrew Kryczka
62f70f6d14 Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.

So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:

- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952

Differential Revision: D13967980

Pulled By: ajkr

fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-11 19:47:32 -08:00
Maysam Yabandeh
576d2d6c60 WritePrepared: relax assert in compaction iterator (#4969)
Summary:
If IsInSnapshot(seq2, snapshot) determines that the snapshot is released, the future queries IsInSnapshot(seq1, snapshot) could still return a definitive answer of true if for example seq1 is too old that is determined visible in all snapshots. This violates a recently added assert statement to compaction iterator. The patch relaxes the assert.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4969

Differential Revision: D14030998

Pulled By: maysamyabandeh

fbshipit-source-id: 6db53db0e37d0a20e8997ef2c1004b8627614ab9
2019-02-11 15:01:46 -08:00
Yanqin Jin
2d049ab7e8 Checksum properties block for block-based table (#4956)
Summary:
Always enable properties block checksum verification for block-based table. For external SST file ingested with 'write_global_seqno==true', we use 'DecodeEntrySlow' to parse its blocks' contents so that the process will not die upon failing the assertion possibly caused by corruption.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4956

Differential Revision: D14012741

Pulled By: riversand963

fbshipit-source-id: 8b766e6f54b36f8f9e074c0e19e0926ec3cce186
2019-02-11 11:50:01 -08:00
Siying Dong
5d9a623e2c Add a unit test to Ignorable manfiest record (#4964)
Summary:
https://github.com/facebook/rocksdb/pull/4960 introduced ignorable manfiest
record. Adding a test to it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4964

Differential Revision: D14012667

Pulled By: siying

fbshipit-source-id: e5f10ecc68dec2716e178d44f0fe2b76c3d857ef
2019-02-11 11:20:24 -08:00
tang-jianfeng
08809f5e6c Implement trace sampling (#4963)
Summary:
Implement trace sampling to allow user to specify the sampling frequency, i.e. save one per how many requests, so that a user does not need to log all if he/she is interested in only a sampled set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4963

Differential Revision: D14011190

Pulled By: tang-jianfeng

fbshipit-source-id: 078b631d9319b67cb089dd2c30e21d0df8dc406a
2019-02-08 18:08:18 -08:00
Siying Dong
1a761e6a6c Add a placeholder in manifest indicating ignorable record (#4960)
Summary:
We want to reserve some right that some extra information added manifest
in the future can be forward compatible by previous versions. Now we create a
place holder for that. A bit in tag is added to indicate that a field can be
safely ignored.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4960

Differential Revision: D14000484

Pulled By: siying

fbshipit-source-id: cbf5bad3f9d5ec798f789806f244d1c20d3b66d6
2019-02-08 11:33:11 -08:00
Siying Dong
f48758e939 Deprecate CompactionFilter::IgnoreSnapshots() = false (#4954)
Summary:
We found that the behavior of CompactionFilter::IgnoreSnapshots() = false isn't
what we have expected. We thought that snapshot will always be preserved.
However, we just realized that, if no snapshot is created while compaction
starts, and a snapshot is created after that, the data seen from the snapshot
can successfully be dropped by the compaction. This creates a strange behavior
to the feature, which is hard to explain. Like what is documented in code
comment, this feature is not very useful with snapshot anyway. The decision
is to deprecate the feature.

We keep the function to avoid to break users code. However, we will fail
compactions if false is returned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4954

Differential Revision: D13981900

Pulled By: siying

fbshipit-source-id: 2db8c2c3865acd86a28dca625945d1481b1d1e36
2019-02-07 16:57:33 -08:00
Siying Dong
cf3a671733 Remove cuckoo hash memtable (#4953)
Summary:
Cuckoo Hash is less useful than we initially expected. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4953

Differential Revision: D13979264

Pulled By: siying

fbshipit-source-id: 2a60afdaa989f045357398b43a1cc5d46f4492ed
2019-02-07 16:15:27 -08:00
Zhongyi Xie
71cae59a99 exclude test CompactFilesShouldTriggerAutoCompaction from ROCKSDB_LITE (#4950)
Summary:
This will fix the following build error:

> db/db_test.cc: In member function ‘virtual void rocksdb::DBTest_CompactFilesShouldTriggerAutoCompaction_Test::TestBody()’:
> db/db_test.cc:5462:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’
>    db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data);
> db/db_test.cc:5490:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’
>    db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data);
> db/db_test.cc:5499:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’
>    db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data);
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4950

Differential Revision: D13965378

Pulled By: miasantreble

fbshipit-source-id: a975435476fe555b1cd9d5da263ee3da3acdea56
2019-02-05 17:01:11 -08:00
Zhongyi Xie
00ed41daee Allow copy for PerfContext objects (#4919)
Summary:
Existing implementation of PerfContext does not define copy constructor or assignment operator, which could potentially cause problems when user create copies and resets the builtin one. This PR address the issue by providing these two constructors with deep copy semantics.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4919

Differential Revision: D13960406

Pulled By: miasantreble

fbshipit-source-id: 36aab5aaee65d4480f537e4e22148faa45e8e334
2019-02-05 14:29:08 -08:00
Jay Zhuang
c9a52cbdc8 Fix potential DB hang while using CompactFiles (#4940)
Summary:
CompactFiles() may block auto compaction which could cuase DB hang when it
reachs level0_stop_writes_trigger.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4940

Differential Revision: D13929648

Pulled By: cooldoger

fbshipit-source-id: 10842df38df3bebf862cd1a120a88ce961fdd381
2019-02-05 11:23:38 -08:00
Siying Dong
8fe073324f BYTES_READ stats miscount for NotFound cases (#4938)
Summary:
In NotFound cases, stats BYTES_READ and perf_context.get_read_bytes is still be increased. The amount increased will be
whatever size of the string or PinnableSlice that users passed in as the output data structure. This is wrong. Fix this by not
increasing these two counters.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4938

Differential Revision: D13908963

Pulled By: siying

fbshipit-source-id: 60bce42e4fbb9862bba3da36dbc27b2963ea6162
2019-02-05 10:53:35 -08:00
yangzhijia
31221bb7e8 Properly set upper bound of subcompaction output (#4879) (#4898)
Summary:
Fix the ouput overlap bug when using subcompactions, the upper bound of output
file was extended incorrectly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4898

Differential Revision: D13736107

Pulled By: ajkr

fbshipit-source-id: 21dca09f81d5f07bf2766bf566f9b50dcab7d8e3
2019-02-05 10:20:16 -08:00
Maysam Yabandeh
30468d8eb4 Fix analyze error on possible un-initialized value (#4937)
Summary:
The patch fixes the following analyze error by checking the return status of ParseInternalKey.
```
db/merge_helper.cc:306:23: warning: The right operand of '==' is a garbage value
    assert(kTypeMerge == orig_ikey.type);
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4937

Differential Revision: D13908506

Pulled By: maysamyabandeh

fbshipit-source-id: 68d7771e75519da3d4bd807fd231675ec12093f6
2019-02-01 09:41:27 -08:00
Ming Zhao
59244447e3 Zero seqnum of final key / drop final tombstone when compacting to bottommost level
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4927

Differential Revision: D13889458

Pulled By: mzhaom

fbshipit-source-id: d6b66db85901a9eb90748fba6a9dc4e7457b9c5e
2019-02-01 09:21:57 -08:00
Yanqin Jin
842cdc11dd Use correct FileMeta for atomic flush result install (#4932)
Summary:
1. this commit fixes our handling of a combination of two separate edge
cases. If a flush job does not pick any memtable to flush (because another
flush job has already picked the same memtables), and the column family
assigned to the flush job is dropped right before RocksDB calls
rocksdb::InstallMemtableAtomicFlushResults, our original code passes
a FileMetaData object whose file number is 0, failing the assertion in
rocksdb::InstallMemtableAtomicFlushResults (assert(m->GetFileNumber() > 0)).
2. Also piggyback a small change: since we already create a local copy of column family's mutable CF options to eliminate potential race condition with `SetOptions` call, we might as well use the local copy in other function calls in the same scope.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4932

Differential Revision: D13901322

Pulled By: riversand963

fbshipit-source-id: b936580af7c127ea0c6c19ea10cd5fcede9fb0f9
2019-01-31 14:49:51 -08:00
Maysam Yabandeh
35e5689e11 Take snapshots once for all cf flushes (#4934)
Summary:
FlushMemTablesToOutputFiles calls FlushMemTableToOutputFile for each column family. The patch moves the take-snapshot logic to outside FlushMemTableToOutputFile so that it does it once for all the flushes. This also addresses a deadlock issue for resetting the managed snapshot of job_snapshot in the 2nd call to FlushMemTableToOutputFile.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4934

Differential Revision: D13900747

Pulled By: maysamyabandeh

fbshipit-source-id: f3cd650c5fff24cf95c1aaf8a10c149d42bf042c
2019-01-31 12:21:59 -08:00
Alexander Zinoviev
32a6dd9a41 Add a new CPU time counter to compaction report (#4889)
Summary:
Measure CPU time consumed for a compaction and report it in the stats report
Enable NowCPUNanos() to work for MacOS
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889

Differential Revision: D13701276

Pulled By: zinoale

fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055
2019-01-29 17:24:00 -08:00
Yanqin Jin
158da7a6ee Verify checksum before ingestion (#4916)
Summary:
before file ingestion (in preparation phase), verify the checksums of
the blocks of the external SST file, including properties block with global
seqno.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4916

Differential Revision: D13863501

Pulled By: riversand963

fbshipit-source-id: dc54697f970e3807832e2460f7228fcc7efe81ee
2019-01-29 17:17:29 -08:00
Sagar Vemuri
4978caaa6f Remove a redundant call to TableFileName in CompactionJob::FinishCompactionOutputFile (#4925)
Summary:
While stepping through the code I noticed that there is a redundant call to TableFileName.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4925

Differential Revision: D13845749

Pulled By: sagar0

fbshipit-source-id: 31db45716b4d720e0e0350dd457b49d6f1848e7d
2019-01-28 13:33:23 -08:00
Siying Dong
ee1818081f Remove PlainTable's feature store_index_in_file (#4914)
Summary:
Store_index_in_file is a less useful feature. To simplify the code to maintain, we are dropping the feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4914

Differential Revision: D13791883

Pulled By: siying

fbshipit-source-id: d187c5d662584866103e4b77d09dfb925509ae2e
2019-01-28 12:50:22 -08:00
Siying Dong
bc7d1661a8 Fix test name typo in PlainTableDBTest
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4926

Differential Revision: D13830196

Pulled By: siying

fbshipit-source-id: e06bf2a6cd273b5eb18dfd82bdd35ffce197d021
2019-01-25 18:14:26 -08:00
Siying Dong
f184bee77b PlainTable should avoid copying Get() results from immortal source. (#4924)
Summary:
https://github.com/facebook/rocksdb/pull/4053 avoids memcopy for Get() results if files are immortable
(read-only DB, max_open_files=-1) and the file is ammaped. The same optimization is being applied to PlainTable
here.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4924

Differential Revision: D13827749

Pulled By: siying

fbshipit-source-id: 1f2cbfc530b40ce08ccd53f95f6e78de4d1c2f96
2019-01-25 17:12:19 -08:00
Siying Dong
fc53839bfa Disallow customized hash function in DynamicBloom (#4915)
Summary:
I didn't find where customized hash function is used in DynamicBloom. This can only reduce performance. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4915

Differential Revision: D13794452

Pulled By: siying

fbshipit-source-id: e38669b11e01444d2d782da11c7decabbd851819
2019-01-24 10:34:30 -08:00
Dmitry Fink
e07aa8669d Allow full merge when root of history for a key is reached (#4909)
Summary:
Previously compaction was not collapsing operands for a first
key on a layer, even in cases when it was its root of history. Some
tests (CompactionJobTest.NonAssocMerge) was actually accounting
for that bug,
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4909

Differential Revision: D13781169

Pulled By: finik

fbshipit-source-id: d2de353ecf05bec39b942cd8d5b97a8dc445f336
2019-01-23 21:46:10 -08:00
Andrew Kryczka
8ec3e72551 Cache dictionary used for decompressing data blocks (#4881)
Summary:
- If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`.
- If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle.
- New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881

Differential Revision: D13663801

Pulled By: ajkr

fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f
2019-01-23 18:15:47 -08:00
PeifengSi
43defe9872 Correct the code comment in Compaction::KeyNotExistsBeyondOutputLevel (#4902)
Summary:
Even one key falls in a file's range, we can not infer it definitely exists in this file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4902

Differential Revision: D13795018

Pulled By: siying

fbshipit-source-id: 590956f727e9440fcdee55ad9541ace934c64914
2019-01-23 18:00:56 -08:00
Siying Dong
d94aa2f7db Make compaction_pri = kMinOverlappingRatio to be default (#4911)
Summary:
compaction_pri = kMinOverlappingRatio usually provides much better write amplification than the default.
https://github.com/facebook/rocksdb/pull/4907 fixes one shortcome of this option. Make it default.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4911

Differential Revision: D13789262

Pulled By: siying

fbshipit-source-id: d90acf8c4dede44f00d183ca4c7a210259378269
2019-01-23 16:47:38 -08:00
Siying Dong
5bf941966b CompactionPri = kMinOverlappingRatio also uses compensated file size (#4907)
Summary:
Right now, CompactionPri = kMinOverlappingRatio provides best write amplification, but it doesn't
prioritize files with more tombstones. We combine the two good features: make kMinOverlappingRatio
to boost files with lots of tombstones too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4907

Differential Revision: D13788774

Pulled By: siying

fbshipit-source-id: 1991cbb495fb76c8b529de69896e38d81ed9d9b3
2019-01-23 13:21:01 -08:00
Andrew Kryczka
01013ae766 Digest ZSTD compression dictionary once when writing SST file (#4849)
Summary:
This is essentially a re-submission of #4251 with a few improvements:

- Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict`
- Eliminated `Init` functions. Instead do all initialization work in constructors.
- Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849

Differential Revision: D13606039

Pulled By: ajkr

fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447
2019-01-18 19:12:57 -08:00
Yi Wu
b1ad6ebba8 WritePrepared: fix two versions in compaction see different status for released snapshots (#4890)
Summary:
Fix how CompactionIterator::findEarliestVisibleSnapshots handles released snapshot. It fixing the two scenarios:

Scenario 1:
key1 has two values v1 and v2. There're two snapshots s1 and s2 taken after v1 and v2 are committed. Right after compaction output v2, s1 is released. Now findEarliestVisibleSnapshot may see s1 being released, and return the next snapshot, which is s2. That's larger than v2's earliest visible snapshot, which was s1.
The fix: the only place we check against last snapshot and current key snapshot is when we decide whether to compact out a value if it is hidden by a later value. In the check if we see current snapshot is even larger than last snapshot, we know last snapshot is released, and we are safe to compact out current key.

Scenario 2:
key1 has two values v1 and v2. there are two snapshots s1 and s2 taken after v1 and v2 are committed. During compaction before we process the key, s1 is released. When compaction process v2, snapshot checker may return kSnapshotReleased, and the earliest visible snapshot for v2 become s2. When compaction process v1, snapshot checker may return kIsInSnapshot (for WritePrepared transaction, it could be because v1 is still in commit cache). The result will become inconsistent here.
The fix: remember the set of released snapshots ever reported by snapshot checker, and ignore them when finding result for findEarliestVisibleSnapshot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4890

Differential Revision: D13705538

Pulled By: maysamyabandeh

fbshipit-source-id: e577f0d9ee1ff5a6035f26859e56902ecc85a5a4
2019-01-18 17:24:06 -08:00
Yi Wu
128f532858 WritePrepared: fix issue with snapshot released during compaction (#4858)
Summary:
Compaction iterator keep a copy of list of live snapshots at the beginning of compaction, and then query snapshot checker to verify if values of a sequence number is visible to these snapshots. However when the snapshot is released in the middle of compaction, the snapshot checker implementation (i.e. WritePreparedSnapshotChecker) may remove info with the snapshot and may report incorrect result, which lead to values being compacted out when it shouldn't. This patch conservatively keep the values if snapshot checker determines that the snapshots is released.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4858

Differential Revision: D13617146

Pulled By: maysamyabandeh

fbshipit-source-id: cf18a94f6f61a94bcff73c280f117b224af5fbc3
2019-01-16 09:55:32 -08:00
Yanqin Jin
e79df377c5 Use chrono::time_point instead of time_t (#4868)
Summary:
By convention, time_t almost always stores the integral number of seconds since
00:00 hours, Jan 1, 1970 UTC, according to http://www.cplusplus.com/reference/ctime/time_t/.
We surely want more precision than seconds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4868

Differential Revision: D13633046

Pulled By: riversand963

fbshipit-source-id: 4e01e23a22e8838023c51a91247a286dbf3a5396
2019-01-16 09:51:05 -08:00
Yi Wu
5d4fddfa52 WritePrepared: Fix visible key compacted out by compaction (#4883)
Summary:
With WritePrepared transaction, flush/compaction can contain uncommitted keys, and those keys can get committed during compaction. If a snapshot is taken before the key is committed, it should not see the key. On the other hand, compaction grab the list of snapshots at its beginning, and only consider those snapshots to dedup keys. Consider the case:
```
seq = 1: put "foo" = "bar"
seq = 2: transaction T: delete "foo", prepare
seq = 3: compaction start
seq = 4: take snapshot S
seq = 5: transaction T: commit.
...
seq = N: compaction iterator reached key "foo".
```
When compaction start, the list of snapshot is empty. Compaction doesn't take snapshot S into account. When it reached "foo", transaction T is committed. Compaction may think the value "foo=bar" is not visible by any snapshot (which is wrong), and compact the value out.

The fix is to explicitly take a snapshot before compaction grabbing the list of snapshots. Compaction will then has to keep keys visible to this snapshot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4883

Differential Revision: D13668775

Pulled By: maysamyabandeh

fbshipit-source-id: 1cab9615f94b7d3e8522cc3d44c3a14c7d4720e4
2019-01-15 21:34:38 -08:00
Maysam Yabandeh
cad99a6031 WritePrepared: snapshot should be larger than max_evicted_seq_ (#4886)
Summary:
The AdvanceMaxEvictedSeq algorithm assumes that new snapshots always have sequence number larger than the last max_evicted_seq_. To enforce this assumption we make two changes:
i) max is not advanced beyond the last published seq, with the exception that the evicted commit entry itself is not published yet, which is quite rare.
ii) When obtaining the snapshot if the max_evicted_seq_ is not published yet, commit a dummy entry so that it waits for it to be published and also increased the latest published seq by one above the max.
To test these non-realistic corner cases we create a commit cache with size 1 so that every single commit results into eviction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4886

Differential Revision: D13685270

Pulled By: maysamyabandeh

fbshipit-source-id: 5461bc09c2a9b75798bfcb9853a256c81cdac0b0
2019-01-15 18:11:52 -08:00
Siying Dong
7d13f307ff Improve Error Message When wal_dir doesn't exist (#4874)
Summary:
Right now the error mesage when options.wal_dir doesn't exist is not helpful to users. Be more specific
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4874

Differential Revision: D13642425

Pulled By: siying

fbshipit-source-id: 9a3172ed0f799af233b0f3b2e5e35bc7ce04c7b5
2019-01-15 16:46:04 -08:00
Yanqin Jin
301da345ae Make a copy of MutableCFOptions to avoid race condition (#4876)
Summary:
If we do not do this, then reading MutableCFOptions may have a race condition
with SetOptions which modifies MutableCFOptions.

Also reserve space in advance for vectors to avoid reallocation changing the
address of its elements.

Test plan
```
$make clean && make -j32 all check
$make clean && COMPILE_WITH_TSAN=1 make -j32 all check
$make clean && COMPILE_WITH_ASAN=1 make -j32 all check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4876

Differential Revision: D13644500

Pulled By: riversand963

fbshipit-source-id: 4b8112c5c819d5a2922bb61ad1521b3d2fb2fd47
2019-01-11 17:43:37 -08:00