Commit Graph

4099 Commits

Author SHA1 Message Date
Zhongyi Xie
aa56b7e74a secondary instance: add support for WAL tailing on OpenAsSecondary
Summary: PR https://github.com/facebook/rocksdb/pull/4899 implemented the general framework for RocksDB secondary instances. This PR adds the support for WAL tailing in `OpenAsSecondary`, which means after the `OpenAsSecondary` call, the secondary is now able to see primary's writes that are yet to be flushed. The secondary can see primary's writes in the WAL up to the moment of `OpenAsSecondary` call starts.

Differential Revision: D15059905

Pulled By: miasantreble

fbshipit-source-id: 44f71f548a30b38179a7940165e138f622de1f10
2019-04-24 12:08:44 -07:00
qinzuoyan
a7d103198e Print smallest and largest seqno in Version::DebugString() for more details (#5231)
Summary:
In some cases, we want to known the smallest and largest sequence numbers of sstable files, to help us get more details.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5231

Differential Revision: D15038087

Pulled By: siying

fbshipit-source-id: c473c1ca07b53efe2f1884fa1ecdc8686f455ed8
2019-04-23 11:22:02 -07:00
Siying Dong
72c8533f2c DBIter to use IteratorWrapper for inner iterator (#5214)
Summary:
It's hard to get DBIter to directly use InternalIterator::NextAndGetResult() because the code change would be complicated. Instead, use IteratorWrapper, where Next() is already using NextAndGetResult(). Performance number is hard to measure because it is small and ther is variation. I run readseq many times, and there seems to be 1% gain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5214

Differential Revision: D15003635

Pulled By: siying

fbshipit-source-id: 17af1965c409c2fe90cd85037fbd2c5a1364f82a
2019-04-23 10:55:01 -07:00
Sagar Vemuri
47fd574829 Log file_creation_time table property (#5232)
Summary:
Log file_creation_time table property when a new table file is created.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5232

Differential Revision: D15033069

Pulled By: sagar0

fbshipit-source-id: aaac56a4c03a8f96c338cad1b0cdb7fbfb887647
2019-04-22 15:30:07 -07:00
Mike Kolupaev
df38c1ce66 Add BlockBasedTableOptions::index_shortening (#5174)
Summary:
Introduce BlockBasedTableOptions::index_shortening to give users control on which key shortening techniques to be used in building index blocks. Before this patch, both separators and successor keys where shortened in indexes. With this patch, the default is set to kShortenSeparators to only shorten the separators. Since each index block has many separators and only one successor (last key), the change should not have negative impact on index block size. However it should prevent many unnecessary block loads where due to approximation introduced by shorted successor, seek would land us to the previous block and then fix it by moving to the next one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5174

Differential Revision: D14884185

Pulled By: al13n321

fbshipit-source-id: 1b08bc8c03edcf09b6b8c16e9a7eea08ad4dd534
2019-04-22 08:20:35 -07:00
jsteemann
de76909464 refactor SavePoints (#5192)
Summary:
Savepoints are assumed to be used in a stack-wise fashion (only
the top element should be used), so they were stored by `WriteBatch`
in a member variable `save_points` using an std::stack.

Conceptually this is fine, but the implementation had a few issues:
- the `save_points_` instance variable was a plain pointer to a heap-
  allocated `SavePoints` struct. The destructor of `WriteBatch` simply
  deletes this pointer. However, the copy constructor of WriteBatch
  just copied that pointer, meaning that copying a WriteBatch with
  active savepoints will very likely have crashed before. Now a proper
  copy of the savepoints is made in the copy constructor, and not just
  a copy of the pointer
- `save_points_` was an std::stack, which defaults to `std::deque` for
  the underlying container. A deque is a bit over the top here, as we
  only need access to the most recent savepoint (i.e. stack.top()) but
  never any elements at the front. std::deque is rather expensive to
  initialize in common environments. For example, the STL implementation
  shipped with GNU g++ will perform a heap allocation of more than 500
  bytes to create an empty deque object. Although the `save_points_`
  container is created lazily by RocksDB, moving from a deque to a plain
  `std::vector` is much more memory-efficient. So `save_points_` is now
  a vector.
- `save_points_` was changed from a plain pointer to an `std::unique_ptr`,
  making ownership more explicit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5192

Differential Revision: D15024074

Pulled By: maysamyabandeh

fbshipit-source-id: 5b128786d3789cde94e46465c9e91badd07a25d7
2019-04-19 20:33:04 -07:00
Yanqin Jin
c77aab584e Force read existing data during db repair (#5209)
Summary:
Setting read_opts.total_order_seek achieves this, even with a different prefix
extractor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5209

Differential Revision: D14980388

Pulled By: riversand963

fbshipit-source-id: 16527989a3d6b3e3ae8241c894d011326429d66e
2019-04-19 11:55:13 -07:00
Siying Dong
7a73adda9c Add some "inline" annotation to DBIter functions (#5217)
Summary:
My compiler doesn't inline DBIter::Next() to arena wrapped iterator, even if it is a direct forward. Adding this annotation makes it inlined. It might not always work but inlinging this function to arena wrapped iterator always feels like the right decision.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5217

Differential Revision: D15004086

Pulled By: siying

fbshipit-source-id: a4cffd79c6fb092669a3a90633c9aa5e494f8a66
2019-04-19 10:38:43 -07:00
Sagar Vemuri
efa948741c Use creation_time or mtime when file_creation_time=0 (#5184)
Summary:
We found an issue in Periodic Compactions (introduced in #5166) where files were not being picked up for compactions as all the SST files created with older versions of RocksDB have `file_creation_time` as 0. (Note that `file_creation_time` is a new table property introduced in #5166).

To address this, Periodic compactions now fall back to looking at the `creation_time` table property or the file's modification time (as given by the Env) when `file_creation_time` table property is found to be 0.

Here how the file's modification time (and, in turn, the file age) is computed now:
1. Use `file_creation_time` table property if it is > 0.
1. If not, then use `creation_time` table property if it is > 0.
1. If not, then use file's mtime stat metadata given by the underlying Env.
Don't consider the file at all for compaction if the modification time cannot be correctly determined based on the above conditions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5184

Differential Revision: D14907795

Pulled By: sagar0

fbshipit-source-id: 4bb2f3631f9a3e04470c674a1d13544584e1e56c
2019-04-18 22:39:34 -07:00
Siying Dong
d6862b3f51 Make ReadRangeDelAggregator::ShouldDelete() more inline friendly (#5202)
Summary:
Reorganize the code so that no function call into ReadRangeDelAggregator is needed if there is no tomb range stone.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5202

Differential Revision: D14968155

Pulled By: siying

fbshipit-source-id: 0bd61911293c7a27b4e1b8d57c66d0c4ad6a6a5f
2019-04-18 12:27:25 -07:00
Siying Dong
01cfea6637 Some small code changes to improve Next() (#5200)
Summary:
Several small changes for Next():
1. Reducing branching by always update local_stats_.next_count_++ even if statistics is null. This should be faster than a branching.
2. Replacing ResetInternalKeysSkippedCounter() in Next() because the valid_ check is not needed in this case.
3. iter_->Valid() should always be true for non merge case. Remove this check.
4. Adding an inline annotation. It ends up with not picked up by my compiler, but it shouldn't hurt.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5200

Differential Revision: D15000391

Pulled By: siying

fbshipit-source-id: be97f61c708968234fb8e5cf272b5c2ac07dc4dd
2019-04-18 12:18:11 -07:00
Siying Dong
992dfc7811 Introduce InternalIteratorBase::NextAndGetResult() (#5197)
Summary:
In long scans, virtual function calls of Next(), Valid(), key() and value() are not trivial. By introducing NextAndGetResult(), Some of the Next(), Valid() and key() calls are consolidated into one virtual function call to reduce CPU.
Also did some inline tricks and add some "final" randomly in some functions. Even without the "final" annotation, most Next() calls are inlined with -O3, but sometimes with a final it is inlined by O2 too. It doesn't hurt to add those final annotations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5197

Differential Revision: D14945977

Pulled By: siying

fbshipit-source-id: 7003969f9a5f1d5717f0bda503b91d19ba75ed88
2019-04-18 11:12:39 -07:00
Fosco Marotto
6c2bf9e916 Add copyright headers per FB open-source checkup tool. (#5199)
Summary:
internal task: T35568575
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5199

Differential Revision: D14962794

Pulled By: gfosco

fbshipit-source-id: 93838ede6d0235eaecff90d200faed9a8515bbbe
2019-04-18 10:55:01 -07:00
Yanqin Jin
392f6d49e5 Fix a bug in GetOverlappingInputsRangeBinarySearch (#5211)
Summary:
As title.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5211

Differential Revision: D14992018

Pulled By: riversand963

fbshipit-source-id: b5720ea4742029e2fb47ff6d9f8d9de006db4ed4
2019-04-18 09:22:16 -07:00
JiYou
5b7e09bd6f VersionSet: optmize GetOverlappingInputsRangeBinarySearch (#4987)
Summary:
`GetOverlappingInputsRangeBinarySearch` firstly use binary search
to find a index in the given range `[begin, end]`. But after find
the index, then use linear search to find the `start_index` and
`end_index`. So the search process degraded to linear time.

Here optmize the search process with below changes:

- use `std::lower_bound` and `std::upper_bound` to get
  `lg(n)` search complexity.
- use uniformed lambda for search process.
- simplify process for `within_interval` true or false.
- remove function `ExtendFileRangeWithinInterval`
  and `ExtendFileRangeOverlappingInterval`.

Signed-off-by: JiYou <jiyou09@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4987

Differential Revision: D14984192

Pulled By: riversand963

fbshipit-source-id: fae4b8e59a21b7e350718d60cdc94dd55ac81e89
2019-04-17 18:15:20 -07:00
Zhongyi Xie
248b6b551e rename variable to avoid shadowing (#5204)
Summary:
this PR fixes the following compile warning:
```
db/memtable.cc: In member function ‘virtual void rocksdb::MemTableIterator::Seek(const rocksdb::Slice&)’:
db/memtable.cc:321:22: error: declaration of ‘user_key’ shadows a member of 'this' [-Werror=shadow]
       Slice user_key(ExtractUserKey(k));
                      ^
db/memtable.cc: In member function ‘virtual void rocksdb::MemTableIterator::SeekForPrev(const rocksdb::Slice&)’:
db/memtable.cc:338:22: error: declaration of ‘user_key’ shadows a member of 'this' [-Werror=shadow]
       Slice user_key(ExtractUserKey(k));
                      ^
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5204

Differential Revision: D14970160

Pulled By: miasantreble

fbshipit-source-id: 388eb089f90c4528cc6d615dd4607fb53ceac705
2019-04-17 10:15:05 -07:00
Zhongyi Xie
baa5302447 Avoid double-compacting data in bottom level in manual compactions (#5138)
Summary:
Depending on the config, manual compaction (leveled compaction style) does following compactions:
L0->L1
L1->L2
...
Ln-1 -> Ln
Ln -> Ln
The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only.
Resolves issue https://github.com/facebook/rocksdb/issues/4995
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138

Differential Revision: D14940106

Pulled By: miasantreble

fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5
2019-04-16 23:32:20 -07:00
Siying Dong
beb44ec3eb WriteBufferManager's dummy entry size to block cache 1MB -> 256KB (#5175)
Summary:
Dummy cache size of 1MB is too large for small block sizes. Our GetDefaultCacheShardBits() use min_shard_size = 512L * 1024L to determine number of shards, so 1MB will excceeds the size of the whole shard and make the cache excceeds the budget.
Change it to 256KB accordingly.
There shouldn't be obvious performance impact, since inserting a cache entry every 256KB of memtable inserts is still infrequently enough.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5175

Differential Revision: D14954289

Pulled By: siying

fbshipit-source-id: 2c275255c1ac3992174e06529e44c55538325c94
2019-04-16 12:03:07 -07:00
yiwu-arbug
f1239d5f10 Avoid per-key upper bound check in BlockBasedTableIterator (#5142)
Summary:
This is second attempt for #5101. Original commit message:
`BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons.

This patch come with two fixes:

Fix 1: To optimize checking for bounds, we need comparing the bounds with index key as well. However BlockBasedTableIterator doesn't know whether its index iterator is internally using user keys or internal keys. The patch fixes that by extending InternalIterator with a user_key() function that is overridden by In IndexBlockIter.

Fix 2: In #5101 we return `IsOutOfBound()=true` when block index key is out of bound. But the index key can be larger than smallest key of the next file on the level. That file can be within upper bound and should not be filtered out.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5142

Differential Revision: D14907113

Pulled By: siying

fbshipit-source-id: ac95775c5b4e7b700f76ab43e39f45402c98fbfb
2019-04-16 11:37:47 -07:00
Vijay Nadimpalli
71a82a0abe Consolidating WAL creation which currently has duplicate logic in db_impl_write.cc and db_impl_open.cc (#5188)
Summary:
Right now, two separate pieces of code are used to create WAL files in DBImpl::Open function of db_impl_open.cc and DBImpl::SwitchMemtable function of db_impl_write.cc. This code change simply creates 1 function called DBImpl::CreateWAL in db_impl_open.cc which is used to replace existing WAL creation logic in DBImpl::Open and DBImpl::SwitchMemtable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5188

Differential Revision: D14942832

Pulled By: vjnadimpalli

fbshipit-source-id: d49230e04c36176015c8c1b422575872f92157fb
2019-04-15 18:51:04 -07:00
Yi Zhang
3e63e553b4 Fix MultiGet ASSERT bug when passing unsorted result (#5195)
Summary:
Found this when test driving the new MultiGet. If you pass unsorted result with sorted_result = false you'll trigger the ASSERT incorrect even though we'll sort down below.

I've also added simple test cover sorted_result=true/false scenario copied from MultiGetSimple.

anand1976
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5195

Differential Revision: D14935475

Pulled By: yizhang82

fbshipit-source-id: 1d2af5e3a003847d965066a16e3b19da68acf170
2019-04-15 11:35:21 -07:00
anand76
29111e92b4 Add bounds check in FilePickerMultiGet::PrepareNextLevel() (#5189)
Summary:
Add bounds check when looping through empty levels in FilePickerMultiGet
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5189

Differential Revision: D14925334

Pulled By: anand1976

fbshipit-source-id: 65d53247cf443153e28ce2b8b753fa51c6ae4566
2019-04-12 18:05:09 -07:00
yiwu-arbug
cca141ecf8 Fix crash with memtable prefix bloom and key out of prefix extractor domain (#5190)
Summary:
Before using prefix extractor `InDomain()` should be check. All uses in memtable.cc didn't check `InDomain()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5190

Differential Revision: D14923773

Pulled By: miasantreble

fbshipit-source-id: b3ad60bcca5f3a1a2b929a6eb34b0b7ba6326f04
2019-04-12 17:07:49 -07:00
Maysam Yabandeh
fe642cbee6 WritePrepared: fix race condition in reading batch with duplicate keys (#5147)
Summary:
When ReadOption doesn't specify a snapshot, WritePrepared::Get used kMaxSequenceNumber to avoid the cost of creating a new snapshot object (that requires sync over db_mutex). This creates a race condition if it is reading from the writes of a transaction that had duplicate keys: each instance of duplicate key is inserted with a different sequence number and depending on the ordering the ::Get might skip the newer one and read the older one that is obsolete.
The patch fixes that by using last published seq as the snapshot sequence number. It also adds a check after the read is done to ensure that the max_evicted_seq has not advanced the aforementioned seq, which is a very unlikely event. If it did, then the read is not valid since the seq is not backed by an actually snapshot to let IsInSnapshot handle that properly when an overlapping commit is evicted from commit cache.
A unit  test is added to reproduce the race condition with duplicate keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5147

Differential Revision: D14758815

Pulled By: maysamyabandeh

fbshipit-source-id: a56915657132cf6ba5e3f5ea1b5d78c803407719
2019-04-12 14:40:41 -07:00
Siying Dong
85b2bde3dd Still implement StatisticsImpl::measureTime() (#5181)
Summary:
Since Statistics::measureTime() is deprecated, StatisticsImpl::measureTime() is not implemented. We realized that users might have a wrapped Statistics implementation in which measureTime() is implemented as forwarded to StatisticsImpl, and causes assert failure. In order to make the change less intrusive, we implement StatisticsImpl::measureTime(). We will revisit whether we need to remove it after several releases.

Also, add a test to make sure that a Statistics implementation using the old interface still works.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5181

Differential Revision: D14907089

Pulled By: siying

fbshipit-source-id: 29b6202fd04e30ed6f6adcaeb1000e87f10d1e1a
2019-04-12 11:00:35 -07:00
Yanqin Jin
3189398c00 Fix bugs detected by clang analyzer (#5185)
Summary:
as titled. False positive included, fixed anyway to make the check
pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5185

Differential Revision: D14909384

Pulled By: riversand963

fbshipit-source-id: dc5177e72b1929ccfd6175a60e2cd7bdb9bd80f3
2019-04-12 10:45:56 -07:00
vijaynadimpalli
f49e12b892 Added missing table properties in log (#5168)
Summary:
When a new SST file is created via flush or compaction, we dump out the table properties, however only a few table properties are logged. The change here is to log all the table properties
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5168

Differential Revision: D14876928

Pulled By: vjnadimpalli

fbshipit-source-id: 1aca42ad00f9f650761d39e187f8beeb8700149b
2019-04-11 14:33:49 -07:00
anand76
fefd4b98c5 Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.

Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency

The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.

Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).

Batch   Sizes

1        | 2        | 4         | 8      | 16  | 32

Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074        - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14        - MultiGet (w/ batching)

Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135

Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62

Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891

dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011

Differential Revision: D14348703

Pulled By: anand1976

fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
2019-04-11 14:28:26 -07:00
Siying Dong
ed9f5e21aa Change OptimizeForPointLookup() and OptimizeForSmallDb() (#5165)
Summary:
Change the behavior of OptimizeForSmallDb() so that it is less likely to go out of memory.
Change the behavior of OptimizeForPointLookup() to take advantage of the new memtable whole key filter, and move away from prefix extractor as well as hash-based indexing, as they are prone to misuse.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5165

Differential Revision: D14880709

Pulled By: siying

fbshipit-source-id: 9af30e3c9e151eceea6d6b38701a58f1f9fb692d
2019-04-11 10:45:36 -07:00
Sagar Vemuri
d3d20dcdca Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.

This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted.  And also, of course, it helps to cleanup data older than certain threshold.

- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).

This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166

Differential Revision: D14884441

Pulled By: sagar0

fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
2019-04-10 19:31:18 -07:00
Siying Dong
0bb555630f Consolidate hash function used for non-persistent data in a new function (#5155)
Summary:
Create new function NPHash64() and GetSliceNPHash64(), which are currently
implemented using murmurhash.
Replace the current direct call of murmurhash() to use the new functions
if the hash results are not used in on-disk format.
This will make it easier to try out or switch to alternative functions
in the uses where data format compatibility doesn't need to be considered.
This part shouldn't have any performance impact.

Also, the sharded cache hash function is changed to the new format, because
it falls into this categoery. It doesn't show visible performance impact
in db_bench results. CPU showed by perf is increased from about 0.2% to 0.4%
in an extreme benchmark setting (4KB blocks, no-compression, everything
cached in block cache). We've known that the current hash function used,
our own Hash() has serious hash quality problem. It can generate a lots of
conflicts with similar input. In this use case, it means extra lock contention
for reads from the same file. This slight CPU regression is worthy to me
to counter the potential bad performance with hot keys. And hopefully this
will get further improved in the future with a better hash function.

cache_test's condition is relaxed a little bit to. The new hash is slightly
more skewed in this use case, but I manually checked the data and see
the hash results are still in a reasonable range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5155

Differential Revision: D14834821

Pulled By: siying

fbshipit-source-id: ec9a2c0a2f8ae4b54d08b13a5c2e9cc97aa80cb5
2019-04-08 13:32:06 -07:00
Yanqin Jin
de00f28132 Refactor ExternalSSTFileTest (#5129)
Summary:
remove an unnecessary function `GenerateAndAddFileIngestBehind`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5129

Differential Revision: D14686710

Pulled By: riversand963

fbshipit-source-id: 5698ae63e10f8ef76c2da753bbb07a36024ac065
2019-04-08 11:16:34 -07:00
Sergei Glushchenko
39c6c5fc1b Expose DB methods to lock and unlock the WAL (#5146)
Summary:
Expose DB methods to lock and unlock the WAL.

These methods are intended to use by MyRocks in order to obtain WAL
coordinates in consistent way.

Usage scenario is following:

MySQL has performance_schema.log_status which provides information that
enables a backup tool to copy the required log files without locking for
the duration of copy. To populate this table MySQL does following:

1. Lock the binary log. Transactions are not allowed to commit now
2. Save the binary log coordinates
3. Walk through the storage engines and lock writes on each engine. For
   InnoDB, redo log is locked. For MyRocks, WAL should be locked.
4. Ask storage engines for their coordinates. InnoDB reports its current
   LSN and checkpoint LSN. MyRocks should report active WAL files names
   and sizes.
5. Release storage engine's locks
6. Unlock binary log

Backup tool will then use this information to copy InnoDB, RocksDB and
MySQL binary logs up to specified positions to end up with consistent DB
state after restore.

Currently, RocksDB allows to obtain the list of WAL files. Only missing
bit is the method to lock the writes to WAL files.

LockWAL method must flush the WAL in order for the reported size to be
accurate (GetSortedWALFiles is using file system stat call to return the
file size), also, since backup tool is going to copy the WAL, it is
better to be flushed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5146

Differential Revision: D14815447

Pulled By: maysamyabandeh

fbshipit-source-id: eec9535a6025229ed471119f19fe7b3d8ae888a3
2019-04-06 06:40:36 -07:00
Adam Simpkins
c06c4c01c5 Fix many bugs in log statement arguments (#5089)
Summary:
Annotate all of the logging functions to inform the compiler that these
use printf-style formatting arguments.  This allows the compiler to emit
warnings if the format arguments are incorrect.

This also fixes many problems reported now that format string checking
is enabled.  Many of these are simply mix-ups in the argument type (e.g,
int vs uint64_t), but in several cases the wrong number of arguments
were being passed in which can cause the code to crash.

The primary motivation for this was to fix the log message in
`DBImpl::SwitchMemtable()` which caused a segfault due to an extra %s
format parameter with no argument supplied.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5089

Differential Revision: D14574795

Pulled By: simpkins

fbshipit-source-id: 0921b03f0743652bf4ae21e414ff54b3bb65422a
2019-04-04 12:12:11 -07:00
Maysam Yabandeh
75e8b6dfcf Fix race condition in IteratorWithLocalStatistics (#5149)
Summary:
The ReadCallback was shared between all threads in IteratorWithLocalStatistics. A race condition was
 hence introduced with recent changes that changes the content of ReadCallback. The patch fixes that by using a separate callback per thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5149

Differential Revision: D14761612

Pulled By: maysamyabandeh

fbshipit-source-id: 814a316aed046c318cb90e22379a6e32ac528949
2019-04-03 16:04:38 -07:00
Zhichao Cao
ebb9b2ed16 Fix the potential DB crash caused by call EndTrace before StartTrace (#5130)
Summary:
Although user should first call StartTrace to begin the RocksDB tracing function and call EndTrace to stop the tracing process, user can accidentally call EndTrace first. It will cause segment fault and crash the DB instance. The issue is fixed by checking the pointer first.

Test case added in db_test2.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5130

Differential Revision: D14691420

Pulled By: zhichao-cao

fbshipit-source-id: 3be13d2f944bc453728ef8eef67b68d7ad0939c8
2019-04-03 13:26:34 -07:00
Zhongyi Xie
e8480d4d9d add assert to silence clang analyzer and fix variable shadowing (#5140)
Summary:
This PR address two open issues:

1.  clang analyzer is paranoid about db_ being nullptr after DB::Open calls in the test.
See https://github.com/facebook/rocksdb/pull/5043#discussion_r271394579
Add an assert to keep clang happy
2. PR https://github.com/facebook/rocksdb/pull/5049 introduced a  variable shadowing:
```
db/db_iterator_test.cc: In constructor ‘rocksdb::DBIteratorWithReadCallbackTest_ReadCallback_Test::TestBody()::TestReadCallback::TestReadCallback(rocksdb::SequenceNumber)’:
db/db_iterator_test.cc:2484:9: error: declaration of ‘max_visible_seq’ shadows a member of 'this' [-Werror=shadow]
         : ReadCallback(max_visible_seq) {}
         ^
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5140

Differential Revision: D14735497

Pulled By: miasantreble

fbshipit-source-id: 3219ea75cf4ae04f64d889323f6779e84be98144
2019-04-02 21:15:44 -07:00
Maysam Yabandeh
5234fc1b70 Mark logs with prepare in PreReleaseCallback (#5121)
Summary:
In prepare phase of 2PC, the db promises to remember the prepared data, for possible future commits. To fulfill the promise the prepared data must be persisted in the WAL so that they could be recovered after a crash. The log that contains a prepare batch that is not committed yet, is marked so that it is not garbage collected before the transaction commits/rollbacks. The bug was that the write to the log file and the mark of the file was not atomic, and WAL gc could have happened before the WAL log is actually marked. This patch moves the marking logic to PreReleaseCallback so that the WAL gc logic that joins both write threads would see the WAL write and WAL mark atomically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5121

Differential Revision: D14665210

Pulled By: maysamyabandeh

fbshipit-source-id: 1d66aeb1c66a296cb4899a5a20c4d40c59e4b534
2019-04-02 15:17:47 -07:00
Maysam Yabandeh
14b3f683a1 WriteUnPrepared: less virtual in iterator callback (#5049)
Summary:
WriteUnPrepared adds a virtual function, MaxUnpreparedSequenceNumber, to ReadCallback, which returns 0 unless WriteUnPrepared is enabled and the transaction has uncommitted data written to the DB. Together with snapshot sequence number, this determines the last sequence that is visible to reads.
The patch clarifies the guarantees of the GetIterator API in WriteUnPrepared transactions and make use of that to statically initialize the read callback and thus avoid the virtual call.
Furthermore it increases the minimum value for min_uncommitted from 0 to 1 as seq 0 is used only for last level keys that are committed in all snapshots.

The following benchmark shows +0.26% higher throughput in seekrandom benchmark.

Benchmark:
./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench

./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG    10 runs] : 20355 ops/sec;  225.2 MB/sec
seekrandom [MEDIAN 10 runs] : 20425 ops/sec;  225.9 MB/sec

./db_bench_lessvirtual3 --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG    10 runs] : 20409 ops/sec;  225.8 MB/sec
seekrandom [MEDIAN 10 runs] : 20487 ops/sec;  226.6 MB/sec
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5049

Differential Revision: D14366459

Pulled By: maysamyabandeh

fbshipit-source-id: ebaff8908332a5ae9af7defeadabcb624be660ef
2019-04-02 14:47:16 -07:00
Siying Dong
ebcc8ae1d3 Revert "Avoid per-key upper bound check in BlockBasedTableIterator (#5101)" (#5132)
Summary:
This reverts commit f29dc1b906.

In BlockBasedTableIterator, index_iter_->key() is sometimes a user key, so it is wrong to call ExtractUserKey() against it. This is a bug introduced by #5101.
Temporarily revert the diff to keep the branch clean.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5132

Differential Revision: D14718584

Pulled By: siying

fbshipit-source-id: 0ac55dc9b5dbc18c7809092146bdf7eb9364b9ad
2019-04-02 10:00:38 -07:00
Mike Kolupaev
120bc4715b Add DBOptions. avoid_unnecessary_blocking_io to defer file deletions (#5043)
Summary:
Just like ReadOptions::background_purge_on_iterator_cleanup but for ColumnFamilyHandle instead of Iterator.

In our use case we sometimes call ColumnFamilyHandle's destructor from low-latency threads, and sometimes it blocks the thread for a few seconds deleting the files. To avoid that, we can either offload ColumnFamilyHandle's destruction to a background thread on our side, or add this option on rocksdb side. This PR does the latter, to be consistent with how we solve exactly the same problem for iterators using background_purge_on_iterator_cleanup option.

(EDIT: It's avoid_unnecessary_blocking_io now, and affects both CF drops and iterator destructors.)
I'm not quite comfortable with having two separate options (background_purge_on_iterator_cleanup and background_purge_on_cf_cleanup) for such a rarely used thing. Maybe we should merge them? Rename background_purge_on_cf_cleanup to something like delete_files_on_background_threads_only or avoid_blocking_io_in_unexpected_places, and make iterators use it instead of the one in ReadOptions? I can do that here if you guys think it's better.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5043

Differential Revision: D14339233

Pulled By: al13n321

fbshipit-source-id: ccf7efa11c85c9a5b91d969bb55627d0fb01e7b8
2019-04-01 17:10:40 -07:00
Yi Wu
f29dc1b906 Avoid per-key upper bound check in BlockBasedTableIterator (#5101)
Summary:
`BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5101

Differential Revision: D14678707

Pulled By: siying

fbshipit-source-id: 2372446116753c7892ea4cec7b4b49ef87ba463e
2019-03-29 13:11:46 -07:00
Yanqin Jin
09957ded1d Update RepeatableThreadTest with MockTimeEnv (#5107)
Summary:
**This PR updates RepeatableThread::wait, breaking some tests on OS X. The rest of the PR fixes the tests on OS X.**
`RepeatableThreadTest.MockEnvTest` uses `MockTimeEnv` and `RepeatableThread`. If `RepeatableThread::wait` calls `TimedWait` with a time smaller than or equal to the current (real) time, `TimedWait` returns immediately on certain platforms, e.g. OS X. #4560 addresses this issue by replacing `TimedWait` with `Wait` in test. This fixes the test but makes test/production code diverge, which is not optimal for test coverage. This PR proposes an alternative fix which unifies test and production code path for `RepeatableThread::wait`. We obtain the current (real) time in seconds and add 10 extra seconds to ensure that `RepeatableThread::wait` invokes `TimedWait` with a time greater than (real) current time. This is to prevent the `TimedWait` function from returning immediately without sleeping and releasing the mutex. If `TimedWait` returns immediately, the mutex will not be released, and `RepeatableThread::TEST_WaitForRun` never has a chance to execute the callback which, in this case, updates the result returned by `mock_env->NowMicros()`. Consequently, `RepeatableThread::wait` cannot break out of the loop, causing test to hang. The extra 10 seconds is a best-effort approach because there seems no reliable and deterministic way to provide the aforementioned guarantee. By the time `RepeatableThread::wait` is called, there is no guarantee that the `delay + mock_env->NowMicros()` will be greater than the current real time. However, 10 seconds should be sufficient in most cases. We will keep an eye for possible flakiness of this test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5107

Differential Revision: D14680885

Pulled By: riversand963

fbshipit-source-id: d1ecbe10e1dacd110bd464cd01e188bfee72b89e
2019-03-29 10:08:50 -07:00
anand76
dae3b5545c Smooth the deletion of WAL files (#5116)
Summary:
WAL files are currently not subject to deletion rate limiting by DeleteScheduler. If the size of the WAL files is significant, this can cause a high delete rate on SSDs that may affect other operations. To fix it, force WAL file deletions to go through the SstFileManager. Original PR for this is #2768
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5116

Differential Revision: D14669437

Pulled By: anand1976

fbshipit-source-id: c5f62d0640cebaa1574de841a1d01e4ce2faadf0
2019-03-28 15:17:13 -07:00
Siying Dong
106a94af15 Improve obsolete_files_test (#5125)
Summary:
We see a failure of obsolete_files_test but aren't able to identify
the issue. Improve the test in following way and hope we can debug
better next time:
1. Place sync point before automatic compaction runs so race condition
   will always trigger.
2. Disable sync point before test finishes.
3. ASSERT_OK() instead of ASSERT_TRUE(status.ok())
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5125

Differential Revision: D14669456

Pulled By: siying

fbshipit-source-id: dccb7648e334501ad651eb212880096eef1f4ab2
2019-03-28 13:16:02 -07:00
Siying Dong
89ab1381f8 Apply automatic formatting to some files (#5114)
Summary:
Following files were run through automatic formatter:
db/db_impl.cc
db/db_impl.h
db/db_impl_compaction_flush.cc
db/db_impl_debug.cc
db/db_impl_files.cc
db/db_impl_readonly.h
db/db_impl_write.cc
db/dbformat.cc
db/dbformat.h
table/block.cc
table/block.h
table/block_based_filter_block.cc
table/block_based_filter_block.h
table/block_based_filter_block_test.cc
table/block_based_table_builder.cc
table/block_based_table_reader.cc
table/block_based_table_reader.h
table/block_builder.cc
table/block_builder.h
table/block_fetcher.cc
table/block_prefix_index.cc
table/block_prefix_index.h
table/block_test.cc
table/format.cc
table/format.h

I could easily run all the files, but I don't want people to feel that
I'm doing it for lines of code changes :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5114

Differential Revision: D14633040

Pulled By: siying

fbshipit-source-id: 3f346cb53bf21e8c10704400da548dfce1e89a52
2019-03-27 16:24:45 -07:00
Siying Dong
5f6adf3f6a Fix some variable naming in db/transaction_log_impl.* (#5112)
Summary:
We follow Google C++ Style which indicates variable names should be
all underscore: https://google.github.io/styleguide/cppguide.html#Variable_Names
Fix some variable names under db/transaction_log_impl.*
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5112

Differential Revision: D14631157

Pulled By: siying

fbshipit-source-id: 9525c9b0976b843bca377b03897700d87cc60af8
2019-03-27 12:27:54 -07:00
Yi Wu
d69241586e Fix perf_context.user_key_comparison_count for range scan (#5098)
Summary:
Currently `perf_context.user_key_comparison_count` is bump only in `InternalKeyComparator`. For places user comparator is used directly the counter is not bump. Fixing the majority of it.

Index iterator and filter code also use user comparator directly and don't bump the counter. It is not fixed in this patch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5098

Differential Revision: D14603753

Pulled By: siying

fbshipit-source-id: 1cd41035644ca9e49b97a51030a5d1e15f5f3cae
2019-03-27 10:34:27 -07:00
Siying Dong
2b4d5ceb47 Remove some "using std::..." from header files. (#5113)
Summary:
The code convention we are following, Google C++ Style, discourage
alias in header files, especially public headers:
https://google.github.io/styleguide/cppguide.html#Aliases
Remove some of them. Might removed some from .cc files as well to be consistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5113

Differential Revision: D14633030

Pulled By: siying

fbshipit-source-id: b990edc919d5de60295992284f980195e501d424
2019-03-27 10:28:21 -07:00
Yanqin Jin
9358178edc Support for single-primary, multi-secondary instances (#4899)
Summary:
This PR allows RocksDB to run in single-primary, multi-secondary process mode.
The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary.
Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`.

This PR has several components:
1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary.

2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue.

3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`.
3.1 Tail the primary's MANIFEST during recovery.
3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`.
3.3 Tailing WAL will be in a future PR.

4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899

Differential Revision: D14510945

Pulled By: riversand963

fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
2019-03-26 16:45:31 -07:00
Shi Feng
01e6badbb6 Introduce CPU timers for iterator seek and next (#5076)
Summary:
Introduce CPU timers for iterator seek and next operations. Seek
counter includes SeekToFirst, SeekToLast and SeekForPrev, w/ the
caveat that SeekToLast timer doesn't include some post processing
time if upper bound is defined.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5076

Differential Revision: D14525218

Pulled By: fredfsh

fbshipit-source-id: 03ba25df3b22b06c072621e4de0eacfa1445f0d9
2019-03-26 16:32:13 -07:00
Siying Dong
48e7effa79 Avoid to go through every CF for every ReleaseSnapshot() (#5090)
Summary:
With https://github.com/facebook/rocksdb/pull/3009 we go through every CF
to check whether a bottommost compaction is needed to be triggered. This is done
within DB mutex. What we do within DB mutex may heavily influece the write throughput
we can achieve, so we always want to minimize work there.

Here we try to avoid this for-loop by first check a global threshold. In most of
the time, the CF loop can be avoided.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5090

Differential Revision: D14582684

Pulled By: siying

fbshipit-source-id: 968f6d9bb6affe1a5ebc4910b418300b076f166f
2019-03-25 19:18:04 -07:00
Maysam Yabandeh
c84fad7a19 Reorder DBIter fields to reduce memory usage (#5078)
Summary:
The patch reorders DBIter fields to put 1-byte fields together and let the compiler optimize the memory usage by using less 64-bit allocations for bools and enums.

This might have a negative side effect of putting the variables that are accessed together into different cache lines and hence increasing the cache misses. Not sure what benchmark would verify that thought. I ran simple, single-threaded seekrandom benchmarks but the variance in the results is too much to be conclusive.

./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench
./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5078

Differential Revision: D14562676

Pulled By: maysamyabandeh

fbshipit-source-id: 2284655d46e079b6e9a860e94be5defb6f482167
2019-03-21 09:55:09 -07:00
Zhongyi Xie
a291f3a1e5 Collect compaction stats by priority and dump to info LOG (#5050)
Summary:
In order to better understand compaction done by different priority thread pool, we now collect compaction stats by priority and also print them to info LOG through stats dump.

```
** Compaction Stats [default] **
Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Low      0/0    0.00 KB   0.0     16.8    11.3      5.5       5.6      0.1       0.0   0.0    406.4    136.1     42.24             34.96        45    0.939     13M  8865K
High      0/0    0.00 KB   0.0      0.0     0.0      0.0      11.4     11.4       0.0   0.0      0.0     76.2    153.00             35.74     12185    0.013       0      0
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5050

Differential Revision: D14408583

Pulled By: miasantreble

fbshipit-source-id: e53746586ea27cb8abc9fec35805bd80ed30f608
2019-03-19 17:28:19 -07:00
Wenjie Yang
36c2a7cfb1 Add an option to filter traces (#5082)
Summary:
Add an option to filter out READ or WRITE operations while tracing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5082

Differential Revision: D14515083

Pulled By: mrmiywj

fbshipit-source-id: 2504c89a9abf1dd629cad44b4104092702d77610
2019-03-19 14:36:51 -07:00
Hiroaki Nakamura
f2f6acbef3 Add missing C API for transaction (#5077)
Summary:
Partly addresses https://github.com/facebook/rocksdb/issues/4999
I verified `make static_lib` runs fine.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5077

Differential Revision: D14521101

Pulled By: maysamyabandeh

fbshipit-source-id: ba88e74a51d2d793cac7260d505b1a54254b53af
2019-03-19 09:43:22 -07:00
Shobhit Dayal
b45b1cde3e Feature for sampling and reporting compressibility (#4842)
Summary:
This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms:
1. lz4 or snappy for fast compression
2. zstd or zlib for slow but higher compression.

The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType.

The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842

Differential Revision: D13629011

Pulled By: shobhitdayal

fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa
2019-03-18 12:15:34 -07:00
anand76
b4fa51dfaf Update bg_error when log flush fails in SwitchMemtable() (#5072)
Summary:
There is a potential failure case in DBImpl::SwitchMemtable() that is not handled properly. The call to cur_log_writer->WriteBuffer() can fail due to an IO error. In that case, we need to call SetBGError() in order set the background error since the WriteBuffer() failure may result in data loss.

Also, the asserts for !new_mem and !new_log are incorrect, as those would have been allocated by the time this failure is detected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5072

Differential Revision: D14461384

Pulled By: anand1976

fbshipit-source-id: fb59bce9d61378f37d2dfcd28c0b704b0f43c3cf
2019-03-15 15:19:25 -07:00
Siying Dong
0920bf4e68 Revert "Remove PlainTable's feature store_index_in_file (#4914)" (#5034)
Summary:
This reverts commit ee1818081f.

We are not ready to deprecate this feature. revert it for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5034

Differential Revision: D14287246

Pulled By: siying

fbshipit-source-id: e4beafdeaee1c94364fdaa6ba198218d158339f7
2019-03-01 15:45:45 -08:00
Siying Dong
aef763b6d6 Make statistics's stats_level change thread-safe (#5030)
Summary:
Right now, users can change statistics.stats_level while DB is running, but TSAN may report
data race. We make stats_level_ to be atomic, and access them using accessors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5030

Differential Revision: D14267519

Pulled By: siying

fbshipit-source-id: 37d7ebeff7a43a406230143422a16af899163f73
2019-03-01 10:42:09 -08:00
Maysam Yabandeh
77ebc82b92 Call PreReleaseCallback between WAL and memtable write (#5015)
Summary:
PreReleaseCallback meant to be called before the writes are visible to the readers. Since the sequence number is known after the WAL write, there is no reason to delay calling PreReleaseCallback to after the memtable write, which would complicates the reader's logic in presence of our memtable writes that are made visible by the other write thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5015

Differential Revision: D14221670

Pulled By: maysamyabandeh

fbshipit-source-id: a504dd665cf923226d7af09cc8e9c7739a25edc6
2019-02-28 15:49:11 -08:00
Siying Dong
5e298f865b Add two more StatsLevel (#5027)
Summary:
Statistics cost too much CPU for some use cases. Add two stats levels
so that people can choose to skip two types of expensive stats, timers and
histograms.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027

Differential Revision: D14252765

Pulled By: siying

fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592
2019-02-28 10:27:59 -08:00
Maysam Yabandeh
a661c0d208 WritePrepared: optimize read path by avoiding virtual (#5018)
Summary:
The read path includes a callback function, ReadCallback, which would eventually calls IsInSnapshot to figure if a particular seq is in the reading snapshot or not. This callback is virtual, which adds the cost of multiple virtual function call to each read. The first few checks in IsInSnapshot, however, are quite trivial and take care of majority of the cases. The patch moves those to a non-virtual function in the the parent class, ReadCallback, to lower the virtual callback cost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5018

Differential Revision: D14226562

Pulled By: maysamyabandeh

fbshipit-source-id: 6feed5b34f3b082e52092c5ef143e29b49c46b44
2019-02-26 16:56:19 -08:00
Zhongyi Xie
c4f5d0aa15 add GetStatsHistory to retrieve stats snapshots (#4748)
Summary:
This PR adds public `GetStatsHistory` API to retrieve stats history in the form of an std map. The key of the map is the timestamp in microseconds when the stats snapshot is taken, the value is another std map from stats name to stats value (stored in std string). Two DBOptions are introduced: `stats_persist_period_sec` (default 10 minutes) controls the intervals between two snapshots are taken; `max_stats_history_count` (default 10) controls the max number of history snapshots to keep in memory. RocksDB will stop collecting stats snapshots if `stats_persist_period_sec` is set to 0.

(This PR is the in-memory part of https://github.com/facebook/rocksdb/pull/4535)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4748

Differential Revision: D13961471

Pulled By: miasantreble

fbshipit-source-id: ac836d401ecb84ea92216bf9966f969dedf4ad04
2019-02-20 15:52:54 -08:00
Siying Dong
93f7e7a450 Temporarily Disable DBTest2.PresetCompressionDict (#5003)
Summary:
DBTest2.PresetCompressionDict is flaky. Temparily disable it for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5003

Differential Revision: D14139505

Pulled By: siying

fbshipit-source-id: ebf1872d364b76b2cb021b489ea2f17ee997116a
2019-02-19 14:44:12 -08:00
Michael Liu
3c5d1b16b1 Apply modernize-use-override (3)
Summary:
Use C++11’s override and remove virtual where applicable.
Change are automatically generated.

bypass-lint
drop-conflicts

Reviewed By: igorsugak

Differential Revision: D14131816

fbshipit-source-id: f20e7f7cecf2e699d70f5fa036f72c0e3f59b50e
2019-02-19 13:39:49 -08:00
Zhongyi Xie
ed995c6a69 add whole key bloom filter support in memtables (#4985)
Summary:
MyRocks calls `GetForUpdate` on `INSERT`, for unique key check, and in almost all cases GetForUpdate returns empty result. For such cases, whole key bloom filter is helpful.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4985

Differential Revision: D14118257

Pulled By: miasantreble

fbshipit-source-id: d35cb7109c62fd5ad541a26968e3a3e16d3e85ea
2019-02-19 12:15:39 -08:00
Siying Dong
c2affccc18 Header logger should call LogHeader() (#4980)
Summary:
The info log header feature never worked well, because log level Header was not
translated to Logger::LogHeader() call. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4980

Differential Revision: D14087283

Pulled By: siying

fbshipit-source-id: 7e7d03ce35fa8d13d4ee549f46f7326f7bc0006d
2019-02-15 16:59:36 -08:00
Siying Dong
26a33ee5bd flush_job logs data size too (#4979)
Summary:
Right now when a flush is triggered, the memory consumption is logged but data size is not.
It's useful to log both when we debug unexpected small flushed file size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4979

Differential Revision: D14071979

Pulled By: siying

fbshipit-source-id: 0cd60449c5205eb00e0fbc299084418f609904ed
2019-02-15 16:33:19 -08:00
Aubin Sanyal
3231a2e581 Deprecate ttl option from CompactionOptionsFIFO (#4965)
Summary:
We introduced ttl option in CompactionOptionsFIFO when ttl-based file
deletion (compaction) was supported only as part of FIFO Compaction. But
with the extension of ttl semantics even to Level compaction,
CompactionOptionsFIFO.ttl can now be deprecated. Instead we will start
using ColumnFamilyOptions.ttl for FIFO compaction as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4965

Differential Revision: D14072960

Pulled By: sagar0

fbshipit-source-id: c98cc2ae695a28136295787cd88d36a220fc219e
2019-02-15 09:51:41 -08:00
Michael Liu
ca89ac2ba9 Apply modernize-use-override (2nd iteration)
Summary:
Use C++11’s override and remove virtual where applicable.
Change are automatically generated.

Reviewed By: Orvid

Differential Revision: D14090024

fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a
2019-02-14 14:41:36 -08:00
Andrew Kryczka
c8c8104d7e Dictionary compression for files written by SstFileWriter (#4978)
Summary:
If `CompressionOptions::max_dict_bytes` and/or `CompressionOptions::zstd_max_train_bytes` are set, `SstFileWriter` will now generate files respecting those options.

I refactored the logic a bit for deciding when to use dictionary compression. Previously we plumbed `is_bottommost_level` down to the table builder and used that. However it was kind of confusing in `SstFileWriter`'s context since we don't know what level the file will be ingested to. Instead, now the higher-level callers (e.g., flush, compaction, file writer) are responsible for building the right `CompressionOptions` to give the table builder.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4978

Differential Revision: D14060763

Pulled By: ajkr

fbshipit-source-id: dc802c327896df2b319dc162d6acc82b9cdb452a
2019-02-14 11:23:55 -08:00
Yanqin Jin
4fc442029a Avoid using kInAtomicGroup tag for single-cf op (#4981)
Summary:
if an operation just involves a single column family, then we do
not have to set the kInAtomicGroup tag when writing to MANIFEST. This change
can fix a compatibility test failure, i.e. 5.15 and earlier cannot recognize
kInAtomicGroup tag.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4981

Differential Revision: D14072687

Pulled By: riversand963

fbshipit-source-id: 46b0c61e399f16c6b7169de0b33430d0ed90d6d4
2019-02-13 18:33:42 -08:00
Yanqin Jin
a69d4deefb Atomic ingest (#4895)
Summary:
Make file ingestion atomic.

 as title.
Ingesting external SST files into multiple column families should be atomic. If
a crash occurs and db reopens, either all column families have successfully
ingested the files before the crash, or non of the ingestions have any effect
on the state of the db.

Also add unit tests for atomic ingestion.

Note that the unit test here does not cover the case of incomplete atomic group
in the MANIFEST, which is covered in VersionSetTest already.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4895

Differential Revision: D13718245

Pulled By: riversand963

fbshipit-source-id: 7df97cc483af73ad44dd6993008f99b083852198
2019-02-12 19:16:17 -08:00
Siying Dong
49ddd7ec4f Stats should be logged in INFO level (#4977)
Summary:
Previously, stats were logged in warning level. This was done in that way because
people reported that it wasn't logged in MyRocks. However, later we learned that it turns
out to be due to a bug in MyRocks, which is fixed in
79bb705e74

Now we revert the stats logging to INFO level, so that it doesn't pollute the warning
level logging.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4977

Differential Revision: D14058485

Pulled By: siying

fbshipit-source-id: 19fab323c19d9bc88184287f209551f9a77ca0e6
2019-02-12 16:54:55 -08:00
Yanqin Jin
c5a64cffd2 Avoid fsync on the same directory in atomic flush (#4817)
Summary:
In `DBImpl::AtomicFlushMemTablesToOutputFiles`, we need to call fsync only once
on the same data directory. If two column families share a common directory for
their data, we call fsync only once.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4817

Differential Revision: D13543689

Pulled By: riversand963

fbshipit-source-id: 4701d77c96a47802fbf6cb9f3337ee65d46b95f5
2019-02-12 12:28:36 -08:00
Andrew Kryczka
62f70f6d14 Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.

So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:

- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952

Differential Revision: D13967980

Pulled By: ajkr

fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-11 19:47:32 -08:00
Maysam Yabandeh
576d2d6c60 WritePrepared: relax assert in compaction iterator (#4969)
Summary:
If IsInSnapshot(seq2, snapshot) determines that the snapshot is released, the future queries IsInSnapshot(seq1, snapshot) could still return a definitive answer of true if for example seq1 is too old that is determined visible in all snapshots. This violates a recently added assert statement to compaction iterator. The patch relaxes the assert.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4969

Differential Revision: D14030998

Pulled By: maysamyabandeh

fbshipit-source-id: 6db53db0e37d0a20e8997ef2c1004b8627614ab9
2019-02-11 15:01:46 -08:00
Yanqin Jin
2d049ab7e8 Checksum properties block for block-based table (#4956)
Summary:
Always enable properties block checksum verification for block-based table. For external SST file ingested with 'write_global_seqno==true', we use 'DecodeEntrySlow' to parse its blocks' contents so that the process will not die upon failing the assertion possibly caused by corruption.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4956

Differential Revision: D14012741

Pulled By: riversand963

fbshipit-source-id: 8b766e6f54b36f8f9e074c0e19e0926ec3cce186
2019-02-11 11:50:01 -08:00
Siying Dong
5d9a623e2c Add a unit test to Ignorable manfiest record (#4964)
Summary:
https://github.com/facebook/rocksdb/pull/4960 introduced ignorable manfiest
record. Adding a test to it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4964

Differential Revision: D14012667

Pulled By: siying

fbshipit-source-id: e5f10ecc68dec2716e178d44f0fe2b76c3d857ef
2019-02-11 11:20:24 -08:00
tang-jianfeng
08809f5e6c Implement trace sampling (#4963)
Summary:
Implement trace sampling to allow user to specify the sampling frequency, i.e. save one per how many requests, so that a user does not need to log all if he/she is interested in only a sampled set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4963

Differential Revision: D14011190

Pulled By: tang-jianfeng

fbshipit-source-id: 078b631d9319b67cb089dd2c30e21d0df8dc406a
2019-02-08 18:08:18 -08:00
Siying Dong
1a761e6a6c Add a placeholder in manifest indicating ignorable record (#4960)
Summary:
We want to reserve some right that some extra information added manifest
in the future can be forward compatible by previous versions. Now we create a
place holder for that. A bit in tag is added to indicate that a field can be
safely ignored.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4960

Differential Revision: D14000484

Pulled By: siying

fbshipit-source-id: cbf5bad3f9d5ec798f789806f244d1c20d3b66d6
2019-02-08 11:33:11 -08:00
Siying Dong
f48758e939 Deprecate CompactionFilter::IgnoreSnapshots() = false (#4954)
Summary:
We found that the behavior of CompactionFilter::IgnoreSnapshots() = false isn't
what we have expected. We thought that snapshot will always be preserved.
However, we just realized that, if no snapshot is created while compaction
starts, and a snapshot is created after that, the data seen from the snapshot
can successfully be dropped by the compaction. This creates a strange behavior
to the feature, which is hard to explain. Like what is documented in code
comment, this feature is not very useful with snapshot anyway. The decision
is to deprecate the feature.

We keep the function to avoid to break users code. However, we will fail
compactions if false is returned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4954

Differential Revision: D13981900

Pulled By: siying

fbshipit-source-id: 2db8c2c3865acd86a28dca625945d1481b1d1e36
2019-02-07 16:57:33 -08:00
Siying Dong
cf3a671733 Remove cuckoo hash memtable (#4953)
Summary:
Cuckoo Hash is less useful than we initially expected. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4953

Differential Revision: D13979264

Pulled By: siying

fbshipit-source-id: 2a60afdaa989f045357398b43a1cc5d46f4492ed
2019-02-07 16:15:27 -08:00
Zhongyi Xie
71cae59a99 exclude test CompactFilesShouldTriggerAutoCompaction from ROCKSDB_LITE (#4950)
Summary:
This will fix the following build error:

> db/db_test.cc: In member function ‘virtual void rocksdb::DBTest_CompactFilesShouldTriggerAutoCompaction_Test::TestBody()’:
> db/db_test.cc:5462:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’
>    db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data);
> db/db_test.cc:5490:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’
>    db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data);
> db/db_test.cc:5499:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’
>    db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data);
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4950

Differential Revision: D13965378

Pulled By: miasantreble

fbshipit-source-id: a975435476fe555b1cd9d5da263ee3da3acdea56
2019-02-05 17:01:11 -08:00
Zhongyi Xie
00ed41daee Allow copy for PerfContext objects (#4919)
Summary:
Existing implementation of PerfContext does not define copy constructor or assignment operator, which could potentially cause problems when user create copies and resets the builtin one. This PR address the issue by providing these two constructors with deep copy semantics.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4919

Differential Revision: D13960406

Pulled By: miasantreble

fbshipit-source-id: 36aab5aaee65d4480f537e4e22148faa45e8e334
2019-02-05 14:29:08 -08:00
Jay Zhuang
c9a52cbdc8 Fix potential DB hang while using CompactFiles (#4940)
Summary:
CompactFiles() may block auto compaction which could cuase DB hang when it
reachs level0_stop_writes_trigger.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4940

Differential Revision: D13929648

Pulled By: cooldoger

fbshipit-source-id: 10842df38df3bebf862cd1a120a88ce961fdd381
2019-02-05 11:23:38 -08:00
Siying Dong
8fe073324f BYTES_READ stats miscount for NotFound cases (#4938)
Summary:
In NotFound cases, stats BYTES_READ and perf_context.get_read_bytes is still be increased. The amount increased will be
whatever size of the string or PinnableSlice that users passed in as the output data structure. This is wrong. Fix this by not
increasing these two counters.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4938

Differential Revision: D13908963

Pulled By: siying

fbshipit-source-id: 60bce42e4fbb9862bba3da36dbc27b2963ea6162
2019-02-05 10:53:35 -08:00
yangzhijia
31221bb7e8 Properly set upper bound of subcompaction output (#4879) (#4898)
Summary:
Fix the ouput overlap bug when using subcompactions, the upper bound of output
file was extended incorrectly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4898

Differential Revision: D13736107

Pulled By: ajkr

fbshipit-source-id: 21dca09f81d5f07bf2766bf566f9b50dcab7d8e3
2019-02-05 10:20:16 -08:00
Maysam Yabandeh
30468d8eb4 Fix analyze error on possible un-initialized value (#4937)
Summary:
The patch fixes the following analyze error by checking the return status of ParseInternalKey.
```
db/merge_helper.cc:306:23: warning: The right operand of '==' is a garbage value
    assert(kTypeMerge == orig_ikey.type);
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4937

Differential Revision: D13908506

Pulled By: maysamyabandeh

fbshipit-source-id: 68d7771e75519da3d4bd807fd231675ec12093f6
2019-02-01 09:41:27 -08:00
Ming Zhao
59244447e3 Zero seqnum of final key / drop final tombstone when compacting to bottommost level
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4927

Differential Revision: D13889458

Pulled By: mzhaom

fbshipit-source-id: d6b66db85901a9eb90748fba6a9dc4e7457b9c5e
2019-02-01 09:21:57 -08:00
Yanqin Jin
842cdc11dd Use correct FileMeta for atomic flush result install (#4932)
Summary:
1. this commit fixes our handling of a combination of two separate edge
cases. If a flush job does not pick any memtable to flush (because another
flush job has already picked the same memtables), and the column family
assigned to the flush job is dropped right before RocksDB calls
rocksdb::InstallMemtableAtomicFlushResults, our original code passes
a FileMetaData object whose file number is 0, failing the assertion in
rocksdb::InstallMemtableAtomicFlushResults (assert(m->GetFileNumber() > 0)).
2. Also piggyback a small change: since we already create a local copy of column family's mutable CF options to eliminate potential race condition with `SetOptions` call, we might as well use the local copy in other function calls in the same scope.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4932

Differential Revision: D13901322

Pulled By: riversand963

fbshipit-source-id: b936580af7c127ea0c6c19ea10cd5fcede9fb0f9
2019-01-31 14:49:51 -08:00
Maysam Yabandeh
35e5689e11 Take snapshots once for all cf flushes (#4934)
Summary:
FlushMemTablesToOutputFiles calls FlushMemTableToOutputFile for each column family. The patch moves the take-snapshot logic to outside FlushMemTableToOutputFile so that it does it once for all the flushes. This also addresses a deadlock issue for resetting the managed snapshot of job_snapshot in the 2nd call to FlushMemTableToOutputFile.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4934

Differential Revision: D13900747

Pulled By: maysamyabandeh

fbshipit-source-id: f3cd650c5fff24cf95c1aaf8a10c149d42bf042c
2019-01-31 12:21:59 -08:00
Alexander Zinoviev
32a6dd9a41 Add a new CPU time counter to compaction report (#4889)
Summary:
Measure CPU time consumed for a compaction and report it in the stats report
Enable NowCPUNanos() to work for MacOS
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889

Differential Revision: D13701276

Pulled By: zinoale

fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055
2019-01-29 17:24:00 -08:00
Yanqin Jin
158da7a6ee Verify checksum before ingestion (#4916)
Summary:
before file ingestion (in preparation phase), verify the checksums of
the blocks of the external SST file, including properties block with global
seqno.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4916

Differential Revision: D13863501

Pulled By: riversand963

fbshipit-source-id: dc54697f970e3807832e2460f7228fcc7efe81ee
2019-01-29 17:17:29 -08:00
Sagar Vemuri
4978caaa6f Remove a redundant call to TableFileName in CompactionJob::FinishCompactionOutputFile (#4925)
Summary:
While stepping through the code I noticed that there is a redundant call to TableFileName.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4925

Differential Revision: D13845749

Pulled By: sagar0

fbshipit-source-id: 31db45716b4d720e0e0350dd457b49d6f1848e7d
2019-01-28 13:33:23 -08:00
Siying Dong
ee1818081f Remove PlainTable's feature store_index_in_file (#4914)
Summary:
Store_index_in_file is a less useful feature. To simplify the code to maintain, we are dropping the feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4914

Differential Revision: D13791883

Pulled By: siying

fbshipit-source-id: d187c5d662584866103e4b77d09dfb925509ae2e
2019-01-28 12:50:22 -08:00
Siying Dong
bc7d1661a8 Fix test name typo in PlainTableDBTest
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4926

Differential Revision: D13830196

Pulled By: siying

fbshipit-source-id: e06bf2a6cd273b5eb18dfd82bdd35ffce197d021
2019-01-25 18:14:26 -08:00
Siying Dong
f184bee77b PlainTable should avoid copying Get() results from immortal source. (#4924)
Summary:
https://github.com/facebook/rocksdb/pull/4053 avoids memcopy for Get() results if files are immortable
(read-only DB, max_open_files=-1) and the file is ammaped. The same optimization is being applied to PlainTable
here.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4924

Differential Revision: D13827749

Pulled By: siying

fbshipit-source-id: 1f2cbfc530b40ce08ccd53f95f6e78de4d1c2f96
2019-01-25 17:12:19 -08:00
Siying Dong
fc53839bfa Disallow customized hash function in DynamicBloom (#4915)
Summary:
I didn't find where customized hash function is used in DynamicBloom. This can only reduce performance. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4915

Differential Revision: D13794452

Pulled By: siying

fbshipit-source-id: e38669b11e01444d2d782da11c7decabbd851819
2019-01-24 10:34:30 -08:00
Dmitry Fink
e07aa8669d Allow full merge when root of history for a key is reached (#4909)
Summary:
Previously compaction was not collapsing operands for a first
key on a layer, even in cases when it was its root of history. Some
tests (CompactionJobTest.NonAssocMerge) was actually accounting
for that bug,
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4909

Differential Revision: D13781169

Pulled By: finik

fbshipit-source-id: d2de353ecf05bec39b942cd8d5b97a8dc445f336
2019-01-23 21:46:10 -08:00
Andrew Kryczka
8ec3e72551 Cache dictionary used for decompressing data blocks (#4881)
Summary:
- If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`.
- If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle.
- New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881

Differential Revision: D13663801

Pulled By: ajkr

fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f
2019-01-23 18:15:47 -08:00
PeifengSi
43defe9872 Correct the code comment in Compaction::KeyNotExistsBeyondOutputLevel (#4902)
Summary:
Even one key falls in a file's range, we can not infer it definitely exists in this file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4902

Differential Revision: D13795018

Pulled By: siying

fbshipit-source-id: 590956f727e9440fcdee55ad9541ace934c64914
2019-01-23 18:00:56 -08:00
Siying Dong
d94aa2f7db Make compaction_pri = kMinOverlappingRatio to be default (#4911)
Summary:
compaction_pri = kMinOverlappingRatio usually provides much better write amplification than the default.
https://github.com/facebook/rocksdb/pull/4907 fixes one shortcome of this option. Make it default.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4911

Differential Revision: D13789262

Pulled By: siying

fbshipit-source-id: d90acf8c4dede44f00d183ca4c7a210259378269
2019-01-23 16:47:38 -08:00
Siying Dong
5bf941966b CompactionPri = kMinOverlappingRatio also uses compensated file size (#4907)
Summary:
Right now, CompactionPri = kMinOverlappingRatio provides best write amplification, but it doesn't
prioritize files with more tombstones. We combine the two good features: make kMinOverlappingRatio
to boost files with lots of tombstones too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4907

Differential Revision: D13788774

Pulled By: siying

fbshipit-source-id: 1991cbb495fb76c8b529de69896e38d81ed9d9b3
2019-01-23 13:21:01 -08:00
Andrew Kryczka
01013ae766 Digest ZSTD compression dictionary once when writing SST file (#4849)
Summary:
This is essentially a re-submission of #4251 with a few improvements:

- Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict`
- Eliminated `Init` functions. Instead do all initialization work in constructors.
- Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849

Differential Revision: D13606039

Pulled By: ajkr

fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447
2019-01-18 19:12:57 -08:00
Yi Wu
b1ad6ebba8 WritePrepared: fix two versions in compaction see different status for released snapshots (#4890)
Summary:
Fix how CompactionIterator::findEarliestVisibleSnapshots handles released snapshot. It fixing the two scenarios:

Scenario 1:
key1 has two values v1 and v2. There're two snapshots s1 and s2 taken after v1 and v2 are committed. Right after compaction output v2, s1 is released. Now findEarliestVisibleSnapshot may see s1 being released, and return the next snapshot, which is s2. That's larger than v2's earliest visible snapshot, which was s1.
The fix: the only place we check against last snapshot and current key snapshot is when we decide whether to compact out a value if it is hidden by a later value. In the check if we see current snapshot is even larger than last snapshot, we know last snapshot is released, and we are safe to compact out current key.

Scenario 2:
key1 has two values v1 and v2. there are two snapshots s1 and s2 taken after v1 and v2 are committed. During compaction before we process the key, s1 is released. When compaction process v2, snapshot checker may return kSnapshotReleased, and the earliest visible snapshot for v2 become s2. When compaction process v1, snapshot checker may return kIsInSnapshot (for WritePrepared transaction, it could be because v1 is still in commit cache). The result will become inconsistent here.
The fix: remember the set of released snapshots ever reported by snapshot checker, and ignore them when finding result for findEarliestVisibleSnapshot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4890

Differential Revision: D13705538

Pulled By: maysamyabandeh

fbshipit-source-id: e577f0d9ee1ff5a6035f26859e56902ecc85a5a4
2019-01-18 17:24:06 -08:00
Yi Wu
128f532858 WritePrepared: fix issue with snapshot released during compaction (#4858)
Summary:
Compaction iterator keep a copy of list of live snapshots at the beginning of compaction, and then query snapshot checker to verify if values of a sequence number is visible to these snapshots. However when the snapshot is released in the middle of compaction, the snapshot checker implementation (i.e. WritePreparedSnapshotChecker) may remove info with the snapshot and may report incorrect result, which lead to values being compacted out when it shouldn't. This patch conservatively keep the values if snapshot checker determines that the snapshots is released.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4858

Differential Revision: D13617146

Pulled By: maysamyabandeh

fbshipit-source-id: cf18a94f6f61a94bcff73c280f117b224af5fbc3
2019-01-16 09:55:32 -08:00
Yanqin Jin
e79df377c5 Use chrono::time_point instead of time_t (#4868)
Summary:
By convention, time_t almost always stores the integral number of seconds since
00:00 hours, Jan 1, 1970 UTC, according to http://www.cplusplus.com/reference/ctime/time_t/.
We surely want more precision than seconds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4868

Differential Revision: D13633046

Pulled By: riversand963

fbshipit-source-id: 4e01e23a22e8838023c51a91247a286dbf3a5396
2019-01-16 09:51:05 -08:00
Yi Wu
5d4fddfa52 WritePrepared: Fix visible key compacted out by compaction (#4883)
Summary:
With WritePrepared transaction, flush/compaction can contain uncommitted keys, and those keys can get committed during compaction. If a snapshot is taken before the key is committed, it should not see the key. On the other hand, compaction grab the list of snapshots at its beginning, and only consider those snapshots to dedup keys. Consider the case:
```
seq = 1: put "foo" = "bar"
seq = 2: transaction T: delete "foo", prepare
seq = 3: compaction start
seq = 4: take snapshot S
seq = 5: transaction T: commit.
...
seq = N: compaction iterator reached key "foo".
```
When compaction start, the list of snapshot is empty. Compaction doesn't take snapshot S into account. When it reached "foo", transaction T is committed. Compaction may think the value "foo=bar" is not visible by any snapshot (which is wrong), and compact the value out.

The fix is to explicitly take a snapshot before compaction grabbing the list of snapshots. Compaction will then has to keep keys visible to this snapshot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4883

Differential Revision: D13668775

Pulled By: maysamyabandeh

fbshipit-source-id: 1cab9615f94b7d3e8522cc3d44c3a14c7d4720e4
2019-01-15 21:34:38 -08:00
Maysam Yabandeh
cad99a6031 WritePrepared: snapshot should be larger than max_evicted_seq_ (#4886)
Summary:
The AdvanceMaxEvictedSeq algorithm assumes that new snapshots always have sequence number larger than the last max_evicted_seq_. To enforce this assumption we make two changes:
i) max is not advanced beyond the last published seq, with the exception that the evicted commit entry itself is not published yet, which is quite rare.
ii) When obtaining the snapshot if the max_evicted_seq_ is not published yet, commit a dummy entry so that it waits for it to be published and also increased the latest published seq by one above the max.
To test these non-realistic corner cases we create a commit cache with size 1 so that every single commit results into eviction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4886

Differential Revision: D13685270

Pulled By: maysamyabandeh

fbshipit-source-id: 5461bc09c2a9b75798bfcb9853a256c81cdac0b0
2019-01-15 18:11:52 -08:00
Siying Dong
7d13f307ff Improve Error Message When wal_dir doesn't exist (#4874)
Summary:
Right now the error mesage when options.wal_dir doesn't exist is not helpful to users. Be more specific
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4874

Differential Revision: D13642425

Pulled By: siying

fbshipit-source-id: 9a3172ed0f799af233b0f3b2e5e35bc7ce04c7b5
2019-01-15 16:46:04 -08:00
Yanqin Jin
301da345ae Make a copy of MutableCFOptions to avoid race condition (#4876)
Summary:
If we do not do this, then reading MutableCFOptions may have a race condition
with SetOptions which modifies MutableCFOptions.

Also reserve space in advance for vectors to avoid reallocation changing the
address of its elements.

Test plan
```
$make clean && make -j32 all check
$make clean && COMPILE_WITH_TSAN=1 make -j32 all check
$make clean && COMPILE_WITH_ASAN=1 make -j32 all check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4876

Differential Revision: D13644500

Pulled By: riversand963

fbshipit-source-id: 4b8112c5c819d5a2922bb61ad1521b3d2fb2fd47
2019-01-11 17:43:37 -08:00
Maysam Yabandeh
d56ac22b44 Remove duplicates from SnapshotList::GetAll (#4860)
Summary:
The vector returned by SnapshotList::GetAll could have duplicate entries if two separate snapshots have the same sequence number. However, when this vector is used in compaction the duplicate entires are of no use and could be safely ignored. Moreover not having duplicate entires simplifies reasoning in the compaction_iterator.cc code. For example when searching for the previous_snap we currently use the snapshot before the current one but the way the code uses that it expects it to be also less than the current snapshot, which would be simpler to read if there is no duplicate entry in the snapshot list.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4860

Differential Revision: D13615502

Pulled By: maysamyabandeh

fbshipit-source-id: d45bf01213ead5f39db811f951802da6fcc3332b
2019-01-09 16:25:42 -08:00
Siying Dong
8641e9adf7 Non-initial file preloading should always prefetch index and filter (#4852)
Summary:
https://github.com/facebook/rocksdb/pull/3340 introduces preloading when max_open_files != -1.
It doesn't preload index and filter in non-initial file loading case. This is a little bit too
complicated to understand. We observed in one MyRocks use case where the filter is expected to be
preloaded but is not. To simplify the use case, we simply always prefetch the index and filter.
They anyway is expected to be loaded in the file verification phase anyway.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4852

Differential Revision: D13595402

Pulled By: siying

fbshipit-source-id: d4d8624eb3e849e20aeb990df2100502d85aff31
2019-01-08 12:47:34 -08:00
tom wang
42135523a0 modify comments about flush_queue_
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4850

Differential Revision: D13591940

Pulled By: sagar0

fbshipit-source-id: 617794e0a41d0f4554d40871180b061e84189fc5
2019-01-07 13:52:59 -08:00
Yi Wu
cf852fdf55 Minor fix: single delete a blob value is not a mismatch (#4848)
Summary:
In compaction iterator, if the next value of single delete is a blob value, it should not treated as mismatch. This is only a minor fix and doesn't affect correctness.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4848

Differential Revision: D13585812

Pulled By: yiwu-arbug

fbshipit-source-id: 0ff6223fa03a644ac9fd8a2d77f9d6711d0a62b0
2019-01-04 16:31:02 -08:00
Andrew Kryczka
9e2c804fe6 Fix point lookup on range tombstone sentinel endpoint (#4829)
Summary:
Previously for point lookup we decided which file to look into based on user key overlap only. We also did not truncate range tombstones in the point lookup code path. These two ideas did not interact well in cases like this:

- L1 has range tombstone [a, c)#1 and point key b#2. The data is split between file1 with range [a#1,1, b#72057594037927935,15], and file2 with range [b#2, c#1].
- L1's file2 gets compacted to L2.
- User issues `Get()` for b#3.
- L1's file1 is opened and the range tombstone [a, c)#1 is found for b, while no point-key for b is found in L1.
- `Get()` assumes that the range tombstone must cover all data in that range in lower levels, so short circuits and returns `NotFound`.

The solution to this problem is to not look into files that only overlap with the point lookup at a range tombstone sentinel endpoint. In the above example, this would mean not opening L1's file1 or its tombstones during the `Get()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4829

Differential Revision: D13561355

Pulled By: ajkr

fbshipit-source-id: a13c21c816870a2f5d32a48af6dbd719a7d9d19f
2019-01-04 11:24:08 -08:00
Yanqin Jin
a07175af65 Refactor atomic flush result installation to MANIFEST (#4791)
Summary:
as titled.
Since different bg flush threads can flush different sets of column families
(due to column family creation and drop), we decide not to let one thread
perform atomic flush result installation for other threads. Bg flush threads
will install their atomic flush results sequentially to MANIFEST, using
a conditional variable, i.e. atomic_flush_install_cv_ to coordinate.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4791

Differential Revision: D13498930

Pulled By: riversand963

fbshipit-source-id: dd7482fc41f4bd22dad1e1ef7d4764ef424688d7
2019-01-03 20:56:24 -08:00
Yi Wu
77a8d4d476 Detect if Jemalloc is linked with the binary (#4844)
Summary:
Declare Jemalloc non-standard APIs as weak symbols, so that if Jemalloc is linked with the binary, these symbols will be replaced by Jemalloc's, otherwise they will be nullptr. This is similar to how folly detect jemalloc, but we assume the main program use jemalloc as long as jemalloc is linked: https://github.com/facebook/folly/blob/master/folly/memory/Malloc.h#L147
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4844

Differential Revision: D13574934

Pulled By: yiwu-arbug

fbshipit-source-id: 7ea871beb1be7d5a1259cc38f9b78078793db2db
2019-01-03 16:30:12 -08:00
DorianZheng
8c79f79208 Fix skip WAL for whole write_group when leader's callback fail (#4838)
Summary:
The original implementation has two problems:

1. f0dda35d7d/db/db_impl_write.cc (L478)
f0dda35d7d/db/write_thread.h (L231)

If the callback status of leader of the write_group fails, then the whole write_group will not write to WAL, this may cause data loss.

2. f0dda35d7d/db/write_thread.h (L130)
The annotation says that Writer.status is the status of memtable inserter, but the original implementation use it for another case which is not consistent with the original design. Looks like we can still reuse Writer.status, but we should modify the annotation, so Writer.status is not only the status of memtable inserter.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4838

Differential Revision: D13574070

Pulled By: yiwu-arbug

fbshipit-source-id: a2a2aefcfd329c4c6a91652bf090aaf1ce119c4b
2019-01-03 12:40:42 -08:00
Siying Dong
e4feb78606 Try to fix DBSSTTest.RateLimitedDelete flakiness (#4840)
Summary:
DBSSTTest.RateLimitedDelete is flakey. The root cause is not completely identified, but
the compaction waiting in the test doesn't strictly wait for compaction cleaning to finish, which
may cause test flakiness. Fix it first and see whether the failures still happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4840

Differential Revision: D13567273

Pulled By: siying

fbshipit-source-id: 6fce38b912aff92a925231e7aa9bb0fef892761a
2019-01-03 11:05:19 -08:00
Andrew Kryczka
ace543a815 fix accounting for range tombstones in TableProperties (#4841)
Summary:
- To be consistent with the accounting of other optypes in `TableProperties`, we should count range tombstones in `TableProperties::num_entries` and `TableProperties::num_deletions`.
- Updated assertions in stress test's `OnTableFileCreated` handler to accept files with range tombstones only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4841

Differential Revision: D13568424

Pulled By: ajkr

fbshipit-source-id: 0139d7806494eda20ece67ec460d2458dbbf6026
2019-01-02 15:08:53 -08:00
Anand Ananthabhotla
b9d6eccac1 Lock free MultiGet (#4754)
Summary:
Avoid locking the DB mutex in order to reference SuperVersions. Instead, we get the thread local cached SuperVersion for each column family in the list. It depends on finding a sequence number that overlaps with all the open memtables. We start with the latest published sequence number, and if any of the memtables is sealed before we can get all the SuperVersions, the process is repeated. After a few times, give up and lock the DB mutex.

Tests:
1. Unit tests
2. make check
3. db_bench -

TEST_TMPDIR=/dev/shm ./db_bench -use_existing_db=true -benchmarks=readrandom -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=5000000 -reads=1000000 -threads=32 -compression_type=none -cache_size=1048576000 -batch_size=1 -bloom_bits=1
readrandom   :       0.167 micros/op 5983920 ops/sec;  426.2 MB/s (1000000 of 1000000 found)

Multireadrandom with batch size 1:
multireadrandom :       0.176 micros/op 5684033 ops/sec; (1000000 of 1000000 found)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4754

Differential Revision: D13363550

Pulled By: anand1976

fbshipit-source-id: 6243e8de7dbd9c8bb490a8eca385da0c855b1dd4
2019-01-02 11:42:54 -08:00
Faustin Lammler
7d65bd5ce4 Fix spelling errors (#4827)
Summary:
Hi, Lintian, the Debian package checker complains about spelling error (spelling-error-in-binary).

See https://salsa.debian.org/mariadb-team/mariadb-10.3/-/jobs/98380
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4827

Differential Revision: D13566362

Pulled By: riversand963

fbshipit-source-id: cd4e9212133c73b0591030de6cdedaa47575968d
2019-01-02 11:17:57 -08:00
Yanqin Jin
ec68091d19 Remove an unused parameter (#4816)
Summary:
The `flush_reason` parameter in `DBImpl::InstallSuperVersionAndScheduleWork` is
not used. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4816

Differential Revision: D13543218

Pulled By: riversand963

fbshipit-source-id: 8fc75d49462ce092e85aef0fe0c50936140db153
2019-01-02 09:59:13 -08:00
Siying Dong
f0dda35d7d Preload some files even if options.max_open_files (#3340)
Summary:
Choose to preload some files if options.max_open_files != -1. This can slightly narrow the gap of performance between options.max_open_files is -1 and a large number. To avoid a significant regression to DB reopen speed if options.max_open_files != -1. Limit the files to preload in DB open time to 16.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3340

Differential Revision: D6686945

Pulled By: siying

fbshipit-source-id: 8ec11bbdb46e3d0cdee7b6ad5897a09c5a07869f
2018-12-28 18:02:28 -08:00
Burton Li
46e3209e0d Compaction limiter miscs (#4795)
Summary:
1. Remove unused API SubtractCompactionTask().
2. Assert outstanding tasks drop to zero in ConcurrentTaskLimiterImpl destructor.
3. Remove GetOutstandingTask() check from manual compaction test, as TEST_WaitForCompact() doesn't synced with 'delete prepicked_compaction' in DBImpl::BGWorkCompaction(), which may make the test flaky.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4795

Differential Revision: D13542183

Pulled By: siying

fbshipit-source-id: 5eb2a47e62efe4126937149aa0df6e243ebefc33
2018-12-26 13:59:35 -08:00
Alexander Zinoviev
80bf8975fd Add a new per level counter for block cache hit (#4796)
Summary:
Add a new per level counter for block cache hits, increase it by one on every successful attempt to get an entry from cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4796

Differential Revision: D13513688

Pulled By: zinoale

fbshipit-source-id: 104df038f1232e3356e162eb2d8ca138e34a8281
2018-12-21 13:20:05 -08:00
Andrew Kryczka
e0be1bc4f1 fix DeleteRange memory leak for mmap and block cache (#4810)
Summary:
Previously we were cleaning up range tombstone meta-block by calling `ReleaseCachedEntry`, which wouldn't work if `value != nullptr && cache_handle == nullptr`. This happened at least in the case with mmap reads and block cache both enabled. I noticed `NewDataBlockIterator` intends to handle all these cases, so migrated to that instead of `NewUnfragmentedRangeTombstoneIterator`.

Also changed the table-opening logic to fail on `ReadRangeDelBlock` failure, since that can cause data corruption. Added a test case to verify this behavior. Note the test case does not fail on `TryReopen` because failure to preload table handlers is not considered critical. However, it does fail on any read involving that file since it cannot return correct data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4810

Differential Revision: D13534296

Pulled By: ajkr

fbshipit-source-id: 55dde1111717cea6ec4bf38418daab81ccef3599
2018-12-20 21:59:49 -08:00
Siying Dong
da1c64b6e7 Introduce a CPU time counter in perf_context (#4741)
Summary:
Introduce the first CPU timing counter, perf_context.get_cpu_nanos. This opens a door to more CPU counters in the future.
Only Posix Env has it implemented using clock_gettime() with CLOCK_THREAD_CPUTIME_ID. How accurate the counter is depends on the platform.
Make PerfStepTimer to take an Env as an argument, and sometimes pass it in. The direct reason is to make the unit tests to use SpecialEnv where we can ingest logic there. But in long term, this is a good change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4741

Differential Revision: D13287798

Pulled By: siying

fbshipit-source-id: 090361049d9d5095d1d1a369fe1338d2e2e1c73f
2018-12-20 12:03:44 -08:00
Abhishek Madan
02bfc5831e Change is_range_del_table_empty_ flag to atomic (#4801)
Summary:
To avoid a race on the flag, make it an atomic_bool. This
doesn't seem to significantly affect benchmarks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4801

Differential Revision: D13523845

Pulled By: abhimadan

fbshipit-source-id: 3bc29f53c50a4e06cd9f8c6232a4bb221868e055
2018-12-19 17:21:14 -08:00
Abhishek Madan
8bf73208a4 Remove stale TODO (#4800)
Summary:
This TODO was already addressed, but I forgot to remove it
before landing the PR it came from.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4800

Differential Revision: D13522284

Pulled By: abhimadan

fbshipit-source-id: 7766bc4f5b54e47d355cf26137ef5e86c604472a
2018-12-19 15:45:37 -08:00
Jakub Tomanik
71a69d9b68 Fix building RocksDB for iOS (#4687)
Summary:
This PR contains the following fixes:

1. Fixing Makefile to support non-default locations of developer tools

2. Fixing compile error using a patch from https://github.com/facebook/rocksdb/pull/4007
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4687

Differential Revision: D13287263

Pulled By: riversand963

fbshipit-source-id: 4525eb42ba7b6f82af5f9bfb8e52fa4024e27ccc
2018-12-19 14:13:55 -08:00
Adam Retter
1b0c9ce396 Fix Windows broken build error due to non-const override (#4798)
Summary:
1) `transaction_base.h` overrides from `transaction.h` with a `const boolean do_validate`.
The non-const base declaration, which I cannot see the need for, causes a compilation error on Microsoft Windows.

2) Implicit cast from `double` to `uint64_t` causes a compilation error on Microsoft Windows.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4798

Differential Revision: D13519734

Pulled By: sagar0

fbshipit-source-id: 6e8cb80e9a589b1122e1500c21b8e3a3a472b459
2018-12-19 13:29:51 -08:00
Yanqin Jin
671a7eb36f Avoid switching empty memtable in certain cases (#4792)
Summary:
in certain cases, we do not perform memtable switching if the active
memtable of the column family is empty. Two exceptions:
1. In manual flush, if cached_recoverable_state_empty_ is false, then we need
   to switch memtable due to requirement of transaction.
2. In switch WAL, we need to switch memtable anyway because we have to seal the
   memtable if the WAL on which it depends will be closed.

This change can potentially delay the occurence of write stalls because number
of memtables increase more slowly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4792

Differential Revision: D13499501

Pulled By: riversand963

fbshipit-source-id: 91c9b17ae753578578039f3851667d93610005e1
2018-12-18 16:47:23 -08:00
Abhishek Madan
c15df15f07 Fix unused member compile error
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4793

Differential Revision: D13509363

Pulled By: abhimadan

fbshipit-source-id: 530b4765e3335d6ecd016bfaa89645f8aa98c61f
2018-12-18 14:28:42 -08:00
Abhishek Madan
81b6b09f6b Remove v1 RangeDelAggregator (#4778)
Summary:
Now that v2 is fully functional, the v1 aggregator is removed.
The v2 aggregator has been renamed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4778

Differential Revision: D13495930

Pulled By: abhimadan

fbshipit-source-id: 9d69500a60a283e79b6c4fa938fc68a8aa4d40d6
2018-12-17 17:33:46 -08:00
Roman Zeyde
a62c6626e0 Support setting options on column families via C bindings (#4785)
Summary:
Currently, it supports setting options only on the default column family.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4785

Differential Revision: D13491819

Pulled By: ajkr

fbshipit-source-id: 75c78bd86222bb05568e538562af84fb53eb4d8d
2018-12-17 13:52:12 -08:00
Abhishek Madan
abf931afa6 Add compaction logic to RangeDelAggregatorV2 (#4758)
Summary:
RangeDelAggregatorV2 now supports ShouldDelete calls on
snapshot stripes and creation of range tombstone compaction iterators.
RangeDelAggregator is no longer used on any non-test code path, and will
be removed in a future commit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4758

Differential Revision: D13439254

Pulled By: abhimadan

fbshipit-source-id: fe105bcf8e3d4a2df37a622d5510843cd71b0401
2018-12-17 13:20:51 -08:00
Maysam Yabandeh
4ed3c1eb88 Fix flaky test DeleteFileRange (#4784)
Summary:
The test fails sporadically expecting the DB to be empty after DeleteFilesInRange(..., nullptr, nullptr) call which is not. Debugging shows cases where the files are skipped since they are being compacted. The patch fixes the test by waiting for the last CompactRange to finish before calling DeleteFilesInRange.
Verified by
```
~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.DeleteFileRange --repeat=10000
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4784

Differential Revision: D13469402

Pulled By: maysamyabandeh

fbshipit-source-id: 3d8f44abe205b82c69f01e7edf27e1f8098248e1
2018-12-14 13:47:36 -08:00
Maysam Yabandeh
349542332a Fix race condition on options_file_number_ (#4780)
Summary:
options_file_number_ must be written under db::mutex_ sine its read is protected by mutex_ in ::GetLiveFiles(). However currently it is written in ::RenameTempFileToOptionsFile() which according to its contract must be called without holding db::mutex_. The patch fixes the race condition by also acquitting the mutex_ before writing options_file_number_. Also it does that only if the rename of option file is successful.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4780

Differential Revision: D13461411

Pulled By: maysamyabandeh

fbshipit-source-id: 2d5bae96a1f3e969ef2505b737cf2d7ae749787b
2018-12-13 19:27:38 -08:00
Yanqin Jin
4fce44fc8b Improve flushing multiple column families (#4708)
Summary:
If one column family is dropped, we should simply skip it and continue to flush
other active ones.
Currently we use Status::ShutdownInProgress to notify caller of column families
being dropped. In the future, we should consider using a different Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4708

Differential Revision: D13378954

Pulled By: riversand963

fbshipit-source-id: 42f248cdf2d32d4c0f677cd39012694b8f1328ca
2018-12-13 15:12:40 -08:00
DorianZheng
2670fe8c73 Get CompactionJobInfo from CompactFiles
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4716

Differential Revision: D13207677

Pulled By: ajkr

fbshipit-source-id: d0ccf5a66df6cbb07288b0c5ebad81fd9df3926b
2018-12-13 14:21:24 -08:00
Burton Li
a8b9891f95 Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918

We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.

With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.

ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.

The usage is straight forward:
e.g.:

//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...

//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0);  // full throttle (0 task)

//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);

// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;

// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;

...

//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332

Differential Revision: D13226590

Pulled By: siying

fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 13:18:28 -08:00
Maysam Yabandeh
0aa17c1002 Fix flaky test DBCompactionTest::DeleteFileRange (#4776)
Summary:
The test has been failing sporadically probably because the configured compaction options were actually unused. Verified that by the following:
```
~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.DeleteFileRange --repeat=1000
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4776

Differential Revision: D13441052

Pulled By: maysamyabandeh

fbshipit-source-id: d35075b9e6cef9b9c9d0d571f9cd72ade8eda55d
2018-12-12 16:32:14 -08:00
DorianZheng
4862720e08 Expose column family id to FlushJobInfo
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4772

Differential Revision: D13428923

Pulled By: ajkr

fbshipit-source-id: e351e9c5eea97816db25429e129357a8af90712a
2018-12-11 20:33:42 -08:00
Abhishek Madan
cad248f5c6 Prepare FragmentedRangeTombstoneIterator for use in compaction (#4740)
Summary:
To support the flush/compaction use cases of RangeDelAggregator
in v2, FragmentedRangeTombstoneIterator now supports dropping tombstones
that cannot be read in the compaction output file. Furthermore,
FragmentedRangeTombstoneIterator supports the "snapshot striping" use
case by allowing an iterator to be split by a list of snapshots.
RangeDelAggregatorV2 will use these changes in a follow-up change.

In the process of making these changes, other miscellaneous cleanups
were also done in these files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4740

Differential Revision: D13287382

Pulled By: abhimadan

fbshipit-source-id: f5aeb03e1b3058049b80c02a558ee48f723fa48c
2018-12-11 12:10:48 -08:00
Sagar Vemuri
dde3ef1116 Change directory where ExternalSSTFileBasicTest runs (#4766)
Summary:
Change the directory where ExternalSSTFileBasicTest* tests run.

**Problem:**
Without this change, I spent considerable time chasing around a non-existent issue as ExternalSSTFileTest.* and ExternalSSTFileBasicTest.* create similar directories.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4766

Differential Revision: D13409384

Pulled By: sagar0

fbshipit-source-id: c33e1f4d505dfa6efbc788d6c57cdb680053ded3
2018-12-11 10:21:37 -08:00
Abhishek Madan
64aabc9183 Properly set smallest key of subcompaction output (#4723)
Summary:
It is possible to see a situation like the following when
subcompactions are enabled:
1. A subcompaction boundary is set to `[b, e)`.
2. The first output file in a subcompaction has `c@20` as its smallest key
3. The range tombstone `[a, d)30` is encountered.
4. The tombstone is written to the range-del meta block and the new
   smallest key is set to `b@0` (since no keys in this subcompaction's
   output can be smaller than `b`).
5. A key `b@10` in a lower level will now reappear, since it is not
   covered by the truncated start key `b@0`.

In general, unless the smallest data key in a file has a seqnum of 0, it
is not safe to truncate a tombstone at the start key to have a seqnum of
0, since it can expose keys with a seqnum greater than 0 but less than
the tombstone's actual seqnum.

To fix this, when the lower bound of a file is from the subcompaction
boundaries, we now set the seqnum of an artificially extended smallest
key to the tombstone's seqnum. This is safe because subcompactions
operate over disjoint sets of keys, and the subcompactions that can
experience this problem are not the first subcompaction (which is
unbounded on the left).

Furthermore, there is now an assertion to detect the described anomalous
case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4723

Differential Revision: D13236188

Pulled By: abhimadan

fbshipit-source-id: a6da6a113f2de1e2ff307ca72e055300c8fe5692
2018-12-10 12:38:31 -08:00
Yanqin Jin
f307479ba6 Enable checkpoint of read-only db (#4681)
Summary:
1. DBImplReadOnly::GetLiveFiles should not return NotSupported. Instead, it
   should call DBImpl::GetLiveFiles(flush_memtable=false).
2. In DBImp::Recover, we should also recover the OPTIONS file name and/or
   number so that an immediate subsequent GetLiveFiles will get the correct
   OPTIONS name.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4681

Differential Revision: D13069205

Pulled By: riversand963

fbshipit-source-id: 3e6a0174307d06db5a01feb099b306cea1f7f88a
2018-12-07 17:06:02 -08:00
Yanqin Jin
9be3e6b488 Allow file-ingest-triggered flush to skip waiting for write-stall clear (#4751)
Summary:
When write stall has already been triggered due to number of L0 files reaching
threshold, file ingestion must proceed with its flush without waiting for the
write stall condition to cleared by the compaction because compaction can wait
for ingestion to finish (circular wait).

In order to avoid this wait, we can set `FlushOptions.allow_write_stall` to be
true (default is false). Setting it to false can cause deadlock.

This can happen when the number of compaction threads is low.

Considere the following
```
Time  compaction_thread                        ingestion_thread
 |                                             num_running_ingest_file_++
 |    while(num_running_ingest_file_>0){wait}
 |                                             flush
 V
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4751

Differential Revision: D13343037

Pulled By: riversand963

fbshipit-source-id: d3b95938814af46ec4c463feff0b50c70bd8b23f
2018-12-05 14:59:29 -08:00
Yanqin Jin
b96fccb1e6 Move a function to critical section (#4752)
Summary:
Test plan
```
$make clean && make -j32 all check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4752

Differential Revision: D13344705

Pulled By: riversand963

fbshipit-source-id: fc3a43174d09d70ccc2b09decd78e1da1b6ba9d1
2018-12-05 13:12:09 -08:00
Zhongyi Xie
2f1ca4e838 Revert "BaseDeltaIterator: always check valid() before accessing key(… (#4744)
Summary:
…) (#4702)"

This reverts commit 3a18bb3e15.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4744

Differential Revision: D13311869

Pulled By: miasantreble

fbshipit-source-id: 6300b12cc34828d8b9274e907a3aef1506d5d553
2018-12-03 23:38:27 -08:00
Zhongyi Xie
3a18bb3e15 BaseDeltaIterator: always check valid() before accessing key() (#4702)
Summary:
Current implementation of `current_over_upper_bound_` fails to take into consideration that keys might be invalid in either base iterator or delta iterator. Calling key() in such scenario will lead to assertion failure and runtime errors.
This PR addresses the bug by adding check for valid keys before calling `IsOverUpperBound()`, also added test coverage for iterate_upper_bound usage in BaseDeltaIterator
Also recommit https://github.com/facebook/rocksdb/pull/4656 (It was reverted earlier due to bugs)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4702

Differential Revision: D13146643

Pulled By: miasantreble

fbshipit-source-id: 6d136929da12d0f2e2a5cea474a8038ec5cdf1d0
2018-11-30 15:35:13 -08:00
Siying Dong
6e938c904f Make NewBloomFilterPolicy() use full filter by default (#4735)
Summary:
Full block (use_block_based_builder=false) Bloom filter has clear CPU saving benefits but with limitation of using temp memory when building an SST file proportional to the SST file size. We reduced the chance of having large SST files with multi-level universal compaction. Now we change to a default with better performance.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4735

Differential Revision: D13266674

Pulled By: siying

fbshipit-source-id: 7594a4c3e32568a5a2adce22bb0e46553e55c602
2018-11-30 13:13:27 -08:00
Sagar Vemuri
70645355ad Move FIFOCompactionPicker to a separate file (#4724)
Summary:
**Summary:**
Simplified the code layout by moving FIFOCompactionPicker to a separate file.
**Why?:**
While trying to add ttl functionality to universal compaction, I found that `FIFOCompactionPicker` class and its impl methods to be interspersed between `LevelCompactionPicker` methods which kind-of made the code a little hard to traverse. So I moved `FIFOCompactionPicker` to a separate compaction_picker_fifo.h/cc file, similar to `UniversalCompactionPicker`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4724

Differential Revision: D13227914

Pulled By: sagar0

fbshipit-source-id: 89471766ea67fa4d87664a41c057dd7df4b3d4e3
2018-11-29 16:04:52 -08:00
Yanqin Jin
8d7bc76f36 Fix a flaky test DBFlushTest.SyncFail (#4633)
Summary:
There is a race condition in DBFlushTest.SyncFail, as illustrated below.
```
time         thread1                             bg_flush_thread
  |     Flush(wait=false, cfd)
  |     refs_before=cfd->current()->TEST_refs()   PickMemtable calls cfd->current()->Ref()
  V
```
The race condition between thread1 getting the ref count of cfd's current
version and bg_flush_thread incrementing the cfd's current version makes it
possible for later assertion on refs_before to fail. Therefore, we add test
sync points to enforce the order and assert on the ref count before and after
PickMemtable is called in bg_flush_thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4633

Differential Revision: D12967131

Pulled By: riversand963

fbshipit-source-id: a99d2bacb7869ec5d8d03b24ef2babc0e6ae1a3b
2018-11-29 13:39:56 -08:00
Kefu Chai
7dbee38716 db/repair: reset Repair::db_lock_ in ctor (#4683)
Summary:
there is chance that

* the caller tries to repair the db when holding the db_lock, in
  that case the env implementation might not set the `lock`
  parameter of Repairer::Run().
* the caller somehow never calls Repairer::Run().

either way, the desctructor of Repair will compare the uninitialized
db_lock_ with nullptr, and tries to unlock it. there is good chance
that the db_lock_ is not nullptr, then boom.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4683

Differential Revision: D13260287

Pulled By: riversand963

fbshipit-source-id: 878a119d2e9f10a0fa17ee62cf3fb24b33d49fa5
2018-11-29 11:26:41 -08:00
Abhishek Madan
8fe1e06ca0 Clean up FragmentedRangeTombstoneList (#4692)
Summary:
Removed `one_time_use` flag, which removed the need for some
tests, and changed all `NewRangeTombstoneIterator` methods to return
`FragmentedRangeTombstoneIterators`.

These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones`
and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692

Differential Revision: D13106570

Pulled By: abhimadan

fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845
2018-11-28 15:29:02 -08:00
Zhichao Cao
7125e24619 Add the max trace file size limitation option to Tracing (#4610)
Summary:
If user do not end the trace manually, the tracing will continue which can potential use up all the storage space and cause problem. In this PR, the max trace file size is added to the TraceOptions and user can set the value if they need or the default is 64GB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4610

Differential Revision: D12893400

Pulled By: zhichao-cao

fbshipit-source-id: acf4b5a6076bb691778bdfbac4864e1006758953
2018-11-27 14:27:05 -08:00
Abhishek Madan
85394a96ca Speed up range scans with range tombstones (#4677)
Summary:
Previously, every range tombstone iterator was seeked on every
ShouldDelete call, which quickly degraded performance for long range
scans. This PR improves performance by tracking iterator positions and
only advancing iterators when necessary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4677

Differential Revision: D13205373

Pulled By: abhimadan

fbshipit-source-id: 80c199dace1e19362a4c61c686bf01913eae87cb
2018-11-26 16:33:41 -08:00
Zhongyi Xie
a21cb22ee3 Revert "apply ReadOptions.iterate_upper_bound to transaction iterator… (#4705)
Summary:
… (#4656)"

This reverts commit b76398a82b.

Will add test coverage for iterate_upper_bound before re-commit b76398
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4705

Differential Revision: D13148592

Pulled By: miasantreble

fbshipit-source-id: 4d1ce0bfd9f7a5359a7688bd780eb06a66f45b1f
2018-11-24 10:46:28 -08:00
Andrew Kryczka
07cf0ee589 Fix ticker stat for number files closed (#4703)
Summary:
We haven't been populating `NO_FILE_CLOSES` since v1.5.8 even though it was never marked as deprecated. Start populating it again. Conveniently `DeleteTableReader` has an unused `void*` argument that we can use...

Blame: 63f216ee0a

Closes #4700.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4703

Differential Revision: D13146769

Pulled By: ajkr

fbshipit-source-id: ad8d6fb0493e701f60a165a3bca1787d255be008
2018-11-21 18:31:34 -08:00
Yi Wu
05d9d82181 Revert "Move MemoryAllocator option from Cache to BlockBasedTableOpti… (#4697)
Summary:
…ons (#4676)"

This reverts commit b32d087dbb.

`MemoryAllocator` needs to be with `Cache`, since cache entry can
outlive DB and block based table. The cache needs to hold reference to
memory allocator when deleting cache entry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4697

Differential Revision: D13133490

Pulled By: yiwu-arbug

fbshipit-source-id: 8ef7e8a51263bfd929f892fd062665ff4ce9ce5a
2018-11-21 11:29:57 -08:00
Abhishek Madan
457f77b9ff Introduce RangeDelAggregatorV2 (#4649)
Summary:
The old RangeDelAggregator did expensive pre-processing work
to create a collapsed, binary-searchable representation of range
tombstones. With FragmentedRangeTombstoneIterator, much of this work is
now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
in each iterator to find a covering tombstone in ShouldDelete, while
doing minimal work in AddTombstones. The old RangeDelAggregator is still
used during flush/compaction for now, though RangeDelAggregatorV2 will
support those uses in a future PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649

Differential Revision: D13146964

Pulled By: abhimadan

fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3
2018-11-21 10:56:45 -08:00
Abhishek Madan
ed5aec5ba3 Fix range tombstone covering short-circuit logic (#4698)
Summary:
Since a range tombstone seen at one level will cover all keys
in the range at lower levels, there was a short-circuiting check in Get
that reported a key was not found at most one file after the range
tombstone was discovered. However, this was incorrect for merge
operands, since a deletion might only cover some merge operands,
which implies that the key should be found. This PR fixes this logic in
the Version portion of Get, and removes the logic from the MemTable
portion of Get, since the perforamnce benefit provided there is minimal.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4698

Differential Revision: D13142484

Pulled By: abhimadan

fbshipit-source-id: cbd74537c806032f2bfa564724d01a80df7c8f10
2018-11-20 13:29:22 -08:00
Siying Dong
13579e8c5a WriteBufferManger doens't cost to cache if no limit is set (#4695)
Summary:
WriteBufferManger is not invoked when allocating memory for memtable if the limit is not set even if a cache is passed. It is inconsistent from the comment syas. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4695

Differential Revision: D13112722

Pulled By: siying

fbshipit-source-id: 0b27eef63867f679cd06033ea56907c0569597f4
2018-11-18 16:55:43 -08:00
Andrew Kryczka
9d6d4867ab Fix uninitialized fields in file metadata (#4693)
Summary:
This is a quick fix for the uninitialized bugs in `LiveFileMetaData` and `SstFileMetaData` that were uncovered in #4686.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4693

Differential Revision: D13113189

Pulled By: ajkr

fbshipit-source-id: 18e798d031d2a59d0b55fc010c135e0126f4042d
2018-11-16 20:49:17 -08:00
Yanqin Jin
147697420a Rollback memtable flush upon atomic flush fail (#4641)
Summary:
This fixes an assertion.

An atomic flush can have multiple flush jobs. Some of them may fail. If any of
them fails, we need to rollback all of them.
For the flush jobs that do fail, we already call `RollbackMemTableFlush` in
`FlushJob::Run`. The tricky part is for flush jobs that have completed
successfully. We need to call `RollbackMemTableFlush` for them as well.

The newly added DBAtomicFlushTest.AtomicFlushRollbackSomeJobs will SigAbort
without the corresponding change in AtomicFlushMemTablesToOutputFiles.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4641

Differential Revision: D12943649

Pulled By: riversand963

fbshipit-source-id: c66a4a664a1e0938e938fd41edc5a70c34cdd868
2018-11-14 20:54:17 -08:00
Abhishek Madan
6bee36a786 Modify FragmentedRangeTombstoneList member layout (#4632)
Summary:
Rather than storing a `vector<RangeTombstone>`, we now store a
`vector<RangeTombstoneStack>` and a `vector<SequenceNumber>`. A
`RangeTombstoneStack` contains the start and end keys of a range tombstone
fragment, and indices into the seqnum vector to indicate which sequence
numbers the fragment is located at. The diagram below illustrates an
example:

```
tombstones_:     [a, b) [c, e) [h, k)
                   | \   /  \   /  |
                   |  \ /    \ /   |
                   v   v      v    v
tombstone_seqs_: [ 5 3 10 7 2 8 6  ]
```

This format allows binary searching the tombstone list to use less key
comparisons, which helps in cases where there are many overlapping
tombstones. Also, this format makes it easier to add DBIter-like
semantics to `FragmentedRangeTombstoneIterator` in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4632

Differential Revision: D13053103

Pulled By: abhimadan

fbshipit-source-id: e8220cc712fcf5be4d602913bb23ace8ea5f8ef0
2018-11-14 17:52:17 -08:00
Siying Dong
f5c8cf5fed Increase wait time in DBTest.SanitizeNumThreads (#4659)
Summary:
DBTest.SanitizeNumThreads Sometimes fails. The test waited for 10ms timeout and expect all threads scheduled to be executed. This can be a source of flakiness. Make a check every 1ms and up to 10s.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4659

Differential Revision: D13074174

Pulled By: siying

fbshipit-source-id: b1d5ff87a326a4fc9eab8d1cc307bbb940dfe70c
2018-11-14 16:19:36 -08:00
Zhongyi Xie
d8df169b84 release db mutex when calling ApproximateSize (#4630)
Summary:
`GenSubcompactionBoundaries` calls `VersionSet::ApproximateSize` which gets BlockBasedTableReader for every file and seeks in its index block to find `key`'s offset. If the table or index block aren't in memory already, this involves I/O. This can be improved by releasing DB mutex when calling ApproximateSize.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4630

Differential Revision: D13052653

Pulled By: miasantreble

fbshipit-source-id: cae31d46d10d0860fa8a26b8d5154b2d17d1685f
2018-11-13 17:08:34 -08:00
Zhongyi Xie
b76398a82b apply ReadOptions.iterate_upper_bound to transaction iterator (#4656)
Summary:
Currently transaction iterator does not apply `ReadOptions.iterate_upper_bound` when iterating. This PR attempts to fix the problem by having `BaseDeltaIterator` enforcing the upper bound check when iterator state is changed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4656

Differential Revision: D13039257

Pulled By: miasantreble

fbshipit-source-id: 909eb9f6b4597a4d80418fb139f32ec82c6ec1d1
2018-11-13 15:44:15 -08:00
Yi Wu
b32d087dbb Move MemoryAllocator option from Cache to BlockBasedTableOptions (#4676)
Summary:
Per offline discussion with siying, `MemoryAllocator` and `Cache` should be decouple. The idea is that memory allocator handles memory allocation, while cache handle cache policy.

It is normal that external cache libraries pack couple the two components for better optimization. If we want to integrate with such library in the future, we can make a wrapper of the library implementing both `Cache` and `MemoryAllocator` interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4676

Differential Revision: D13047662

Pulled By: yiwu-arbug

fbshipit-source-id: cd42e246d80ab600b4de47d073f7d2db308ce6dd
2018-11-13 13:48:38 -08:00
Siying Dong
abb1a8fc23 Add a unit test to assert number of preads (#4657)
Summary:
We used to have a bug, which caused every block to be read twice, and none of our tests caught it. Add a very simply unit test to make sure that when reading a data block, we only issue one pread against the SST file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4657

Differential Revision: D13005260

Pulled By: siying

fbshipit-source-id: 03167b554ad2451192b1707415536d7d05e9026c
2018-11-13 12:52:19 -08:00
QingpingWang
4f0fcb78ae Expose num entries and deletions of sst files (#4623)
Summary:
he ratio of num_deletions to num_entries of a level can be useful to determine if a manual compaction needs to be triggered on a level.
Also refer #3980
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4623

Differential Revision: D13045744

Pulled By: sagar0

fbshipit-source-id: 71f3c8e363a8ffd194ec3bb0ed0b69612231f0b3
2018-11-13 11:52:19 -08:00
Soli Como
5945e16dfc Divide NO_ITERATORS into two counters NO_ITERATOR_CREATED and NO_ITERATOR_DELETE (#4498)
Summary:
Currently, `Statistics` can record tick by `recordTick()` whose second parameter is an `uint64_t`.
That means tick can only increase.
If we want to reduce tick, we have to work around like `RecordTick(statistics_, NO_ITERATORS, uint64_t(-1));`.
That's kind of a hack.

So, this PR divide `NO_ITERATORS` into two counters `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETE`, making the counters increase only.

Fixes #3013 .
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4498

Differential Revision: D10395010

Pulled By: sagar0

fbshipit-source-id: cfb523b22a37411c794b4e9da090f1ae30293db2
2018-11-13 11:46:32 -08:00
Soli
a478682260 Fix #3840: only SyncClosedLogs for multiple CFs (#4460)
Summary:
Call `SyncClosedLogs()` only if there are more than one column families.

Update several unit tests (in `fault_injection_test` and `db_flush_test`) correspondingly.

See #3840 for more info.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4460

Differential Revision: D12896377

Pulled By: riversand963

fbshipit-source-id: f49afdaec32568f12f001219a3aec1dfde3b32bf
2018-11-13 11:32:16 -08:00
Andrew Kryczka
ea9454700a Backup engine support for direct I/O reads (#4640)
Summary:
Use the `DBOptions` that the backup engine already holds to figure out the right `EnvOptions` to use when reading the DB files. This means that, if a user opened a DB instance with `use_direct_reads=true`, then using `BackupEngine` to back up that DB instance will use direct I/O to read files when calculating checksums and copying. Currently the WALs and manifests would still be read using buffered I/O to prevent mixing direct I/O reads with concurrent buffered I/O writes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4640

Differential Revision: D13015268

Pulled By: ajkr

fbshipit-source-id: 77006ad6f3e00ce58374ca4793b785eea0db6269
2018-11-13 11:17:25 -08:00
Zhongyi Xie
b313019326 use per-level perfcontext for DB::Get calls (#4617)
Summary:
this PR adds two more per-level perf context counters to track
* number of keys returned in Get call, break down by levels
* total processing time at each level during Get call
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4617

Differential Revision: D12898024

Pulled By: miasantreble

fbshipit-source-id: 6b84ef1c8097c0d9e97bee1a774958f56ab4a6c4
2018-11-13 10:40:49 -08:00
Sagar Vemuri
2993cd2002 Fix RocksDB Lite build (#4675)
Summary:
Our internal CI test caught RocksDB Lite build failures. The failures are due to a new test introduced in #4665 using `SSTFileWriter` and `IngestExternalFile`, but these is not exposed under lite mode. Fixed by #ifdef'ing out the test.

```
db/db_test2.cc: In member function ‘virtual void rocksdb::DBTest2_TestCompactFiles_Test::TestBody()’:
db/db_test2.cc:2907:3: error: ‘SstFileWriter’ is not a member of ‘rocksdb’
   rocksdb::SstFileWriter sst_file_writer{rocksdb::EnvOptions(), options};
   ^
In file included from ./util/testharness.h:15:0,
                 from ./table/mock_table.h:23,
                 from ./db/db_test_util.h:44,
                 from db/db_test2.cc:13:
db/db_test2.cc:2912:13: error: ‘sst_file_writer’ was not declared in this scope
   ASSERT_OK(sst_file_writer.Open(external_file1));
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4675

Differential Revision: D13035984

Pulled By: sagar0

fbshipit-source-id: c1ceac550dfac1a85eeea436693dc7dd467519a6
2018-11-12 19:01:37 -08:00
Abhishek Madan
7d04ef4655 Fix flaky DBDynamicLevelTest.DynamicLevelMaxBytesBase2 (#4668)
Summary:
Part of the test required that a compaction start before a
manual flush, but this was not enforced by the test. In some cases,
particularly when writing to tmpfs, this could lead to the compaction
starting after the flush, which caused the base level to be higher than
it was expected to be. Add a sync point in the test to ensure that the
flush and compaction happen simultaneously.

The test also had some stale comments, so those have been removed or
modified, and the test has been simplified so that it no longer uses sleeps
and writes uncompressed SSTs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4668

Differential Revision: D13032440

Pulled By: abhimadan

fbshipit-source-id: 3f23b583a096454dafb8d8ea75678605dec80209
2018-11-12 16:42:16 -08:00
DorianZheng
0f88160f67 Fix CompactFiles bug (#4665)
Summary:
`CompactFiles` gets `SuperVersion` before `WaitForIngestFile`, while `IngestExternalFile` may add files that overlap with `input_file_names`

The timeline of execution flow is as follow:

Let's say that level N has two file [1,2] and [5,6]
```
timeline              user_thread1                             user_thread2
t0   |      CompactFiles([1, 2], [5, 6]) begin
t1   |         GetReferencedSuperVersion()
t2   |                                              IngestExternalFile([3,4]) to level N begin
t3   |             CompactFiles resume
     V
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4665

Differential Revision: D13030674

Pulled By: ajkr

fbshipit-source-id: 8be19477fd6e505032267a979d32f3097cc3be51
2018-11-12 14:32:18 -08:00
Yanqin Jin
05dec0c7c7 Remove redundant member var and set options (#4631)
Summary:
In the past, both `DBImpl::atomic_flush_` and
`DBImpl::immutable_db_options_.atomic_flush` exist. However, we fail to set
`immutable_db_options_.atomic_flush`, but use `DBImpl::atomic_flush_` which is
set correctly. This does not lead to incorrect behavior, but is a duplicate of
information.

Since `immutable_db_options_` is always there and has `atomic_flush`, we should
use it as source of truth and remove `DBImpl::atomic_flush_`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4631

Differential Revision: D12928371

Pulled By: riversand963

fbshipit-source-id: f85a811959d3828aad4a3a1b05f71facf19c636d
2018-11-12 12:24:26 -08:00
DorianZheng
09426ae1c7 Fix DBImpl::GetColumnFamilyHandleUnlocked data race (#4666)
Summary:
Hi, yiwu-arbug, I found that `DBImpl::GetColumnFamilyHandleUnlocked` still have data race condition, because `column_family_memtables_` has a stateful cache `current_` and `column_family_memtables_::Seek` maybe call without the protection of `mutex_` by a write thread

check 859dbda6e3/db/write_batch.cc (L1188)  and   859dbda6e3/db/write_batch.cc (L1756)  and  859dbda6e3/db/db_impl_write.cc (L318)

So it's better to use `versions_->GetColumnFamilySet()->GetColumnFamily` instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4666

Differential Revision: D13027117

Pulled By: yiwu-arbug

fbshipit-source-id: 4e3778eaf8e7f7c8577bbd78129b6a5fd7ce79fb
2018-11-12 11:52:34 -08:00
Yi Wu
859dbda6e3 Fix DBTest.SoftLimit flakyness (#4658)
Summary:
The flakyness can be reproduced with the following patch:
```
 --- a/db/db_impl_compaction_flush.cc
+++ b/db/db_impl_compaction_flush.cc
@@ -2013,6 +2013,9 @@ void DBImpl::BackgroundCallFlush() {
       if (job_context.HaveSomethingToDelete()) {
         PurgeObsoleteFiles(job_context);
       }
+      static int f_count = 0;
+      printf("clean flush job context %d\n", ++f_count);
+      env_->SleepForMicroseconds(1000000);
       job_context.Clean();
       mutex_.Lock();
     }
```
The issue is that FlushMemtable with opt.wait=true does not wait for `OnStallConditionsChanged` being called. The event listener is triggered on `JobContext::Clean`, which happens after flush result is installed. At the time we check for stall condition after flushing memtable, the job context cleanup may not be finished.

To fix the flaykyness, we use sync point to create a custom WaitForFlush that waits for context cleanup.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4658

Differential Revision: D13007301

Pulled By: yiwu-arbug

fbshipit-source-id: d98395ee7b0ad4c62e83e8d0e9b6028058c61712
2018-11-09 16:45:19 -08:00
Sagar Vemuri
dc3528077a Update all unique/shared_ptr instances to be qualified with namespace std (#4638)
Summary:
Ran the following commands to recursively change all the files under RocksDB:
```
find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
```
Running `make format` updated some formatting on the files touched.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638

Differential Revision: D12934992

Pulled By: sagar0

fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
2018-11-09 11:19:58 -08:00
Zhongyi Xie
fce5994603 Add more sync point to fix flaky test GroupCommitTest
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4637

Differential Revision: D12963727

Pulled By: miasantreble

fbshipit-source-id: 76053501afbecc6ef388ddc56542fa0185243e3f
2018-11-07 14:07:53 -08:00
Siying Dong
566fc8b994 Black list some valgrind tests (#4642)
Summary:
valgrind tests with 1 thread run too long. To make it shorter, black list some long tests. These are already blacklisted in parallel valgrind tests, but they are not in non-parallel mode
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4642

Differential Revision: D12945237

Pulled By: siying

fbshipit-source-id: 04cf977d435996480fe87aa09f14b17975b74f7d
2018-11-06 14:22:36 -08:00
Andrew Kryczka
fffac43cfb Add DB property for SST files kept from deletion (#4618)
Summary:
This property can help debug why SST files aren't being deleted. Previously we only had the property "rocksdb.is-file-deletions-enabled". However, even when that returned true, obsolete SSTs may still not be deleted due to the coarse-grained mechanism we use to prevent newly created SSTs from being accidentally deleted. That coarse-grained mechanism uses a lower bound file number for SSTs that should not be deleted, and this property exposes that lower bound.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4618

Differential Revision: D12898179

Pulled By: ajkr

fbshipit-source-id: fe68acc041ddbcc9276bbd48976524d95aafc776
2018-11-05 20:24:40 -08:00
Siying Dong
c3105aa50d Try to fix ExternalSSTFileTest.IngestNonExistingFile flakines (#4625)
Summary:
ExternalSSTFileTest.IngestNonExistingFile occasionally fail for number of SST files after manual compaction doesn't go down as expected. Although I don't find a reason how this can happen, adding an extra waiting to make sure obsolete file purging has finished before we check the files doesn't hurt.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4625

Differential Revision: D12910586

Pulled By: siying

fbshipit-source-id: 2a5ddec6908c99cf3bcc78431c6f93151c2cab59
2018-11-02 17:26:35 -07:00
Zhongyi Xie
61311157ff exclude get db property calls from rocksdb_lite (#4619)
Summary:
fix current failing lite test:
> In file included from ./util/testharness.h:15:0,
                 from ./table/mock_table.h:23,
                 from ./db/db_test_util.h:44,
                 from db/db_flush_test.cc:10:
db/db_flush_test.cc: In member function ‘virtual void rocksdb::DBFlushTest_ManualFlushFailsInReadOnlyMode_Test::TestBody()’:
db/db_flush_test.cc:250:35: error: ‘Properties’ is not a member of ‘rocksdb::DB’
   ASSERT_TRUE(db_->GetIntProperty(DB::Properties::kBackgroundErrors,
                                   ^
make: *** [db/db_flush_test.o] Error 1
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4619

Differential Revision: D12898319

Pulled By: miasantreble

fbshipit-source-id: 72de603b1f2e972fc8caa88611798c4e98e348c6
2018-11-02 11:28:59 -07:00
Yanqin Jin
de18a2d82e Update test to cover a new case in file ingestion (#4614)
Summary:
The new case is directIO = true, write_global_seqno = false in which we no longer write global_seqno to the external SST file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4614

Differential Revision: D12885001

Pulled By: riversand963

fbshipit-source-id: 7541bdc608b3a0c93d3c3c435da1b162b36673d4
2018-11-01 16:23:49 -07:00
Bo Hou
cd9404bb77 xxhash 64 support
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4607

Reviewed By: siying

Differential Revision: D12836696

Pulled By: jsjhoubo

fbshipit-source-id: 7122ccb712d0b0f1cd998aa4477e0da1401bd870
2018-11-01 15:44:06 -07:00
Andrew Kryczka
5c794d94c4 Prevent manual flush hanging in read-only mode (#4615)
Summary:
The logic to wait for stall conditions to clear before beginning a manual flush didn't take into account whether the DB was in read-only mode. In read-only mode the stall conditions would never clear since no background work is happening, so the wait would be never-ending. It's probably better to return an error to the user.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4615

Differential Revision: D12888008

Pulled By: ajkr

fbshipit-source-id: 1c474b42a7ac38d9fd0d0e2340ff1d53e684d83c
2018-11-01 15:27:06 -07:00
Andrew Kryczka
b8f68bac38 Prevent manual compaction hanging in read-only mode (#4611)
Summary:
A background compaction with pre-picked files (i.e., either a manual compaction or a bottom-pri compaction) fails when the DB is in read-only mode. In the failure handling, we forgot to unregister the compaction and the files it covered. Then subsequent manual compactions could conflict with this zombie compaction (possibly Halloween related) and wait forever for it to finish.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4611

Differential Revision: D12871217

Pulled By: ajkr

fbshipit-source-id: 9d24e921d5bbd2ee8c2c9536a30abfa42a220c6e
2018-10-31 17:24:36 -07:00
Yanqin Jin
d1118f6f19 Add test to check if DB can handle atomic group (#4433)
Summary:
Add unit tests to demonstrate that `VersionSet::Recover` is able to detect and handle cases in which the MANIFEST has valid atomic group, incomplete trailing atomic group, atomic group mixed with normal version edits and atomic group with incorrect size.
With this capability, RocksDB identifies non-valid groups of version edits and do not apply them, thus guaranteeing that the db is restored to a state consistent with the most recent successful atomic flush before applying WAL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4433

Differential Revision: D10079202

Pulled By: riversand963

fbshipit-source-id: a0e0b8bf4da1cf68e044d397588c121b66c68876
2018-10-30 16:37:47 -07:00
Abhishek Madan
eaaf1a6f05 Promote rocksdb.{deleted.keys,merge.operands} to main table properties (#4594)
Summary:
Since the number of range deletions are reported in
TableProperties, it is confusing to not report the number of merge
operands and point deletions as top-level properties; they are
accessible through the public API, but since they are not the "main"
properties, they do not appear in aggregated table properties, or the
string representation of table properties.

This change promotes those two property keys to
`rocksdb/table_properties.h`, adds corresponding uint64 members for
them, deprecates the old access methods `GetDeletedKeys()` and
`GetMergeOperands()` (though they are still usable for now), and removes
`InternalKeyPropertiesCollector`. The property key strings are the same
as before this change, so this should be able to read DBs written from older
versions (though I haven't tested this yet).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4594

Differential Revision: D12826893

Pulled By: abhimadan

fbshipit-source-id: 9e4e4fbdc5b0da161c89582566d184101ba8eb68
2018-10-30 15:34:27 -07:00
Siying Dong
9da88a8321 Remove info logging in db mutex inside EnableFileDeletions() (#4604)
Summary:
EnableFileDeletions() does info logging inside db mutex. This is not recommended in the code base, since there could be I/O involved. Move this outside the DB mutex.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4604

Differential Revision: D12834432

Pulled By: siying

fbshipit-source-id: ffe5c2626fcfdb4c54a661a3c3b0bc95054816cf
2018-10-30 10:33:59 -07:00
Andrew Kryczka
cae540ebef Fix range tombstones written to more files than necessary (#4592)
Summary:
When there's a gap between files, we do not need to output tombstones starting at the next output file's begin key to the current output file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4592

Differential Revision: D12808627

Pulled By: ajkr

fbshipit-source-id: 77c8b2e7523a95b1cd6611194144092c06acb505
2018-10-29 19:23:27 -07:00
Yanqin Jin
806ff34b61 Disable DBIOFailureTest.NoSpaceCompactRange in LITE (#4596)
Summary:
Since ErrorHandler::RecoverFromNoSpace is no-op in LITE mode, then we should
not have this test in LITE mode. If we do keep it, it will cause the test
thread to wait on bg_cv_ that will not be signalled.

How to reproduce
```
$make clean && git checkout a27fce408e
$OPT="-DROCKSDB_LITE -g" make -j20
$./db_io_failure_test --gtest_filter=DBIOFailureTest.NoSpaceCompactRange
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4596

Differential Revision: D12818516

Pulled By: riversand963

fbshipit-source-id: bc83524f40fff1e29506979017f7f4c2b70322f3
2018-10-29 14:36:31 -07:00
Yanqin Jin
92b4401566 Avoid memtable cut when active memtable is empty (#4595)
Summary:
For flush triggered by RocksDB due to memory usage approaching certain
threshold (WriteBufferManager or Memtable full), we should cut the memtable
only when the current active memtable is not empty, i.e. contains data. This is
what we do for non-atomic flush. If we always cut memtable even when the active
memtable is empty, we will generate extra, empty immutable memtable.
This is not ideal since it may cause write stall. It also causes some
DBAtomicFlushTest to fail because cfd->imm()->NumNotFlushed() is different from
expectation.

Test plan
```
$make clean && make J=1 -j32 all check
$make clean && OPT="-DROCKSDB_LITE -g" make J=1 -j32 all check
$make clean && TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make J=1 -j32 valgrind_test
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4595

Differential Revision: D12818520

Pulled By: riversand963

fbshipit-source-id: d867bdbeacf4199fdd642debb085f94703c41a18
2018-10-29 09:45:32 -07:00
Yanqin Jin
5b4c709fad Enable atomic flush (#4023)
Summary:
Adds a DB option `atomic_flush` to control whether to enable this feature. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4023

Differential Revision: D8518381

Pulled By: riversand963

fbshipit-source-id: 1e3bb33e99bb102876a31b378d93b0138ff6634f
2018-10-26 15:08:43 -07:00
Yi Wu
f560c8f5c8 s/CacheAllocator/MemoryAllocator/g (#4590)
Summary:
Rename the interface, as it is mean to be a generic interface for memory allocation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4590

Differential Revision: D10866340

Pulled By: yiwu-arbug

fbshipit-source-id: 85cb753351a40cb856c046aeaa3f3b369eef3d16
2018-10-26 14:30:30 -07:00
Abhishek Madan
7528130e38 Cache fragmented range tombstones in BlockBasedTableReader (#4493)
Summary:
This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses.

On the same DB used in #4449, running `readrandom` results in the following:
```
readrandom   :       0.983 micros/op 1017076 ops/sec;   78.3 MB/s (63103 of 100000 found)
```

Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results):
```
   Tombstones?    | avg micros/op | stddev micros/op |  avg ops/s   | stddev ops/s
----------------- | ------------- | ---------------- | ------------ | ------------
None              |        0.6186 |          0.04637 | 1,625,252.90 | 124,679.41
500 Expanded      |        0.6019 |          0.03628 | 1,666,670.40 | 101,142.65
500 Unexpanded    |        0.6435 |          0.03994 | 1,559,979.40 | 104,090.52
1k Expanded       |        0.6034 |          0.04349 | 1,665,128.10 | 125,144.57
1k Unexpanded     |        0.6261 |          0.03093 | 1,600,457.50 |  79,024.94
5k Expanded       |        0.6163 |          0.05926 | 1,636,668.80 | 154,888.85
5k Unexpanded     |        0.6402 |          0.04002 | 1,567,804.70 | 100,965.55
10k Expanded      |        0.6036 |          0.05105 | 1,667,237.70 | 142,830.36
10k Unexpanded    |        0.6128 |          0.02598 | 1,634,633.40 |  72,161.82
25k Expanded      |        0.6198 |          0.04542 | 1,620,980.50 | 116,662.93
25k Unexpanded    |        0.5478 |          0.0362  | 1,833,059.10 | 121,233.81
50k Expanded      |        0.5104 |          0.04347 | 1,973,107.90 | 184,073.49
50k Unexpanded    |        0.4528 |          0.03387 | 2,219,034.50 | 170,984.32
```

After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493

Differential Revision: D10842844

Pulled By: abhimadan

fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509
2018-10-25 19:26:44 -07:00
Zhongyi Xie
fe0d23059d Fix two contrun job failures (#4587)
Summary:
Currently there are two contrun test failures:
* rocksdb-contrun-lite:
> tools/db_bench_tool.cc: In function ‘int rocksdb::db_bench_tool(int, char**)’:
tools/db_bench_tool.cc:5814:5: error: ‘DumpMallocStats’ is not a member of ‘rocksdb’
     rocksdb::DumpMallocStats(&stats_string);
     ^
make: *** [tools/db_bench_tool.o] Error 1
* rocksdb-contrun-unity:
> In file included from unity.cc:44:0:
db/range_tombstone_fragmenter.cc: In member function ‘void rocksdb::FragmentedRangeTombstoneIterator::FragmentTombstones(std::unique_ptr<rocksdb::InternalIteratorBase<rocksdb::Slice> >, rocksdb::SequenceNumber)’:
db/range_tombstone_fragmenter.cc:90:14: error: reference to ‘ParsedInternalKeyComparator’ is ambiguous
   auto cmp = ParsedInternalKeyComparator(icmp_);

This PR will fix them
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4587

Differential Revision: D10846554

Pulled By: miasantreble

fbshipit-source-id: 8d3358879e105060197b1379c84aecf51b352b93
2018-10-24 20:16:45 -07:00
Yanqin Jin
eb8c9918f7 Remove unused variable
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4585

Differential Revision: D10841983

Pulled By: riversand963

fbshipit-source-id: 6a7e0b40065bcfbb10a2cac0cec1e8da0750a617
2018-10-24 15:51:45 -07:00
Abhishek Madan
8c78348c77 Use only "local" range tombstones during Get (#4449)
Summary:
Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key from higher levels, and compute the highest covering
sequence number at the current level. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.

In the benchmark results, the following command was used to initialize the database:
```
./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
```

...and the following command was used to measure read throughput:
```
./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
```

The filluniquerandom command was only run once, and the resulting database was used
to measure read performance before and after the PR. Both binaries were compiled with
`DEBUG_LEVEL=0`.

Readrandom results before PR:
```
readrandom   :       4.544 micros/op 220090 ops/sec;   16.9 MB/s (63103 of 100000 found)
```

Readrandom results after PR:
```
readrandom   :      11.147 micros/op 89707 ops/sec;    6.9 MB/s (63103 of 100000 found)
```

So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).

----
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449

Differential Revision: D10370575

Pulled By: abhimadan

fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
2018-10-24 12:31:12 -07:00
Zhongyi Xie
21bf7421ca use per-level perf context for bloom filter related counters (#4581)
Summary:
PR https://github.com/facebook/rocksdb/pull/4226 introduced per-level perf context which allows breaking down perf context by levels.
This PR takes advantage of the feature to populate a few counters related to bloom filters
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4581

Differential Revision: D10518010

Pulled By: miasantreble

fbshipit-source-id: 011244561783ec860d32d5b0fa6bce6e78d70ef8
2018-10-24 12:21:38 -07:00
Neil Mayhew
43dbd4411e Adapt three unit tests with newer compiler/libraries (#4562)
Summary:
This fixes three tests that fail with relatively recent tools and libraries:

The tests are:

* `spatial_db_test`
* `table_test`
* `db_universal_compaction_test`

I'm using:

* `gcc` 7.3.0
* `glibc` 2.27
* `snappy` 1.1.7
* `gflags` 2.2.1
* `zlib` 1.2.11
* `bzip2` 1.0.6.0.1
* `lz4` 1.8.2
* `jemalloc` 5.0.1

The versions used in the Travis environment (which is two Ubuntu LTS versions behind the current one and doesn't use `lz4` or `jemalloc`) don't seem to have a problem. However, to be safe, I verified that these tests pass with and without my changes in a trusty Docker container without `lz4` and `jemalloc`.

However, I do get an unrelated set of other failures when using a trusty Docker container that uses `lz4` and `jemalloc`:

```
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/0, where GetParam() = (1, false) (1189 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1, where GetParam() = (1, true) (1246 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2, where GetParam() = (3, false) (1237 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3, where GetParam() = (3, true) (1195 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4, where GetParam() = (5, false) (1161 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5, where GetParam() = (5, true) (1229 ms)
```

I haven't attempted to fix these since I'm not using trusty and Travis doesn't use `lz4` and `jemalloc`. However, the final commit in this PR does at least fix the compilation errors that occur when using trusty's version of `lz4`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4562

Differential Revision: D10510917

Pulled By: maysamyabandeh

fbshipit-source-id: 59534042015ec339270e5fc2f6ac4d859370d189
2018-10-24 08:17:56 -07:00
Zhongyi Xie
f6b151f16d fix clang analyzer error (#4583)
Summary:
clang analyzer currently fails with the following warnings:
> db/log_reader.cc:323:9: warning: Undefined or garbage value returned to caller
        return r;
        ^~~~~~~~
db/log_reader.cc:344:11: warning: Undefined or garbage value returned to caller
          return r;
          ^~~~~~~~
db/log_reader.cc:369:11: warning: Undefined or garbage value returned to caller
          return r;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4583

Differential Revision: D10523517

Pulled By: miasantreble

fbshipit-source-id: 0cc8b8f27657b202bead148bbe7c4aa84fed095b
2018-10-23 22:14:54 -07:00
Maysam Yabandeh
c34cc40424 Fix user comparator receiving internal key (#4575)
Summary:
There was a bug that the user comparator would receive the internal key instead of the user key. The bug was due to RangeMightExistAfterSortedRun expecting user key but receiving internal key when called in GenerateBottommostFiles. The patch augment an existing unit test to reproduce the bug and fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4575

Differential Revision: D10500434

Pulled By: maysamyabandeh

fbshipit-source-id: 858346d2fd102cce9e20516d77338c112bdfe366
2018-10-23 08:14:46 -07:00
Siying Dong
7024263682 Dynamic level to adjust level multiplier when write is too heavy (#4338)
Summary:
Level compaction usually performs poorly when the writes so heavy that the level targets can't be guaranteed. With this improvement, we improve level_compaction_dynamic_level_bytes = true so that in the write heavy cases, the level multiplier can be slightly adjusted based on the size of L0.

We keep the behavior the same if number of L0 files is under 2X compaction trigger and the total size is less than options.max_bytes_for_level_base, so that unless write is so heavy that compaction cannot keep up, the behavior doesn't change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4338

Differential Revision: D9636782

Pulled By: siying

fbshipit-source-id: e27fc17a7c29c84b00064cc17536a01dacef7595
2018-10-22 10:21:47 -07:00
Yi Wu
933250e355 Fix RepeatableThreadTest::MockEnvTest hang (#4560)
Summary:
When `MockTimeEnv` is used in test to mock time methods, we cannot use `CondVar::TimedWait` because it is using real time, not the mocked time for wait timeout. On Mac the method can return immediately without awaking other waiting threads, if the real time is larger than `wait_until` (which is a mocked time). When that happen, the `wait()` method will fall into an infinite loop.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4560

Differential Revision: D10472851

Pulled By: yiwu-arbug

fbshipit-source-id: 898902546ace7db7ac509337dd8677a527209d19
2018-10-21 20:17:18 -07:00
Yanqin Jin
da4aa59b4c Add read retry support to log reader (#4394)
Summary:
Current `log::Reader` does not perform retry after encountering `EOF`. In the future, we need the log reader to be able to retry tailing the log even after `EOF`.

Current implementation is simple. It does not provide more advanced retry policies. Will address this in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4394

Differential Revision: D9926508

Pulled By: riversand963

fbshipit-source-id: d86d145792a41bd64a72f642a2a08c7b7b5201e1
2018-10-19 11:53:00 -07:00
Maysam Yabandeh
0afa5b53d7 Disable GroupCommitTest in Appveyor (#4536)
Summary:
We have already disabled it on Travis since it has been too flaky. The same problem arises in Appveyor as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4536

Differential Revision: D10452240

Pulled By: maysamyabandeh

fbshipit-source-id: 728f4ecddf780097159dc0a0737d460eb5ce4f09
2018-10-18 14:21:09 -07:00
Abhishek Madan
45f213b558 Lazily initialize RangeDelAggregator stripe map entries (#4497)
Summary:
When there are no range deletions, flush and compaction perform a binary search
on an effectively empty map every time they call ShouldDelete. This PR lazily
initializes each stripe map entry so that the binary search can be elided in
these cases.

After this PR, the total amount of time spent in compactions is 52.541331s, and the total amount of time spent in flush is 5.532608s, the former of which is a significant improvement from the results after #4495.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4497

Differential Revision: D10428610

Pulled By: abhimadan

fbshipit-source-id: 6f7e1ce3698fac3ef86d1197955e6b72e0931a0f
2018-10-17 11:47:34 -07:00
Zhongyi Xie
d6ec288703 Add PerfContextByLevel to provide per level perf context information (#4226)
Summary:
Current implementation of perf context is level agnostic. Making it hard to do performance evaluation for the LSM tree. This PR adds `PerfContextByLevel` to decompose the counters by level.
This will be helpful when analyzing point and range query performance as well as tuning bloom filter
Also replaced __thread with thread_local keyword for perf_context
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4226

Differential Revision: D10369509

Pulled By: miasantreble

fbshipit-source-id: f1ced4e0de5fcebdb7f9cff36164516bc6382d82
2018-10-17 11:19:40 -07:00
anand1976
1e3845805d Properly determine a truncated CompactRange stop key (#4496)
Summary:
When a CompactRange() call for a level is truncated before the end key
is reached, because it exceeds max_compaction_bytes, we need to properly
set the compaction_end parameter to indicate the stop key. The next
CompactRange will use that as the begin key. We set it to the smallest
key of the next file in the level after expanding inputs to get a clean
cut.

Previously, we were setting it before expanding inputs. So we could end
up recompacting some files. In a pathological case, where a single key
has many entries spanning all the files in the level (possibly due to
merge operands without a partial merge operator, thus resulting in
compaction output identical to the input), this would result in
an endless loop over the same set of files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4496

Differential Revision: D10395026

Pulled By: anand1976

fbshipit-source-id: f0c2f89fee29b4b3be53b6467b53abba8e9146a9
2018-10-15 23:22:51 -07:00
Yanqin Jin
e633983cf1 Add support to flush multiple CFs atomically (#4262)
Summary:
Leverage existing `FlushJob` to implement atomic flush of multiple column families.

This PR depends on other PRs and is a subset of #3752 . This PR itself is not sufficient in fulfilling atomic flush.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4262

Differential Revision: D9283109

Pulled By: riversand963

fbshipit-source-id: 65401f913e4160b0a61c0be6cd02adc15dad28ed
2018-10-15 20:01:17 -07:00
Andrew Kryczka
32b4d4ad47 Avoid per-key linear scan over snapshots in compaction (#4495)
Summary:
`CompactionIterator::snapshots_` is ordered by ascending seqnum, just like `DBImpl`'s linked list of snapshots from which it was copied. This PR exploits this ordering to make `findEarliestVisibleSnapshot` do binary search rather than linear scan. This can make flush/compaction significantly faster when many snapshots exist since that function is called on every single key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4495

Differential Revision: D10386470

Pulled By: ajkr

fbshipit-source-id: 29734991631227b6b7b677e156ac567690118a8b
2018-10-15 16:21:22 -07:00
Yanqin Jin
729a617b5b Add listener to sample file io (#3933)
Summary:
We would like to collect file-system-level statistics including file name, offset, length, return code, latency, etc., which requires to add callbacks to intercept file IO function calls when RocksDB is running.
To collect file-system-level statistics, users can inherit the class `EventListener`, as in `TestFileOperationListener `. Note that `TestFileOperationListener::ShouldBeNotifiedOnFileIO()` returns true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3933

Differential Revision: D10219571

Pulled By: riversand963

fbshipit-source-id: 7acc577a2d31097766a27adb6f78eaf8b1e8ff15
2018-10-12 18:36:11 -07:00
Yi Wu
6f8d4bdff1 Fix compile error with jemalloc (#4488)
Summary:
The "je_" prefix of jemalloc APIs presents only when the macro `JEMALLOC_NO_RENAME` from jemalloc.h presents.

With the patch I'm also adding -DROCKSDB_JEMALLOC flag in buck TARGETS.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4488

Differential Revision: D10355971

Pulled By: yiwu-arbug

fbshipit-source-id: 03a2d69790a44ac89219c7525763fa937a63d95a
2018-10-12 11:50:50 -07:00
Chinmay Kamat
6422356a27 Acquire lock on DB LOCK file before starting repair. (#4435)
Summary:
This commit adds code to acquire lock on the DB LOCK file
before starting the repair process. This will prevent
multiple processes from performing repair on the same DB
simultaneously. Fixes repair_test to work with this change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4435

Differential Revision: D10361499

Pulled By: riversand963

fbshipit-source-id: 3c512c48b7193d383b2279ccecabdb660ac1cf22
2018-10-12 10:41:54 -07:00
Abhishek Madan
7dd1641048 Use vector in UncollapsedRangeDelMap (#4487)
Summary:
Using `./range_del_aggregator_bench --use_collapsed=false
--num_range_tombstones=5000 --num_runs=1000`, here are the results before and
after this change:

Before:
```
=========================
Results:
=========================
AddTombstones:           1822.61 us
ShouldDelete (first):    94.5286 us
```

After:
```
=========================
Results:
=========================
AddTombstones:           199.26 us
ShouldDelete (first):    38.9344 us
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4487

Differential Revision: D10347288

Pulled By: abhimadan

fbshipit-source-id: d44efe3a166d583acfdc3ec1199e0892f34dbfb7
2018-10-11 15:29:14 -07:00
UncP
531786ebf7 DBWriteImpl: remove redundant code (#4450)
Summary:
in `WriteThread::LaunchParallelMemTableWriters`, there is `  write_group->running.store(write_group->size);
`
https://github.com/facebook/rocksdb/blob/master/db/write_thread.cc#L510
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4450

Differential Revision: D10201900

Pulled By: yiwu-arbug

fbshipit-source-id: 96c8fbbba5aff7ba8a6ceb3117a2bd7cc9b2f34b
2018-10-10 21:00:32 -07:00
Simon Grätzer
ceded4535d WriteBatch::Iterate wrongly returns Status::Corruption (#4478)
Summary:
Wrong I overwrite `WriteBatch::Handler::Continue` to return _false_ at some point, I always get the `Status::Corruption` error.
I don't think this check is used correctly here: The counter in `found` cannot reflect all entries in the WriteBatch when we exit the loop early.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4478

Differential Revision: D10317416

Pulled By: yiwu-arbug

fbshipit-source-id: cccae3382805035f9b3239b66682b5fcbba6bb61
2018-10-10 20:57:27 -07:00
Andrew Kryczka
7e56072290 Fix merge operand reappearing when covered by DeleteRange (#4481)
Summary:
Even during `DBIter::Prev()`, there is a case where we need to use `RangeDelPositioningMode::kForwardTraversal`. In particular, when we hit too many internal keys for a single user key, we use seek to find the newest internal key. If it's a merge operand, we then scan forwards, collecting the merge operands. This forward scan should be using `RangeDelPositioningMode::kForwardTraversal`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4481

Differential Revision: D10319507

Pulled By: ajkr

fbshipit-source-id: b5ce7352461f3a7696b28a5136ae0076f2bde51f
2018-10-10 18:16:12 -07:00
Peter Pei
09814f2cfc support OnCompactionBegin (#4431)
Summary:
fix #4288

Add `OnCompactionBegin` support to `rocksdb::EventListener`.

Currently, we only have these three callbacks:

- OnFlushBegin
- OnFlushCompleted
- OnCompactionCompleted

As paolococchi requested in #4288 , and ajkr agreed, we should also support `OnCompactionBegin`.

This PR is a try to implement the support of `OnCompactionBegin`.

Hope it is useful to you.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4431

Differential Revision: D10055515

Pulled By: yiwu-arbug

fbshipit-source-id: 39c0f95f8e9ff1c7ca3a10787502a17f258d2334
2018-10-10 17:32:27 -07:00
Andrew Kryczka
faa70fc575 DeleteRange regression tests using public API (#4476)
Summary:
I wrote a couple tests using the public API to expose/prevent the bugs we talked. In particular,

- When files have overlapping endpoints and a range tombstone spans them, ensure the largest key does not reappear to readers. This was happening due to a bug that skipped writing range tombstones to an output file when their begin key exactly matched the file's largest key.
- When a tombstone spans multiple atomic compaction units, ensure newer keys do not disappear by being compacted beneath it. This happened due to a range tombstone appearing untruncated to readers when it spanned files with overlapping endpoints, even if it extended into files without overlapping endpoints (i.e., different atomic compaction units).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4476

Differential Revision: D10286001

Pulled By: ajkr

fbshipit-source-id: bb5ca51d0f90812fb37bfe1d01aec93f7eda55aa
2018-10-10 12:30:11 -07:00
Abhishek Madan
9c6fea7fe1 Update HISTORY.md, fix unity_test failure (#4479)
Summary:
Follow-up to https://github.com/facebook/rocksdb/pull/4432.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4479

Differential Revision: D10304151

Pulled By: abhimadan

fbshipit-source-id: 3608b95c324702ca26791f95cb26dae1d49efbe7
2018-10-10 12:09:56 -07:00
Anand Ananthabhotla
854a4be03f Handle mixed slowdown/no_slowdown writer properly (#4475)
Summary:
There is a bug when the write queue leader is blocked on a write
delay/stop, and the queue has writers with WriteOptions::no_slowdown set
to true. They are not woken up until the write stall is cleared.

The fix introduces a dummy writer inserted at the tail to indicate a
write stall and prevent further inserts into the queue, and a condition
variable that writers who can tolerate slowdown wait on before adding
themselves to the queue. The leader calls WriteThread::BeginWriteStall()
to add the dummy writer and then walk the queue to fail any writers with
no_slowdown set. Once the stall clears, the leader calls
WriteThread::EndWriteStall() to remove the dummy writer and signal the
condition variable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4475

Differential Revision: D10285827

Pulled By: anand1976

fbshipit-source-id: 747465e5e7f07a829b1fb0bc1afcd7b93f4ab1a9
2018-10-09 22:52:40 -07:00
jsteemann
141ef7f8d3 avoid copying when iterating using range-based for (#4459)
Summary:
this avoids a few copies of std::string and other structs
in the context of range-based for loops. instead of copying
the values for each iteration, use a const reference to avoid
copying.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4459

Differential Revision: D10282045

Pulled By: sagar0

fbshipit-source-id: 5012e910dca279abd2be847e1fb432d96274edfb
2018-10-09 17:15:51 -07:00
jsteemann
517d3b8b77 fix typo in error message, twice (#4457)
Summary:
Fixes a typo in error messages returned by Iterator::GetProperty(...)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4457

Differential Revision: D10281965

Pulled By: sagar0

fbshipit-source-id: 1cd3c665f467ef06cdfd9f482692e6f8568f3d22
2018-10-09 17:07:27 -07:00
Abhishek Madan
3a4bd36fed Truncate range tombstones by leveraging InternalKeys (#4432)
Summary:
To more accurately truncate range tombstones at SST boundaries,
we now represent them in RangeDelAggregator using InternalKeys, which
are end-key-exclusive as they were before this change.

During compaction, "atomic compaction unit boundaries" (the range of
keys contained in neighbouring and overlaping SSTs) are propagated down
to RangeDelAggregator to truncate range tombstones at those boundariies
instead. See https://github.com/facebook/rocksdb/pull/4432#discussion_r221072219 and https://github.com/facebook/rocksdb/pull/4432#discussion_r221138683
for motivating examples.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4432

Differential Revision: D10263952

Pulled By: abhimadan

fbshipit-source-id: 2fe85ff8a02b3a6a2de2edfe708012797a7bd579
2018-10-09 15:19:38 -07:00
Zhongyi Xie
283a700f5d add locking around calls to RecalculateWriteStallConditions in column_family_test (#4474)
Summary:
this should fix the current failing TSAN jobs:
The callstack for TSAN:
> WARNING: ThreadSanitizer: data race (pid=87440)
  Read of size 8 at 0x7d580000fce0 by thread T22 (mutexes: write M548703):
    #0 rocksdb::InternalStats::DumpCFStatsNoFileHistogram(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) db/internal_stats.cc:1204 (column_family_test+0x00000080eca7)
    #1 rocksdb::InternalStats::DumpCFStats(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) db/internal_stats.cc:1169 (column_family_test+0x0000008106d0)
    #2 rocksdb::InternalStats::HandleCFStats(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, rocksdb::Slice) db/internal_stats.cc:578 (column_family_test+0x000000810720)
    #3 rocksdb::InternalStats::GetStringProperty(rocksdb::DBPropertyInfo const&, rocksdb::Slice const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) db/internal_stats.cc:488 (column_family_test+0x00000080670c)
    #4 rocksdb::DBImpl::DumpStats() db/db_impl.cc:625 (column_family_test+0x00000070ce9a)

>  Previous write of size 8 at 0x7d580000fce0 by main thread:
    #0 rocksdb::InternalStats::AddCFStats(rocksdb::InternalStats::InternalCFStatsType, unsigned long) db/internal_stats.h:324 (column_family_test+0x000000693bbf)
    #1 rocksdb::ColumnFamilyData::RecalculateWriteStallConditions(rocksdb::MutableCFOptions const&) db/column_family.cc:818 (column_family_test+0x000000693bbf)
    #2 rocksdb::ColumnFamilyTest_WriteStallSingleColumnFamily_Test::TestBody() db/column_family_test.cc:2563 (column_family_test+0x0000005e5a49)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4474

Differential Revision: D10262099

Pulled By: miasantreble

fbshipit-source-id: 1247973a3ca32e399b4575d3401dd5439c39efc5
2018-10-09 14:10:13 -07:00
Zhongyi Xie
cac87fcf57 move dump stats to a separate thread (#4382)
Summary:
Currently statistics are supposed to be dumped to info log at intervals of `options.stats_dump_period_sec`. However the implementation choice was to bind it with compaction thread, meaning if the database has been serving very light traffic, the stats may not get dumped at all.
We decided to separate stats dumping into a new timed thread using `TimerQueue`, which is already used in blob_db. This will allow us schedule new timed tasks with more deterministic behavior.

Tested with db_bench using `--stats_dump_period_sec=20` in command line:
> LOG:2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
LOG:2018/09/17-14:08:05.643286 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
LOG:2018/09/17-14:08:25.691325 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
LOG:2018/09/17-14:08:45.740989 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------

LOG content:
> 2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
2018/09/17-14:07:45.575080 7fe99fbfe700 [WARN] [db/db_impl.cc:606]
** DB Stats **
Uptime(secs): 20.0 total, 20.0 interval
Cumulative writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5.57 GB, 285.01 MB/s
Cumulative WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 GB, 285.01 MB/s
Cumulative stall: 00:00:0.012 H:M:S, 0.1 percent
Interval writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5700.71 MB, 285.01 MB/s
Interval WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 MB, 285.01 MB/s
Interval stall: 00:00:0.012 H:M:S, 0.1 percent
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4382

Differential Revision: D9933051

Pulled By: miasantreble

fbshipit-source-id: 6d12bb1e4977674eea4bf2d2ac6d486b814bb2fa
2018-10-08 22:54:43 -07:00
DorianZheng
27090ae8f6 Fix DBImpl::GetColumnFamilyHandleUnlocked race condition (#4391)
Summary:
- Fix DBImpl API race condition

The timeline of execution flow is as follow:
```
timeline              user_thread1                      user_thread2
t1   |     cfh = GetColumnFamilyHandleUnlocked(0)
t2   |     id1 = cfh->GetID()
t3   |                                                GetColumnFamilyHandleUnlocked(1)
t4   |     id2 = cfh->GetID()
     V
```
The original implementation return a pointer to a stateful variable, so that the return `ColumnFamilyHandle` will be changed when another thread calls `GetColumnFamilyHandleUnlocked` with different `column family id`

- Expose ColumnFamily ID to compaction event listener

- Fix the return status of `DBImpl::GetLatestSequenceForKey`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4391

Differential Revision: D10221243

Pulled By: yiwu-arbug

fbshipit-source-id: dec60ee9ff0c8261a2f2413a8506ec1063991993
2018-10-08 14:24:16 -07:00
DorianZheng
e0f05754ba Expose column family id to OnCompactionCompleted (#4466)
Summary:
The controller you requested could not be found. PTAL
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4466

Differential Revision: D10241358

Pulled By: yiwu-arbug

fbshipit-source-id: 99664eb286860a6c8844d50efeb0ef6f0e10dd1e
2018-10-08 14:24:16 -07:00
DorianZheng
7487a7628c Fix return status of DBImpl::GetLatestSequenceForKey
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4467

Differential Revision: D10241418

Pulled By: yiwu-arbug

fbshipit-source-id: f6adbe7292b2c934e14971c7432b3eb115c35026
2018-10-08 14:22:05 -07:00
Maysam Yabandeh
21b51dfec4 Add inline comments to flush job (#4464)
Summary:
It also renames InstallMemtableFlushResults to MaybeInstallMemtableFlushResults to clarify its contract.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4464

Differential Revision: D10224918

Pulled By: maysamyabandeh

fbshipit-source-id: 04e3f2d8542002cb9f8010cb436f5152751b3cbe
2018-10-05 15:41:17 -07:00
Maysam Yabandeh
1fb6805527 Fix snprintf buffer overflow bug (#4465)
Summary:
The contract of snprintf says that it returns "The number of characters that would have been written if n had been sufficiently large" http://www.cplusplus.com/reference/cstdio/snprintf/
The existing code however was assuming that the return value is the actual number of written bytes and uses that to reposition the starting point on the next call to snprintf. This leads to buffer overflow when the last call to snprintf has filled up the buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4465

Differential Revision: D10224080

Pulled By: maysamyabandeh

fbshipit-source-id: 40f44e122d15b0db439812a0a361167cf012de3e
2018-10-05 14:50:51 -07:00
Dmitry Alimov
e13d8dcbbb Fix typos in comments (#4456)
Summary:
Fix some typos in the comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4456

Differential Revision: D10209214

Pulled By: miasantreble

fbshipit-source-id: dff857ba60396bc95126e635db96d7dc8330d2cb
2018-10-04 20:46:50 -07:00
Zhongyi Xie
ce1fc5af09 fix unused param allocator in compression.h (#4453)
Summary:
this should fix currently failing contrun test: rocksdb-contrun-no_compression, rocksdb-contrun-tsan, rocksdb-contrun-tsan_crash
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4453

Differential Revision: D10202626

Pulled By: miasantreble

fbshipit-source-id: 850b07f14f671b5998c22d8239e2a55b2fc1e355
2018-10-04 13:24:22 -07:00
JiYou
a1f6142f38 VersionSet: GetOverlappingInputs() fix overflow and optimize. (#4385)
Summary:
This fix is for `level == 0` in `GetOverlappingInputs()`:
- In `GetOverlappingInputs()`, if `level == 0`, it has potential
risk of overflow if `i == 0`.
- Optmize process when `expand = true`, the expected complexity
can be reduced to O(n).

Signed-off-by: JiYou <jiyou09@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4385

Differential Revision: D10181001

Pulled By: riversand963

fbshipit-source-id: 46eef8a1d1605c9329c164e6471cd5c5b6de16b5
2018-10-03 18:40:59 -07:00
Yanqin Jin
4e58b2ea3d Check for compression lib support before test exec (#4443)
Summary:
Before running CompactFilesTest.SentinelCompressionType, we should check
whether zlib and snappy are supported.

CompactFilesTest.SentinelCompressionType is a newly added test. Compilation and
linking with different options, e.g. COMPILE_WITH_TSAN, COMPILE_WITH_ASAN, etc.
lead to generation of different binaries. On the one hand, it's not clear why
zlib or snappy is present under ASAN, but not under TSAN. On the other hand,
changing the compilation flags for TSAN or ASAN seems a bigger change worth much
more attention. To unblock the cont-runs, I suggest that we simply add these
two checks at the beginning of the test, as we did for
GeneralTableTest.ApproximateOffsetOfCompressed in table/table_test.cc.

Future actions include invesigating the absence of zlib and snappy when
compiling with TSAN, i.e. COMPILE_WITH_TSAN=1, if necessary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4443

Differential Revision: D10140935

Pulled By: riversand963

fbshipit-source-id: 62f96d1e685386accd2ef0b98f6f754d3fd67b3e
2018-10-02 10:42:01 -07:00
Yanqin Jin
be5cc4c7b8 Remove a race condition between lsdir and rm (#4440)
Summary:
In DBCompactionTestWithParam::ManualLevelCompactionOutputPathId, there is
a race condition between `DBTestBase::GetSstFileCount` and
`DBImpl::PurgeObsoleteFiles`. The following graph explains why.

```
Timeline  db_compact_test_t              bg_flush_t         bg_compact_t
    |  [initiate bg flush and
    |      start waiting]
    |                                     flush
    |                                     DeleteObsoleteFiles
    |  [waken up by bg_flush_t which
    |   signaled in DeleteObsoleteFiles]
    |
    |  [initiate compaction and
    |   start waiting]
    |
    |                                                         [compact,
    |                                                          set manual.done to true]
    |                                   [signal at the end of
    |                                    BackgroundCallFlush]
    |
    |  [waken up by bg_flush_t
    |   which signaled before
    |   returning from
    |   BackgroundCallFlush]
    |
    |  Check manual.done is true
    |
    |  GetSstFileCount    <-- race condition -->           PurgeObsoleteFiles
    V
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4440

Differential Revision: D10122628

Pulled By: riversand963

fbshipit-source-id: 3ede73c39fee6ad804dc6ac1ed84759c7e63977f
2018-10-01 11:57:55 -07:00
Andrew Kryczka
ac6f435a9a Fix CompactFiles support for kDisableCompressionOption (#4438)
Summary:
Previously `CompactFiles` with `CompressionType::kDisableCompressionOption` caused program to crash on assertion failure. This PR fixes the crash by adding support for that setting. Now, that setting will cause RocksDB to choose compression according to the column family's options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4438

Differential Revision: D10115761

Pulled By: ajkr

fbshipit-source-id: a553c6fa76fa5b6f73b0d165d95640da6f454122
2018-10-01 01:18:10 -07:00
JiYou
75ca13875c FindFile: use std::lower_bound reduce the repeated code. (#4372)
Summary:
`FindFile()` and  `FindFileInRange()` actually works as the same
of `std::lower_bound()`. Use `std::lower_bound()` to reduce the
repeated code.

- change `FindFile()` and `FindFileInRange()` to use `std::lower_bound()`

Signed-off-by: JiYou <jiyou09@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4372

Differential Revision: D9919677

Pulled By: ajkr

fbshipit-source-id: f74aaa30e2f80e410e299c5a5bca4eaf2a7a26de
2018-09-27 10:35:00 -07:00
Yi Wu
dc813e4b85 Improve log handling when recover without flush (#4405)
Summary:
Improve log handling when avoid_flush_during_recovery=true.
1. restore total_log_size_ after recovery, by summing up existing log sizes. Fixes #4253.
2. truncate the last existing log, since this log can contain preallocated space and it will be a waste to keep the space. It avoids a crash loop of user application cause a lot of log with non-trivial size being created and ultimately take up all disk space.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4405

Differential Revision: D9953933

Pulled By: yiwu-arbug

fbshipit-source-id: 967780fee8acec7f358b6eb65190fb4684f82e56
2018-09-26 10:37:48 -07:00
Nikhil Benesch
17edc82a4b Handle tombstones at the same seqno in the CollapsedRangeDelMap (#4424)
Summary:
The CollapsedRangeDelMap was entirely mishandling tombstones at the same
sequence number when the tombstones did not have identical start and end
keys. Such tombstones are common since 90fc40690, which causes
tombstones to be split during compactions.

For example, if the tombstone [a, c) @ 1 lies across a compaction
boundary at b, it will be split into [a, b) @ 1 and [b, c) @ 1. Without
this patch, the collapsed range deletion map would look like this:

  a -> 1
  b -> 1
  c -> 0

Notice how the b -> 1 entry is redundant. When the tombstones overlap,
the problem is even worse. Consider tombstones [a, c) @ 1 and [b, d) @
1, which produces this map without this patch:

  a -> 1
  b -> 1
  c -> 0
  d -> 0

This map is corrupt, as a map can never contain adjacent sentinel (zero)
entries. When the iterator advances from b to c, it will notice that c
is a sentinel enty and skip to d--but d is also a sentinel entry! Asking
what tombstone this iterator points to will trigger an assertion, as it
is not pointing to a valid tombstone.

/cc ajkr
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4424

Differential Revision: D10039248

Pulled By: abhimadan

fbshipit-source-id: 6d737c1e88d60e80cf27286726627ba44463e7f4
2018-09-25 14:50:31 -07:00
Abhishek Madan
3c350a7cf0 Improve RangeDelAggregator benchmarks (#4395)
Summary:
Improve time measurements for AddTombstones to only include the
call and not the VectorIterator setup. Also add a new
add_tombstones_per_run flag to call AddTombstones multiple times per
aggregator, which will help simulate more realistic workloads.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4395

Differential Revision: D9996811

Pulled By: abhimadan

fbshipit-source-id: 5865a95c323fbd9b3606493013664b4890fe5a02
2018-09-21 16:13:08 -07:00
Anand Ananthabhotla
72712f4e28 Allow dynamic modification of window size and deletion trigger (#4403)
Summary:
Make the CompactOnDeletionCollectorFactory class public, and provide
methods to update the window size and deletion trigger params. These
will take effect on subsequent created SST files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4403

Differential Revision: D9976857

Pulled By: anand1976

fbshipit-source-id: 31dbf0511c12fa2bb9b2a7ba620079e0ee09cf48
2018-09-20 15:15:28 -07:00
Andrew Kryczka
990b52e95b Unit test for custom comparator RangeDelAggregator (#4388)
Summary:
Add a unit test for range collapsing when non-default comparator is used. This exposes the bug fixed in #4386.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4388

Differential Revision: D9918252

Pulled By: ajkr

fbshipit-source-id: 99501b96b251eab41791a7e33b27055ee36c5c39
2018-09-18 12:13:20 -07:00
jsteemann
27221b0cc2 use specified comparator in CollapsedRangeDelMap (#4386)
Summary:
The Comparator passed to CollapsedRangeDelMap was not used for
operator less of the std::map `rep_` object contained in
CollapsedRangeDelMap. So the map was always sorted using the
default ByteWiseComparator, which seems wrong.

Passing the specified Comparator through for usage in that map
object fixes actual problems we were seeing with RangeDelete operations
that do not delete keys as expected when using a custom Comparator.

I found that the tests in current master crash when I run them locally,
both with and without my patch, at the very same location. I therefore
don't know if the patch breaks something else, but it seems to fix
RangeDeletion issues in our product that uses RocksDB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4386

Differential Revision: D9916506

Pulled By: ajkr

fbshipit-source-id: 27bff8c775831f089dde8c5289df7343d88b2d66
2018-09-18 09:28:30 -07:00
Maysam Yabandeh
65ac72edd9 Fix bug in partition filters with format_version=4 (#4381)
Summary:
Value delta encoding in format_version 4 requires the differences between the size of two consecutive handles to be sent to BlockBuilder::Add. This applies not only to indexes on blocks but also the indexes on indexes and filters in partitioned indexes and filters respectively. The patch fixes a bug where the partitioned filters would encode the entire size of the handle rather than the difference of the size with the last size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4381

Differential Revision: D9879505

Pulled By: maysamyabandeh

fbshipit-source-id: 27a22e49b482b927fbd5629dc310c46d63d4b6d1
2018-09-17 17:28:15 -07:00
Abhishek Madan
1626f6ab6b Add RangeDelAggregator microbenchmarks (#4363)
Summary:
To measure the results of upcoming DeleteRange v2 work, this commit adds
simple benchmarks for RangeDelAggregator. It measures the average time
for AddTombstones and ShouldDelete calls.

Using this to compare the results before #4014 and on the latest master (using the default arguments) produces the following results:

Before #4014:
```
=======================
Results:
=======================
AddTombstones:          1356.28 us
ShouldDelete:           0.401732 us
```

Latest master:
```
=======================
Results:
=======================
AddTombstones:          740.82 us
ShouldDelete:           0.383271 us
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4363

Differential Revision: D9881676

Pulled By: abhimadan

fbshipit-source-id: 793e7d61aa4b9d47eb917bbcc03f08695b5e5442
2018-09-17 14:58:31 -07:00
Anand Ananthabhotla
30c21df97c Fix regression test failures introduced by PR #4164 (#4375)
Summary:
1. Add override keyword to overridden virtual functions in EventListener
2. Fix a memory corruption that can happen during DB shutdown when in
read-only mode due to a background write error
3. Fix uninitialized buffers in error_handler_test.cc that cause
valgrind to complain
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4375

Differential Revision: D9875779

Pulled By: anand1976

fbshipit-source-id: 022ede1edc01a9f7e21ecf4c61ef7d46545d0640
2018-09-17 13:14:07 -07:00
Anand Ananthabhotla
a27fce408e Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
  a) On the first occurance of an out of space error during compaction,
subsequent
  compactions will be delayed until the disk free space check indicates
  enough available space. The required space is computed as the sum of
  input sizes.
  b) The free space check requirement will be removed once the amount of
  free space is greater than the size reserved by in progress
  compactions when the first error occured
  c) If the out of space error is a hard error, a background thread in
  SFM will poll for sufficient headroom before triggering the recovery
  of the database and putting it in write-only mode. The headroom is
  calculated as the sum of the write_buffer_size of all the DB instances
  associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()

Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164

Differential Revision: D9846378

Pulled By: anand1976

fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 13:43:04 -07:00
Sagar Vemuri
3db584059c Remove sync point from Block destructor (#4370)
Summary:
AddressSanitizer: heap-use-after-free in std::__atomic_base<bool>::load(std::memory_order) const
==1798517==ABORTING
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4370

Differential Revision: D9844146

Pulled By: sagar0

fbshipit-source-id: 18a2970b1d504b4f6c8fb04857f26e0f32124dd1
2018-09-15 00:12:57 -07:00
Dmitri Smirnov
879998b369 Adjust c test and fix windows compilation issues
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4369

Differential Revision: D9844200

Pulled By: sagar0

fbshipit-source-id: 0d9f5f73b28234eaac55d3551ce4e2dc177af138
2018-09-14 20:57:22 -07:00
JiYou
82e8e9e26b VersionBuilder: optmize SaveTo() to linear time. (#4366)
Summary:
Because `base_files` and `added_files` both are sorted, using a merge
operation to these two sorted arrays is more effective. The complexity
is reduced to linear time.

    - optmize the merge complexity.
    - move the `NDEBUG` of sorted `added_files` out of merge process.

Signed-off-by: JiYou <jiyou09@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4366

Differential Revision: D9833592

Pulled By: ajkr

fbshipit-source-id: dd32b67ebdca4c20e5e9546ab8082cecefe99fd0
2018-09-14 19:43:04 -07:00
Yanqin Jin
8959063c9c Store the return value of Fsync for check
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4361

Differential Revision: D9803723

Pulled By: riversand963

fbshipit-source-id: 5a0d4cd3e57fd195571dcd5822895ee00547fa6a
2018-09-14 13:29:56 -07:00
Yanqin Jin
82057b0d8f Improve type conversion (#4367)
Summary:
Use `static_cast<type>(var)` instead of `(type)var`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4367

Differential Revision: D9833391

Pulled By: riversand963

fbshipit-source-id: 3d33fc2c290e7e0f3d1d45b256a881d1bc5a7df2
2018-09-14 11:12:52 -07:00
Andrew Kryczka
c94523ee56 Delete code for WAL reader to start at nonzero offset (#4362)
Summary:
The code is dead in RocksDB as `log::Reader::initial_offset_` is always zero. We should delete it so we don't have to maintain it like in #4359.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4362

Differential Revision: D9817829

Pulled By: ajkr

fbshipit-source-id: 474a2c679e5bd273b40608f3a5332931d9eefe6d
2018-09-13 17:13:03 -07:00
Vitaly Isaev
0bd2ede10e Memory usage stats in C API (#4340)
Summary:
Please consider this small PR providing access to the `MemoryUsage::GetApproximateMemoryUsageByType` function in plain C API. Actually I'm working on Go application and now trying to investigate the reasons of high memory consumption (#4313). Go [wrappers](https://github.com/tecbot/gorocksdb) are built on the top of Rocksdb C API. According to the #706, `MemoryUsage::GetApproximateMemoryUsageByType` is considered as the best option to get database internal memory usage stats, but it wasn't supported in C API yet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4340

Differential Revision: D9655135

Pulled By: ajkr

fbshipit-source-id: a3d2f3f47c143ae75862fbcca2f571ea1b49e14a
2018-09-13 14:27:31 -07:00
Dan Melnic
ca92fc71a4 Initialize uninitialized std::atomic variables
Summary: Initialize uninitialized std::atomic variables

Reviewed By: yfeldblum

Differential Revision: D9758050

fbshipit-source-id: 865d89eddafc81f3cab6f11e2ebb669f7ff70d04
2018-09-12 08:58:05 -07:00
Abhishek Madan
c86a22ac43 Restrict RangeDelAggregator's tombstone end-key truncation (#4356)
Summary:
`RangeDelAggregator::AddTombstones` contained an assertion which stated that, if a range tombstone extended past the largest key in the sstable, then `FileMetaData::largest` must have a sentinel sequence number of `kMaxSequenceNumber`, which implies that the tombstone's end key is safe to truncate. However, `largest` will not be a sentinel key when the next sstable in the level's smallest key is equal to the current sstable's largest key, which caused the assertion to fail.

The assertion must hold for the truncation to be safe, so it has been moved to an additional check on end-key truncation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4356

Differential Revision: D9760891

Pulled By: abhimadan

fbshipit-source-id: 7c20c3885cd919dcd14f291f88fd27aa33defebc
2018-09-10 17:42:43 -07:00
Maysam Yabandeh
3f5282268f Skip concurrency control during recovery of pessimistic txn (#4346)
Summary:
TransactionOptions::skip_concurrency_control allows pessimistic transactions to skip the overhead of concurrency control. This could be as an optimization if the application knows that the transaction would not have any conflict with concurrent transactions. It is currently used during recovery assuming (i) application guarantees no conflict between prepared transactions in the WAL (ii) application guarantees that recovered transactions will be rolled back/commit before new transactions start.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4346

Differential Revision: D9759149

Pulled By: maysamyabandeh

fbshipit-source-id: f896e84fa58b0b584be904c7fd3883a41ea3215b
2018-09-10 16:57:53 -07:00
Anand Ananthabhotla
ced618cf39 Fix a lint error due to unspecified move evaluation order (#4348)
Summary:
In C++ 11, the order of argument and move evaluation in a statement such
as below is unspecified -
  foo(a.b).bar(std::move(a))
The compiler is free to evaluate std::move(a) first, and then a.b is unspecified.

In C++ 17, this will be safe if a draft proposal around function
chaining rules is accepted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4348

Differential Revision: D9688810

Pulled By: anand1976

fbshipit-source-id: e4651d0ca03dcf007e50371a0fc72c0d1e710fb4
2018-09-06 14:42:57 -07:00
cngzhnp
64324e329e Support pragma once in all header files and cleanup some warnings (#4339)
Summary:
As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited.

Besides this, try to fix some warnings about loss of data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339

Differential Revision: D9654990

Pulled By: ajkr

fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848
2018-09-05 18:13:31 -07:00
Andrew Kryczka
1a88c43751 Reduce empty SST creation/deletion in compaction (#4336)
Summary:
This is a followup to #4311. Checking `!RangeDelAggregator::IsEmpty()` before opening a dedicated range tombstone SST did not properly prevent empty SSTs from being generated. That's because it relies on `CollapsedRangeDelMap::Size`, which had an underflow bug when the map was empty. This PR fixes that underflow bug.

Also fixed an uninitialized variable in db_stress.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4336

Differential Revision: D9600080

Pulled By: ajkr

fbshipit-source-id: bc6980ca79d2cd01b825ebc9dbccd51c1a70cfc7
2018-08-31 12:28:52 -07:00
Mikhail Antonov
927f274939 Avoiding write stall caused by manual flushes (#4297)
Summary:
Basically at the moment it seems it's possible to cause write stall by calling flush (either manually vis DB::Flush(), or from Backup Engine directly calling FlushMemTable() while background flush may be already happening.

One of the ways to fix it is that in DBImpl::CompactRange() we already check for possible stall and delay flush if needed before we actually proceed to call FlushMemTable(). We can simply move this delay logic to separate method and call it from FlushMemTable.

This is draft patch, for first look; need to check tests/update SyncPoints and most certainly would need to add allow_write_stall method to FlushOptions().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4297

Differential Revision: D9420705

Pulled By: mikhail-antonov

fbshipit-source-id: f81d206b55e1d7b39e4dc64242fdfbceeea03fcc
2018-08-29 12:12:55 -07:00
Andrew Kryczka
42733637e1 Sync CURRENT file during checkpoint (#4322)
Summary: For the CURRENT file forged during checkpoint, we were forgetting to `fsync` or `fdatasync` it after its creation. This PR fixes it.

Differential Revision: D9525939

Pulled By: ajkr

fbshipit-source-id: a505483644026ee3f501cfc0dcbe74832165b2e3
2018-08-28 12:43:18 -07:00
Yanqin Jin
198459ce17 Fix an inaccurate comment (#4315)
Summary:
According to 4848bd0c4e/db/log_reader.cc (L355), the original text is misleading when describing the layout of RecyclableLogHeader.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4315

Differential Revision: D9505284

Pulled By: riversand963

fbshipit-source-id: 79994c37a69e7003f03453e7efc0186feeafa609
2018-08-24 18:13:20 -07:00
Shrikanth Shankar
4848bd0c4e Drop unnecessary deletion markers during compaction (issue - 3842) (#4289)
Summary:
This PR fixes issue 3842. We drop deletion markers iff
1. We are the bottom most level AND
2. All other occurrences of the key are in the same snapshot range as the delete

I've also enhanced db_stress_test to add an option that does a full compare of the keys. This is done by a single thread (thread # 0). For tests I've run (so far)

make check -j64
db_stress
db_stress  --acquire_snapshot_one_in=1000 --ops_per_thread=100000 /* to verify that new code doesnt break existing tests */
./db_stress --compare_full_db_state_snapshot=true --acquire_snapshot_one_in=1000 --ops_per_thread=100000 /* to verify new test code */
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4289

Differential Revision: D9491165

Pulled By: shrikanthshankar

fbshipit-source-id: ce144834f31736c189aaca81bed356ba990331e2
2018-08-24 15:17:54 -07:00
Yanqin Jin
7daae512d2 Refactor flush request queueing and processing (#3952)
Summary:
RocksDB currently queues individual column family for flushing. This is not sufficient to support the needs of some applications that want to enforce order/dependency between column families, given that multiple foreground and background activities can trigger flushing in RocksDB.

This PR aims to address this limitation. Each flush request is described as a `FlushRequest` that can contain multiple column families. A background flushing thread pops one flush request from the queue at a time and processes it.

This PR does not enable atomic_flush yet, but is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3952

Differential Revision: D8529933

Pulled By: riversand963

fbshipit-source-id: 78908a21e389a3a3f7de2a79bae0cd13af5f3539
2018-08-24 13:27:35 -07:00
Andrew Kryczka
17f9a181d5 Reduce empty SST creation/deletion during compaction (#4311)
Summary:
I have a PR to start calling `OnTableFileCreated` for empty SSTs: #4307. However, it is a behavior change so should not go into a patch release.

This PR adds back a check to make sure range deletions at least exist before starting file creation. This PR should be safe to backport to earlier versions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4311

Differential Revision: D9493734

Pulled By: ajkr

fbshipit-source-id: f0d43cda4cfd904f133cfe3a6eb622f52a9ccbe8
2018-08-24 12:27:57 -07:00
Andrew Kryczka
ee234e83e3 Invoke OnTableFileCreated for empty SSTs (#4307)
Summary:
The API comment on `OnTableFileCreationStarted` (b6280d01f9/include/rocksdb/listener.h (L331-L333)) led users to believe a call to `OnTableFileCreationStarted` will always be matched with a call to `OnTableFileCreated`. However, we were skipping the `OnTableFileCreated` call in one case: no error happens but also no file is generated since there's no data.

This PR adds the call to `OnTableFileCreated` for that case. The filename will be "(nil)" and the size will be zero.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4307

Differential Revision: D9485201

Pulled By: ajkr

fbshipit-source-id: 2f077ec7913f128487aae2624c69a50762394df6
2018-08-23 18:27:30 -07:00
Gauresh Rane
ad789e4e0d Adding a method for memtable class for memtable getting flushed. (#4304)
Summary:
Memtables are selected for flushing by the flush job. Currently we
have listener which is invoked when memtables for a column family are
flushed. That listener does not indicate which memtable was flushed in
the notification. If clients want to know if particular data in the
memtable was retired, there is no straight forward way to know this.
This method will help users who implement memtablerep factory and extend
interface for memtablerep, to know if the data in the memtable was
retired.
Another option that was tried, was to depend on memtable destructor to
be called after flush to mark that data was persisted. This works all
the time but sometimes there can huge delays between actual flush
happening and memtable getting destroyed. Hence, if anyone who is
waiting for data to persist will have to wait that longer.
It is expected that anyone who is implementing this method to have
return quickly as it blocks RocksDB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4304

Reviewed By: riversand963

Differential Revision: D9472312

Pulled By: gdrane

fbshipit-source-id: 8e693308dee749586af3a4c5d4fcf1fa5276ea4d
2018-08-23 17:14:25 -07:00
Yanqin Jin
bb5dcea98e Add path to WritableFileWriter. (#4039)
Summary:
We want to sample the file I/O issued by RocksDB and report the function calls. This requires us to include the file paths otherwise it's hard to tell what has been going on.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4039

Differential Revision: D8670178

Pulled By: riversand963

fbshipit-source-id: 97ee806d1c583a2983e28e213ee764dc6ac28f7a
2018-08-23 10:12:58 -07:00
Zhongyi Xie
f1f5ba085f add missing counters in readonly mode (#4260)
Summary:
User reported (https://github.com/facebook/rocksdb/issues/4168) that when opening RocksDB in read-only mode, some statistics are not correctly reported. After some investigation, we believe the following counters are indeed not reported during Get() call in a read-only DB:
rocksdb.memtable.hit
rocksdb.memtable.miss
rocksdb.number.keys.read
rocksdb.bytes.read
As well as histogram rocksdb.bytes.per.read
and perf context get_read_bytes
This PR will add the necessary counter reporting logic in the Get() call path
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4260

Differential Revision: D9476431

Pulled By: miasantreble

fbshipit-source-id: 7ab409d4e59df05d09ae8b69fe75554e5aa240d6
2018-08-22 22:43:13 -07:00
Siying Dong
d5612b43de Two code changes to make "clang analyze" happy (#4292)
Summary:
Clang analyze is not happy in two pieces of code, with "Potential memory leak". No idea what the problem but slightly changing the code makes clang happy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4292

Differential Revision: D9413555

Pulled By: siying

fbshipit-source-id: 9428c9d3664530c72129feefd135ee63d8386137
2018-08-20 17:43:41 -07:00
Yanqin Jin
d116a1725d Update recovery code for version edits group commit. (#3945)
Summary:
During recovery, RocksDB is able to handle version edits that belong to group commits.
This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3945

Differential Revision: D8529122

Pulled By: riversand963

fbshipit-source-id: 57cb0f9cc55ecca684a837742d6626dc9c07f37e
2018-08-20 14:58:00 -07:00
Mikhail Antonov
889a0553c8 VerifyChecksum() API should preserve options
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4275

Reviewed By: yiwu-arbug

Differential Revision: D9369766

Pulled By: mikhail-antonov

fbshipit-source-id: d91b64c34cc1976b324a260767fce343fa32afde
2018-08-16 16:42:29 -07:00
Andrey Zagrebin
aeed4f0749 #3865 fix performance regression introduced by MergeOperator.ShouldMerge (#4266)
Summary:
This PR addresses issue #3865 and implements the following approach to fix it:
 - adds `MergeContext::GetOperandsDirectionForward` and `MergeContext::GetOperandsDirectionBackward` to query merge operands in a specific order
 - `MergeContext::GetOperands` becomes a shortcut for `MergeContext::GetOperandsDirectionForward`
 - pass `MergeContext::GetOperandsDirectionBackward` to `MergeOperator::ShouldMerge` and document the order
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4266

Differential Revision: D9360750

Pulled By: sagar0

fbshipit-source-id: 20cb73ff017760b062ecdcf4382560767086e092
2018-08-16 10:58:05 -07:00
jsteemann
33ad9060d3 fix compilation with g++ option -Wsuggest-override (#4272)
Summary:
Fixes compilation warnings (which are turned into compilation errors by default) when compiling with g++ option `-Wsuggest-override`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4272

Differential Revision: D9322556

Pulled By: siying

fbshipit-source-id: abd57a29ec8f544bee77c0bb438f31be830b7244
2018-08-14 15:13:10 -07:00
Huachao Huang
d916a1105a c-api: add some missing options
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4267

Differential Revision: D9309505

Pulled By: anand1976

fbshipit-source-id: eb9fee8037f4ff24dc1cdd5cc5ef41c231a03e1f
2018-08-13 18:42:30 -07:00
Siying Dong
f3d91a0b57 Add a unit test to verify iterators release data blocks after using them (#4170)
Summary:
Add a unit test to check that iterators release data blocks after it has moved away from it. Verify the same for compaction input iterators.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4170

Differential Revision: D8962513

Pulled By: siying

fbshipit-source-id: 05a5b604d7d29887fb488f2cda7286f554a14407
2018-08-13 17:43:14 -07:00
Anand Ananthabhotla
4ea56b1bd0 Revert changes in PR #4003 (#4263)
Summary:
Revert this change. Not generating the OnTableFileCreated() notification for a 0 byte SST on flush breaks the assumption that every OnTableFileCreationStarted() notification is followed by a corresponding OnTableFileCreated().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4263

Differential Revision: D9285623

Pulled By: anand1976

fbshipit-source-id: 808c3dcd498b4b4f4ed4be947a29a24b2296aa8d
2018-08-11 16:57:36 -07:00
Zhichao Cao
6d75319d95 Add tracing function of Seek() and SeekForPrev() to trace_replay (#4228)
Summary:
In the current trace_and replay, Get an WriteBatch are traced. This pull request track down the Seek() and SeekForPrev() to the trace file. <target_key, timestamp, column_family_id> are write to the file.

Replay of Iterator is not supported in the current implementation.

Tested with trace_analyzer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4228

Differential Revision: D9201381

Pulled By: zhichao-cao

fbshipit-source-id: 6f9cc9cb3c20260af741bee065ec35c5c96354ab
2018-08-10 17:57:40 -07:00
Zhichao Cao
76d77205da Remove the redundant condition inclusion to avoid confusion (#4254)
Summary:
The pair of ROCKSDB_LITE condition inclusion is redundant, it is already inside the #ifndef ROCKSDB_LITE. Remove them to void confusion.

Tested by make asan_check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4254

Differential Revision: D9281652

Pulled By: zhichao-cao

fbshipit-source-id: 06bf7641ede71391f21f6a3fe37fbd13f0e2a43a
2018-08-10 17:43:33 -07:00
Maysam Yabandeh
caf0f53a74 Index value delta encoding (#3983)
Summary:
Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the  block cache more efficiently. The feature is enabled with using format_version 4.

The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size.
Results with sysbench read-only using 4k blocks and using 16 index restart interval:
Format 2:
19585   rocksdb read-only range=100
Format 3:
19569   rocksdb read-only range=100
Format 4:
19352   rocksdb read-only range=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983

Differential Revision: D8361343

Pulled By: maysamyabandeh

fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651
2018-08-09 16:58:40 -07:00
Yanqin Jin
de7f423a82 Add SST ingestion to ldb (#4205)
Summary:
We add two subcommands `write_extern_sst` and `ingest_extern_sst` to ldb. This PR avoids changing existing code because we hope to cherry-pick to earlier releases to support compatibility check for external SST file ingestion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4205

Differential Revision: D9112711

Pulled By: riversand963

fbshipit-source-id: 7cae88380d4de86da8440230e87eca66755648e4
2018-08-09 14:29:11 -07:00
Zhongyi Xie
b15379dcea fix use-after-free error involving a temporary string (#4240)
Summary:
In the current code, `error_msg` is pointing to the inner buffer of a temporary std::string object. When `error_msg` is used to construct the error message, that array is already released. This PR will fix this bug by copying the string to a local variable.
Fixes https://github.com/facebook/rocksdb/issues/4239
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4240

Differential Revision: D9204334

Pulled By: miasantreble

fbshipit-source-id: 0ac599e166ae0a4ec413e32d8b8853d7c5fba878
2018-08-09 11:13:10 -07:00
Maysam Yabandeh
d8d66c937e Simplify DBWithMaxSpaceAllowedRandomized (#4235)
Summary:
The test has become complicated over the years and hard to reason about the corner cases that makes the test flaky. The patch simplifies the test and also fixes some probable synchronization issues.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4235

Differential Revision: D9187995

Pulled By: maysamyabandeh

fbshipit-source-id: 53c7b060f14367e5a9e361014578c26debfe3d27
2018-08-08 07:27:46 -07:00
Huachao Huang
badfd70a3e types: add kEntryBlobIndex for TablePropertiesCollector (#4233)
Summary:
So that we can act accordingly on blob index entries
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4233

Differential Revision: D9190205

Pulled By: yiwu-arbug

fbshipit-source-id: e5b84d5b41e44fa7a76762f1f7b0305369bb3a0c
2018-08-06 18:27:44 -07:00
Yi Wu
4cb7068c1e BlobDB: Fix VisibleToActiveSnapshot() (#4236)
Summary:
There are two issues with `VisibleToActiveSnapshot`:
1. If there are no snapshots, `oldest_snapshot` will be 0 and `VisibleToActiveSnapshot` will always return true. Since the method is used to decide whether it is safe to delete obsolete files, obsolete file won't be able to delete in this case.
2. The `auto` keyword of `auto snapshots = db_impl_->snapshots()` translate to a copy of `const SnapshotList` instead of a reference. Since copy constructor of `SnapshotList` is not defined, using the copy may yield unexpected result.

Issue 2 actually hide issue 1 from being catch by tests. During test `snapshots.empty()` can return false while it should actually be empty, and `snapshots.oldest()` return an invalid address, making `oldest_snapshot` being some random large number.

The issue was originally reported by BlobDB early adopter at Kuaishou.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4236

Differential Revision: D9188706

Pulled By: yiwu-arbug

fbshipit-source-id: a0f2624b927cf9bf28c1bb534784fee5d106f5ea
2018-08-06 16:57:42 -07:00
Jingguo Yao
ceb5fea1e3 Improve FullFilterBitsReader::HashMayMatch's doc (#4202)
Summary:
HashMayMatch is related to AddKey() instead of CreateFilter().
Also applies some minor Fixes #4191 #4200 #3910
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4202

Differential Revision: D9180945

Pulled By: maysamyabandeh

fbshipit-source-id: 6f07b81c5bb9bda5c0273475b486ba8a030471e6
2018-08-06 11:13:18 -07:00
Yanqin Jin
1f802773bc Update JobContext. (#3949)
Summary:
In the past, we assume that a job modifies a single column family. Therefore, a job can create at most one superversion since each superversion corresponds to one column family. This assumption leads to the fact that a `JobContext` has only one member variable called `superversion_context`.
Now we want to support group flush of column families, indicating that each job can create multiple superversions. Therefore, we need to make the following change to accommodate this new feature.

Add a vector of `SuperVersionContext` to `JobContext` to support installing
superversions for multiple column families in one job context.

This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3949

Differential Revision: D8864895

Pulled By: riversand963

fbshipit-source-id: 5937a48817276370d3c8172db9c8aafc826d97ca
2018-08-03 17:42:34 -07:00
Yanqin Jin
22368965a0 Modify verification logic of ObsoleteOptionsFileTest (#4218)
Summary:
The current verification logic does not consider the case in which multiple
threads (foreground and background) may execute `PurgeObsoleteFiles` function
simultaneously. Each invocation will trigger the callback adding elements to
a vector. Then we verify the elements in the vector, which can fail sometimes.

The solution is to give up checking the elements. Instead, we check the number
of OPTIONS file in the database dir.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4218

Differential Revision: D9128727

Pulled By: riversand963

fbshipit-source-id: 2b13b705fb21bc0ddd41940c4ec9b6b0c8d88224
2018-08-03 13:57:40 -07:00
DorianZheng
f9373e2d5c Make sure to call ReleaseFileNumberFromPendingOutputs
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4219

Differential Revision: D9144294

Pulled By: riversand963

fbshipit-source-id: e46b72e5f8a149dc7a0512e38edcd0ddb0150f30
2018-08-02 18:57:34 -07:00
Andrew Kryczka
f8f6983f89 Skip range deletions at seqno zero when collapsing (#4216)
Summary:
`CollapsedRangeDelMap` internally uses seqno zero as a sentinel value to
denote a gap between range tombstones or the end of range tombstones. It
therefore expects to never have consecutive sentinel tombstones.

However, since `DeleteRange` is now supported in `SstFileWriter`, an
ingested file may contain range tombstones, and that ingested file may
be assigned global seqno zero. When such tombstones are added to the
collapsed map, they resemble sentinel tombstones due to having seqno
zero. Then, the invariant mentioned above about never having consecutive
sentinel tombstones can be violated.

The symptom of this violation was dereferencing the `end()` iterator
(#4204). The fix in this PR is to not add range tombstones with seqno
zero to the collapsed map. They're not needed anyways since they can't
possibly cover anything (in case of a key and a range tombstone with the
same seqno, the key is visible).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4216

Differential Revision: D9121716

Pulled By: ajkr

fbshipit-source-id: f5b78a70bea9527354603ea7ac8542a7e2b6a210
2018-08-01 12:12:02 -07:00
Sagar Vemuri
12b6cdeed3 Trace and Replay for RocksDB (#3837)
Summary:
A framework for tracing and replaying RocksDB operations.

A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench.

- Column-families are supported
- Multi-threaded tracing is supported.
- TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it.
- This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations.

Currently supported DB operations:
- Writes:
-- Put
-- Merge
-- Delete
-- SingleDelete
-- DeleteRange
-- Write
- Reads:
-- Get (point lookups)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837

Differential Revision: D7974837

Pulled By: sagar0

fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72
2018-08-01 00:27:08 -07:00
Yanqin Jin
54de56844d Remove random writes from SST file ingestion (#4172)
Summary:
RocksDB used to store global_seqno in external SST files written by
SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the
`global_seqno`. Since random write is not supported in some non-POSIX compliant
file systems, external SST file ingestion is not supported on these file
systems. To address this limitation, we no longer update `global_seqno` during
file ingestion. Later RocksDB uses the MANIFEST and other information in table
properties to deduce global seqno for externally-ingested SST files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172

Differential Revision: D8961465

Pulled By: riversand963

fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071
2018-07-27 16:12:23 -07:00
DorianZheng
f5e46354d2 Protect external file when ingesting (#4099)
Summary:
If crash happen after a hard link established, Recover function may reuse the file number that has already assigned to the internal file, and this will overwrite the external file. To protect the external file, we have to make sure the file number will never being reused.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/4099

Differential Revision: D9034092

Pulled By: riversand963

fbshipit-source-id: 3f1a737440b86aa2ef01673e5013aacbb7c33e28
2018-07-27 14:13:12 -07:00
Siying Dong
fd45495cf5 DBImpl::IngestExternalFile() should grab mutex when releasing file number in failure case (#4189)
Summary:
995fcf7573 has a bug: ReleaseFileNumberFromPendingOutputs() added is not protected by the DB mutex. Fix it by grabbing the lock for this operation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4189

Differential Revision: D9015447

Pulled By: siying

fbshipit-source-id: b8506e09a96c3f95a6fe32b5ca5fcdb9bee88937
2018-07-26 11:12:29 -07:00
Siying Dong
2a81633da2 Fix bug when seeking backward against an out-of-bound iterator (#4187)
Summary:
92ee3350e0 introduces an out-of-bound check in BlockBasedTableIterator::Valid(). However, this flag is not reset when re-seeking in backward direction. This caused the iterator to be invalide by mistake. Fix it by always resetting the out-of-bound flag in every seek.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4187

Differential Revision: D8996600

Pulled By: siying

fbshipit-source-id: b6235ea614f71381e50e7904c4fb036300604ac1
2018-07-25 17:14:01 -07:00
Manuel Ung
ea212e5316 WriteUnPrepared: Implement unprepared batches for transactions (#4104)
Summary:
This adds support for writing unprepared batches based on size defined in `TransactionOptions::max_write_batch_size`. This is done by overriding methods that modify data (Put/Delete/SingleDelete/Merge) and checking first if write batch size has exceeded threshold. If so, the write batch is written to DB as an unprepared batch.

Support for Commit/Rollback for unprepared batch is added as well. This has been done by simply extending the WritePrepared Commit/Rollback logic to take care of all unprep_seq numbers either when updating prepare heap, or adding to commit map. For updating the commit map, this logic exists inside `WriteUnpreparedCommitEntryPreReleaseCallback`.

A test change was also made to have transactions unregister themselves when committing without prepare. This is because with write unprepared, there may be unprepared entries (which act similarly to prepared entries) already when a commit is done without prepare.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4104

Differential Revision: D8785717

Pulled By: lth

fbshipit-source-id: c02006e281ec1ce00f628e2a7beec0ee73096a91
2018-07-24 00:13:18 -07:00
Zhongyi Xie
f95a5b2464 Avoid unnecessary big for-loop when reporting ticker stats stored in GetContext (#3490)
Summary:
Currently in `Version::Get` when reporting ticker stats stored in `GetContext`, there is a big for-loop through all `Ticker` which adds unnecessary cost to overall CPU usage. We can optimize by storing only ticker values that are used in `Get()` calls in a new struct `GetContextStats` since only a small fraction of all tickers are used in `Get()` calls. For comparison, with the new approach we only need to visit 17 values while old approach will require visiting 100+ `Ticker`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3490

Differential Revision: D6969154

Pulled By: miasantreble

fbshipit-source-id: fc27072965a3a94125a3e6883d20dafcf5b84029
2018-07-20 16:58:13 -07:00
Siying Dong
a5e851e113 Reformatting some recent changes (#4161)
Summary:
Lint is not happy with some new code recently committed. Format them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4161

Differential Revision: D8940582

Pulled By: siying

fbshipit-source-id: c9b43b1ef8c88b5e923911058b44eb77234b36b7
2018-07-20 14:43:38 -07:00
Siying Dong
8425c8bd4d BlockBasedTableReader: automatically adjust tail prefetch size (#4156)
Summary:
Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others.
Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10%
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4156

Differential Revision: D8916847

Pulled By: siying

fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758
2018-07-20 14:43:37 -07:00
Yanqin Jin
2736752b33 Fix a bug in MANIFEST group commit (#4157)
Summary:
PR #3944 introduces group commit of `VersionEdit` in MANIFEST. The
implementation has a bug. When updating the log file number of each column
family, we must consider only `VersionEdit`s that operate on the same column
family. Otherwise, a column family may accidentally set its log file number
higher than actual value, indicating that log files with smaller file number
will be ignored, thus causing some updates to be lost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4157

Differential Revision: D8916650

Pulled By: riversand963

fbshipit-source-id: 8f456cf688f17bf35ad87b38e30e899aa162f201
2018-07-19 17:27:56 -07:00
Dmitri Smirnov
78ab11cd71 Return new operator for Status allocations for Windows (#4128)
Summary: Windows requires new/delete for memory allocations to be overriden. Refactor to be less intrusive.

Differential Revision: D8878047

Pulled By: siying

fbshipit-source-id: 35f2b5fec2f88ea48c9be926539c6469060aab36
2018-07-19 15:09:06 -07:00
Sagar Vemuri
f3801528c1 Disable DBFlushTest.SyncFail and DBTest.GroupCommitTest on Travis (#4154)
Summary:
I am temporarily disabling DBFlushTest.SyncFail and DBTest.GroupCommitTest tests on Travis until we figure out the root-cause. These tests will still continue to run locally though.
I haven't been able to reproduce these failures locally so far (even on a [local Travis environment](https://docs.travis-ci.com/user/common-build-problems/#Troubleshooting-Locally-in-a-Docker-Image) ).

These tests  are failing way too frequently causing everyone to wonder why their PR failed on travis, and waste time in debugging.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4154

Differential Revision: D8907258

Pulled By: sagar0

fbshipit-source-id: f40068b16e9245fb3791b6a4796435d1ce1ed205
2018-07-18 18:43:11 -07:00
Siying Dong
37e0fdc824 DBSSTTest.DeleteSchedulerMultipleDBPaths data race (#4146)
Summary:
Fix a minor data race in DBSSTTest.DeleteSchedulerMultipleDBPaths reported by TSAN
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4146

Differential Revision: D8880945

Pulled By: siying

fbshipit-source-id: 25c632f685757735c59ad4ff26b2f346a443a446
2018-07-17 17:57:46 -07:00
Yi Wu
d538ebdff0 Fix write get stuck when pipelined write is enabled (#4143)
Summary:
Fix the issue when pipelined write is enabled, writers can get stuck indefinitely and not able to finish the write. It can show with the following example: Assume there are 4 writers W1, W2, W3, W4 (W1 is the first, W4 is the last).

T1: all writers pending in WAL writer queue:
WAL writer queue: W1, W2, W3, W4
memtable writer queue: empty

T2. W1 finish WAL writer and move to memtable writer queue:
WAL writer queue: W2, W3, W4,
memtable writer queue: W1

T3. W2 and W3 finish WAL write as a batch group. W2 enter ExitAsBatchGroupLeader and move the group to memtable writer queue, but before wake up next leader.
WAL writer queue: W4
memtable writer queue: W1, W2, W3

T4. W1, W2, W3 finish memtable write as a batch group. Note that W2 still in the previous ExitAsBatchGroupLeader, although W1 have done memtable write for W2.
WAL writer queue: W4
memtable writer queue: empty

T5. The thread corresponding to W3 create another writer W3' with the same address as W3.
WAL writer queue: W4, W3'
memtable writer queue: empty

T6. W2 continue with ExitAsBatchGroupLeader. Because the address of W3' is the same as W3, the last writer in its group, it thinks there are no pending writers, so it reset newest_writer_ to null, emptying the queue. W4 and W3' are deleted from the queue and will never be wake up.

The issue exists since pipelined write was introduced in 5.5.0.

Closes #3704
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4143

Differential Revision: D8871599

Pulled By: yiwu-arbug

fbshipit-source-id: 3502674e51066a954a0660257e24ac588f815e2a
2018-07-17 17:27:51 -07:00
Siying Dong
ddc07b40fc Remove managed iterator
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4124

Differential Revision: D8829910

Pulled By: siying

fbshipit-source-id: f3e952ccf3a631071a5d77c48e327046f8abb560
2018-07-17 14:43:18 -07:00
Siying Dong
995fcf7573 Pending output file number should be released after bulkload failure (#4145)
Summary:
If bulkload fails for an input error, the pending output file number wasn't released. This bug can cause all future files with larger number than the current number won't be deleted, even they are compacted. This commit fixes the bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4145

Differential Revision: D8877900

Pulled By: siying

fbshipit-source-id: 080be92a23d43305ca1e13fe1c06eb4cd0b01466
2018-07-17 14:13:16 -07:00
Maysam Yabandeh
b55da012f6 Refactor IndexBlockIter (#4141)
Summary:
Refactor IndexBlockIter to reduce conditional branches on key_includes_seq_. IndexBlockIter::Prev is also separated from DataBlockIter::Prev, not to cache the prev entries as they are of less importance when iterating over the index block.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4141

Differential Revision: D8866437

Pulled By: maysamyabandeh

fbshipit-source-id: fdac76880426fc2be7d3c6354c09ab98f6657d4b
2018-07-16 17:13:10 -07:00
Sagar Vemuri
991120fa10 Allow ttl to be changed dynamically (#4133)
Summary:
Allow ttl to be changed dynamically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4133

Differential Revision: D8845440

Pulled By: sagar0

fbshipit-source-id: c8c87ae643b3a8c4123e4c037c4645efc094a2d3
2018-07-16 14:27:53 -07:00
Nathan VanBenschoten
ef7815b803 Support range deletion tombstones in IngestExternalFile SSTs (#3778)
Summary:
Fixes #3391.

This change adds a `DeleteRange` method to `SstFileWriter` and adds
support for ingesting SSTs with range deletion tombstones. This is
important for applications that need to atomically ingest SSTs while
clearing out any existing keys in a given key range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3778

Differential Revision: D8821836

Pulled By: anand1976

fbshipit-source-id: ca7786c1947ff129afa703dab011d524c7883844
2018-07-13 22:43:09 -07:00
Peter Mattis
90fc40690a Relax VersionStorageInfo::GetOverlappingInputs check (#4050)
Summary:
Do not consider the range tombstone sentinel key as causing 2 adjacent
sstables in a level to overlap. When a range tombstone's end key is the
largest key in an sstable, the sstable's end key is so to a "sentinel"
value that is the smallest key in the next sstable with a sequence
number of kMaxSequenceNumber. This "sentinel" is guaranteed to not
overlap in internal-key space with the next sstable. Unfortunately,
GetOverlappingFiles uses user-keys to determine overlap and was thus
considering 2 adjacent sstables in a level to overlap if they were
separated by this sentinel key. This in turn would cause compactions to
be larger than necessary.

Note that this conflicts with
https://github.com/facebook/rocksdb/pull/2769 and cases
`DBRangeDelTest.CompactionTreatsSplitInputLevelDeletionAtomically` to
fail.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4050

Differential Revision: D8844423

Pulled By: ajkr

fbshipit-source-id: df3f9f1db8f4cff2bff77376b98b83c2ae1d155b
2018-07-13 17:42:38 -07:00
Yanqin Jin
21171615c1 Reduce execution time of IngestFileWithGlobalSeqnoRandomized (#4131)
Summary:
Make `ExternalSSTFileTest.IngestFileWithGlobalSeqnoRandomized` run faster.

`make format`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4131

Differential Revision: D8839952

Pulled By: riversand963

fbshipit-source-id: 4a7e842fde1cde4dc902e928a1cf511322578521
2018-07-13 17:27:39 -07:00
Maysam Yabandeh
8581a93a6b Per-thread unique test db names (#4135)
Summary:
The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel.
Example: ``` ~/gtest-parallel/gtest-parallel ./table_test```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135

Differential Revision: D8846653

Pulled By: maysamyabandeh

fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84
2018-07-13 17:27:39 -07:00
Fosco Marotto
8527012bb6 Converted db/merge_test.cc to use gtest (#4114)
Summary:
Picked up a task to convert this to use the gtest framework.  It can't be this simple, can it?

It works, but should all the std::cout be removed?

```
[$] ~/git/rocksdb [gft !]: ./merge_test
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from MergeTest
[ RUN      ] MergeTest.MergeDbTest
Test read-modify-write counters...
a: 3
1
2
a: 3
b: 1225
3
Compaction started ...
Compaction ended
a: 3
b: 1225
Test merge-based counters...
a: 3
1
2
a: 3
b: 1225
3
Test merge in memtable...
a: 3
1
2
a: 3
b: 1225
3
Test Partial-Merge
Test merge-operator not set after reopen
[       OK ] MergeTest.MergeDbTest (93 ms)
[ RUN      ] MergeTest.MergeDbTtlTest
Opening database with TTL
Test read-modify-write counters...
a: 3
1
2
a: 3
b: 1225
3
Compaction started ...
Compaction ended
a: 3
b: 1225
Test merge-based counters...
a: 3
1
2
a: 3
b: 1225
3
Test merge in memtable...
Opening database with TTL
a: 3
1
2
a: 3
b: 1225
3
Test Partial-Merge
Opening database with TTL
Opening database with TTL
Opening database with TTL
Opening database with TTL
Test merge-operator not set after reopen
[       OK ] MergeTest.MergeDbTtlTest (97 ms)
[----------] 2 tests from MergeTest (190 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (190 ms total)
[  PASSED  ] 2 tests.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4114

Differential Revision: D8822886

Pulled By: gfosco

fbshipit-source-id: c299d008e883c3bb911d2b357a2e9e4423f8e91a
2018-07-13 14:13:07 -07:00
Anand Ananthabhotla
e3eba52a5d Re-enable kUniversalSubcompactions option_config (#4125)
Summary:
1. Move kUniversalSubcompactions up before kEnd in db_test_util.h, so
tests that cycle through all the option_configs include this
2. Skip kUniversalSubcompactions wherever kUniversalCompaction and
kUniversalCompactionMultilevel are skipped

Related to #3935
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4125

Differential Revision: D8828637

Pulled By: anand1976

fbshipit-source-id: 650dee15fd27d85281cf9bb4ca8ab460e04cac6f
2018-07-13 11:13:01 -07:00
Tamir Duberstein
7bee48bdbd Add GCC 8 to Travis (#3433)
Summary:
- Avoid `strdup` to use jemalloc on Windows
- Use `size_t` for consistency
- Add GCC 8 to Travis
- Add CMAKE_BUILD_TYPE=Release to Travis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3433

Differential Revision: D6837948

Pulled By: sagar0

fbshipit-source-id: b8543c3a4da9cd07ee9a33f9f4623188e233261f
2018-07-13 10:58:06 -07:00
Yanqin Jin
90ebf1a257 Reduce execution time of a test. (#4127)
Summary:
Reduce the number of key ranges in `ExternalSSTFileTest.OverlappingRanges` so
that the test completes in shorter time to avoid timeouts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4127

Differential Revision: D8827851

Pulled By: riversand963

fbshipit-source-id: a16387b0cc92a7c872b1c50f0cfbadc463afc9db
2018-07-12 17:42:03 -07:00
Yanqin Jin
dbeaa0d397 Reduce #iterations to shorten execution time. (#4123)
Summary:
Reduce #iterations from 5000 to 1000 so that
`ExternalSSTFileTest.CompactDuringAddFileRandom` can finish faster.
On the one hand, 5000 iterations does not seem to improve the quality of unit
test in comparison with 1000. On the other hand, long running tests should belong to stress tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4123

Differential Revision: D8822514

Pulled By: riversand963

fbshipit-source-id: 0f439b8d5ccd9a4aed84638f8bac16382de17245
2018-07-12 14:42:39 -07:00
Nikhil Benesch
5f3088d565 Range deletion performance improvements + cleanup (#4014)
Summary:
This fixes the same performance issue that #3992 fixes but with much more invasive cleanup.

I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z.

With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level.

ajkr please take a look when you have a minute!
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4014

Differential Revision: D8773253

Pulled By: ajkr

fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
2018-07-12 14:42:39 -07:00
Nikhil Benesch
5cd8240b86 Test range deletions with more configurations (#4021)
Summary:
Run the basic range deletion tests against the standard set of
configurations. This testing exposed that files with hash indexes and
partitioned indexes were not handling the case where the file contained
only range deletions--i.e., where the index was empty.

Additionally file a TODO about the fact that range deletions are broken
when allow_mmap_reads = true is set.

/cc ajkr nvanbenschoten

Best viewed with ?w=1: https://github.com/facebook/rocksdb/pull/4021/files?w=1
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4021

Differential Revision: D8811860

Pulled By: ajkr

fbshipit-source-id: 3cc07e6d6210a2a00b932866481b3d5c59775343
2018-07-11 15:57:49 -07:00
Yanqin Jin
331cb63641 SetOptions Backup Race Condition (#4108)
Summary:
Prior to this PR, there was a race condition between `DBImpl::SetOptions` and `BackupEngine::CreateNewBackup`, as illustrated below.
```
Time                  thread 1                           thread 2
  |   CreateNewBackup -> GetLiveFiles
  |                                         SetOptions -> RenameTempFileToOptionsFile
  |                                         SetOptions -> RenameTempFileToOptionsFile
  |                                         SetOptions -> RenameTempFileToOptionsFile // unlink oldest OPTIONS file
  |   copy the oldest OPTIONS // IO error!
  V
```
Proposed fix is to check the value of `DBImpl::disable_obsolete_files_deletion_` before calling `DeleteObsoleteOptionsFiles`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4108

Differential Revision: D8796360

Pulled By: riversand963

fbshipit-source-id: 02045317f793ea4c7d4400a5bf333b8502fa3e82
2018-07-11 14:57:46 -07:00
Manuel Ung
b9846370e9 WriteUnPrepared: Add support for recovering WriteUnprepared transactions (#4078)
Summary:
This adds support for recovering WriteUnprepared transactions through the following changes:
- The information in `RecoveredTransaction` is extended so that it can reference multiple batches.
- `MarkBeginPrepare` is extended with a bool indicating whether it is an unprepared begin, and this is passed down to `InsertRecoveredTransaction` to indicate whether the current transaction is prepared or not.
- `WriteUnpreparedTxnDB::Initialize` is overridden so that it will rollback unprepared transactions from the recovered transactions. This can be done without updating the prepare heap/commit map, because this is before the DB has finished initializing, and after writing the rollback batch, those data structures should not contain information about the rolled back transaction anyway.

Commit/Rollback of live transactions is still unimplemented and will come later.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4078

Differential Revision: D8703382

Pulled By: lth

fbshipit-source-id: 7e0aada6c23bd39299f1f20d6c060492e0e6b60a
2018-07-06 17:59:13 -07:00
Yanqin Jin
db7ae0a485 Fix a map lookup that may throw exception. (#4098)
Summary:
`std::map::at(key)` throws std::out_of_range if key does not exist. Current
code does not handle this. Although this case is unlikely, I feel it's safe to
use `std::map::find`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4098

Differential Revision: D8753865

Pulled By: riversand963

fbshipit-source-id: 9a9ba43badb0fb5e0d24cd87903931fd12f3f8ec
2018-07-06 16:12:49 -07:00
Huachao Huang
35b83327a7 compaction: fix max_subcompactions option for CompactRange (#4082)
Summary:
The max_subcompactions option was introduced in https://github.com/facebook/rocksdb/pull/3775.
Closes https://github.com/facebook/rocksdb/pull/4082

Differential Revision: D8743258

Pulled By: ajkr

fbshipit-source-id: d60ee75769dfc19ab6f8754e4ff3a267848f1ed9
2018-07-05 20:12:56 -07:00
Zhongyi Xie
b3efb1cbe0 fix clang analyzer warnings (#4072)
Summary:
clang analyze is giving the following warnings:
> db/compaction_job.cc:1178:16: warning: Called C++ object pointer is null
    } else if (meta->smallest.size() > 0) {
               ^~~~~~~~~~~~~~~~~~~~~
db/compaction_job.cc:1201:33: warning: Access to field 'marked_for_compaction' results in a dereference of a null pointer (loaded from variable 'meta')
    meta->marked_for_compaction = sub_compact->builder->NeedCompact();
    ~~~~
db/version_set.cc:2770:26: warning: Called C++ object pointer is null
        uint32_t cf_id = last_writer->cfd->GetID();
                         ^~~~~~~~~~~~~~~~~~~~~~~~~
Closes https://github.com/facebook/rocksdb/pull/4072

Differential Revision: D8685852

Pulled By: miasantreble

fbshipit-source-id: b0e2fd9dfc1cbba2317723e09886384b9b1c9085
2018-06-28 19:12:35 -07:00
Manuel Ung
8ad63a4b86 WriteUnPrepared: Add new WAL marker kTypeBeginUnprepareXID (#4069)
Summary:
This adds a new WAL marker of type kTypeBeginUnprepareXID.

Also, DBImpl now contains a field called batch_per_txn (meaning one WriteBatch per transaction, or possibly multiple WriteBatches). This would also indicate that this DB is using WriteUnprepared policy.

Recovery code would be able to make use of this extra field on DBImpl in a separate diff. For now, it is just used to determine whether the WAL is compatible or not.
Closes https://github.com/facebook/rocksdb/pull/4069

Differential Revision: D8675099

Pulled By: lth

fbshipit-source-id: ca27cae1738e46d65f2bb92860fc759deb874749
2018-06-28 18:58:29 -07:00
Anand Ananthabhotla
52d4c9b7f6 Allow DB resume after background errors (#3997)
Summary:
Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts -
1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not
2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place
3. Provide an API for the user to clear the error and resume the DB instance

This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors.
Closes https://github.com/facebook/rocksdb/pull/3997

Differential Revision: D8653831

Pulled By: anand1976

fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd
2018-06-28 12:34:40 -07:00
Yanqin Jin
26d67e357e Support group commits of version edits (#3944)
Summary:
This PR supports the group commit of multiple version edit entries corresponding to different column families. Column family drop/creation still cannot be grouped. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
Closes https://github.com/facebook/rocksdb/pull/3944

Differential Revision: D8432536

Pulled By: riversand963

fbshipit-source-id: 8f11bd05193b6c0d9272d82e44b676abfac113cb
2018-06-28 12:34:39 -07:00
Maysam Yabandeh
0a5b5d88b2 Remove ReadOnly part of PinnableSliceAndMmapReads from Lite (#4070)
Summary:
Lite does not support readonly DBs.
Closes https://github.com/facebook/rocksdb/pull/4070

Differential Revision: D8677858

Pulled By: maysamyabandeh

fbshipit-source-id: 536887d2363ee2f5d8e1ea9f1a511e643a1707fa
2018-06-28 08:42:17 -07:00
Zhongyi Xie
14f409c0f1 PrefixMayMatch: remove unnecessary check for prefix_extractor_ (#4067)
Summary:
with https://github.com/facebook/rocksdb/pull/3601 and https://github.com/facebook/rocksdb/pull/3899, `prefix_extractor_` is not really being used in block based filter and full filter's version of `PrefixMayMatch` because now `prefix_extractor` is passed as an argument. Also it is now possible that prefix_extractor_ may be initialized to nullptr when a non-standard prefix_extractor is used and also for ROCKSDB_LITE. Removing these checks should not break any existing tests.
Closes https://github.com/facebook/rocksdb/pull/4067

Differential Revision: D8669002

Pulled By: miasantreble

fbshipit-source-id: 0e701ba912b8a26734fadb72d15bb1b266b6176a
2018-06-27 20:42:43 -07:00
Zhichao Cao
1f6efabe23 Add bottommost_compression_opts to for bottommost_compression (#3985)
Summary:
…ression

 For `CompressionType` we have options `compression` and `bottommost_compression`. Thus, to make the compression options consitent with the compression type when bottommost_compression is enabled, we add the bottommost_compression_opts
Closes https://github.com/facebook/rocksdb/pull/3985

Reviewed By: riversand963

Differential Revision: D8385911

Pulled By: zhichao-cao

fbshipit-source-id: 07bc533dd61bcf1cef5927d8d62901c13d38d5fc
2018-06-27 17:42:38 -07:00
Maysam Yabandeh
235ab9dd32 Pin mmap files in ReadOnlyDB (#4053)
Summary:
https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case.
Closes https://github.com/facebook/rocksdb/pull/4053

Differential Revision: D8662546

Pulled By: maysamyabandeh

fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd
2018-06-27 17:13:34 -07:00
Manuel Ung
a16e00b7b9 WriteUnPrepared Txn: Disable seek to snapshot optimization (#3955)
Summary:
This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data.

There are the places where reads had to be modified.
- Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible.
- Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key.
- Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case.
Closes https://github.com/facebook/rocksdb/pull/3955

Differential Revision: D8286688

Pulled By: lth

fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe
2018-06-27 12:23:07 -07:00
Nikhil Benesch
17339dc2f3 Add table property tracking number of range deletions (#4016)
Summary:
Add a new table property, rocksdb.num.range-deletions, which tracks the
number of range deletions in a block-based table. Range deletions are no
longer counted in rocksdb.num.entries; as discovered in PR #3778, there
are various code paths that implicitly assume that rocksdb.num.entries
counts only true keys, not range deletions.

/cc ajkr nvanbenschoten
Closes https://github.com/facebook/rocksdb/pull/4016

Differential Revision: D8527575

Pulled By: ajkr

fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7
2018-06-26 20:27:35 -07:00
Zhongyi Xie
408205a36b use user_key and iterate_upper_bound to determine compatibility of bloom filters (#3899)
Summary:
Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file.
This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at.
Closes https://github.com/facebook/rocksdb/pull/3899

Differential Revision: D8157459

Pulled By: miasantreble

fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41
2018-06-26 15:57:26 -07:00
Andrew Kryczka
a8e503e545 Fix universal compaction scheduling conflict with CompactFiles (#4055)
Summary:
Universal size-amp-triggered compaction was pulling the final sorted run into the compaction without checking whether any of its files are already being compacted. When all compactions are automatic, it is safe since it verifies the second-last sorted run is not already being compacted, which implies the last sorted run is also not being compacted (in automatic compaction multiple sorted runs are always compacted together). But with manual compaction, files in the last sorted run can be compacted independently, so the last sorted run also must be checked.

We were seeing the below assertion failure in `db_stress`. Also the test case included in this PR repros the failure.

```
db_universal_compaction_test: db/compaction.cc:312: void rocksdb::Compaction::MarkFilesBeingCompacted(bool): Assertion `mark_as_compacted ? !inputs_[i][j]->being_compacted : inputs_[i][j]->being_compacted' failed.
Aborted (core dumped)
```
Closes https://github.com/facebook/rocksdb/pull/4055

Differential Revision: D8630094

Pulled By: ajkr

fbshipit-source-id: ac3b30a874678b76e113d4f6c42c1260411b08f8
2018-06-26 10:44:56 -07:00
Sagar Vemuri
189f0c27aa Make BlockBasedTableIterator compaction-aware (#4048)
Summary:
Pass in `for_compaction` to `BlockBasedTableIterator` via `BlockBasedTableReader::NewIterator`.

In 7103559f49, `for_compaction` was set in `BlockBasedTable::Rep` via `BlockBasedTable::SetupForCompaction`. In hindsight it was not the right decision; it also caused TSAN to complain.
Closes https://github.com/facebook/rocksdb/pull/4048

Differential Revision: D8601056

Pulled By: sagar0

fbshipit-source-id: 30127e898c15c38c1080d57710b8c5a6d64a0ab3
2018-06-25 13:19:27 -07:00
Maysam Yabandeh
80ade9ad83 Pin top-level index on partitioned index/filter blocks (#4037)
Summary:
Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead.
Closes https://github.com/facebook/rocksdb/pull/4037

Differential Revision: D8596218

Pulled By: maysamyabandeh

fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2
2018-06-22 15:27:46 -07:00
Zhongyi Xie
795e663df0 option for timing measurement of non-blocking ops during compaction (#4029)
Summary:
For example calling CompactionFilter is always timed and gives the user no way to disable.
This PR will disable the timer if `Statistics::stats_level_` (which is part of DBOptions) is `kExceptDetailedTimers`
Closes https://github.com/facebook/rocksdb/pull/4029

Differential Revision: D8583670

Pulled By: miasantreble

fbshipit-source-id: 913be9fe433ae0c06e88193b59d41920a532307f
2018-06-21 21:28:05 -07:00
Yanqin Jin
524c6e6b72 Add file name info to SequentialFileReader. (#4026)
Summary:
We potentially need this information for tracing, profiling and diagnosis.
Closes https://github.com/facebook/rocksdb/pull/4026

Differential Revision: D8555214

Pulled By: riversand963

fbshipit-source-id: 4263e06c00b6d5410b46aa46eb4e358ff2161dd2
2018-06-21 08:42:24 -07:00
Siying Dong
92ee3350e0 BlockBasedTableIterator to keep BlockIter after out of upper bound (#4004)
Summary:
b555ed30a4 makes the BlockBasedTableIterator to be invalidated if the current position if over the upper bound. However, this can bring performance regression to the case of multiple Seek()s hitting the same data block but all out of upper bound.

For example, if an SST file has a data block containing following keys : {a, z}

The user sets the upper bound to be "x", and it executed following queries:
Seek("b")
Seek("c")
Seek("d")

Before the upper bound optimization, these queries always come to this same current data block of the iterator, but now inside each Seek() the data block is read from the block cache but is returned again.

To prevent this regression case, we keep the current data block iterator if it is upper bound.
Closes https://github.com/facebook/rocksdb/pull/4004

Differential Revision: D8463192

Pulled By: siying

fbshipit-source-id: 8710628b30acde7063a097c3184d6c4333a8ef81
2018-06-19 09:57:11 -07:00
Tomas Kolda
c766887458 Fix ExternalSSTFileTest::OverlappingRanges test on Solaris Sparc (#4012)
Summary:
Fix of #4011
Closes https://github.com/facebook/rocksdb/pull/4012

Differential Revision: D8499173

Pulled By: sagar0

fbshipit-source-id: cbb2b90c544ed364a3640ea65835d577b2dbc5df
2018-06-18 14:57:37 -07:00
Zhongyi Xie
80bc35927c Should only decode restart points for uncompressed blocks (#3996)
Summary:
The Block object assumes contents are uncompressed. Block's constructor tries to read the number of restarts, but does not get an accurate number when its contents are compressed, which is causing issues like https://github.com/facebook/rocksdb/issues/3843.
This PR address this issue by skipping reconstruction of restart points when blocks are known to be compressed. Somehow the restart points can be read directly when Snappy is used and some tests (for example https://github.com/facebook/rocksdb/blob/master/db/db_block_cache_test.cc#L196) expects blocks to be fully constructed even when Snappy compression is used, so here we keep the restart point logic for Snappy.
Closes https://github.com/facebook/rocksdb/pull/3996

Differential Revision: D8416186

Pulled By: miasantreble

fbshipit-source-id: 002c0b62b9e5d89fb7736563d354ce0023c8cb28
2018-06-15 19:26:58 -07:00
Anand Ananthabhotla
c48764ba47 Don't generate a notification for a 0 size SST (#4003)
Summary:
Don't call the OnTableFileCreated listener callback when a 0 size SST
file gets created by Flush. Doing so causes an assertion failure in db_stress. It is also not correct behavior as we call env->DeleteFile() for such files right before the notification.
Closes https://github.com/facebook/rocksdb/pull/4003

Differential Revision: D8461385

Pulled By: anand1976

fbshipit-source-id: ae92d4f921c2e2cff981ad58f4929ed8b609f35d
2018-06-15 17:57:24 -07:00
zhichao-cao
3fbc865cd5 Add kOptionsStatistics to GetProperty() (#3966)
Summary:
Add a new DB property to DB::GetProperty(), which returns the option.statistics. Test is updated to pass.
Closes https://github.com/facebook/rocksdb/pull/3966

Differential Revision: D8311139

Pulled By: zhichao-cao

fbshipit-source-id: ea78f4727358c807b0e5a0ea62e09defb10ad9ac
2018-06-15 17:28:01 -07:00
奏之章
f23fed19a1 Delay verify compaction output table (#3979)
Summary:
Verify table will load SST into `TableCache`
it occupy memory & `TableCache`‘s capacity ...
but no logic use them
it's unnecessary ...

so , we verify them after all sub compact finished
Closes https://github.com/facebook/rocksdb/pull/3979

Differential Revision: D8389946

Pulled By: ajkr

fbshipit-source-id: 54bd4f474f9e7b3accf39c3068b1f36a27ec4c49
2018-06-15 12:42:53 -07:00
Fenggang Wu
fbe3b9e2b6 Udpate db_universal_compaction_test according to PR #3970 (#3995)
Summary:
The SST file sizes changed slightly after the improvement of PR #3970
which reduces the size of the properties block. Before PR #3970 a size
ratio compaction included all of the first four flushed files but it
only includes two files after. We increase the size_ratio universal
compaction option to make that compaction include all four files again.
Closes https://github.com/facebook/rocksdb/pull/3995

Differential Revision: D8426925

Pulled By: fgwu

fbshipit-source-id: 1429c38672e9f4fb4d4881fd4b06db45c4861d62
2018-06-15 10:42:21 -07:00
Siying Dong
d82f1421b4 Fix regression bug of Prev() with upper bound (#3989)
Summary:
A recent change pushed down the upper bound checking to child iterators. However, this causes the logic of following sequence wrong:
  Seek(key);
  if (!Valid()) SeekToLast();
Because !Valid() may be caused by upper bounds, rather than the end of the iterator. In this case SeekToLast() points to totally wrong places. This can cause wrong results, infinite loops, or segfault in some cases.
This sequence is called when changing direction from forward to backward. And this by itself also implicitly happen during reseeking optimization in Prev().

Fix this bug by using SeekForPrev() rather than this sequuence, as what is already done in prefix extrator case.
Closes https://github.com/facebook/rocksdb/pull/3989

Differential Revision: D8385422

Pulled By: siying

fbshipit-source-id: 429e869990cfd2dc389421e0836fc496bed67bb4
2018-06-12 16:57:36 -07:00
Maysam Yabandeh
b73652169e Extend format 3 to partitioned index/filters (#3958)
Summary:
format_version 3 changes the format of index blocks by storing user keys instead of the internal keys, which saves 8-bytes per key. This patch extends the format to top-level indexes in partitioned index/filters.
Closes https://github.com/facebook/rocksdb/pull/3958

Differential Revision: D8294615

Pulled By: maysamyabandeh

fbshipit-source-id: 17666cc16b8076c363972e2308e31547e835f0fe
2018-06-06 16:58:16 -07:00
Andrew Kryczka
4420df4b0e Check conflict at output level in CompactFiles (#3926)
Summary:
CompactFiles checked whether the existing files conflicted with the chosen compaction. But it missed checking whether future files would conflict, i.e., when another compaction was simultaneously writing new files to the same range at the same output level.
Closes https://github.com/facebook/rocksdb/pull/3926

Differential Revision: D8218996

Pulled By: ajkr

fbshipit-source-id: 21cb00a6fed4c8c62d3ed2ff810962e6bdc2fdfb
2018-06-05 14:14:05 -07:00
Zhongyi Xie
f1592a06c2 run make format for PR 3838 (#3954)
Summary:
PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings.
Run `make format` to fix formatting as suggested by siying .
Also piggyback two changes:
1) fix singleton destruction order for windows and posix env
2) fix two clang warnings
Closes https://github.com/facebook/rocksdb/pull/3954

Differential Revision: D8272041

Pulled By: miasantreble

fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f
2018-06-05 12:58:02 -07:00
Maysam Yabandeh
d0c38c0c8c Extend some tests to format_version=3 (#3942)
Summary:
format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3.
Closes https://github.com/facebook/rocksdb/pull/3942

Differential Revision: D8238413

Pulled By: maysamyabandeh

fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035
2018-06-04 20:13:00 -07:00
Zhongyi Xie
50d7ac0ea3 Fix test for rocksdb_lite: hide incompatible option kDirectIO
Summary:
Previous commit https://github.com/facebook/rocksdb/pull/3935 unhide a few test options which includes kDirectIO. However it's not supported by RocksDB lite. Need to hide this option from the lite build.
Closes https://github.com/facebook/rocksdb/pull/3943

Differential Revision: D8242757

Pulled By: miasantreble

fbshipit-source-id: 1edfad3a5d01a46bfb7eedee765981ebe02c500a
2018-06-01 20:42:36 -07:00
Andrew Kryczka
fea2b1dfb2 Copy Get() result when file reads use mmap
Summary:
For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region.

For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped.   The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`).

This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied.

This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime.
Closes https://github.com/facebook/rocksdb/pull/3881

Differential Revision: D8076288

Pulled By: ajkr

fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08
2018-06-01 16:57:58 -07:00
straw
89b37081a1 add c api rocksdb_sstfilewriter_file_size
Summary: Closes https://github.com/facebook/rocksdb/pull/3922

Differential Revision: D8208528

Pulled By: ajkr

fbshipit-source-id: d384fe53cf526f2aadc7b79a423ce36dbd3ff224
2018-06-01 09:43:59 -07:00
Maysam Yabandeh
44cf84932f Fix the bug of some test scenarios being put after kEnd
Summary:
DBTestBase::OptionConfig includes the scenarios that unit tests could iterate over them by calling ChangeOptions(). Some of the options have  been mistakenly put after kEnd which makes them essentially invisible to ChangeOptions() caller. This patch fixes it except for kUniversalSubcompactions which is left as TODO since it would break some unit tests.
Closes https://github.com/facebook/rocksdb/pull/3935

Differential Revision: D8230748

Pulled By: maysamyabandeh

fbshipit-source-id: edddb8fffcd161af1809fef24798ce118f8593db
2018-05-31 19:28:00 -07:00
QingpingWang
2807678b11 c api set bottommost level compaction
Summary: Closes https://github.com/facebook/rocksdb/pull/3928

Differential Revision: D8224962

Pulled By: ajkr

fbshipit-source-id: 3caf463509a935bff46530f27232a85ae7e4e484
2018-05-31 17:30:50 -07:00
Siying Dong
82089d59c3 DBImpl::FindObsoleteFiles() not to call GetChildren() on the same path
Summary:
DBImpl::FindObsoleteFiles() may call GetChildren() multiple times if different CFs are on the same path. Fix it.
Closes https://github.com/facebook/rocksdb/pull/3885

Differential Revision: D8084634

Pulled By: siying

fbshipit-source-id: b471fbc251f6a05e9243304dc14c0831060cc0b0
2018-05-31 12:58:33 -07:00
maoyouxiang
a35451eaa4 fix deadlock with enable_pipelined_write=true and max_successive_merges > 0
Summary:
fix this https://github.com/facebook/rocksdb/issues/3916
Closes https://github.com/facebook/rocksdb/pull/3923

Differential Revision: D8215192

Pulled By: yiwu-arbug

fbshipit-source-id: a4c2f839a91d92dc70906d2b7c6de0fe014a2422
2018-05-31 11:13:14 -07:00
Siying Dong
4dd80debd0 Remove tests from ROCKSDB_VALGRIND_RUN
Summary:
In order to make valgrind check test to pass in a day, remove some tests that run prohibitively slow under valgrind.
Closes https://github.com/facebook/rocksdb/pull/3924

Differential Revision: D8210184

Pulled By: siying

fbshipit-source-id: 5b06fb08f3cf57571d422d05a0dbddc9f9376f7a
2018-05-30 16:15:16 -07:00
Anand Ananthabhotla
a736255de8 Delete triggered compaction for universal style
Summary:
This is still WIP, but I'm hoping for early feedback on the overall approach.

This patch implements deletion triggered compaction, which till now only
worked for leveled, for universal style. SST files are marked for
compaction by the CompactOnDeletionCollertor table property. This is
expected to be used when free disk space is low and the user wants to
reclaim space by deleting a bunch of keys. The deletions are expected to
be dense. In such a situation, we want to avoid a full compaction due to
its space overhead.

The strategy used in this case is similar to leveled. We pick one file
from the set of files marked for compaction. We then expand the inputs
to a clean cut on the same level, and then pick overlapping files from
the next non-mepty level. Picking files from the next level can cause
the key range to expand, and we opportunistically expand inputs in the
source level to include files wholly in this key range.

The main side effect of this is that it breaks the property of no time
range overlap between levels. This shouldn't break any functionality.
Closes https://github.com/facebook/rocksdb/pull/3860

Differential Revision: D8124397

Pulled By: anand1976

fbshipit-source-id: bfa2a9dd6817930e991b35d3a8e7e61304ed3dcf
2018-05-29 15:44:34 -07:00
Yanqin Jin
cf826de3ed Fix compilation error when OPT="-DROCKSDB_LITE".
Summary: Closes https://github.com/facebook/rocksdb/pull/3917

Differential Revision: D8187733

Pulled By: riversand963

fbshipit-source-id: e4aa179cd0791ca77167e357f99de9afd4aef910
2018-05-29 12:28:59 -07:00
奏之章
1c1bafa668 Fix VersionStorageInfo::EstimateLiveDataSize seg fault
Summary:
`HandleEstimateLiveDataSize`'s `need_out_of_mutex` is true
402b7aa07f/db/internal_stats.cc (L412-L413)
so , is will ref a `SuperVersion`
402b7aa07f/db/db_impl.cc (L1896-L1908)
so , the param `version` of `InternalStats::HandleEstimateLiveDataSize` is safe , but `cfd_->current()` is not safe !
402b7aa07f/db/internal_stats.cc (L790-L795)

the `cfd_->current()` maybe invalid ...

here's mongo-rocks crash backtrace
```
 mongod(mongo::printStackTrace(std::basic_ostream<char, std::char_traits<char> >&)+0x41) [0x7fe3a3137c51]
 mongod(+0x2152E89) [0x7fe3a3136e89]
 mongod(+0x21534F6) [0x7fe3a31374f6]
 libpthread.so.0(+0xF5E0) [0x7fe39f5e45e0]
 mongod(rocksdb::InternalKeyComparator::Compare(rocksdb::Slice const&, rocksdb::Slice const&) const+0x17) [0x7fe3a22375a7]
 mongod(rocksdb::VersionStorageInfo::EstimateLiveDataSize() const+0x3AA) [0x7fe3a228daba]
 mongod(rocksdb::InternalStats::HandleEstimateLiveDataSize(unsigned long*, rocksdb::DBImpl*, rocksdb::Version*)+0x20) [0x7fe3a2250d70]
 mongod(rocksdb::DBImpl::GetIntPropertyInternal(rocksdb::ColumnFamilyData*, rocksdb::DBPropertyInfo const&, bool, unsigned long*)+0xEF) [0x7fe3a21e3dbf]
```
Closes https://github.com/facebook/rocksdb/pull/3912

Differential Revision: D8179944

Pulled By: yiwu-arbug

fbshipit-source-id: 26f314a8f98f4c2dc4348745d759f26f0e8d95e1
2018-05-28 11:27:08 -07:00
Maysam Yabandeh
402b7aa07f Exclude seq from index keys
Summary:
Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator.
The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s
Closes https://github.com/facebook/rocksdb/pull/3894

Differential Revision: D8118775

Pulled By: maysamyabandeh

fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd
2018-05-25 18:42:43 -07:00
Yanqin Jin
aa53579d6c Fix segfault caused by object premature destruction
Summary:
Please refer to earlier discussion in [issue 3609](https://github.com/facebook/rocksdb/issues/3609).
There was also an alternative fix in [PR 3888](https://github.com/facebook/rocksdb/pull/3888), but the proposed solution requires complex change.

To summarize the cause of the problem. Upon creation of a column family, a `BlockBasedTableFactory` object is `new`ed and encapsulated by a `std::shared_ptr`. Since there is no other `std::shared_ptr` pointing to this `BlockBasedTableFactory`, when the column family is dropped, the `ColumnFamilyData` is `delete`d, causing the destructor of `std::shared_ptr`. Since there is no other `std::shared_ptr`, the underlying memory is also freed.
Later when the db exits, it releases all the table readers, including the table readers that have been operating on the dropped column family. This needs to access the `table_options` owned by `BlockBasedTableFactory` that has already been deleted. Therefore, a segfault is raised.
Previous workaround is to purge all obsolete files upon `ColumnFamilyData` destruction, which leads to a force release of table readers of the dropped column family. However this does not work when the user disables file deletion.

Our solution in this PR is making a copy of `table_options` in `BlockBasedTable::Rep`. This solution increases memory copy and usage, but is much simpler.

Test plan
```
$ make -j16
$ ./column_family_test --gtest_filter=ColumnFamilyTest.CreateDropAndDestroy:ColumnFamilyTest.CreateDropAndDestroyWithoutFileDeletion
```

Expected behavior:
All tests should pass.
Closes https://github.com/facebook/rocksdb/pull/3898

Differential Revision: D8149421

Pulled By: riversand963

fbshipit-source-id: eaecc2e064057ef607fbdd4cc275874f866c3438
2018-05-25 11:57:51 -07:00
QingpingWang
070319f7bb add flush_before_backup parameter to c api rocksdb_backup_engine_create_new_backup
Summary:
Add flush_before_backup to rocksdb_backup_engine_create_new_backup. make c api able to control the flush before backup behavior.
Closes https://github.com/facebook/rocksdb/pull/3897

Differential Revision: D8157676

Pulled By: ajkr

fbshipit-source-id: 88998c62f89f087bf8672398fd7ddafabbada505
2018-05-24 22:28:52 -07:00
Yi Wu
bc7e8d472e LRUCache midpoint insertion
Summary:
Implement midpoint insertion strategy where new blocks will be insert to the middle of LRU list, then move the head on the first hit in cache.
Closes https://github.com/facebook/rocksdb/pull/3877

Differential Revision: D8100895

Pulled By: yiwu-arbug

fbshipit-source-id: f4bd83cb8be469e5d02072cfc8bd66011391f3da
2018-05-24 15:57:33 -07:00
Yanqin Jin
4011012d9d Specify the underlying type of enums.
Summary:
Explicitly specify the underlying type of enums help developers understand the physical storage.
Closes https://github.com/facebook/rocksdb/pull/3892

Differential Revision: D8107027

Pulled By: riversand963

fbshipit-source-id: a00efecbba46df4a3c8eed0994a2d4972ad1a1d3
2018-05-23 16:12:59 -07:00
Andrew Kryczka
7db721b9a6 Avoid sleep in DBTest.GroupCommitTest to fix flakiness
Summary:
DBTest.GroupCommitTest would often fail when run under valgrind because its sleeps were insufficient to guarantee a group commit had multiple entries. Instead we can use sync point to force a leader to wait until a non-leader thread has enqueued its work, thus guaranteeing a leader can do group commit work for multiple threads.
Closes https://github.com/facebook/rocksdb/pull/3883

Differential Revision: D8079429

Pulled By: ajkr

fbshipit-source-id: 61dc50fad29d2c85547842f681288de60fa29049
2018-05-22 12:16:25 -07:00
Siying Dong
3db1ada3bf PersistRocksDBOptions() to use WritableFileWriter
Summary:
By using WritableFileWriter rather than WritableFile directly, we can buffer multiple Append() calls to one write() file system call, which will be expensive to underlying Env without its own write buffering.
Closes https://github.com/facebook/rocksdb/pull/3882

Differential Revision: D8080673

Pulled By: siying

fbshipit-source-id: e0db900cb3c178166aa738f3985db65e3ae2cf1b
2018-05-21 16:42:22 -07:00
Zhongyi Xie
c3ebc75843 Move prefix_extractor to MutableCFOptions
Summary:
Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
This PR aims to make it possible to dynamically change bloom filter config.
Closes https://github.com/facebook/rocksdb/pull/3601

Differential Revision: D7253114

Pulled By: miasantreble

fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
2018-05-21 14:43:11 -07:00
Yanqin Jin
263ef52b65 Update ColumnFamilyTest for multi-CF verification
Summary:
Change `keys_` from `set<string>` to `vector<set<string>>` so that each column
family's keys are stored in one set.

ajkr When you have a chance, can you PTAL? Thanks!
Closes https://github.com/facebook/rocksdb/pull/3871

Differential Revision: D8056447

Pulled By: riversand963

fbshipit-source-id: 650d0f9cad02b1bc005fc329ad76edbf053e6386
2018-05-21 11:57:42 -07:00
Andrew Kryczka
7b655214d2 Assert keys/values pinned by range deletion meta-block iterators
Summary:
`RangeDelAggregator` holds the pointers returned by `BlockIter::key()` and `BlockIter::value()` so requires the data to which they point is pinned. `BlockIter::key()` points into block memory and is guaranteed to be pinned if and only if prefix encoding is disabled (or, equivalently, restart interval is set to one). I think `BlockIter::value()` is always pinned. Added an assert for these and removed the wrong TODO about increasing restart interval, which would enable key prefix encoding and break the assertion.
Closes https://github.com/facebook/rocksdb/pull/3875

Differential Revision: D8063667

Pulled By: ajkr

fbshipit-source-id: 60b5ebcc0cdd610dd6aad9e74a23378793672c41
2018-05-21 09:57:00 -07:00
Zhongyi Xie
ed4d3393fb fix a division by zero bug
Summary:
fixes the failing clang_analyze contrun test
Closes https://github.com/facebook/rocksdb/pull/3872

Differential Revision: D8059241

Pulled By: miasantreble

fbshipit-source-id: e8fc1838004fe16a823456188386b8b39429803b
2018-05-18 21:57:24 -07:00
Siying Dong
17af09fcce Implement key shortening functions in ReverseBytewiseComparator
Summary:
Right now ReverseBytewiseComparator::FindShortestSeparator() doesn't really shorten key, and ReverseBytewiseComparator::FindShortestSuccessor() seems to return wrong results. The code is confusing too as it uses BytewiseComparatorImpl::FindShortestSeparator() but the function actually won't do anything if the the first key is larger than the second.

Implement ReverseBytewiseComparator::FindShortestSeparator() and override ReverseBytewiseComparator::FindShortestSuccessor() to be empty.
Closes https://github.com/facebook/rocksdb/pull/3836

Differential Revision: D7959762

Pulled By: siying

fbshipit-source-id: 93acb621c16ce6f23e087ae4e19f7d84d1254683
2018-05-17 18:27:16 -07:00
Zhongyi Xie
1d7ca20f29 add override to virtual functions
Summary:
this will fix the failing clang_check test
Closes https://github.com/facebook/rocksdb/pull/3868

Differential Revision: D8050880

Pulled By: miasantreble

fbshipit-source-id: 749932e2e4025f835c961c068d601e522a126da6
2018-05-17 17:57:48 -07:00
Mike Kolupaev
8bf555f487 Change and clarify the relationship between Valid(), status() and Seek*() for all iterators. Also fix some bugs
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
 * If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
 * When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.

However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).

This PR changes the convention to:
 * If status() is not ok, Valid() always returns false.
 * Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)

This does sacrifice the two use cases listed above, but siying said it's ok.

Overview of the changes:
 * A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
 * Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
 * A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.

To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:

Iterators that didn't need changes:
 * status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
 * Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.

Iterators with changes (see inline comments for details):
 * DBIter - an overhaul:
    - It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
    - It had a few code paths silently discarding subiterator's status. The stress test caught a few.
    - The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
    - Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
    - It used to not reset status on seek for some types of errors.
    - Some simplifications and better comments.
    - Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
 * MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
 * ForwardIterator - changed to the new convention, also slightly simplified.
 * ForwardLevelIterator - fixed some bugs and simplified.
 * LevelIterator - simplified.
 * TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
 * BlockBasedTableIterator - minor changes.
 * BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
 * PlainTableIterator - some seeks used to not reset status.
 * CuckooTableIterator - tiny code cleanup.
 * ManagedIterator - fixed some bugs.
 * BaseDeltaIterator - changed to the new convention and fixed a bug.
 * BlobDBIterator - seeks used to not reset status.
 * KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810

Differential Revision: D7888019

Pulled By: al13n321

fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
2018-05-17 02:56:56 -07:00
Maysam Yabandeh
46fde6b653 Fix race condition between log_.erase and log_.back
Summary:
log_ contract specifies that it should not be modified unless both mutex_ and log_write_mutex_ are held. log_.erase however does that with only holding mutex_. This causes a race condition with two_write_queues since logs_.back is read with holding only log_write_mutex_ (which is correct according to logs_ contract) but logs_.erase is called concurrently. This is probably the cause of logs_.back returning nullptr in https://github.com/facebook/rocksdb/issues/3852 although I could not reproduce it.
Fixes https://github.com/facebook/rocksdb/issues/3852
Closes https://github.com/facebook/rocksdb/pull/3859

Differential Revision: D8026103

Pulled By: maysamyabandeh

fbshipit-source-id: ee394e00fe4aa520d884c5ef87981e9d6b5ccb28
2018-05-16 13:01:33 -07:00
Maysam Yabandeh
12ad711247 Suppress tsan lock-order-inversion on FlushWAL
Summary:
TSAN reports a false alarm for lock-order-inversion in DBWriteTest.IOErrorOnWALWritePropagateToWriteThreadFollower but Open and FlushWAL are not run concurrently. Suppressing the error by skipping FlushWAL in the test until TSAN is fixed.

The alternative would be to use
```
TSAN_OPTIONS="suppressions=tsan-suppressions.txt" ./db_write_test
```
but it does not seem straightforward to integrate it to our test infra.
Closes https://github.com/facebook/rocksdb/pull/3854

Differential Revision: D8000202

Pulled By: maysamyabandeh

fbshipit-source-id: fde33483d963a7ad84d3145123821f64960a4802
2018-05-14 21:13:35 -07:00
Andrew Kryczka
3d7dc75b36 Bottommost level-based compactions in bottom-pri pool
Summary:
This feature was introduced for universal compaction in cc01985d. At that point we thought it'd be used only to prevent long-running universal full compactions from blocking short-lived upper-level compactions. Now we have a level compaction user who could benefit from it since they use more expensive compression algorithm in the bottom level. So enable it for level.
Closes https://github.com/facebook/rocksdb/pull/3835

Differential Revision: D7957179

Pulled By: ajkr

fbshipit-source-id: 177285d2cef3b650b6a4d81dc5db84bc441c9fe4
2018-05-14 14:57:15 -07:00
Maysam Yabandeh
718c1c9c1f Pass manual_wal_flush also to the first wal file
Summary:
Currently manual_wal_flush if set in the options will be used only for the wal files created during wal switch. The configuration thus does not affect the first wal file. The patch fixes that and also update the related unit tests.
This PR is built on top of https://github.com/facebook/rocksdb/pull/3756
Closes https://github.com/facebook/rocksdb/pull/3824

Differential Revision: D7909153

Pulled By: maysamyabandeh

fbshipit-source-id: 024ed99d2555db06bf096c902b998e432bb7b9ce
2018-05-14 10:57:56 -07:00
Sergey Elin
3272bc07c6 Fix formatting in log message
Summary:
Add missing space.
Closes https://github.com/facebook/rocksdb/pull/3826

Differential Revision: D7956059

Pulled By: miasantreble

fbshipit-source-id: 3aeba76385f8726399a3086c46de710636a31191
2018-05-11 11:28:54 -07:00
Andrew Kryczka
072ae671a7 Apply use_direct_io_for_flush_and_compaction to writes only
Summary:
Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used.

This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously.
Closes https://github.com/facebook/rocksdb/pull/3829

Differential Revision: D7915443

Pulled By: ajkr

fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279
2018-05-09 19:42:58 -07:00
Dmitri Smirnov
f92cd2feb4 Introduce and use the option to disable stall notifications structures
Summary:
and code. Removing this helps with insert performance.
Closes https://github.com/facebook/rocksdb/pull/3830

Differential Revision: D7921030

Pulled By: siying

fbshipit-source-id: 84e80d50a7ef96f5441c51c9a0d089c50217cce2
2018-05-09 10:13:53 -07:00
Andrew Kryczka
4bf169f07e Disable readahead when using mmap for reads
Summary:
`ReadaheadRandomAccessFile` had an unwritten assumption, which was that its wrapped file's `Read()` function always copies into the provided scratch buffer. Actually this was not true when the wrapped file was `PosixMmapReadableFile`, whose `Read()` implementation does no copying and instead returns a `Slice` pointing directly into the  `mmap`'d memory region. This PR:

- prevents `ReadaheadRandomAccessFile` from ever wrapping mmap readable files
- adds an assert for the assumption `ReadaheadRandomAccessFile` makes about the wrapped file's use of scratch buffer
Closes https://github.com/facebook/rocksdb/pull/3813

Differential Revision: D7891513

Pulled By: ajkr

fbshipit-source-id: dc64a55222d6af280c39a1852ee39e9e9d7cde7d
2018-05-08 12:13:18 -07:00
Maysam Yabandeh
d72a51e9e1 Split FaultInjectionTest.FaultTest to avoid timeout
Summary:
tsan flavor of this test occasionally times out in our test infra. The patch split the test to two, each working on half of the option range.
Before:
[       OK ] FaultTest/FaultInjectionTest.FaultTest/0 (5918 ms)
[       OK ] FaultTest/FaultInjectionTest.FaultTest/1 (5336 ms)
After:
[       OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/0 (2930 ms)
[       OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/1 (2676 ms)
[       OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/2 (2759 ms)
[       OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/3 (2546 ms)
Closes https://github.com/facebook/rocksdb/pull/3819

Differential Revision: D7894975

Pulled By: maysamyabandeh

fbshipit-source-id: 809f1411cbcc27f8aa71a6b29a16b039f51b67c9
2018-05-07 12:29:58 -07:00
LingBin
72942ad7a4 Recommit "Avoid adding tombstones of the same file to RangeDelAggregator multiple times"
Summary:
The origin commit #3635  will hurt performance for users who aren't using range deletions, because unneeded std::set operations, so it was reverted by commit 44653c7b7a. (see #3672)

To fix this, move the set to  and add a check in , i.e., file will be added only if  is non-nullptr.

The db_bench command which find the performance regression:
> ./db_bench --benchmarks=fillrandom,seekrandomwhilewriting --threads=1 --num=1000000 --reads=150000 --key_size=66 > --value_size=1262 --statistics=0 --compression_ratio=0.5 --histogram=1 --seek_nexts=1 --stats_per_interval=1 > --stats_interval_seconds=600 --max_background_flushes=4 --num_multi_db=1 --max_background_compactions=16 --seed=1522388277 > -write_buffer_size=1048576 --level0_file_num_compaction_trigger=10000 --compression_type=none

Before and after the modification, I re-run this command on the machine, the results of are as follows:

  **fillrandom**
 Table | P50 | P75 | P99 | P99.9 | P99.99 |
  ---- | --- | --- | --- | ----- | ------ |
 before commit | 5.92 | 8.57 | 19.63 | 980.97 | 12196.00 |
 after commit  | 5.91 | 8.55 | 19.34 | 965.56 | 13513.56 |

 **seekrandomwhilewriting**
  Table | P50 | P75 | P99 | P99.9 | P99.99 |
   ---- | --- | --- | --- | ----- | ------ |
 before commit | 1418.62 | 1867.01 | 3823.28 | 4980.99 | 9240.00 |
 after commit  | 1450.54 | 1880.61 | 3962.87 | 5429.60 | 7542.86 |
Closes https://github.com/facebook/rocksdb/pull/3800

Differential Revision: D7874245

Pulled By: ajkr

fbshipit-source-id: 2e8bec781b3f7399246babd66395c88619534a17
2018-05-04 16:45:15 -07:00
Maysam Yabandeh
171f415b30 Rename vars to satisfy unity built
Summary:
Tested by "make unity_test"
Closes https://github.com/facebook/rocksdb/pull/3807

Differential Revision: D7882657

Pulled By: maysamyabandeh

fbshipit-source-id: 84862c18d7f2fc762bd96ad070eaeb6936e45159
2018-05-04 15:28:06 -07:00
Zhongyi Xie
a703432808 MaxFileSizeForLevel: adjust max_file_size for dynamic level compaction
Summary:
`MutableCFOptions::RefreshDerivedOptions` always assume base level is L1, which is not true when `level_compaction_dynamic_level_bytes=true` and Level based compaction is used.
This PR fixes this by recomputing `max_file_size` at query time (in `MaxFileSizeForLevel`)
Fixes https://github.com/facebook/rocksdb/issues/3229

In master:

```
Level Files Size(MB)
--------------------
  0       14      846
  1        0        0
  2        0        0
  3        0        0
  4        0        0
  5       15      366
  6       11      481
Cumulative compaction: 3.83 GB write, 2.27 GB read
```
In branch:
```
Level Files Size(MB)
--------------------
  0        9      544
  1        0        0
  2        0        0
  3        0        0
  4        0        0
  5        0        0
  6      445      935
Cumulative compaction: 2.91 GB write, 1.46 GB read
```

db_bench command used:
```
./db_bench --benchmarks="fillrandom,deleterandom,fillrandom,levelstats,stats" --statistics -deletes=5000 -db=tmp -compression_type=none --num=20000 -value_size=100000 -level_compaction_dynamic_level_bytes=true -target_file_size_base=2097152 -target_file_size_multiplier=2
```
Closes https://github.com/facebook/rocksdb/pull/3755

Differential Revision: D7721381

Pulled By: miasantreble

fbshipit-source-id: 39afb8503190bac3b466adf9bbf2a9b3655789f8
2018-05-03 16:42:13 -07:00
Dmitri Smirnov
934f96de27 Better destroydb
Summary:
Delete archive directory before WAL folder
  since archive may be contained as a subfolder.
  Also improve loop readability.
Closes https://github.com/facebook/rocksdb/pull/3797

Differential Revision: D7866378

Pulled By: riversand963

fbshipit-source-id: 0c45d97677ce6fbefa3f8d602ef5e2a2a925e6f5
2018-05-03 16:13:09 -07:00
Maysam Yabandeh
a8d77ca381 Speedup ManualCompactionTest.Test
Summary:
ManualCompactionTest.Test occasionally times out in tsan flavor of our test infra. The patch reduces the number of keys to make the test run faster. The change does not seem to negatively impact the coverage of the test.
Closes https://github.com/facebook/rocksdb/pull/3802

Differential Revision: D7865596

Pulled By: maysamyabandeh

fbshipit-source-id: b4f60e32c3ae1677e25506f71c766e33fa985785
2018-05-03 16:13:09 -07:00
Siying Dong
d59549298f Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.

Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)

This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765

Differential Revision: D7747618

Pulled By: siying

fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
2018-05-03 15:43:09 -07:00
Zhongyi Xie
6cab3184f5 avoid double delete on dummy record insertion failure
Summary:
When the dummy record insertion fails, there is no need to explicitly delete the block as it will be registered for cleanup regardless.
Closes https://github.com/facebook/rocksdb/pull/3688

Differential Revision: D7537741

Pulled By: miasantreble

fbshipit-source-id: fcd3a3d3d382ee8e2c7ced0a4980e683d93a16d6
2018-05-01 16:01:28 -07:00
Victor Grishchenko
c9ace1d81b expose WAL iterator in the C API
Summary:
A minor change: I wrapped TransactionLogIterator for the C API.
I needed that for the golang binding.
Closes https://github.com/facebook/rocksdb/pull/3304

Differential Revision: D6628736

Pulled By: miasantreble

fbshipit-source-id: 3374f3c64b1d7b225696b8767090917761e2f30a
2018-04-27 16:56:59 -07:00
Huachao Huang
ed7a95b28c Add max_subcompactions as a compaction option
Summary:
Sometimes we want to compact files as fast as possible, but don't want to set a large `max_subcompactions` in the `DBOptions` by default.
I add a `max_subcompactions` options to `CompactionOptions` so that we can choose a proper concurrency dynamically.
Closes https://github.com/facebook/rocksdb/pull/3775

Differential Revision: D7792357

Pulled By: ajkr

fbshipit-source-id: 94f54c3784dce69e40a229721a79a97e80cd6a6c
2018-04-27 11:57:39 -07:00
Yanqin Jin
7dfbe33532 Rename pending_compaction_ to queued_for_compaction_.
Summary:
We use `queued_for_flush_` to indicate a column family has been added to the
flush queue. Similarly and to be consistent in our naming, we need to use `queued_for_compaction_` to indicate a column family has been added to the compaction queue. In the past we used
`pending_compaction_` which can also be ambiguous.
Closes https://github.com/facebook/rocksdb/pull/3781

Differential Revision: D7790063

Pulled By: riversand963

fbshipit-source-id: 6786b11a4fcaea36dc9b4672233dbe042f921804
2018-04-27 11:12:01 -07:00
Yanqin Jin
513b5ce618 Rename pending_flush_ to queued_for_flush_.
Summary:
With ColumnFamilyData::pending_flush_, we have the following code snippet in DBImpl::ScheedulePendingFlush

```
if (!cfd->pending_flush() && cfd->imm()->IsFlushPending()) {
...
}
```

`Pending` is ambiguous, and I feel `queued_for_flush` is a better name,
especially for the sake of readability.
Closes https://github.com/facebook/rocksdb/pull/3777

Differential Revision: D7783066

Pulled By: riversand963

fbshipit-source-id: f1bd8c8bfe5eafd2c94da0d8566c9b2b6bb57229
2018-04-26 21:12:51 -07:00
Siying Dong
63c965cdb4 Sync parent directory after deleting a file in delete scheduler
Summary:
sync parent directory after deleting a file in delete scheduler. Otherwise, trim speed may not be as smooth as what we want.
Closes https://github.com/facebook/rocksdb/pull/3767

Differential Revision: D7760136

Pulled By: siying

fbshipit-source-id: ec131d53b61953f09c60d67e901e5eeb2716b05f
2018-04-26 13:58:20 -07:00
Vincent Lee
7c9f23e6db Rate limiter should be allowed to share between different rocksdb instances in C API
Summary:
Currently, the `rocksdb_options_set_ratelimiter` in  `c.cc` will change the input to nil, which make it is
 not possible to use the shared rate limiter create by `rocksdb_ratelimiter_create` in different rocksdb option.

In this pr, I changed it to shared ptr.
Closes https://github.com/facebook/rocksdb/pull/3758

Differential Revision: D7749740

Pulled By: ajkr

fbshipit-source-id: c6121f8ca75402afdb4b295ce63c2338d253a1b5
2018-04-25 15:57:48 -07:00
Mike Kolupaev
affe01b0d5 Improve write time breakdown stats
Summary:
There's a group of stats in PerfContext for profiling the write path. They break down the write time into WAL write, memtable insert, throttling, and everything else. We use these stats a lot for figuring out the cause of slow writes.

These stats got a bit out of date and are now categorizing some interesting things as "everything else", and also do some double counting. This PR fixes it and adds two new stats: time spent waiting for other threads of the batch group, and time spent waiting for scheduling flushes/compactions. Probably these will be enough to explain all the occasional abnormally slow (multiple seconds) writes that we're seeing.
Closes https://github.com/facebook/rocksdb/pull/3602

Differential Revision: D7251562

Pulled By: al13n321

fbshipit-source-id: 0a2d0f5a4fa5677455e1f566da931cb46efe2a0d
2018-04-23 17:58:54 -07:00
Siying Dong
d5afa73789 Revert "Skip deleted WALs during recovery"
Summary:
This reverts commit 73f21a7b21.

It breaks compatibility. When created a DB using a build with this new change, opening the DB and reading the data will fail with this error:

"Corruption: Can't access /000000.sst: IO error: while stat a file for size: /tmp/xxxx/000000.sst: No such file or directory"

This is because the dummy AddFile4 entry generated by the new code will be treated as a real entry by an older build. The older build will think there is a real file with number 0, but there isn't such a file.
Closes https://github.com/facebook/rocksdb/pull/3762

Differential Revision: D7730035

Pulled By: siying

fbshipit-source-id: f2051859eff20ef1837575ecb1e1bb96b3751e77
2018-04-23 12:01:26 -07:00
Anand Ananthabhotla
dbdaa4662e Add a stat for MultiGet keys found, update memtable hit/miss stats
Summary:
1. Add a new ticker stat rocksdb.number.multiget.keys.found to track the
number of keys successfully read
2. Update rocksdb.memtable.hit/miss in DBImpl::MultiGet(). It was being done in
DBImpl::GetImpl(), but not MultiGet
Closes https://github.com/facebook/rocksdb/pull/3730

Differential Revision: D7677364

Pulled By: anand1976

fbshipit-source-id: af22bd0ef8ddc5cf2b4244b0a024e539fe48bca5
2018-04-20 15:28:19 -07:00
Maysam Yabandeh
c3d1e36cce WritePrepared Txn: enable TryAgain for duplicates at the end of the batch
Summary:
The WriteBatch::Iterate will try with a larger sequence number if the memtable reports a duplicate. This status is specified with TryAgain status. So far the assumption was that the last entry in the batch will never return TryAgain, which is correct when WAL is created via WritePrepared since it always appends a batch separator if a natural one does not exist. However when reading a WAL generated by WriteCommitted this batch separator might  not exist. Although WritePrepared is not supposed to be able to read the WAL generated by WriteCommitted we should avoid confusing scenarios in which the behavior becomes unpredictable. The path fixes that by allowing TryAgain even for the last entry of the write batch.
Closes https://github.com/facebook/rocksdb/pull/3747

Differential Revision: D7708391

Pulled By: maysamyabandeh

fbshipit-source-id: bfaddaa9b14a4cdaff6977f6f63c789a6ab1ee0d
2018-04-20 15:28:19 -07:00
przemyslaw.skibinski@percona.com
dee95a1afc Fix GitHub issue #3716: gcc-8 warnings
Summary:
Fix the following gcc-8 warnings:
- conflicting C language linkage declaration [-Werror]
- writing to an object with no trivial copy-assignment [-Werror=class-memaccess]
- array subscript -1 is below array bounds [-Werror=array-bounds]

Solves https://github.com/facebook/rocksdb/issues/3716
Closes https://github.com/facebook/rocksdb/pull/3736

Differential Revision: D7684161

Pulled By: yiwu-arbug

fbshipit-source-id: 47c0423d26b74add251f1d3595211eee1e41e54a
2018-04-20 13:42:47 -07:00
Zhongyi Xie
e1e826b980 check return status for Sync() and Append() calls to avoid corruption
Summary:
Right now in `SyncClosedLogs`, `CopyFile`, and `AddRecord`, where `Sync` and `Append` are invoked in a loop, the error status are not checked. This could lead to potential corruption as later calls will overwrite the error status.
Closes https://github.com/facebook/rocksdb/pull/3740

Differential Revision: D7678848

Pulled By: miasantreble

fbshipit-source-id: 4b0b412975989dfe80348f73217b9c4122a4bd77
2018-04-19 14:13:46 -07:00
Yi Wu
ad511684b2 Add block cache related DB properties
Summary:
Add DB properties "rocksdb.block-cache-capacity", "rocksdb.block-cache-usage", "rocksdb.block-cache-pinned-usage" to show block cache usage.
Closes https://github.com/facebook/rocksdb/pull/3734

Differential Revision: D7657180

Pulled By: yiwu-arbug

fbshipit-source-id: dd34a019d5878dab539c51ee82669e97b2b745fd
2018-04-18 21:42:25 -07:00
Yanqin Jin
5e48811844 Initialize a boolean member variable of a struct.
Summary:
The reason for this initialization is that LLVM UBSAN check will fail due to
uninitialized bool. [StackOverflow post](https://stackoverflow.com/questions/31420154/runtime-error-load-of-value-127-which-is-not-a-valid-value-for-type-bool).

UBSAN log:
> ===== Running external_sst_file_basic_test
[==========] Running 7 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 7 tests from ExternalSSTFileBasicTest
[ RUN      ] ExternalSSTFileBasicTest.Basic
[       OK ] ExternalSSTFileBasicTest.Basic (6 ms)
[ RUN      ] ExternalSSTFileBasicTest.NoCopy
db/external_sst_file_ingestion_job.h:23:8: runtime error: load of value 253, which is not a valid value for type 'bool'

miasantreble  I've tested this locally using the following command.
```
TEST_TMPDIR=/dev/shm/rocksdb COMPILE_WITH_UBSAN=1 OPT=-g make J=1 -j8 ubsan_check
```

ajkr This PR is related to your review comment in [PR](https://github.com/facebook/rocksdb/pull/3713/). It turns out that, with UBSAN enabled, we must provide a default value for boolean member variables.
Closes https://github.com/facebook/rocksdb/pull/3728

Differential Revision: D7642476

Pulled By: riversand963

fbshipit-source-id: 4c09a4b8d271151cb99ae7393db9e4ad9f29762e
2018-04-16 14:28:01 -07:00
Zhongyi Xie
954b496b3f fix memory leak in two_level_iterator
Summary:
this PR fixes a few failed contbuild:
1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406
2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662
3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag.
Closes https://github.com/facebook/rocksdb/pull/3718

Reviewed By: maysamyabandeh

Differential Revision: D7621192

Pulled By: miasantreble

fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146
2018-04-15 17:26:26 -07:00
zhangjinpeng1987
31ee4bf240 add kEntryRangeDeletion
Summary:
When there are many range deletions in a range, we want to trigger manual compaction on this range to reclaim disk space as soon as possible and speed up read.
After this change, we can collect informations of range deletions and store them into user properties which can guide our manual compaction.
Closes https://github.com/facebook/rocksdb/pull/3695

Differential Revision: D7570322

Pulled By: ajkr

fbshipit-source-id: c358fa43b0aac6cc954d2eadc7d3bd8015373369
2018-04-13 11:27:17 -07:00
Yanqin Jin
c81b0abedd Improve accuracy of I/O stats collection of external SST ingestion.
Summary:
RocksDB supports ingestion of external ssts. If ingestion_options.move_files is true, when performing ingestion, RocksDB first tries to link external ssts. If external SST file resides on a different FS, or the underlying FS does not support hard link, then RocksDB performs actual file copy. However, no matter which choice is made, current code increase bytes-written when updating compaction stats, which is inaccurate when RocksDB does NOT copy file.

Rename a sync point.
Closes https://github.com/facebook/rocksdb/pull/3713

Differential Revision: D7604151

Pulled By: riversand963

fbshipit-source-id: dd0c0d9b9a69c7d9ffceafc3d9c23371aa413586
2018-04-13 10:58:42 -07:00
David Lai
3be9b36453 comment unused parameters to turn on -Wunused-parameter flag
Summary:
This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557.
Closes https://github.com/facebook/rocksdb/pull/3662

Differential Revision: D7426121

Pulled By: Dayvedde

fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352
2018-04-12 17:59:16 -07:00
Yanqin Jin
d42bd041c5 Improve visibility into the reasons for compaction.
Summary:
Add `compaction_reason` as part of event log for event `compaction started`.
Add counters for each `CompactionReason`.
Closes https://github.com/facebook/rocksdb/pull/3679

Differential Revision: D7550348

Pulled By: riversand963

fbshipit-source-id: a19cff3a678c785aa5ef41aac78b9a5968fcc34d
2018-04-11 10:58:44 -07:00
Andrew Kryczka
019d7894eb fix calling SetOptions on deprecated options
Summary:
In `cf_options_type_info`, the deprecated options are all considered to have offset zero in the `MutableCFOptions` struct. Previously we weren't checking in `GetMutableOptionsFromStrings` whether the provided option was deprecated or not and simply writing the provided value to the offset specified by `cf_options_type_info`. That meant setting any deprecated option would overwrite the first element in the struct, which is `write_buffer_size`. `db_stress` hit this often since it calls `SetOptions` with `soft_rate_limit=0` and `hard_rate_limit=0`, which are both deprecated so cause `write_buffer_size` to be set to zero, which causes it to crash on the following assertion:

```
db_stress: db/memtable.cc:106: rocksdb::MemTable::MemTable(const rocksdb::InternalKeyComparator&, const rocksdb::ImmutableCFOptions&, const rocksdb::MutableCFOptions&, rocksdb::WriteBufferManager*, rocksdb::SequenceNumber, uint32_t): Assertion `!ShouldScheduleFlush()' failed.
```

We fix it by skipping deprecated options (and logging a warning) when users provide them to `SetOptions`. I didn't want to fail the call for compatibility reasons.
Closes https://github.com/facebook/rocksdb/pull/3700

Differential Revision: D7572596

Pulled By: ajkr

fbshipit-source-id: bd5d84e14c0c39f30c5d4c6df7c1503d2c28ecf1
2018-04-10 19:02:09 -07:00
Yanqin Jin
d95014b9df fix some text in comments.
Summary:
1. Remove redundant text.
2. Make terminology consistent across all comments and doc of RocksDB. Also do
   our best to conform to conventions. Specifically, use 'callback' instead of
   'call-back' [wikipedia](https://en.wikipedia.org/wiki/Callback_(computer_programming)).
Closes https://github.com/facebook/rocksdb/pull/3693

Differential Revision: D7560396

Pulled By: riversand963

fbshipit-source-id: ba8c251c487f4e7d1872a1a8dc680f9e35a6ffb8
2018-04-10 15:59:24 -07:00
Zhongyi Xie
2770a94c42 make MockTimeEnv::current_time_ atomic to fix data race
Summary:
fix a new TSAN failure
https://gist.github.com/miasantreble/7599c33f4e17da1024c67d4540dbe397
Closes https://github.com/facebook/rocksdb/pull/3694

Differential Revision: D7565310

Pulled By: miasantreble

fbshipit-source-id: f672c96e925797b34dec6e20b59527e8eebaa825
2018-04-10 14:13:18 -07:00
Gihwan Oh
65fe8d6cd6 Change a comment
Summary:
In this case, we add input files of compaction, not outputs.
Closes https://github.com/facebook/rocksdb/pull/3686

Differential Revision: D7556781

Pulled By: ajkr

fbshipit-source-id: ae135bb6eda60db8f275a9ba2d21c18aaadef5b7
2018-04-09 13:42:31 -07:00
Andrew Kryczka
1c27cbfbd1 fix intra-L0 FIFO for uncompressed use case
Summary:
- inflate the argument passed as `max_compact_bytes_per_del_file` by a bit (10%). The intent of this argument is prevent L0 files from being intra-L0 compacted multiple times. Without compression, some intra-L0 compactions exceed this limit (and thus aren't executed), even though none of their files have gone through intra-L0 before.
- fix `FindIntraL0Compaction` as it was rejecting some valid intra-L0 compactions. In particular, `compact_bytes_per_del_file` is the work-per-deleted-file for the span [0, span_len), whereas `new_compact_bytes_per_del_file` is the work-per-deleted-file for the span [0, span_len+1). The former is more correct for checking whether we've found an eligible span.
Closes https://github.com/facebook/rocksdb/pull/3684

Differential Revision: D7530396

Pulled By: ajkr

fbshipit-source-id: cad4f50902bdc428ac9ff6fffb13eb288648d85e
2018-04-09 13:42:31 -07:00
Zhongyi Xie
f3a1d9e049 fix data race
Summary:
Fix a TSAN failure in `DBRangeDelTest.ValidLevelSubcompactionBoundaries`:
https://gist.github.com/miasantreble/712e04b4de2ff7f193c98b1acf07e899
Closes https://github.com/facebook/rocksdb/pull/3691

Differential Revision: D7541400

Pulled By: miasantreble

fbshipit-source-id: b0b4538980bce7febd0385e61d6e046580bcaefb
2018-04-09 12:28:28 -07:00
Maysam Yabandeh
bde1c1a72a WritePrepared Txn: add stats
Summary:
Adding some stats that would be helpful to monitor if the DB has gone to unlikely stats that would hurt the performance. These are mostly when we end up needing to acquire a mutex.
Closes https://github.com/facebook/rocksdb/pull/3683

Differential Revision: D7529393

Pulled By: maysamyabandeh

fbshipit-source-id: f7d36279a8f39bd84d8ddbf64b5c97f670c5d6d9
2018-04-07 21:56:42 -07:00
Gihwan Oh
74767deec3 Fix typo
Summary:
regrad -> regard
Closes https://github.com/facebook/rocksdb/pull/3685

Differential Revision: D7540952

Pulled By: miasantreble

fbshipit-source-id: e08c9389f7fccf401c962a4441b62cd5e73a33ad
2018-04-06 15:42:50 -07:00
Phani Shekhar Mantripragada
446b32cfc3 Support for Column family specific paths.
Summary:
In this change, an option to set different paths for different column families is added.
This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.

Changes :
1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions.  This member is used to identify the path information whenever files are accessed.
2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
4) Unit tests are added appropriately.
Closes https://github.com/facebook/rocksdb/pull/3102

Differential Revision: D6951697

Pulled By: ajkr

fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
2018-04-05 19:58:20 -07:00
Dmitri Smirnov
147dfc7bdf Fix pre_release callback argument list.
Summary:
Primitive types constness does not affect the signature of the
  method and has no influence on whether the overriding method would
  actually have that const bool instead of just bool. In addition,
  it is rarely useful but does produce a compatibility warnings
  in VS 2015 compiler.
Closes https://github.com/facebook/rocksdb/pull/3663

Differential Revision: D7475739

Pulled By: ajkr

fbshipit-source-id: fb275378b5acc397399420ae6abb4b6bfe5bd32f
2018-04-05 11:12:16 -07:00
Zhongyi Xie
c827b2dc2a fix build for rocksdb lite
Summary:
currently rocksdb lite build fails due to the following errors:
> db/db_sst_test.cc:29:51: error: ‘FlushJobInfo’ does not name a type
   virtual void OnFlushCompleted(DB* /*db*/, const FlushJobInfo& info) override {
                                                   ^
db/db_sst_test.cc:29:16: error: ‘virtual void rocksdb::FlushedFileCollector::OnFlushCompleted(rocksdb::DB*, const int&)’ marked ‘override’, but does not override
   virtual void OnFlushCompleted(DB* /*db*/, const FlushJobInfo& info) override {
                ^
db/db_sst_test.cc:24:7: error: ‘class rocksdb::FlushedFileCollector’ has virtual functions and accessible non-virtual destructor [-Werror=non-virtual-dtor]
 class FlushedFileCollector : public EventListener {
       ^
db/db_sst_test.cc: In member function ‘virtual void rocksdb::FlushedFileCollector::OnFlushCompleted(rocksdb::DB*, const int&)’:
db/db_sst_test.cc:31:35: error: request for member ‘file_path’ in ‘info’, which is of non-class type ‘const int’
     flushed_files_.push_back(info.file_path);
                                   ^
cc1plus: all warnings being treated as errors
make: *** [db/db_sst_test.o] Error 1
Closes https://github.com/facebook/rocksdb/pull/3676

Differential Revision: D7493006

Pulled By: miasantreble

fbshipit-source-id: 77dff0a5b23e27db51be9b9798e3744e6fdec64f
2018-04-05 09:11:36 -07:00
Sagar Vemuri
7d9067991e Ttl-triggered and snapshot-release-triggered compactions should not be manual compactions
Summary:
Ttl-triggered and snapshot-release-triggered compactions should not be considered as manual compactions. This is a bug.
Closes https://github.com/facebook/rocksdb/pull/3678

Differential Revision: D7498151

Pulled By: sagar0

fbshipit-source-id: a2d5bed05268a4dc93d54ea97a9ae44b366df15d
2018-04-05 06:41:52 -07:00
Sagar Vemuri
04c11b867d Level Compaction with TTL
Summary:
Level Compaction with TTL.

As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space.

Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data.
This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space.

This functionality can be controlled by the newly introduced column family option -- ttl.

TODO for later:
- Make ttl mutable
- Extend TTL to Universal compaction as well? (TTL is already supported in FIFO)
- Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option.
Closes https://github.com/facebook/rocksdb/pull/3591

Differential Revision: D7275442

Pulled By: sagar0

fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267
2018-04-02 22:14:28 -07:00
Maysam Yabandeh
b225de7e10 WritePrepared Txn: smallest_prepare optimization
Summary:
The is an optimization to reduce lookup in the CommitCache when querying IsInSnapshot. The optimization takes the smallest uncommitted data at the time that the snapshot was taken and if the sequence number of the read data is lower than that number it assumes the data as committed.
To implement this optimization two changes are required: i) The AddPrepared function must be called sequentially to avoid out of order insertion in the PrepareHeap (otherwise the top of the heap does not indicate the smallest prepare in future too), ii) non-2PC transactions also call AddPrepared if they do not commit in one step.
Closes https://github.com/facebook/rocksdb/pull/3649

Differential Revision: D7388630

Pulled By: maysamyabandeh

fbshipit-source-id: b79506238c17467d590763582960d4d90181c600
2018-04-02 20:27:41 -07:00
Amy Tai
1579626d0d Enable cancelling manual compactions if they hit the sfm size limit
Summary:
Manual compactions should be cancelled, just like scheduled compactions are cancelled, if sfm->EnoughRoomForCompaction is not true.
Closes https://github.com/facebook/rocksdb/pull/3670

Differential Revision: D7457683

Pulled By: amytai

fbshipit-source-id: 669b02fdb707f75db576d03d2c818fb98d1876f5
2018-04-02 19:58:04 -07:00
Zhongyi Xie
44653c7b7a Revert "Avoid adding tombstones of the same file to RangeDelAggregato…
Summary:
…r multiple times"

This reverts commit e80709a33a.

lingbin PR https://github.com/facebook/rocksdb/pull/3635 is causing some performance regression for seekrandom workloads
I'm reverting the commit for now but feel free to submit new patches 😃

To reproduce the regression, you can run the following db_bench command
> ./db_bench --benchmarks=fillrandom,seekrandomwhilewriting --threads=1 --num=1000000 --reads=150000 --key_size=66 --value_size=1262 --statistics=0 --compression_ratio=0.5 --histogram=1 --seek_nexts=1 --stats_per_interval=1 --stats_interval_seconds=600 --max_background_flushes=4 --num_multi_db=1 --max_background_compactions=16 --seed=1522388277 -write_buffer_size=1048576 --level0_file_num_compaction_trigger=10000 --compression_type=none

write stats printed by db_bench:

Table | | | | | | | | | | |
 --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---
revert commit | Percentiles: | P50: | 80.77  | P75: |102.94  |P99: | 1786.44 | P99.9: | 1892.39 |P99.99: 2645.10 |
keep commit | Percentiles: | P50: | 221.72 | P75: | 686.62 | P99: | 1842.57 | P99.9: | 1899.70|  P99.99: 2814.29|
Closes https://github.com/facebook/rocksdb/pull/3672

Differential Revision: D7463315

Pulled By: miasantreble

fbshipit-source-id: 8e779c87591127f2c3694b91a56d9b459011959d
2018-04-02 19:58:04 -07:00
Fosco Marotto
d12112d05e Throw NoSpace instead of IOError when out of space.
Summary:
Replaces #1702 and is updated from feedback.
Closes https://github.com/facebook/rocksdb/pull/3531

Differential Revision: D7457395

Pulled By: gfosco

fbshipit-source-id: 25a21dd8cfa5a6e42e024208b444d9379d920c82
2018-03-30 15:27:18 -07:00
Maysam Yabandeh
73f21a7b21 Skip deleted WALs during recovery
Summary:
This patch record the deleted WAL numbers in the manifest to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Closes https://github.com/facebook/rocksdb/pull/3488

Differential Revision: D6967893

Pulled By: maysamyabandeh

fbshipit-source-id: 13119feb155a08ab6d4909f437c7a750480dc8a1
2018-03-30 11:28:05 -07:00
Maysam Yabandeh
89d989ed75 WritePrepared Txn: fix a bug in publishing recoverable state seq
Summary:
When using two_write_queue, the published seq and the last allocated sequence could be ahead of the LastSequence, even if both write queues are stopped as in WriteRecoverableState. The patch fixes a bug in WriteRecoverableState in which LastSequence was used as a reference but the result was applied to last fetched sequence and last published seq.
Closes https://github.com/facebook/rocksdb/pull/3665

Differential Revision: D7446099

Pulled By: maysamyabandeh

fbshipit-source-id: 1449bed9aed8e9db6af85946efd347cb8efd3c0b
2018-03-29 14:46:41 -07:00
Maysam Yabandeh
0377ff9dea WritePrepared Txn: make recoverable state visible after flush
Summary:
Currently if the CommitTimeWriteBatch is set to be used only as a state that is required only for recovery , the user cannot see that in DB until it is restarted. This while the state is already inserted into the DB after the memtable flush. It would be useful for debugging if make this state visible to the user after the flush by committing it. The patch does it by a invoking a callback that does the commit on the recoverable state.
Closes https://github.com/facebook/rocksdb/pull/3661

Differential Revision: D7424577

Pulled By: maysamyabandeh

fbshipit-source-id: 137f9408662f0853938b33fa440f27f04c1bbf5c
2018-03-28 12:12:08 -07:00
Yanqin Jin
1f5def1653 Fix race condition causing double deletion of ssts
Summary:
Possible interleaved execution of background compaction thread calling `FindObsoleteFiles (no full scan) / PurgeObsoleteFiles` and user thread calling `FindObsoleteFiles (full scan) / PurgeObsoleteFiles` can lead to race condition on which RocksDB attempts to delete a file twice. The second attempt will fail and return `IO error`. This may occur to other files,  but this PR targets sst.
Also add a unit test to verify that this PR fixes the issue.

The newly added unit test `obsolete_files_test` has a test case for this scenario, implemented in `ObsoleteFilesTest#RaceForObsoleteFileDeletion`. `TestSyncPoint`s are used to coordinate the interleaving the `user_thread` and background compaction thread. They execute as follows
```
timeline              user_thread                background_compaction thread
t1   |                                          FindObsoleteFiles(full_scan=false)
t2   |     FindObsoleteFiles(full_scan=true)
t3   |                                          PurgeObsoleteFiles
t4   |     PurgeObsoleteFiles
     V
```
When `user_thread` invokes `FindObsoleteFiles` with full scan, it collects ALL files in RocksDB directory, including the ones that background compaction thread have collected in its job context. Then `user_thread` will see an IO error when trying to delete these files in `PurgeObsoleteFiles` because background compaction thread has already deleted the file in `PurgeObsoleteFiles`.
To fix this, we make RocksDB remember which (SST) files have been found by threads after calling `FindObsoleteFiles` (see `DBImpl#files_grabbed_for_purge_`). Therefore, when another thread calls `FindObsoleteFiles` with full scan, it will not collect such files.

ajkr could you take a look and comment? Thanks!
Closes https://github.com/facebook/rocksdb/pull/3638

Differential Revision: D7384372

Pulled By: riversand963

fbshipit-source-id: 01489516d60012e722ee65a80e1449e589ce26d3
2018-03-28 10:29:59 -07:00
Maysam Yabandeh
35a4469bbf Fix race condition via concurrent FlushWAL
Summary:
Currently log_writer->AddRecord in WriteImpl is protected from concurrent calls via FlushWAL only if two_write_queues_ option is set. The patch fixes the problem by i) skip log_writer->AddRecord in FlushWAL if manual_wal_flush is not set, ii) protects log_writer->AddRecord in WriteImpl via log_write_mutex_ if manual_wal_flush_ is set but two_write_queues_ is not.

Fixes #3599
Closes https://github.com/facebook/rocksdb/pull/3656

Differential Revision: D7405608

Pulled By: maysamyabandeh

fbshipit-source-id: d6cc265051c77ae49c7c6df4f427350baaf46934
2018-03-26 16:29:56 -07:00
Maysam Yabandeh
3e417a6607 WritePrepared Txn: AddPrepared for all sub-batches
Summary:
Currently AddPrepared is performed only on the first sub-batch if there are duplicate keys in the write batch. This could cause a problem if the transaction takes too long to commit and the seq number of the first sub-patch moved to old_prepared_ but not the seq of the later ones. The patch fixes this by calling AddPrepared for all sub-patches.
Closes https://github.com/facebook/rocksdb/pull/3651

Differential Revision: D7388635

Pulled By: maysamyabandeh

fbshipit-source-id: 0ccd80c150d9bc42fe955e49ddb9d7ca353067b4
2018-03-23 17:30:04 -07:00
Dmitri Smirnov
d382ae7de6 Imporve perf of random read and insert compare by suggesting inlining to the compiler
Summary:
Results from 2015 compiler. This improve sequential insert. Random Read results are inconclusive but I hope 2017 will do a better job at inlining.

Before:
fillseq      :       **3.638 micros/op 274866 ops/sec;  213.9 MB/s**

After:
fillseq      :       **3.379 micros/op 295979 ops/sec;  230.3 MB/s**
Closes https://github.com/facebook/rocksdb/pull/3645

Differential Revision: D7382711

Pulled By: siying

fbshipit-source-id: 092a07ffe8a6e598d1226ceff0f11b35e6c5c8e4
2018-03-23 13:26:55 -07:00
LingBin
e80709a33a Avoid adding tombstones of the same file to RangeDelAggregator multiple times
Summary:
RangeDelAggregator will remember the files whose range tombstones have been added,
so the caller can check whether the file has been added before call AddTombstones.

Closes https://github.com/facebook/rocksdb/pull/3635

Differential Revision: D7354604

Pulled By: ajkr

fbshipit-source-id: 9b9f7ec130556028df417e650711554b46d8d107
2018-03-23 12:43:06 -07:00
Radoslaw Zarzynski
09b6bf828a InlineSkiplist: don't decode keys unnecessarily during comparisons
Summary:
Summary
========
`InlineSkipList<>::Insert` takes the `key` parameter as a C-string. Then, it performs multiple comparisons with it requiring the `GetLengthPrefixedSlice()` to be spawn in `MemTable::KeyComparator::operator()(const char* prefix_len_key1, const char* prefix_len_key2)` on the same data over and over. The patch tries to optimize that.

Rough performance comparison
=====
Big keys, no compression.

```
$ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256
(...)
fillrandom   :       4.222 micros/op 236836 ops/sec;   80.4 MB/s
```

```
$ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256
(...)
fillrandom   :       4.064 micros/op 246059 ops/sec;   83.5 MB/s
```

TODO
======
In ~~a separated~~ this PR:
- [x] Go outside the write path. Maybe even eradicate the C-string-taking variant of `KeyIsAfterNode` entirely.
- [x] Try to cache the transformations applied by `KeyComparator` & friends in situations where we havy many comparisons with the same key.
Closes https://github.com/facebook/rocksdb/pull/3516

Differential Revision: D7059300

Pulled By: ajkr

fbshipit-source-id: 6f027dbb619a488129f79f79b5f7dbe566fb2dbb
2018-03-23 12:14:30 -07:00
Zhongyi Xie
1cbc96d236 FlushReason improvement
Summary:
Right now flush reason "SuperVersion Change" covers a few different scenarios which is a bit vague. For example, the following db_bench job should trigger "Write Buffer Full"

> $ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304
$ grep 'flush_reason' /dev/shm/dbbench/LOG
...
2018/03/06-17:30:42.543638 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242543634, "job": 192, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018024, "flush_reason": "SuperVersion Change"}
2018/03/06-17:30:42.569541 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242569536, "job": 193, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"}
2018/03/06-17:30:42.596396 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242596392, "job": 194, "event": "flush_started", "num_memtables": 1, "num_entries": 7008, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "SuperVersion Change"}
2018/03/06-17:30:42.622444 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242622440, "job": 195, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"}

With the fix:
> 2018/03/19-14:40:02.341451 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602341444, "job": 98, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018008, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.379655 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602379642, "job": 100, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018016, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.418479 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602418474, "job": 101, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.455084 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602455079, "job": 102, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.492293 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602492288, "job": 104, "event": "flush_started", "num_memtables": 1, "num_entries": 7007, "num_deletes": 0, "memory_usage": 1018056, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.528720 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602528715, "job": 105, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.566255 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602566238, "job": 107, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018112, "flush_reason": "Write Buffer Full"}
Closes https://github.com/facebook/rocksdb/pull/3627

Differential Revision: D7328772

Pulled By: miasantreble

fbshipit-source-id: 67c94065fbdd36930f09930aad0aaa6d2c152bb8
2018-03-22 18:42:18 -07:00
Sagar Vemuri
2e3d407778 Fsync after writing global seq number in ExternalSstFileIngestionJob
Summary:
Fsync after writing global sequence number to the ingestion file in ExternalSstFileIngestionJob. Otherwise the file metadata could be incorrect.
Closes https://github.com/facebook/rocksdb/pull/3644

Differential Revision: D7373813

Pulled By: sagar0

fbshipit-source-id: 4da2c9e71a8beb5c08b4ac955f288ee1576358b8
2018-03-22 17:42:56 -07:00
Andrew Kryczka
4d51feab0b Rename function for handling WAL write error
Summary:
It was misnamed. It actually updates `bg_error_` if `PreprocessWrite()` or `WriteToWAL()` fail, not related to the user callback.
Closes https://github.com/facebook/rocksdb/pull/3485

Differential Revision: D6955787

Pulled By: ajkr

fbshipit-source-id: bd7afc3fdb7a52830c021cbfc25fcbc3ab7d5e10
2018-03-22 15:58:39 -07:00
Maysam Yabandeh
7429b20e39 WritePrepared Txn: fix race condition on publishing seq
Summary:
This commit fixes a race condition on calling SetLastPublishedSequence. The function must be called only from the 2nd write queue when two_write_queues is enabled. However there was a bug that would also call it from the main write queue if CommitTimeWriteBatch is provided to the commit request and yet use_only_the_last_commit_time_batch_for_recovery optimization is not enabled. To fix that we penalize the commit request in such cases by doing an additional write solely to publish the seq number from the 2nd queue.
Closes https://github.com/facebook/rocksdb/pull/3641

Differential Revision: D7361508

Pulled By: maysamyabandeh

fbshipit-source-id: bf8f7a27e5cccf5425dccbce25eb0032e8e5a4d7
2018-03-22 14:43:36 -07:00
QingpingWang
70282cf876 fix behavior does not match name for "IsFileDeletionsEnabled"
Summary:
for PR https://github.com/facebook/rocksdb/pull/3598
I deleted the original repo for some reason. Sorry for the inconvenience.
Closes https://github.com/facebook/rocksdb/pull/3612

Differential Revision: D7291671

Pulled By: ajkr

fbshipit-source-id: 918490ba86b13fe450d232af436cbe259d847c64
2018-03-21 22:13:34 -07:00
QingpingWang
2ce8f63f81 C API for PerfContext
Summary:
This pull request exposes the interface of PerfContext as C API
Closes https://github.com/facebook/rocksdb/pull/3607

Differential Revision: D7294225

Pulled By: ajkr

fbshipit-source-id: eddcfbc13538f379950b2c8b299486695ffb5e2c
2018-03-21 22:13:34 -07:00
Jingguo Yao
8823487ff7 doc: fix a typo
Summary:
s/synchromization/synchronization/
Closes https://github.com/facebook/rocksdb/pull/3583

Differential Revision: D7276596

Pulled By: sagar0

fbshipit-source-id: 552ec6d6935f642e1a3a7c552de6c94441ac50e0
2018-03-21 15:58:58 -07:00
Siying Dong
93d52696bf Memory Problem Of Destorying ColumnFamilyHandle after deleting the CF
Summary:
When destorying column family handle after the column family has been deleted, the handle may hold share pointers of some objects in ColumnFamilyOptions, but in the destructor, the destructing order may cause some of the objects to be destoryed before being used by the following steps. Fix it by making a copy of the option object and destory it as the last step.
Closes https://github.com/facebook/rocksdb/pull/3610

Differential Revision: D7281025

Pulled By: siying

fbshipit-source-id: ac18f3b2841788cba4ccfa1abd8d59158c1113bc
2018-03-20 17:13:12 -07:00
Andrew Kryczka
d1b26507bd fix db_compaction_test when compression disabled
Summary:
Previously, the compaction in `DBCompactionTestWithParam.ForceBottommostLevelCompaction` generated multiple files in no-compression use case, andone file in compression use case. I increased `target_file_size_base` so it generates one file in both use cases.
Closes https://github.com/facebook/rocksdb/pull/3625

Differential Revision: D7311885

Pulled By: ajkr

fbshipit-source-id: 97f249fa83a9924ac34357a4bb3189c969ecb107
2018-03-19 12:30:05 -07:00
zhsj
cc340268e9 fix wrong length in snprintf
Summary: Closes https://github.com/facebook/rocksdb/pull/3622

Differential Revision: D7307689

Pulled By: ajkr

fbshipit-source-id: b8f52effc63fea06c2058b39c60944c2c1f814b4
2018-03-16 13:27:55 -07:00
Huachao Huang
ecfca1ff59 Optimize overlap checking for external file ingestion
Summary:
If there are a lot of overlapped files in L0, creating a merging iterator for
all files in L0 to check overlap can be very slow because we need to read and
seek all files in L0. However, in that case, the ingested file is likely to
overlap with some files in L0, so if we check those files one by one, we can stop
once we encounter overlap.

Ref: https://github.com/facebook/rocksdb/issues/3540
Closes https://github.com/facebook/rocksdb/pull/3564

Differential Revision: D7196784

Pulled By: anand1976

fbshipit-source-id: 8700c1e903bd515d0fa7005b6ce9b3a3d9db2d67
2018-03-16 10:43:17 -07:00
Niv Dayan
da82aab126 allowing CompactFiles to return new file names
Summary:
This is a small API extension to allow the CompactFiles method to return the names of files that were created during the compaction.
Closes https://github.com/facebook/rocksdb/pull/3608

Differential Revision: D7275789

Pulled By: siying

fbshipit-source-id: 1ec0c3954a0f10cd877efb5f29f9be6c7b59e9ba
2018-03-15 11:58:12 -07:00
Dmitri Smirnov
6f7b7f91b5 Optionally create DuplicateDetector
Summary:
Address issue https://github.com/facebook/rocksdb/issues/3579
Closes https://github.com/facebook/rocksdb/pull/3589

Differential Revision: D7221161

Pulled By: yiwu-arbug

fbshipit-source-id: bd875ab0aa0e414dfa98b1bf036ba9b4ed351361
2018-03-14 00:57:25 -07:00
Andrew Kryczka
2256dab135 fix flaky DBSSTTest.DeleteSchedulerMultipleDBPaths
Summary:
I landed #3544 which made this test flaky. The reason was the files scheduled for deletion sometimes went through the trash-marking process, and sometimes were deleted directly. Our counter only bumped on the former code path, so if the latter code path was used, we'd miss counting a file deleted by deletion scheduler. This PR also bumps the counter in the latter code path.
Closes https://github.com/facebook/rocksdb/pull/3593

Differential Revision: D7226173

Pulled By: yiwu-arbug

fbshipit-source-id: 81ab44c60834df6ff88db1d73ea34e26c6e93c39
2018-03-13 14:57:26 -07:00
Amy Tai
e476d0e252 Adding stat to count cancelled compactions
Summary:
Added a stat that counts the number of cancelled compactions.
Closes https://github.com/facebook/rocksdb/pull/3574

Differential Revision: D7190259

Pulled By: amytai

fbshipit-source-id: d5ce82dc9398da6d6d34023ad4ed8cec909852a3
2018-03-08 10:42:28 -08:00
Bruce Mitchener
a3a3f5497c Fix some typos in comments and docs.
Summary: Closes https://github.com/facebook/rocksdb/pull/3568

Differential Revision: D7170953

Pulled By: siying

fbshipit-source-id: 9cfb8dd88b7266da920c0e0c1e10fb2c5af0641c
2018-03-08 10:27:25 -08:00
Lukas Rist
a277b0f2b7 Clarification regarding record format
Summary:
The CRC is actually calculated based on the record type and payload.
The wiki should also be updated accordingly and extended with a section on the recyclable record format.
Closes https://github.com/facebook/rocksdb/pull/3576

Differential Revision: D7196478

Pulled By: siying

fbshipit-source-id: 39f7a0395075cc73e2aa2bfc9e42c85bce35e765
2018-03-08 10:27:25 -08:00
Bruce Mitchener
0de710f5b8 Use nullptr instead of NULL / 0 more consistently.
Summary: Closes https://github.com/facebook/rocksdb/pull/3569

Differential Revision: D7170968

Pulled By: yiwu-arbug

fbshipit-source-id: 308a6b7dd358a04fd9a7de3d927bfd8abd57d348
2018-03-07 12:42:12 -08:00
Stuart
f021f1d9e1 Add rocksdb_open_with_ttl function in C API
Summary:
Change-Id: Ie6f9b10bce459f6bf0ade0e5877264b4e10da3f5
Signed-off-by: Stuart <Stuart.Hu@emc.com>
Closes https://github.com/facebook/rocksdb/pull/3553

Differential Revision: D7144833

Pulled By: sagar0

fbshipit-source-id: 815225fa6e560d8a5bc47ffd0a98118b107ce264
2018-03-06 20:57:20 -08:00
amytai
0a3db28d98 Disallow compactions if there isn't enough free space
Summary:
This diff handles cases where compaction causes an ENOSPC error.
This does not handle corner cases where another background job is started while compaction is running, and the other background job triggers ENOSPC, although we do allow the user to provision for these background jobs with SstFileManager::SetCompactionBufferSize.
It also does not handle the case where compaction has finished and some other background job independently triggers ENOSPC.

Usage: Functionality is inside SstFileManager. In particular, users should set SstFileManager::SetMaxAllowedSpaceUsage, which is the reference highwatermark for determining whether to cancel compactions.
Closes https://github.com/facebook/rocksdb/pull/3449

Differential Revision: D7016941

Pulled By: amytai

fbshipit-source-id: 8965ab8dd8b00972e771637a41b4e6c645450445
2018-03-06 16:27:54 -08:00
Andrew Kryczka
20c508c1ed Enable subcompactions in manual level-based compaction
Summary:
This is the simplest way I could think of to speed up `CompactRange`. It works but isn't that optimal because it relies on the same `max_compaction_bytes` and `max_subcompactions` options that are used in other places. If it turns out to be useful we can allow overriding these in `CompactRangeOptions` in the future.
Closes https://github.com/facebook/rocksdb/pull/3549

Differential Revision: D7117634

Pulled By: ajkr

fbshipit-source-id: d0cd03d6bd0d2fd7ea3fb13cd3b8bf7c47d11e42
2018-03-06 12:43:51 -08:00
Andrew Kryczka
6a3eebbab0 support multiple db_paths in SstFileManager
Summary:
Now that files scheduled for deletion are kept in the same directory, we don't need to constrain deletion scheduler to `db_paths[0]`. Previously this was done because there was a separate trash directory, and this constraint prevented files from being accidentally copied to another filesystem when they're scheduled for deletion.
Closes https://github.com/facebook/rocksdb/pull/3544

Differential Revision: D7093786

Pulled By: ajkr

fbshipit-source-id: 202f5c92d925eafebec1281fb95bb5828d33414f
2018-03-06 12:43:51 -08:00
Fosco Marotto
d518fe1da6 uint64_t and size_t changes to compile for iOS
Summary:
In attempting to build a static lib for use in iOS, I ran in to lots of type errors between uint64_t and size_t.  This PR contains the changes I made to get `TARGET_OS=IOS make static_lib` to succeed while also getting Xcode to build successfully with the resulting `librocksdb.a` library imported.

This also compiles for me on macOS and tests fine, but I'm really not sure if I made the correct decisions about where to `static_cast` and where to change types.

Also up for discussion: is iOS worth supporting?  Getting the static lib is just part one, we aren't providing any bridging headers or wrappers like the ObjectiveRocks project, it won't be a great experience.
Closes https://github.com/facebook/rocksdb/pull/3503

Differential Revision: D7106457

Pulled By: gfosco

fbshipit-source-id: 82ac2073de7e1f09b91f6b4faea91d18bd311f8e
2018-03-06 12:43:51 -08:00
Dmitri Smirnov
c364eb42b5 Windows cumulative patch
Summary:
This patch addressed several issues.
  Portability including db_test std::thread -> port::Thread Cc: @
  and %z to ROCKSDB portable macro. Cc: maysamyabandeh

  Implement Env::AreFilesSame

  Make the implementation of file unique number more robust

  Get rid of C-runtime and go directly to Windows API when dealing
  with file primitives.

  Implement GetSectorSize() and aling unbuffered read on the value if
  available.

  Adjust Windows Logger for the new interface, implement CloseImpl() Cc: anand1976

  Fix test running script issue where $status var was of incorrect scope
  so the failures were swallowed and not reported.

  DestroyDB() creates a logger and opens a LOG file in the directory
  being cleaned up. This holds a lock on the folder and the cleanup is
  prevented. This fails one of the checkpoin tests. We observe the same in production.
  We close the log file in this change.

 Fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose failure where the test
 attempts to open a directory with NewRandomAccessFile which does not
 work on Windows.
  Fix DBTest.SoftLimit as it is dependent on thread timing. CC: yiwu-arbug
Closes https://github.com/facebook/rocksdb/pull/3552

Differential Revision: D7156304

Pulled By: siying

fbshipit-source-id: 43db0a757f1dfceffeb2b7988043156639173f5b
2018-03-06 11:57:43 -08:00
Yi Wu
b864bc9b5b Blob DB: Improve FIFO eviction
Summary:
Improving blob db FIFO eviction with the following changes,
* Change blob_dir_size to max_db_size. Take into account SST file size when computing DB size.
* FIFO now only take into account live sst files and live blob files. It is normal for disk usage to go over max_db_size because there are obsolete sst files and blob files pending deletion.
* FIFO eviction now also evict TTL blob files that's still open. It doesn't evict non-TTL blob files.
* If FIFO is triggered, it will pass an expiration and the current sequence number to compaction filter. Compaction filter will then filter inlined keys to evict those with an earlier expiration and smaller sequence number. So call LSM FIFO.
* Compaction filter also filter those blob indexes where corresponding blob file is gone.
* Add an event listener to listen compaction/flush event and update sst file size.
* Implement DB::Close() to make sure base db, as well as event listener and compaction filter, destruct before blob db.
* More blob db statistics around FIFO.
* Fix some locking issue when accessing a blob file.
Closes https://github.com/facebook/rocksdb/pull/3556

Differential Revision: D7139328

Pulled By: yiwu-arbug

fbshipit-source-id: ea5edb07b33dfceacb2682f4789bea61de28bbfa
2018-03-06 11:57:42 -08:00
Maysam Yabandeh
62277e15c3 WritePrepared Txn: Move DuplicateDetector to util
Summary:
Move DuplicateDetector and SetComparator to its own header file in util. It would also address a complaint in the unity test.
Closes https://github.com/facebook/rocksdb/pull/3567

Differential Revision: D7163268

Pulled By: maysamyabandeh

fbshipit-source-id: 6ddf82773473646dbbc1284ae601a78c4907c778
2018-03-05 23:57:12 -08:00
Huachao Huang
9cb4856dbd Don't need to UpdateFilesByCompactionPri for kCompactionStyleNone
Summary: Closes https://github.com/facebook/rocksdb/pull/3563

Differential Revision: D7154653

Pulled By: ajkr

fbshipit-source-id: 4f32fb1b02451a934504c40be22b07fb1f2deb9c
2018-03-05 17:57:39 -08:00
Andrew Kryczka
5d68243e61 Comment out unused variables
Summary:
Submitting on behalf of another employee.
Closes https://github.com/facebook/rocksdb/pull/3557

Differential Revision: D7146025

Pulled By: ajkr

fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e
2018-03-05 13:13:41 -08:00
Maysam Yabandeh
680864ae54 WritePrepared Txn: Fix bug with duplicate keys during recovery
Summary:
Fix the following bugs:
- During recovery a duplicate key was inserted twice into the write batch of the recovery transaction,
once when the memtable returns false (because it was duplicates) and once for the 2nd attempt. This would result into different SubBatch count measured when the recovered transactions is committing.
- If a cf is flushed during recovery the memtable is not available to assist in detecting the duplicate key. This could result into not advancing the sequence number when iterating over duplicate keys of a flushed cf and hence inserting the next key with the wrong sequence number.
- SubBacthCounter would reset the comparator to default comparator after the first duplicate key. The 2nd duplicate key hence would have gone through a wrong comparator and not being detected.
Closes https://github.com/facebook/rocksdb/pull/3562

Differential Revision: D7149440

Pulled By: maysamyabandeh

fbshipit-source-id: 91ec317b165f363f5d11ff8b8c47c81cebb8ed77
2018-03-05 10:57:59 -08:00
Sagar Vemuri
15f55e5e06 Fix TSAN timeout in MergeOperatorPinningTest.Randomized/x test
Summary:
[FB - Internal]
MergeOperatorPinningTest.Randomized/x tests are frequently failing with timeouts when run with tsan, as they are exceeding 10 minute limit for tests. The tests are in turn getting disabled due to frequent failures.
I halved the number of rounds to make the test complete sooner. This reduces the number of testing iterations a little, but it still is much better than totally letting the test be disabled.
Closes https://github.com/facebook/rocksdb/pull/3523

Differential Revision: D7031498

Pulled By: sagar0

fbshipit-source-id: 9a694f2176b235259920a42bf24bca5346f7cff1
2018-03-02 16:27:21 -08:00
Yi Wu
1209b6db5c Blob DB: remove existing garbage collection implementation
Summary:
Red diff to remove existing implementation of garbage collection. The current approach is reference counting kind of approach and require a lot of effort to get the size counter right on compaction and deletion. I'm going to go with a simple mark-sweep kind of approach and will send another PR for that.

CompactionEventListener was added solely for blob db and it adds complexity and overhead to compaction iterator. Removing it as well.
Closes https://github.com/facebook/rocksdb/pull/3551

Differential Revision: D7130190

Pulled By: yiwu-arbug

fbshipit-source-id: c3a375ad2639a3f6ed179df6eda602372cc5b8df
2018-03-02 12:57:23 -08:00
Maysam Yabandeh
d060421c77 Fix a leak in prepared_section_completed_
Summary:
The zeroed entries were not removed from prepared_section_completed_ map. This patch adds a unit test to show the problem and fixes that by refactoring the code. The new code is more efficient since i) it uses two separate mutex to avoid contention between commit and prepare threads, ii) it uses a sorted vector for maintaining uniq log entires with prepare which avoids a very large heap with many duplicate entries.
Closes https://github.com/facebook/rocksdb/pull/3545

Differential Revision: D7106071

Pulled By: maysamyabandeh

fbshipit-source-id: b3ae17cb6cd37ef10b6b35e0086c15c758768a48
2018-03-01 20:41:56 -08:00
Yi Wu
bf937cf15b Add "rocksdb.live-sst-files-size" DB property
Summary:
Add "rocksdb.live-sst-files-size" DB property which only include files of latest version. Existing "rocksdb.total-sst-files-size" include files from all versions and thus include files that's obsolete but not yet deleted. I'm going to use this new property to cap blob db sst + blob files size.
Closes https://github.com/facebook/rocksdb/pull/3548

Differential Revision: D7116939

Pulled By: yiwu-arbug

fbshipit-source-id: c6a52e45ce0f24ef78708156e1a923c1dd6bc79a
2018-03-01 18:01:10 -08:00
leviathan1995
ec5843dca9 Comment typo
Summary: Closes https://github.com/facebook/rocksdb/pull/3546

Differential Revision: D7111708

Pulled By: ajkr

fbshipit-source-id: 522a4a00eb3e34c73afcb86c1f75cd2e90e7608d
2018-02-28 09:56:45 -08:00
Andrew Kryczka
3ae0047278 skip CompactRange flush based on memtable contents
Summary:
CompactRange has a call to Flush because we guarantee that, at the time it's called, all existing keys in the range will be pushed through the user's compaction filter. However, previously the flush was done blindly, so it'd happen even if the memtable does not contain keys in the range specified by the user. This caused unnecessarily many L0 files to be created, leading to write stalls in some cases. This PR checks the memtable's contents, and decides to flush only if it overlaps with `CompactRange`'s range.

- Move the memtable overlap check logic from `ExternalSstFileIngestionJob` to `ColumnFamilyData::RangesOverlapWithMemtables`
- Reuse the above logic in `CompactRange` and skip flushing if no overlap
Closes https://github.com/facebook/rocksdb/pull/3520

Differential Revision: D7018897

Pulled By: ajkr

fbshipit-source-id: a3c6b1cfae56687b49dd89ccac7c948e53545934
2018-02-27 17:12:44 -08:00
Zhongyi Xie
ad05cbb182 DB:Open should fail on tmpfs when use_direct_reads=true
Summary:
Before:

> $ TEST_TMPDIR=/dev/shm ./db_bench -use_direct_reads=true -benchmarks=readrandomwriterandom -num=10000000 -reads=100000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -readwritepercent=50 -key_size=16 -value_size=48 -threads=32
DB path: [/dev/shm/dbbench]
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
db_bench: tpp.c:84: __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed.
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument

After:
> TEST_TMPDIR=/dev/shm ./db_bench -use_direct_reads=true -benchmarks=readrandomwriterandom -num=10000000 -reads=100000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -readwritepercent=50 -key_size=16 -value_size=48 -threads=32
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
open error: Not implemented: Direct I/O is not supported by the specified DB.
Closes https://github.com/facebook/rocksdb/pull/3539

Differential Revision: D7082658

Pulled By: miasantreble

fbshipit-source-id: f9d9c6ec3b5e9e049cab52154940ee101ba4d342
2018-02-26 14:58:06 -08:00
Anand Ananthabhotla
dfbe52e099 Fix the Logger::Close() and DBImpl::Close() design pattern
Summary:
The recent Logger::Close() and DBImpl::Close() implementation rely on
calling the CloseImpl() virtual function from the destructor, which will
not work. Refactor the implementation to have a private close helper
function in derived classes that can be called by both CloseImpl() and
the destructor.
Closes https://github.com/facebook/rocksdb/pull/3528

Reviewed By: gfosco

Differential Revision: D7049303

Pulled By: anand1976

fbshipit-source-id: 76a64cbf403209216dfe4864ecf96b5d7f3db9f4
2018-02-23 13:57:26 -08:00
Siying Dong
30649dc6a1 Have a different function when ROCKSDB_JEMALLOC=0
Summary:
Some sanitizer is not happy with parameter name with ROCKSDB_JEMALLOC not set. Use another function instead.
Closes https://github.com/facebook/rocksdb/pull/3536

Differential Revision: D7064849

Pulled By: siying

fbshipit-source-id: c6ae94e044686176af1259df9172453d52c2f9d5
2018-02-23 11:42:33 -08:00
Igor Sugak
aba3409740 Back out "[codemod] - comment out unused parameters"
Reviewed By: igorsugak

fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37
2018-02-22 12:43:17 -08:00
David Lai
f4a030ce81 - comment out unused parameters
Reviewed By: everiq, igorsugak

Differential Revision: D7046710

fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461
2018-02-22 09:44:23 -08:00
Sagar Vemuri
8ada876dfe Add rocksdb.iterator.internal-key property
Summary:
Added a new iterator property: `rocksdb.iterator.internal-key` to get the internal-key (converted to user key) at which the iterator stopped.
Closes https://github.com/facebook/rocksdb/pull/3525

Differential Revision: D7033694

Pulled By: sagar0

fbshipit-source-id: d51e6c00f5e9d766c6276ef79774b81c6c5216f8
2018-02-20 19:12:09 -08:00
Maysam Yabandeh
c178da053b WritePrepared Txn: optimizations for sysbench update_noindex
Summary:
These are optimization that we applied to improve sysbech's update_noindex performance.
1. Make use of LIKELY compiler hint
2. Move std::atomic so the subclass
3. Make use of skip_prepared in non-2pc transactions.
Closes https://github.com/facebook/rocksdb/pull/3512

Differential Revision: D7000075

Pulled By: maysamyabandeh

fbshipit-source-id: 1ab8292584df1f6305a4992973fb1b7933632181
2018-02-16 08:42:31 -08:00
Mike Kolupaev
97307d888f Fix deadlock in ColumnFamilyData::InstallSuperVersion()
Summary:
Deadlock: a memtable flush holds DB::mutex_ and calls ThreadLocalPtr::Scrape(), which locks ThreadLocalPtr mutex; meanwhile, a thread exit handler locks ThreadLocalPtr mutex and calls SuperVersionUnrefHandle, which tries to lock DB::mutex_.

This deadlock is hit all the time on our workload. It blocks our release.

In general, the problem is that ThreadLocalPtr takes an arbitrary callback and calls it while holding a lock on a global mutex. The same global mutex is (at least in some cases) locked by almost all ThreadLocalPtr methods, on any instance of ThreadLocalPtr. So, there'll be a deadlock if the callback tries to do anything to any instance of ThreadLocalPtr, or waits for another thread to do so.

So, probably the only safe way to use ThreadLocalPtr callbacks is to do only do simple and lock-free things in them.

This PR fixes the deadlock by making sure that local_sv_ never holds the last reference to a SuperVersion, and therefore SuperVersionUnrefHandle never has to do any nontrivial cleanup.

I also searched for other uses of ThreadLocalPtr to see if they may have similar bugs. There's only one other use, in transaction_lock_mgr.cc, and it looks fine.
Closes https://github.com/facebook/rocksdb/pull/3510

Reviewed By: sagar0

Differential Revision: D7005346

Pulled By: al13n321

fbshipit-source-id: 37575591b84f07a891d6659e87e784660fde815f
2018-02-16 08:13:34 -08:00
Maysam Yabandeh
8eb1d445c3 Unbreak MemTableRep API change
Summary:
The MemTableRep API was broken by this commit: 813719e952
This patch reverts the changes and instead adds InsertKey (and etc.) overloads to extend the MemTableRep API without breaking the existing classes that inherit from it.
Closes https://github.com/facebook/rocksdb/pull/3513

Differential Revision: D7004134

Pulled By: maysamyabandeh

fbshipit-source-id: e568d91fe1e17dd76c0c1f6c7dd51a18633b1c4f
2018-02-15 17:27:24 -08:00
jsteemann
4e7a182d09 Several small "fixes"
Summary:
- removed a few unneeded variables
- fused some variable declarations and their assignments
- fixed right-trimming code in string_util.cc to not underflow
- simplifed an assertion
- move non-nullptr check assertion before dereferencing of that pointer
- pass an std::string function parameter by const reference instead of by value (avoiding potential copy)
Closes https://github.com/facebook/rocksdb/pull/3507

Differential Revision: D7004679

Pulled By: sagar0

fbshipit-source-id: 52944952d9b56dfcac3bea3cd7878e315bb563c4
2018-02-15 16:57:37 -08:00
Zhongyi Xie
c88c57cde1 Tweak external file ingestion seqno logic under universal compaction
Summary:
Right now it is possible that a file gets assigned to L0 but also assigned the seqno from a higher level which it doesn't fit
Under the current impl, it is possibe that seqno in lower levels (Ln) can be equal to smallest seqno of higher levels (Ln-1), which is undesirable from universal compaction's point of view.
This should fix the intermittent failure of `ExternalSSTFileBasicTest.IngestFileWithGlobalSeqnoPickedSeqno`
Closes https://github.com/facebook/rocksdb/pull/3411

Differential Revision: D6813802

Pulled By: miasantreble

fbshipit-source-id: 693d0462fa94725ccfb9d8858743e6d2d9992d14
2018-02-15 14:13:39 -08:00
Fosco Marotto
ba6ee1f749 Fix 2 more unused reference errors VS2017
Summary:
As in #3425
Closes https://github.com/facebook/rocksdb/pull/3497

Differential Revision: D6979588

Pulled By: gfosco

fbshipit-source-id: e9fb32d04ad45575dfe9de1d79348d158e474197
2018-02-14 11:12:36 -08:00
Igor Sugak
d08d05cb62 fix UBSAN errors in fault_injection_test
Summary:
This fixes shift and signed-integer-overflow UBSAN checks in fault_injection_test by using a larger and unsigned type.
Closes https://github.com/facebook/rocksdb/pull/3498

Reviewed By: siying

Differential Revision: D6981116

Pulled By: igorsugak

fbshipit-source-id: 3688f62cce570534b161e9b5f42109ebc9ae5a2c
2018-02-13 14:12:40 -08:00
Siying Dong
dadf01672a Rename one of the two LevelIterator
Summary:
A new LevelIterator was recently created. Rename the old one to make unity build happy. It's also not a good idea to have two classes in the same name anyway.
Closes https://github.com/facebook/rocksdb/pull/3499

Differential Revision: D6979325

Pulled By: siying

fbshipit-source-id: 3a032d93fe205650a08e92e5262594731ec726bb
2018-02-13 13:57:58 -08:00
Siying Dong
b555ed30a4 Customized BlockBasedTableIterator and LevelIterator
Summary:
Use a customzied BlockBasedTableIterator and LevelIterator to replace current implementations leveraging two-level-iterator. Hope the customized logic will make code easier to understand. As a side effect, BlockBasedTableIterator reduces the allocation for the data block iterator object, and avoid the virtual function call to it, because we can directly reference BlockIter, a final class. Similarly, LevelIterator reduces virtual function call to the dummy iterator iterating the file metadata. It also enabled further optimization.

The upper bound check is also moved from index block to data block. This implementation fits this iterator better. After the change, forwared iterator is slightly optimized to ensure we trim those iterators.

The two-level-iterator now is only used by partitioned index, so it is simplified.
Closes https://github.com/facebook/rocksdb/pull/3406

Differential Revision: D6809041

Pulled By: siying

fbshipit-source-id: 7da3b9b1d3c8e9d9405302c15920af1fcaf50ffa
2018-02-12 17:12:25 -08:00
Andrew Kryczka
ee1c802675 Add delay before flush in CompactRange to avoid write stalling
Summary:
- Refactored logic for checking write stall condition to a helper function: `GetWriteStallConditionAndCause`. Now it is decoupled from the logic for updating WriteController / stats in `RecalculateWriteStallConditions`, so we can reuse it for predicting whether write stall will occur.
- Updated `CompactRange` to first check whether the one additional immutable memtable / L0 file would cause stalling before it flushes. If so, it waits until that is no longer true.
- Updated `bg_cv_` to be signaled on `SetOptions` calls. The stall conditions `CompactRange` cares about can change when (1) flush finishes, (2) compaction finishes, or (3) options dynamically change. The cv was already signaled for (1) and (2) but not yet for (3).
Closes https://github.com/facebook/rocksdb/pull/3381

Differential Revision: D6754983

Pulled By: ajkr

fbshipit-source-id: 5613e03f1524df7192dc6ae885d40fd8f091d972
2018-02-12 15:42:47 -08:00
Zhongyi Xie
3f1bb07351 make flush_reason_ atomic to keep TSAN happy
Summary: Closes https://github.com/facebook/rocksdb/pull/3487

Differential Revision: D6967098

Pulled By: miasantreble

fbshipit-source-id: 48e0accf2e3b3f589ddb797ff8083c8520269bf0
2018-02-12 13:28:18 -08:00
Siying Dong
ef29d2a234 Explictly fail writes if key or value is not smaller than 4GB
Summary:
Right now, users will encounter unexpected bahavior if they use key or value larger than 4GB. We should explicitly fail the queriers.
Closes https://github.com/facebook/rocksdb/pull/3484

Differential Revision: D6953895

Pulled By: siying

fbshipit-source-id: b60491e1af064fc5d52971956661f6c18ceac24f
2018-02-09 14:57:54 -08:00
Yi Wu
fe228da0a9 WritePrepared Txn: Support merge operator
Summary:
CompactionIterator invoke MergeHelper::MergeUntil() to do partial merge between snapshot boundaries. Previously it only depend on sequence number to tell snapshot boundary, but we also need to make use of snapshot_checker to verify visibility of the merge operands to the snapshots. For example, say there is a snapshot with seq = 2 but only can see data with seq <= 1. There are three merges, each with seq = 1, 2, 3. A correct compaction output would be (1),(2+3). Without taking snapshot_checker into account when generating merge result, compaction will generate output (1+2),(3).

By filtering uncommitted keys with read callback, the read path already take care of merges well and don't need additional updates.
Closes https://github.com/facebook/rocksdb/pull/3475

Differential Revision: D6926087

Pulled By: yiwu-arbug

fbshipit-source-id: 8f539d6f897cfe29b6dc27a8992f68c2a629d40a
2018-02-09 14:57:54 -08:00
Zhongyi Xie
945f618ba5 log flush reason for better debugging experience
Summary:
It's always a mystery from the logs why flush was triggered -- user triggered it manually, WriteBufferManager triggered it,  logs were full, write buffer was full, etc.
This PR logs Flush reason whenever a flush is scheduled.
Closes https://github.com/facebook/rocksdb/pull/3401

Differential Revision: D6788142

Pulled By: miasantreble

fbshipit-source-id: a867e54d493c06adf5172bd36a180fb3faae3511
2018-02-09 12:12:43 -08:00
Siying Dong
821e0b1683 Disable options_settable_test in UBSAN and fix UBSAN failure in blob_…
Summary:
…db_test

options_settable_test won't pass UBSAN so disable it.
blob_db_test fails in UBSAN as SnapshotList doesn't initialize all the fields in dummy snapshot. Fix it. I don't understand why only blob_db_test fails though.
Closes https://github.com/facebook/rocksdb/pull/3477

Differential Revision: D6928681

Pulled By: siying

fbshipit-source-id: e31dd300fcdecdfd4f6af279a0987fd0cdec5122
2018-02-07 14:42:26 -08:00
Yi Wu
81736d8afe WritePrepared Txn: update compaction_iterator_test and db_iterator_test
Summary:
Update compaction_iterator_test with write-prepared transaction DB related tests. Transaction related tests are group in CompactionIteratorWithSnapshotCheckerTest. The existing test are duplicated to make them also test with dummy SnapshotChecker that will say every key is visible to every snapshot (this is okay, we still compare sequence number to verify visibility). Merge related tests are disabled and will be revisit in another PR.

Existing db_iterator_tests are also duplicated to test with dummy read_callback that will say every key is committed.
Closes https://github.com/facebook/rocksdb/pull/3466

Differential Revision: D6909253

Pulled By: yiwu-arbug

fbshipit-source-id: 2ae4656b843a55e2e9ff8beecf21f2832f96cd25
2018-02-06 14:12:13 -08:00
Maysam Yabandeh
88d8b2a2f5 WritePrepared Txn: Duplicate Keys, Txn Part
Summary:
This patch takes advantage of memtable being able to detect duplicate <key,seq> and returning TryAgain to handle duplicate keys in WritePrepared Txns. Through WriteBatchWithIndex's index it detects existence of at least a duplicate key in the write batch. If duplicate key was reported, it then pays the cost of counting the number of sub-patches by iterating over the write batch and pass it to DBImpl::Write. DB will make use of the provided batch_count to assign proper sequence numbers before sending them to the WAL. When later inserting the batch to the memtable, it increases the seq each time memtbale reports a duplicate (a sub-patch in our counting) and tries again.
Closes https://github.com/facebook/rocksdb/pull/3455

Differential Revision: D6873699

Pulled By: maysamyabandeh

fbshipit-source-id: db8487526c3a5dc1ddda0ea49f0f979b26ae648d
2018-02-05 18:43:24 -08:00
Anand Ananthabhotla
4b124fb9d3 Handle error return from WriteBuffer()
Summary:
There are a couple of places where we swallow any error from
WriteBuffer() - in SwitchMemtable() and DBImpl::CloseImpl(). Propagate
the error up in those cases rather than ignoring it.
Closes https://github.com/facebook/rocksdb/pull/3404

Differential Revision: D6879954

Pulled By: anand1976

fbshipit-source-id: 2ef88b554be5286b0a8bad7384ba17a105395bdb
2018-02-05 13:59:34 -08:00
Mike Kolupaev
cb5b8f2090 Fix use-after-free in tailing iterator with merge operator
Summary:
ForwardIterator::SVCleanup() sometimes didn't pin superversion when it was supposed to. See the added test for the scenario. Here's the ASAN output of the added test without the fix (using `COMPILE_WITH_ASAN=1 make`): https://pastebin.com/9rD0Ywws
Closes https://github.com/facebook/rocksdb/pull/3415

Differential Revision: D6817414

Pulled By: al13n321

fbshipit-source-id: bc80c44ea78a3a1fa885dfa448a26111f91afb24
2018-02-02 21:26:28 -08:00
Tamir Duberstein
cd5092e168 Suppress unused warnings
Summary:
- Use `__unused__` everywhere
- Suppress unused warnings in Release mode
    + This currently affects non-MSVC builds (e.g. mingw64).
Closes https://github.com/facebook/rocksdb/pull/3448

Differential Revision: D6885496

Pulled By: miasantreble

fbshipit-source-id: f2f6adacec940cc3851a9eee328fafbf61aad211
2018-02-02 12:27:07 -08:00
Fosco Marotto
ba8aa8fdc8 Upgrade Appveyor to VS2017
Summary:
Per some discussions, this will switch our Appveyor testing to use Visual Studio 2017.
Closes https://github.com/facebook/rocksdb/pull/3445

Differential Revision: D6874918

Pulled By: gfosco

fbshipit-source-id: c5a0032ca9f37f0d3baeae35c59d850d528c3176
2018-02-01 13:57:01 -08:00
Maysam Yabandeh
813719e952 WritePrepared Txn: Duplicate Keys, Memtable part
Summary:
Currently DB does not accept duplicate keys (keys with the same user key and the same sequence number). If Memtable returns false when receiving such keys, we can benefit from this signal to properly increase the sequence number in the rare cases when we have a duplicate key in the write batch written to DB under WritePrepared transactions.
Closes https://github.com/facebook/rocksdb/pull/3418

Differential Revision: D6822412

Pulled By: maysamyabandeh

fbshipit-source-id: adea3ce5073131cd38ed52b16bea0673b1a19e77
2018-01-31 18:57:07 -08:00
Fosco Marotto
5400800a56 Work around VS2017 warning for unused reference
Summary:
For #3407
Closes https://github.com/facebook/rocksdb/pull/3425

Differential Revision: D6836900

Pulled By: gfosco

fbshipit-source-id: 7bcaf7a1beeeeabb7c05584f2745e7b4a2473497
2018-01-31 11:58:10 -08:00
Andrew Kryczka
ab5ab36ac2 fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose file ID support check
Summary:
Updated the test case to handle tmpfs mounted at directories different from "/dev/shm/".
Closes https://github.com/facebook/rocksdb/pull/3440

Differential Revision: D6848213

Pulled By: ajkr

fbshipit-source-id: 465e9dbf0921d0930161f732db6b3766bb030589
2018-01-30 16:50:42 -08:00
Huachao Huang
ab43ff58b5 Delete files in multiple ranges at once
Summary:
Using `DeleteFilesInRange` to delete files in a lot of ranges can be slow, because
`VersionSet::LogAndApply` is expensive.

This PR adds a new `DeleteFilesInRange` function to delete files in multiple
ranges at once.

Close https://github.com/facebook/rocksdb/issues/2951
Closes https://github.com/facebook/rocksdb/pull/3431

Differential Revision: D6849228

Pulled By: ajkr

fbshipit-source-id: daeedcabd8def4b1d9ee95a58266dee77b5d68cb
2018-01-30 13:56:39 -08:00
Yi Wu
4bdf06e78f Fix DBFlushTest::ManualFlushWithMinWriteBufferNumberToMerge dead lock
Summary:
In the test, there can be a dead lock between background flush thread and foreground main thread as following:
* background flush thread:
  - holding db mutex, while
  - waiting on "DBImpl::FlushMemTableToOutputFile:BeforeInstallSV" sync point.
* foreground thread:
  - waiting for db mutex to write "key2"

Fixing by let background flush thread wait without holding db mutex.
Closes https://github.com/facebook/rocksdb/pull/3436

Differential Revision: D6841334

Pulled By: yiwu-arbug

fbshipit-source-id: b020768ac94e166e40953c5d09e505515a5f244d
2018-01-29 18:56:47 -08:00
Sagar Vemuri
e6605e5302 Tests for dynamic universal compaction options
Summary:
Added a test for three dynamic universal compaction options, in the realm of read amplification:
- size_ratio
- min_merge_width
- max_merge_width

Also updated DynamicUniversalCompactionSizeAmplification by adding a check on compaction reason.
Found a bug in compaction reason setting while working on this PR, and fixed in #3412 .

TODO for later: Still to add tests for these options: compression_size_percent, stop_style and trivial_move.
Closes https://github.com/facebook/rocksdb/pull/3419

Differential Revision: D6822217

Pulled By: sagar0

fbshipit-source-id: 074573fca6389053cbac229891a0163f38bb56c4
2018-01-29 16:42:45 -08:00
Zhongyi Xie
3fe0937180 Use block cache to track memory usage when ReadOptions.fill_cache=false
Summary:
ReadOptions.fill_cache is set in compaction inputs and can be set by users in their queries too. It tells RocksDB not to put a data block used to block cache.

The memory used by the data block is, however, not trackable by users.

To make the system more manageable, we can cost the block to block cache while using it, and then release it after using.
Closes https://github.com/facebook/rocksdb/pull/3333

Differential Revision: D6670230

Pulled By: miasantreble

fbshipit-source-id: ab848d3ed286bd081a13ee1903de357b56cbc308
2018-01-29 14:43:10 -08:00
Mark Isaacson
b8eb32f8cf Suppress lint in old files
Summary: Grandfather in super old lint issues to make a clean slate for moving forward that allows us to have stronger enforcement on new issues.

Reviewed By: yiwu-arbug

Differential Revision: D6821806

fbshipit-source-id: 22797d31ec58e9eb0255d3b66fedfcfcb0dc127c
2018-01-29 12:56:42 -08:00
Sagar Vemuri
7fcc1d0ddf Incorrect Universal Compaction reason
Summary:
While writing tests for dynamic Universal Compaction options, I found that the compaction reasons we set for size-ratio based and sorted-run based universal compactions are swapped with each other. Fixed it.
Closes https://github.com/facebook/rocksdb/pull/3412

Differential Revision: D6820540

Pulled By: sagar0

fbshipit-source-id: 270a188968ba25b2c96a8339904416c4c87ff5b3
2018-01-26 11:12:40 -08:00
Yi Wu
c7226428dd WritePrepared Txn: Fix DBIterator and add test
Summary:
In DBIter, Prev() calls FindValueForCurrentKey() to search the current value backward. If it finds that there are too many stale value being skipped, it falls back to FindValueForCurrentKeyUsingSeek(), seeking directly to the key with snapshot sequence. After introducing read_callback, however, the key it seeks to might not be visible, according to read_callback. It thus needs to keep searching forward until the first visible value.
Closes https://github.com/facebook/rocksdb/pull/3382

Differential Revision: D6756148

Pulled By: yiwu-arbug

fbshipit-source-id: 064e39b1eec5e083af1c10142600f26d1d2697be
2018-01-23 16:57:11 -08:00
Yi Wu
d46e832e94 Assert last reference before destroy ColumnFamilyData
Summary:
In ColumnFamilySet destructor, assert it hold the last reference to cfd before destroy them.

Closes #3112
Closes https://github.com/facebook/rocksdb/pull/3397

Differential Revision: D6777967

Pulled By: yiwu-arbug

fbshipit-source-id: 60b19070e0c194b3b6146699140c1d68777866cb
2018-01-23 15:12:28 -08:00
Yi Wu
edc258127e DB::DumpSupportInfo should log all supported compression types
Summary:
DB::DumpSupportInfo should log all supported compression types.
Closes #3146
Closes https://github.com/facebook/rocksdb/pull/3396

Differential Revision: D6777019

Pulled By: yiwu-arbug

fbshipit-source-id: 5b17f1ffb2d71224e52f7d9c045434746c789fb0
2018-01-23 14:44:12 -08:00
Nathan VanBenschoten
ec0167eecb Fix WriteBatch rep_ format for RangeDeletion records
Summary:
This is a small amount of general cleanup I made while experimenting with https://github.com/facebook/rocksdb/issues/3391.
Closes https://github.com/facebook/rocksdb/pull/3392

Differential Revision: D6788365

Pulled By: yiwu-arbug

fbshipit-source-id: 2716e5aabd5424a4dfdaa954361a62c8eb721ae2
2018-01-23 12:57:32 -08:00
Siying Dong
7291a3f813 Improve fallocate size in compaction output
Summary:
Now in leveled compaction, we allocate solely based on output target file size. If the total input size is smaller than the number, we should use the total input size instead. Also, cap the allocate size to 1GB.
Closes https://github.com/facebook/rocksdb/pull/3385

Differential Revision: D6762363

Pulled By: siying

fbshipit-source-id: e30906f6e9bff3ec847d2166e44cb49c92f98a13
2018-01-22 16:43:46 -08:00
Islam AbdelRahman
c615689bb5 Support skipping bloom filters for SstFileWriter
Summary:
Add an option for SstFileWriter to skip building bloom filters
Closes https://github.com/facebook/rocksdb/pull/3360

Differential Revision: D6709120

Pulled By: IslamAbdelRahman

fbshipit-source-id: 964d4bce38822a048691792f447bcfbb4b6bd809
2018-01-22 14:42:18 -08:00
Bernard Spil
6f5ba0bf5b Fix building on FreeBSD
Summary:
FreeBSD uses jemalloc as the base malloc implementation.
The patch has been functional on FreeBSD as of the MariaDB 10.2 port.
Closes https://github.com/facebook/rocksdb/pull/3386

Differential Revision: D6765742

Pulled By: yiwu-arbug

fbshipit-source-id: d55dbc082eecf640ef3df9a21f26064ebe6587e8
2018-01-19 17:12:43 -08:00
Yi Wu
5568aec421 Fix DBTest::SoftLimit TSAN failure
Summary:
Fix data race found by TSAN around WriteStallListener: https://gist.github.com/yiwu-arbug/027d2448b903648f2f0f40b05258d80f
Closes https://github.com/facebook/rocksdb/pull/3384

Differential Revision: D6762167

Pulled By: yiwu-arbug

fbshipit-source-id: cd3a5c9f806de390bd1af6077ea6dbbc8bcaec09
2018-01-19 12:57:15 -08:00
Yi Wu
f1cb83fcf4 Fix Flush() keep waiting after flush finish
Summary:
Flush() call could be waiting indefinitely if min_write_buffer_number_to_merge is used. Consider the sequence:
1. User call Flush() with flush_options.wait = true
2. The manual flush started in the background
3. New memtable become immutable because of writes. The new memtable will not trigger flush if min_write_buffer_number_to_merge is not reached.
4. The manual flush finish.

Because of the new memtable created at step 3 not being flush, previous logic of WaitForFlushMemTable() keep waiting, despite the memtables it intent to flush has been flushed.

Here instead of checking if there are any more memtables to flush, WaitForFlushMemTable() also check the id of the earliest memtable. If the id is larger than that of latest memtable at the time flush was initiated, it means all the memtable at the time of flush start has all been flush.
Closes https://github.com/facebook/rocksdb/pull/3378

Differential Revision: D6746789

Pulled By: yiwu-arbug

fbshipit-source-id: 35e698f71c7f90b06337a93e6825f4ea3b619bfa
2018-01-18 17:45:16 -08:00
topilski
b9873162f0 Fixed get version on windows, moved throwing exceptions into cc file.
Summary:
Fixes for msys2 and mingw, hide exceptions into cpp  file.
Closes https://github.com/facebook/rocksdb/pull/3377

Differential Revision: D6746707

Pulled By: yiwu-arbug

fbshipit-source-id: 456b38df80bc48b8386a2cf87f669b5a4f9999a4
2018-01-18 14:56:56 -08:00
Andrew Kryczka
46e599fc6b fix live WALs purged while file deletions disabled
Summary:
When calling `DisableFileDeletions` followed by `GetSortedWalFiles`, we guarantee the files returned by the latter call won't be deleted until after file deletions are re-enabled. However, `GetSortedWalFiles` didn't omit files already planned for deletion via `PurgeObsoleteFiles`, so the guarantee could be broken.

We fix it by making `GetSortedWalFiles` wait for the number of pending purges to hit zero if file deletions are disabled. This condition is eventually met since `PurgeObsoleteFiles` is guaranteed to be called for the existing pending purges, and new purges cannot be scheduled while file deletions are disabled. Once the condition is met, `GetSortedWalFiles` simply returns the content of DB and archive directories, which nobody can delete (except for deletion scheduler, for which I plan to fix this bug later) until deletions are re-enabled.
Closes https://github.com/facebook/rocksdb/pull/3341

Differential Revision: D6681131

Pulled By: ajkr

fbshipit-source-id: 90b1e2f2362ea9ef715623841c0826611a817634
2018-01-17 17:42:04 -08:00
Andrew Kryczka
266d85fbec fix DBTest.AutomaticConflictsWithManualCompaction
Summary:
After af92d4ad11, only exclusive manual compaction can have conflict. dc360df81e updated the conflict-checking test case accordingly. But we missed the point that exclusive manual compaction can only conflict with automatic compactions scheduled after it, since it waits on pending automatic compactions before it begins running.

This PR updates the test case to ensure the automatic compactions are scheduled after the manual compaction starts but before it finishes, thus ensuring a conflict. I also cleaned up the test case to use less space as I saw it cause out-of-space error on travis.
Closes https://github.com/facebook/rocksdb/pull/3375

Differential Revision: D6735162

Pulled By: ajkr

fbshipit-source-id: 020530a4e150a4786792dce7cec5d66b420cb884
2018-01-16 23:12:00 -08:00
Yi Wu
dc360df81e Fix multiple build failures
Summary:
* Fix DBTest.CompactRangeWithEmptyBottomLevel lite build failure
* Fix DBTest.AutomaticConflictsWithManualCompaction failure introduce by #3366
* Fix BlockBasedTableTest::IndexUncompressed should be disabled if snappy is disabled
* Fix ASAN failure with DBBasicTest::DBClose test
Closes https://github.com/facebook/rocksdb/pull/3373

Differential Revision: D6732313

Pulled By: yiwu-arbug

fbshipit-source-id: 1eb9b9d9a8d795f56188fa9770db9353f6fdedc5
2018-01-16 17:30:39 -08:00
Sunguck Lee
af92d4ad11 Avoid too frequent MaybeScheduleFlushOrCompaction() call
Summary:
If there's manual compaction in the queue, then "HaveManualCompaction(compaction_queue_.front())" will return true, and this cause too frequent MaybeScheduleFlushOrCompaction().

https://github.com/facebook/rocksdb/issues/3198
Closes https://github.com/facebook/rocksdb/pull/3366

Differential Revision: D6729575

Pulled By: ajkr

fbshipit-source-id: 96da04f8fd33297b1ccaec3badd9090403da29b0
2018-01-16 13:12:12 -08:00
Anand Ananthabhotla
d0f1b49ab6 Add a Close() method to DB to return status when closing a db
Summary:
Currently, the only way to close an open DB is to destroy the DB
object. There is no way for the caller to know the status. In one
instance, the destructor encountered an error due to failure to
close a log file on HDFS. In order to prevent silent failures, we add
DB::Close() that calls CloseImpl() which must be implemented by its
descendants.
The main failure point in the destructor is closing the log file. This
patch also adds a Close() entry point to Logger in order to get status.
When DBOptions::info_log is allocated and owned by the DBImpl, it is
explicitly closed by DBImpl::CloseImpl().
Closes https://github.com/facebook/rocksdb/pull/3348

Differential Revision: D6698158

Pulled By: anand1976

fbshipit-source-id: 9468e2892553eb09c4c41b8723f590c0dbd8ab7d
2018-01-16 11:08:57 -08:00
Andrew Kryczka
43549c7d59 Prevent unnecessary calls to PurgeObsoleteFiles
Summary:
Split `JobContext::HaveSomethingToDelete` into two functions: itself and `JobContext::HaveSomethingToClean`. Now we won't call `DBImpl::PurgeObsoleteFiles` in cases where we really just need to call `JobContext::Clean`. The change is needed because I want to track pending calls to `PurgeObsoleteFiles` for a bug fix, which is much simpler if we only call it after `FindObsoleteFiles` finds files to delete.
Closes https://github.com/facebook/rocksdb/pull/3350

Differential Revision: D6690609

Pulled By: ajkr

fbshipit-source-id: 61502e7469288afe16a663a1b7df345baeaf246f
2018-01-12 13:27:08 -08:00
Andrew Kryczka
ba295cda29 replace DBTest.HugeNumbersOfLevel with a more targeted test case
Summary:
This test often causes out-of-space error when run on travis. We don't want such stress tests in our unit test suite.

The bug in #596, which this test intends to expose, can be repro'd as long as the bottommost level(s) are empty when CompactRange is called. I rewrote the test to cover this simple case without writing a lot of data.
Closes https://github.com/facebook/rocksdb/pull/3362

Differential Revision: D6710417

Pulled By: ajkr

fbshipit-source-id: 9a1ec85e738c813ac2fee29f1d5302065ecb54c5
2018-01-12 11:12:09 -08:00
Changli Gao
0a7ba0e548 Fix memleak when DB::DeleteFile()
Summary:
Because the corresponding read_first_record_cache_ item wasn't
erased, memory leaked.
Closes https://github.com/facebook/rocksdb/pull/1712

Differential Revision: D4363654

Pulled By: ajkr

fbshipit-source-id: 7da1adcfc8c380e4ffe05b8769fc2221ad17a225
2018-01-11 18:57:33 -08:00
Bo Liu
204af1eccc add WriteBatch::WriteBatch(std::string&&)
Summary:
to save a string copy for some use cases.

The change is pretty straightforward, please feel free to let me know if you want to suggest any tests for it.
Closes https://github.com/facebook/rocksdb/pull/3349

Differential Revision: D6706828

Pulled By: yiwu-arbug

fbshipit-source-id: 873ce4442937bdc030b395c7f99228eda7f59eb7
2018-01-11 15:43:56 -08:00
Andrew Kryczka
0c6e8be9e2 Fix directory name for db_basic_test
Summary:
It was using the same directory as `db_options_test` so transiently failed when unit tests were run in parallel.
Closes https://github.com/facebook/rocksdb/pull/3352

Differential Revision: D6691649

Pulled By: ajkr

fbshipit-source-id: bee433484fec4faedd5cadf2db3c92fdcc99a170
2018-01-10 15:41:46 -08:00
Siying Dong
6aa95f4d0f Fix a wrong log formatting
Summary:
I experienced weird segfault because of this mismatch of type in log formatting. Fix it.
Closes https://github.com/facebook/rocksdb/pull/3345

Differential Revision: D6687224

Pulled By: siying

fbshipit-source-id: c51fb1c008b7ebc3efdc353a4adad3e8f5b3e9de
2018-01-09 14:58:33 -08:00
Andrew Kryczka
0f0d2ab95a fix DBImpl instance variable naming
Summary:
got confused while reading `FindObsoleteFiles` due to thinking it's a local variable, so renamed it properly
Closes https://github.com/facebook/rocksdb/pull/3342

Differential Revision: D6684797

Pulled By: ajkr

fbshipit-source-id: a4df0aae1cccce99d4dd4d164aadc85b17707132
2018-01-09 12:56:58 -08:00
Chris Lu
24e2c1640d add support for allow_ingest_behind in C API
Summary:
https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files

Need to expose these functions in the C API to be used by Go bindings.
Closes https://github.com/facebook/rocksdb/pull/3011

Differential Revision: D6679563

Pulled By: sagar0

fbshipit-source-id: 536f844ddaeb0172c6d7e416d2a75e8f9e57c8ef
2018-01-08 17:26:31 -08:00
Andrew Kryczka
f00e176c5b fix ForwardIterator reference to temporary object
Summary:
Fixes the following ASAN error:

```
==2108042==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7fc50ae9b868 at pc 0x7fc5112aff55 bp 0x7fff9eb9dc10 sp 0x7fff9eb9dc08
=== How to use this, how to get the raw stack trace, and more: fburl.com/ASAN ===
READ of size 8 at 0x7fc50ae9b868 thread T0
SCARINESS: 23 (8-byte-read-stack-use-after-scope)
     #0 rocksdb/dbformat.h:164                   rocksdb::InternalKeyComparator::user_comparator() const
     #1 librocksdb_src_rocksdb_lib.so+0x1429a7d  rocksdb::RangeDelAggregator::InitRep(std::vector<...> const&)
     #2 librocksdb_src_rocksdb_lib.so+0x142ceae  rocksdb::RangeDelAggregator::AddTombstones(std::unique_ptr<...>)
     #3 librocksdb_src_rocksdb_lib.so+0x1382d88  rocksdb::ForwardIterator::RebuildIterators(bool)
     #4 librocksdb_src_rocksdb_lib.so+0x1382362  rocksdb::ForwardIterator::ForwardIterator(rocksdb::DBImpl*, rocksdb::ReadOptions const&, rocksdb::ColumnFamilyData*, rocksdb::SuperVersion*)
     #5 librocksdb_src_rocksdb_lib.so+0x11f433f  rocksdb::DBImpl::NewIterator(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*)
     #6 rocksdb/src/include/rocksdb/db.h:382     rocksdb::DB::NewIterator(rocksdb::ReadOptions const&)
     #7 rocksdb/db_range_del_test.cc:807         rocksdb::DBRangeDelTest_TailingIteratorRangeTombstoneUnsupported_Test::TestBody()
    #18 rocksdb/db_range_del_test.cc:1006        main

Address 0x7fc50ae9b868 is located in stack of thread T0 at offset 104 in frame
     #0 librocksdb_src_rocksdb_lib.so+0x13825af  rocksdb::ForwardIterator::RebuildIterators(bool)
```
Closes https://github.com/facebook/rocksdb/pull/3300

Differential Revision: D6612989

Pulled By: ajkr

fbshipit-source-id: e7ea2ed914c1b80a8a29d71d92440a6bd9cbcc80
2017-12-20 16:12:04 -08:00
Maysam Yabandeh
0ef3fdd732 Disable need_log_sync on bg err
Summary:
When there is a background error PreprocessWrite returns without marking the logs synced. If we keep need_log_sync to true, it would try to sync them at the end, which would break the logic. The patch would unset need_log_sync if the logs end up not being marked for sync in PreprocessWrite.
Closes https://github.com/facebook/rocksdb/pull/3293

Differential Revision: D6602347

Pulled By: maysamyabandeh

fbshipit-source-id: 37ee04209e8dcfd78de891654ce50d0954abeb38
2017-12-20 08:12:24 -08:00
Yi Wu
06149429d9 WritePrepared Txn: Return NotSupported on iterator refresh
Summary:
A proper implementation of Iterator::Refresh() for WritePreparedTxnDB would require release and acquire another snapshot. Since MyRocks don't make use of Iterator::Refresh(), we just simply mark it as not supported.
Closes https://github.com/facebook/rocksdb/pull/3290

Differential Revision: D6599931

Pulled By: yiwu-arbug

fbshipit-source-id: 4e1632d967316431424f6e458254ecf9a97567cf
2017-12-18 22:29:30 -08:00
Maysam Yabandeh
78c2eedb4f fix release order in validateNumberOfEntries
Summary:
ScopedArenaIterator should be defined after range_del_agg so that it destructs the assigned iterator, which depends on range_del_agg, before it range_del_agg is already destructed.
Closes https://github.com/facebook/rocksdb/pull/3281

Differential Revision: D6592332

Pulled By: maysamyabandeh

fbshipit-source-id: 89a15d8ed13d0fc856b0c47dce3d91778738dbac
2017-12-18 14:27:28 -08:00
Anand Ananthabhotla
fccc12f386 Add a histogram stat for memtable flush
Summary:
Add a new histogram stat called rocksdb.db.flush.micros for memtable
flush
Closes https://github.com/facebook/rocksdb/pull/3269

Differential Revision: D6559496

Pulled By: anand1976

fbshipit-source-id: f5c771ba2568630458751795e8c37a493ff9b14d
2017-12-15 18:57:00 -08:00
Yi Wu
237b292515 BlobDB: Remove the need to get sequence number per write
Summary:
Previously we store sequence number range of each blob files, and use the sequence number range to check if the file can be possibly visible by a snapshot. But it adds complexity to the code, since the sequence number is only available after a write. (The current implementation get sequence number by calling GetLatestSequenceNumber(), which is wrong.) With the patch, we are not storing sequence number range, and check if snapshot_sequence < obsolete_sequence to decide if the file is visible by a snapshot (previously we check if first_sequence <= snapshot_sequence < obsolete_sequence).
Closes https://github.com/facebook/rocksdb/pull/3274

Differential Revision: D6571497

Pulled By: yiwu-arbug

fbshipit-source-id: ca06479dc1fcd8782f6525b62b7762cd47d61909
2017-12-15 13:27:30 -08:00
Andrew Kryczka
5a7e08468a fix ThreadStatus for bottom-pri compaction threads
Summary:
added `ThreadType::BOTTOM_PRIORITY` which is used in the `ThreadStatus` object to indicate the thread is used for bottom-pri compactions. Previously there was a bug where we mislabeled such threads as `ThreadType::LOW_PRIORITY`.
Closes https://github.com/facebook/rocksdb/pull/3270

Differential Revision: D6559428

Pulled By: ajkr

fbshipit-source-id: 96b1a50a9c19492b1a5fd1b77cf7061a6f9f1d1c
2017-12-14 14:57:49 -08:00
Siying Dong
def6a00740 Print out compression type of new SST files in logging
Summary: Closes https://github.com/facebook/rocksdb/pull/3264

Differential Revision: D6552768

Pulled By: siying

fbshipit-source-id: 6303110aff22f341d5cff41f8d2d4f138a53652d
2017-12-14 10:27:43 -08:00
Maysam Yabandeh
546a63272f disableWAL with WriteImplWALOnly
Summary:
Currently WriteImplWALOnly simply returns when disableWAL is set. This is an incorrect behavior since it does not allocated the sequence number, which is a side-effect of writing to the WAL. This patch fixes the issue.
Closes https://github.com/facebook/rocksdb/pull/3262

Differential Revision: D6550974

Pulled By: maysamyabandeh

fbshipit-source-id: 745a83ae8f04e7ca6c8ffb247d6ef16c287c52e7
2017-12-13 07:57:44 -08:00
Zhongyi Xie
51c2ea0feb Reduce heavy hitter for Get operation
Summary:
This PR addresses the following heavy hitters in `Get` operation by moving calls to `StatisticsImpl::recordTick` from `BlockBasedTable` to `Version::Get`

- rocksdb.block.cache.bytes.write
- rocksdb.block.cache.add
- rocksdb.block.cache.data.miss
- rocksdb.block.cache.data.bytes.insert
- rocksdb.block.cache.data.add
- rocksdb.block.cache.hit
- rocksdb.block.cache.data.hit
- rocksdb.block.cache.bytes.read

The db_bench statistics before and after the change are:

|1GB block read|Children      |Self  |Command          |Shared Object        |Symbol|
|---|---|---|---|---|---|
|master:     |4.22%     |1.31%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
|updated:    |0.51%     |0.21%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
|     	     |0.14%     |0.14%  |db_bench  |db_bench  |[.] rocksdb::GetContext::record_counters|

|1MB block read|Children      |Self  |Command          |Shared Object        |Symbol|
|---|---|---|---|---|---|
|master:    |3.48%     |1.08%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
|updated:    |0.80%     |0.31%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
|    	     |0.35%     |0.35%  |db_bench  |db_bench  |[.] rocksdb::GetContext::record_counters|
Closes https://github.com/facebook/rocksdb/pull/3172

Differential Revision: D6330532

Pulled By: miasantreble

fbshipit-source-id: 2b492959e00a3db29e9437ecdcc5e48ca4ec5741
2017-12-12 21:11:33 -08:00
Islam AbdelRahman
9089373a01 Fix DeleteScheduler::MarkAsTrash() handling existing trash
Summary:
DeleteScheduler::MarkAsTrash() don't handle existing .trash files correctly
This cause rocksdb to not being able to delete existing .trash files on restart
Closes https://github.com/facebook/rocksdb/pull/3261

Differential Revision: D6548003

Pulled By: IslamAbdelRahman

fbshipit-source-id: c3800639412e587a690062c63076a5a08881e0e6
2017-12-12 18:17:13 -08:00
Yi Wu
e3a06f12d2 WritePrepared Txn: fix compaction filter snapshot checks
Summary:
Add snapshot_checker check whenever we need to check sequence against snapshots and decide what to do with an input key. The changes are related to one of:
* compaction filter
* single delete
* delete at bottom level
* merge
Closes https://github.com/facebook/rocksdb/pull/3251

Differential Revision: D6537850

Pulled By: yiwu-arbug

fbshipit-source-id: 3faba40ed5e37779f4a0cb7ae78af9546659c7f2
2017-12-12 11:12:24 -08:00
Zhongyi Xie
bb5ed4b1d1 exclude DynamicUniversalCompactionOptions from ROCKSDB_LITE
Summary:
since [SetOptions](https://github.com/facebook/rocksdb/blob/master/db/db_impl.cc#L494) is not supported in ROCKSDB_LITE
Right now unit test under lite is broken
Closes https://github.com/facebook/rocksdb/pull/3253

Differential Revision: D6539428

Pulled By: miasantreble

fbshipit-source-id: 13172b8ecbd75682330726498ea198969bc3e637
2017-12-11 16:28:20 -08:00
Yi Wu
9a27ac5d89 Fix drop column family data race
Summary:
A data race is caught by tsan_crash test between compaction and DropColumnFamily:
https://gist.github.com/yiwu-arbug/5a2b4baae05eeb99ae1719b650f30a44 Compaction checks if the column family has been dropped on each key input, while user can issue DropColumnFamily which updates cfd->dropped_, causing the data race. Fixing it by making cfd->dropped_ an atomic.
Closes https://github.com/facebook/rocksdb/pull/3250

Differential Revision: D6535991

Pulled By: yiwu-arbug

fbshipit-source-id: 5571df020beae7fa7db6fff5ad0d598f49962895
2017-12-11 13:57:48 -08:00
Zhongyi Xie
fcc8a6574d Make Universal compaction options dynamic
Summary:
Let me know if more test coverage is needed
Closes https://github.com/facebook/rocksdb/pull/3213

Differential Revision: D6457165

Pulled By: miasantreble

fbshipit-source-id: 3f944abff28aa7775237f1c4f61c64ccbad4eea9
2017-12-11 13:27:06 -08:00
Prashant D
6a183d1ae8 Fix coverity issues compaction_job, compaction_picker
Summary:
db/compaction_job.cc:
  ReportStartedCompaction(compaction);

CID 1419863 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
2. uninit_member: Non-static class member bottommost_level_ is not initialized in this constructor nor in any functions that it calls.

db/compaction_picker_universal.cc:
7struct InputFileInfo {
   	2. uninit_member: Non-static class member level is not initialized in this constructor nor in any functions that it calls.

CID 1405355 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
4. uninit_member: Non-static class member index is not initialized in this constructor nor in any functions that it calls.
 38  InputFileInfo() : f(nullptr) {}

db/dbformat.h:
 ParsedInternalKey()
 84      : sequence(kMaxSequenceNumber)  // Make code analyzer happy

CID 1168095 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
2. uninit_member: Non-static class member type is not initialized in this constructor nor in any functions that it calls.
 85  {}  // Intentionally left uninitialized (for speed)
Closes https://github.com/facebook/rocksdb/pull/3091

Differential Revision: D6534558

Pulled By: yiwu-arbug

fbshipit-source-id: 5ada975956196d267b3f149386842af71eda7553
2017-12-11 11:57:15 -08:00
Prashant D
34aa245dd8 Fix coverity issues version, write_batch
Summary:
db/version_builder.cc:
117        base_vstorage_->InternalComparator();

CID 1351713 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
2. uninit_member: Non-static class member field level_zero_cmp_.internal_comparator is not initialized in this constructor nor in any functions that it calls.

db/version_edit.h:
145  FdWithKeyRange()
146      : fd(),
147        smallest_key(),
148        largest_key() {

CID 1418254 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
2. uninit_member: Non-static class member file_metadata is not initialized in this constructor nor in any functions that it calls.
149  }

db/version_set.cc:
120    }

CID 1322789 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
4. uninit_member: Non-static class member curr_file_level_ is not initialized in this constructor nor in any functions that it calls.
121  }

db/write_batch.cc:
 939    assert(cf_mems_);

CID 1419862 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
3. uninit_member: Non-static class member rebuilding_trx_seq_ is not initialized in this constructor nor in any functions that it calls.
 940  }
Closes https://github.com/facebook/rocksdb/pull/3092

Differential Revision: D6505666

Pulled By: yiwu-arbug

fbshipit-source-id: fd2c68948a0280772691a419d72ac7e190951d86
2017-12-07 11:57:36 -08:00
Andrew Kryczka
2e3a00987e fix ASAN for DeleteFilesInRange test case
Summary:
error message was

```
==3095==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7ffd18216c40 at pc 0x0000005edda1 bp 0x7ffd18215550 sp 0x7ffd18214d00
...
Address 0x7ffd18216c40 is located in stack of thread T0 at offset 1952 in frame
     #0 internal_repo_rocksdb/db_compaction_test.cc:1520 rocksdb::DBCompactionTest_DeleteFileRangeFileEndpointsOverlapBug_Test::TestBody()
```

It was unsafe to have slices referring to the temporary string objects' buffers, as those strings were destroyed before the slices were used. Fixed it by assigning the strings returned by `Key()` to local variables.
Closes https://github.com/facebook/rocksdb/pull/3238

Differential Revision: D6507864

Pulled By: ajkr

fbshipit-source-id: dd07de1a0070c6748c1ab4f3d7bd31f9a81889d0
2017-12-07 11:12:43 -08:00
Sagar Vemuri
bbef8c3884 Log GetCurrentTime failures during Flush and Compaction
Summary:
`GetCurrentTime()` is used to populate `creation_time` table property during flushes and compactions. It is safe to ignore `GetCurrentTime()` failures here but they should be logged.

(Note that `creation_time` property was introduced as part of TTL-based FIFO compaction in #2480.)

Tes Plan:
`make check`
Closes https://github.com/facebook/rocksdb/pull/3231

Differential Revision: D6501935

Pulled By: sagar0

fbshipit-source-id: 376adcf4ab801d3a43ec4453894b9a10909c8eb6
2017-12-06 20:56:53 -08:00
Andrew Kryczka
78d1a5ec72 Preserve overlapping file endpoint invariant
Summary:
Fix for #2833.

- In `DeleteFilesInRange`, use `GetCleanInputsWithinInterval` instead of `GetOverlappingInputs` to make sure we get a clean cut set of files to delete.
- In `GetCleanInputsWithinInterval`, support nullptr as `begin_key` or `end_key`.
- In `GetOverlappingInputsRangeBinarySearch`, move the assertion for non-empty range away from `ExtendFileRangeWithinInterval`, which should be allowed to return an empty range (via `end_index < begin_index`).
Closes https://github.com/facebook/rocksdb/pull/2843

Differential Revision: D5772387

Pulled By: ajkr

fbshipit-source-id: e554e8461823c6be82b21a9262a2da02b3957881
2017-12-06 18:56:54 -08:00
Yi Wu
a7d32776f0 Fix write_callback_test compile error
Summary:
Rename shadow variable name db_impl.

Fixing #3227
Closes https://github.com/facebook/rocksdb/pull/3235

Differential Revision: D6504051

Pulled By: yiwu-arbug

fbshipit-source-id: 186c9378dabb11f8d6db56f45c95cc3b029fcb88
2017-12-06 17:12:27 -08:00
Yi Wu
20995c5729 Make iterator invalid on Merge error
Summary:
Since #1665, on merge error, iterator will be set to corrupted status, but it doesn't invalidate the iterator. Fixing it.
Closes https://github.com/facebook/rocksdb/pull/3226

Differential Revision: D6499094

Pulled By: yiwu-arbug

fbshipit-source-id: 80222930f949e31f90a6feaa37ddc3529b510d2c
2017-12-06 11:56:39 -08:00
Andrew Kryczka
63f1c0a57d fix gflags namespace
Summary:
I started adding gflags support for cmake on linux and got frustrated that I'd need to duplicate the build_detect_platform logic, which determines namespace based on attempting compilation. We can do it differently -- use the GFLAGS_NAMESPACE macro if available, and if not, that indicates it's an old gflags version without configurable namespace so we can simply hardcode "google".
Closes https://github.com/facebook/rocksdb/pull/3212

Differential Revision: D6456973

Pulled By: ajkr

fbshipit-source-id: 3e6d5bde3ca00d4496a120a7caf4687399f5d656
2017-12-01 10:42:05 -08:00
Maysam Yabandeh
18dcf7f98d WritePrepared Txn: PreReleaseCallback
Summary:
Add PreReleaseCallback to be called at the end of WriteImpl but before publishing the sequence number. The callback is used in WritePrepareTxn to i) update the commit map, ii) update the last published sequence number in the 2nd write queue. It also ensures that all the commits will go to the 2nd queue.
These changes will ensure that the commit map is updated before the sequence number is published and used by reading snapshots. If we use two write queues, the snapshots will use the seq number published by the 2nd queue. If we use one write queue (the default, the snapshots will use the last seq number in the memtable, which also indicates the last published seq number.
Closes https://github.com/facebook/rocksdb/pull/3205

Differential Revision: D6438959

Pulled By: maysamyabandeh

fbshipit-source-id: f8b6c434e94bc5f5ab9cb696879d4c23e2577ab9
2017-11-30 23:50:45 -08:00
zhangjinpeng1987
ffacaaa3ea fix Seek with lower_bound
Summary:
When Seek a key less than `lower_bound`, should return `lower_bound`.
ajkr PTAL
Closes https://github.com/facebook/rocksdb/pull/3199

Differential Revision: D6421126

Pulled By: ajkr

fbshipit-source-id: a06c825830573e0040630704f6bcb3f7f48626f7
2017-11-29 22:56:29 -08:00
kapitan-k
75d57a5d53 C API: Add some block based table options
Summary: Closes https://github.com/facebook/rocksdb/pull/3159

Differential Revision: D6428220

Pulled By: sagar0

fbshipit-source-id: 60508d09b5281f54b907a1c40e9631fc08343131
2017-11-28 14:12:44 -08:00
Yi Wu
3cf562be31 Fix IOError on WAL write doesn't propagate to write group follower
Summary:
This is a simpler version of #3097 by removing all unrelated changes.

Fixing the bug where concurrent writes may get Status::OK while it actually gets IOError on WAL write. This happens when multiple writes form a write batch group, and the leader get an IOError while writing to WAL. The leader failed to pass the error to followers in the group, and the followers end up returning Status::OK() while actually writing nothing. The bug only affect writes in a batch group. Future writes after the batch group will correctly return immediately with the IOError.
Closes https://github.com/facebook/rocksdb/pull/3201

Differential Revision: D6421644

Pulled By: yiwu-arbug

fbshipit-source-id: 1c2a455c5b73f6842423785eb8a9dbfbb191dc0e
2017-11-28 11:42:48 -08:00
Andrew Kryczka
1bdb44de95 optimize file ingestion checks for range deletion overlap
Summary:
Before we were checking every file in the level which was unnecessary. We can piggyback onto the code for checking point-key overlap, which already opens all the files that could possibly contain overlapping range deletions. This PR makes us check just the range deletions from those files, so no extra ones will be opened.
Closes https://github.com/facebook/rocksdb/pull/3179

Differential Revision: D6358125

Pulled By: ajkr

fbshipit-source-id: 00e200770fdb8f3cc6b1b2da232b755e4ba36279
2017-11-28 11:27:02 -08:00
Griffin Smith
2f09524762 Expose all remaining read and write options via the C API
Summary:
Expose read and write options via the C API
Closes https://github.com/facebook/rocksdb/pull/3185

Differential Revision: D6389658

Pulled By: sagar0

fbshipit-source-id: 1848912750329a476805b3cb2f315e7b71f61472
2017-11-28 10:28:46 -08:00
Maysam Yabandeh
e59cb2a19b Add seq_per_batch to WriteWithCallbackTest
Summary:
Augment WriteWithCallbackTest to also test when seq_per_batch is true.
Closes https://github.com/facebook/rocksdb/pull/3195

Differential Revision: D6398143

Pulled By: maysamyabandeh

fbshipit-source-id: 7bc4218609355ec20fed25df426a8455ec2390d3
2017-11-22 13:56:44 -08:00
Zhongyi Xie
5fac4729cc make compaction_readahead_size_ thread safe
Summary:
this should fix the failing tsan_check
Closes https://github.com/facebook/rocksdb/pull/3192

Differential Revision: D6390004

Pulled By: miasantreble

fbshipit-source-id: 6cadfc6f68febb1a77b0abcdb5416570dad926a5
2017-11-21 20:11:38 -08:00
anand1976
d394a6bb48 Add a ticker stat for number of keys skipped during iteration
Summary:
This diff adds a new ticker stat, NUMBER_ITER_SKIP, to count the
number of internal keys skipped during iteration. Keys can be skipped
due to deletes, or lower sequence number, or higher sequence number
than the one requested.

Also, fix the issue when StatisticsData is naturally aligned on cacheline boundary,
padding becomes a zero size array, which the Windows compiler doesn't
like. So add a cacheline worth of padding in that case to keep it happy.
We cannot conditionally add padding as gcc doesn't allow using sizeof
in preprocessor directives.
Closes https://github.com/facebook/rocksdb/pull/3177

Differential Revision: D6353897

Pulled By: anand1976

fbshipit-source-id: 441d5a09af9c4e22e7355242dfc0c7b27aa0a6c2
2017-11-20 21:26:37 -08:00
Gustav Davidsson
2d04ed65e4 Make trash-to-DB size ratio limit configurable
Summary:
Allow users to configure the trash-to-DB size ratio limit, so
that ratelimits for deletes can be enforced even when larger portions of
the database are being deleted.
Closes https://github.com/facebook/rocksdb/pull/3158

Differential Revision: D6304897

Pulled By: gdavidsson

fbshipit-source-id: a28dd13059ebab7d4171b953ed91ce383a84d6b3
2017-11-17 11:58:17 -08:00
Zhongyi Xie
32e31d49d1 Make DBOption compaction_readahead_size dynamic
Summary: Closes https://github.com/facebook/rocksdb/pull/3004

Differential Revision: D6056141

Pulled By: miasantreble

fbshipit-source-id: 56df1630f464fd56b07d25d38161f699e0528b7f
2017-11-16 17:57:25 -08:00
Maysam Yabandeh
54b43563be WritePrepared Txn: Refactoring WriteCallback
Summary:
Refactor the logic around WriteCallback in the write path to clarify when and how exactly we advance the sequence number and making sure it is consistent across the code.
Closes https://github.com/facebook/rocksdb/pull/3168

Differential Revision: D6324312

Pulled By: maysamyabandeh

fbshipit-source-id: 9a34f479561fdb2a5d01ef6d37a28908d03bbe33
2017-11-15 08:27:06 -08:00
Maysam Yabandeh
53863b76f9 WritePrepared Txn: fix bug with Rollback seq
Summary:
The sequence number was not properly advanced after a rollback marker. The patch extends the existing unit tests to detect the bug and also fixes it.
Closes https://github.com/facebook/rocksdb/pull/3157

Differential Revision: D6304291

Pulled By: maysamyabandeh

fbshipit-source-id: 1b519c44a5371b802da49c9e32bd00087a8da401
2017-11-15 08:27:06 -08:00
Maysam Yabandeh
175d5d6a9e Properly destruct rebuilding_trx_
Summary:
When testing rebuilding_trx_ in MemTableInserter might still be set before the tests finishes which would cause ASAN alarms for leaks. This patch deletes the pointers in MemTableInserter destructor.
Closes https://github.com/facebook/rocksdb/pull/3162

Differential Revision: D6317113

Pulled By: maysamyabandeh

fbshipit-source-id: a68be70709a4fff7ac2b768660119311968f9c21
2017-11-14 08:56:50 -08:00
Maysam Yabandeh
2515266725 WritePrepared Txn: Refactoring TrackKeys
Summary:
This patch clarifies and refactors the logic around tracked keys in transactions.
Closes https://github.com/facebook/rocksdb/pull/3140

Differential Revision: D6290258

Pulled By: maysamyabandeh

fbshipit-source-id: 03b50646264cbcc550813c060b180fc7451a55c1
2017-11-11 13:14:20 -08:00
Maysam Yabandeh
2edc92bc28 WritePrepared Txn: cross-compatibility test
Summary:
Add tests to ensure that WritePrepared and WriteCommitted policies are cross compatible when the db WAL is empty. This is important when the admin want to switch between the policies. In such case, before the switch the admin needs to empty the WAL by i) committing/rollbacking all the pending transactions, ii) FlushMemTables
Closes https://github.com/facebook/rocksdb/pull/3118

Differential Revision: D6227247

Pulled By: maysamyabandeh

fbshipit-source-id: bcde3d92c1e89cda3b9cfa69f6a20af5d8993db7
2017-11-11 11:28:37 -08:00
Maysam Yabandeh
857adf388f WritePrepared Txn: Refactor conf params
Summary:
Summary of changes:
- Move seq_per_batch out of Options
- Rename concurrent_prepare to two_write_queues
- Add allocate_seq_only_for_data_
Closes https://github.com/facebook/rocksdb/pull/3136

Differential Revision: D6304458

Pulled By: maysamyabandeh

fbshipit-source-id: 08e685bfa82bbc41b5b1c5eb7040a8ca6e05e58c
2017-11-10 17:28:12 -08:00
Dmitri Smirnov
f8e2db0717 Fix crashes, address test issues and adjust windows test script
Summary:
Add per-exe execution capability
  Add fix parsing of groups/tests
  Add timer test exclusion

 Fix unit tests
  Ifdef threadpool specific tests that do not pass on Vista threadpool.
  Remove spurious outout from prefix_test so test case listing works
  properly.
  Fix not using standard test directories results in file creation errors
  in sst_dump_test.

  BlobDb fixes:
    In C++ end() iterators can not be dereferenced. They are not valid.
	When deleting blob_db_ set it to nullptr before any other code executes.
	Not fixed:. On Windows you can not delete a file while it is open.
	[ RUN      ] BlobDBTest.ReadWhileGC
	d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options)
	IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied
	d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options)
	IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied

  write_batch
    Should not call front() if there is a chance the container is empty
Closes https://github.com/facebook/rocksdb/pull/3152

Differential Revision: D6293274

Pulled By: sagar0

fbshipit-source-id: 318c3717c22087fae13b18715dffb24565dbd956
2017-11-10 10:41:57 -08:00
Shaohua Li
eefd75a228 Stream
Summary:
Add a simple policy for NVMe write time life hint
Closes https://github.com/facebook/rocksdb/pull/3095

Differential Revision: D6298030

Pulled By: shligit

fbshipit-source-id: 9a72a42e32e92193af11599eb71f0cf77448e24d
2017-11-10 09:26:24 -08:00
kapitan-k
f1c5eaba56 updated c ingestexternalfileoptions for ingest behind
Summary: Closes https://github.com/facebook/rocksdb/pull/3151

Differential Revision: D6293861

Pulled By: ajkr

fbshipit-source-id: f8db0a71509d1cd8237f2d377bf9e1bb0464bdbf
2017-11-09 18:15:09 -08:00
Andrew Kryczka
93f69cb93a use bottommost compression when base level is bottommost
Summary:
The previous compression type selection caused unexpected behavior when the base level was also the bottommost level. The following sequence of events could happen:

- full compaction generates files with `bottommost_compression` type
- now base level is bottommost level since all files are in the same level
- any compaction causes files to be rewritten `compression_per_level` type since bottommost compression didn't apply to base level

I changed the code to make bottommost compression apply to base level.
Closes https://github.com/facebook/rocksdb/pull/3141

Differential Revision: D6264614

Pulled By: ajkr

fbshipit-source-id: d7aaa8675126896684154a1f2c9034d6214fde82
2017-11-09 17:42:00 -08:00
Sagar Vemuri
a6d8e30c05 Remove unnecessary status check in TableCache::NewIterator
Summary:
While investigating the usage of `new_table_iterator_nanos` perf counter, I saw some code was wrapper around with unnecessary status check ... so removed it.
Closes https://github.com/facebook/rocksdb/pull/3120

Differential Revision: D6229181

Pulled By: sagar0

fbshipit-source-id: f8a44fe67f5a05df94553fdb233b21e54e88cc34
2017-11-03 14:42:08 -07:00
Andrew Kryczka
24ad430600 pass key/value samples through zstd compression dictionary generator
Summary:
Instead of using samples directly, we now support passing the samples through zstd's dictionary generator when `CompressionOptions::zstd_max_train_bytes` is set to nonzero. If set to zero, we will use the samples directly as the dictionary -- same as before.

Note this is the first step of #2987, extracted into a separate PR per reviewer request.
Closes https://github.com/facebook/rocksdb/pull/3057

Differential Revision: D6116891

Pulled By: ajkr

fbshipit-source-id: 70ab13cc4c734fa02e554180eed0618b75255497
2017-11-02 22:56:36 -07:00
Andrew Kryczka
c4c1f961e7 dynamically change current memtable size
Summary:
Previously setting `write_buffer_size` with `SetOptions` would only apply to new memtables. An internal user wanted it to take effect immediately, instead of at an arbitrary future point, to prevent OOM.

This PR makes the memtable's size mutable, and makes `SetOptions()` mutate it. There is one case when we preserve the old behavior, which is when memtable prefix bloom filter is enabled and the user is increasing the memtable's capacity. That's because the prefix bloom filter's size is fixed and wouldn't work as well on a larger memtable.
Closes https://github.com/facebook/rocksdb/pull/3119

Differential Revision: D6228304

Pulled By: ajkr

fbshipit-source-id: e44bd9d10a5f8c9d8c464bf7436070bb3eafdfc9
2017-11-02 22:28:10 -07:00
Yi Wu
62578d80c1 Blob DB: Add compaction filter to remove expired blob index entries
Summary:
After adding expiration to blob index in #3066, we are now able to add a compaction filter to cleanup expired blob index entries.
Closes https://github.com/facebook/rocksdb/pull/3090

Differential Revision: D6183812

Pulled By: yiwu-arbug

fbshipit-source-id: 9cb03267a9702975290e758c9c176a2c03530b83
2017-11-02 17:27:38 -07:00
Yi Wu
7bfa88037e Blob DB: fix snapshot handling
Summary:
Blob db will keep blob file if data in the file is visible to an active snapshot. Before this patch it checks whether there is an active snapshot has sequence number greater than the earliest sequence in the file. This is problematic since we take snapshot on every read, if it keep having reads, old blob files will not be cleanup. Change to check if there is an active snapshot falls in the range of [earliest_sequence, obsolete_sequence) where obsolete sequence is
1. if data is relocated to another file by garbage collection, it is the latest sequence at the time garbage collection finish
2. otherwise, it is the latest sequence of the file
Closes https://github.com/facebook/rocksdb/pull/3087

Differential Revision: D6182519

Pulled By: yiwu-arbug

fbshipit-source-id: cdf4c35281f782eb2a9ad6a87b6727bbdff27a45
2017-11-02 15:58:27 -07:00
Andrew Kryczka
6778690b51 fix duplicate definition of GetEntryType()
Summary:
It's also defined in db/dbformat.cc per 7fe3b32896
Closes https://github.com/facebook/rocksdb/pull/3111

Differential Revision: D6219140

Pulled By: ajkr

fbshipit-source-id: 0f2b14e41457334a4665c6b7e3f42f1a060a0f35
2017-11-01 22:56:17 -07:00
Maysam Yabandeh
02693f64fc WritePrepared Txn: ValidateSnapshot
Summary:
Implements ValidateSnapshot for WritePrepared txns and also adds a unit test to clarify the contract of this function.
Closes https://github.com/facebook/rocksdb/pull/3101

Differential Revision: D6199405

Pulled By: maysamyabandeh

fbshipit-source-id: ace509934c307ea5d26f4bbac5f836d7c80fd240
2017-11-01 19:11:09 -07:00
Mikhail Antonov
7fe3b32896 Added support for differential snapshots
Summary:
The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).

This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.

From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".

This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.

For now, what's done here according to initial discussions:

Preserving deletes:
 - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
 - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
 - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.

Iterator changes:
 - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.

TableCache changes:
 - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.

What's left:

 - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
Closes https://github.com/facebook/rocksdb/pull/2999

Differential Revision: D6175602

Pulled By: mikhail-antonov

fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
2017-11-01 18:56:43 -07:00
Maysam Yabandeh
17731a43a6 WritePrepared Txn: Optimize for recoverable state
Summary:
GetCommitTimeWriteBatch is currently used to store some state as part of commit in 2PC. In MyRocks it is specifically used to store some data that would be needed only during recovery. So it is not need to be stored in memtable right after each commit.
This patch enables an optimization to write the GetCommitTimeWriteBatch only to the WAL. The batch will be written to memtable during recovery when the WAL is replayed. To cover the case when WAL is deleted after memtable flush, the batch is also buffered and written to memtable right before each memtable flush.
Closes https://github.com/facebook/rocksdb/pull/3071

Differential Revision: D6148023

Pulled By: maysamyabandeh

fbshipit-source-id: 2d09bae5565abe2017c0327421010d5c0d55eaa7
2017-11-01 17:26:46 -07:00
Shaohua Li
33c7d4ccd9 Make writable_file_max_buffer_size dynamic
Summary:
The DBOptions::writable_file_max_buffer_size can be changed dynamically.
Closes https://github.com/facebook/rocksdb/pull/3053

Differential Revision: D6152720

Pulled By: shligit

fbshipit-source-id: aa0c0cfcfae6a54eb17faadb148d904797c68681
2017-10-31 13:56:35 -07:00
Andrew Kryczka
b7bc9cc038 fix tracking oldest snapshot for bottom-level compaction
Summary:
The assertion was caught by `MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/5` when run in a loop. The caller doesn't track whether the released snapshot is oldest, so let this function handle that case.
Closes https://github.com/facebook/rocksdb/pull/3080

Differential Revision: D6185257

Pulled By: ajkr

fbshipit-source-id: 4b3015c11db5d31e46521a00af568546ef4558cd
2017-10-30 00:55:58 -07:00