Commit Graph

2956 Commits

Author SHA1 Message Date
Sagar Vemuri
93c2b91740 Introduce conditional merge-operator invocation in point lookups
Summary:
For every merge operand encountered for a key in the read path we now have the ability to decide whether to look further (to retrieve more merge operands for the key) or stop and invoke the merge operator to return the value. The user needs to override `ShouldMerge()` method with a condition to terminate search when true to avail this facility.

This has a couple of advantages:
1. It helps in limiting the number of merge operands that are looked at to compute a value as part of a user Get operation.
2. It allows to peek at a merge key-value to see if further merge operands need to look at.

Example: Limiting the number of merge operands that are looked at: Lets say you have 10 merge operands for a key spread over various levels. If you only want RocksDB to look at the latest two merge operands instead of all 10 to compute the value, it is now possible with this PR. You can set the condition in `ShouldMerge()` to return true when the size of the operand list is 2. Look at the example implementation in the unit test. Without this PR, a Get might look at all the 10 merge operands in different levels before invoking the merge-operator.

Added a new unit test.
Made sure that there is no perf regression by running benchmarks.

Command line to Load data:
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks="mergerandom" --merge_operator="uint64add" --num=10000000
...
mergerandom  :      12.861 micros/op 77757 ops/sec;    8.6 MB/s ( updates:10000000)
```

**ReadRandomMergeRandom bechmark results:**
Command line:
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks="readrandommergerandom" --merge_operator="uint64add" --num=10000000
```

Base -- Without this code change (on commit fc7476b):
```
readrandommergerandom :      38.586 micros/op 25916 ops/sec; (reads:3001599 merges:6998401 total:10000000 hits:842235 maxlength:8)
```

With this code change:
```
readrandommergerandom :      38.653 micros/op 25870 ops/sec; (reads:3001599 merges:6998401 total:10000000 hits:842235 maxlength:8)
```
Closes https://github.com/facebook/rocksdb/pull/2923

Differential Revision: D5898239

Pulled By: sagar0

fbshipit-source-id: daefa325019f77968639a75c851d46352c2303ef
2017-09-28 15:58:49 -07:00
Quinn Jarrell
6a541afcc4 Make bytes_per_sync and wal_bytes_per_sync mutable
Summary:
SUMMARY
Moves the bytes_per_sync and wal_bytes_per_sync options from immutableoptions to mutable options. Also if wal_bytes_per_sync is changed, the wal file and memtables are flushed.
TEST PLAN
ran make check
all passed

Two new tests SetBytesPerSync, SetWalBytesPerSync check that after issuing setoptions with a new value for the var, the db options have the new value.
Closes https://github.com/facebook/rocksdb/pull/2893

Reviewed By: yiwu-arbug

Differential Revision: D5845814

Pulled By: TheRushingWookie

fbshipit-source-id: 93b52d779ce623691b546679dcd984a06d2ad1bd
2017-09-27 17:49:45 -07:00
Maysam Yabandeh
aa67bae6cf Break down PinnedDataIteratorRandomized
Summary:
Its timing out under tsan.
Closes https://github.com/facebook/rocksdb/pull/2928

Differential Revision: D5911766

Pulled By: maysamyabandeh

fbshipit-source-id: 2faacc07752ac8713a3a2abb5a4c4b7ae3bdf208
2017-09-26 14:27:30 -07:00
Zhongyi Xie
1d6700f9e6 Add test kPointInTimeRecoveryCFConsistency
Summary:
Context/problem:

- CFs may be flushed at different times
- A WAL can only be deleted after all CFs have flushed beyond end of that WAL.
- Point-in-time recovery might stop upon reaching the first corruption.
- Some CFs may have already flushed beyond that point, while others haven't. We should fail the Open() instead of proceeding with inconsistent CFs.
Closes https://github.com/facebook/rocksdb/pull/2900

Differential Revision: D5863281

Pulled By: miasantreble

fbshipit-source-id: 180dbaf83d96c804cff49b3c406312a4ae61313e
2017-09-22 17:26:36 -07:00
Andrew Kryczka
4708a6875c Repair DBs with trailing slash in name
Summary:
Problem:

- `DB::SanitizeOptions` strips trailing slash from `wal_dir` but not `dbname`
- We check whether `wal_dir` and `dbname` refer to the same directory using string equality: https://github.com/facebook/rocksdb/blob/master/db/repair.cc#L258
- Providing `dbname` with trailing slash causes default `wal_dir` to be misidentified as a separate directory.
- Then the repair tries to add all SST files to the `VersionEdit` twice (once for `dbname` dir, once for `wal_dir`) and fails with coredump.

Solution:

- Add a new `Env` function, `AreFilesSame`, which uses device and inode number to check whether files are the same. It's currently only implemented in `PosixEnv`.
- Migrate repair to use `AreFilesSame` to check whether `dbname` and `wal_dir` are same. If unsupported, falls back to string comparison.
Closes https://github.com/facebook/rocksdb/pull/2827

Differential Revision: D5761349

Pulled By: ajkr

fbshipit-source-id: c839d548678b742af1166d60b09abd94e5476238
2017-09-22 12:42:22 -07:00
Andrew Kryczka
fc7476bec1 fix populating range deletions in forward iterator
Summary:
fixes #2902
Closes https://github.com/facebook/rocksdb/pull/2917

Differential Revision: D5887175

Pulled By: ajkr

fbshipit-source-id: 364e292c636a3238bfc53b0fb9a01ff2f82dcbb9
2017-09-21 17:56:38 -07:00
PhaniShekhar
65a9cd6168 Use L1 size as estimate for L0 size in LevelCompactionBuilder::GetPathID
Summary:
Fix for [2461](https://github.com/facebook/rocksdb/issues/2461).

Problem: When using multiple db_paths setting with RocksDB, RocksDB incorrectly calculates the size of L1 in LevelCompactionBuilder::GetPathId.

max_bytes_for_level_base is used as L0 size and L1 size is calculated as (L0 size * max_bytes_for_level_multiplier). However, L1 size should be max_bytes_for_level_base.

Solution: Use max_bytes_for_level_base as L1 size. Also, use L1 size as the estimated size of L0.
Closes https://github.com/facebook/rocksdb/pull/2903

Differential Revision: D5885442

Pulled By: maysamyabandeh

fbshipit-source-id: 036da1c9298d173b9b80479cc6661ee4b7a951f6
2017-09-21 15:57:58 -07:00
Yi Wu
b4596c6174 Fix Get does not return super version on error
Summary:
This is caught when I was testing #2886.
Closes https://github.com/facebook/rocksdb/pull/2907

Differential Revision: D5863153

Pulled By: yiwu-arbug

fbshipit-source-id: 8c54759ba1a0dc101f24ab50423e35731300612d
2017-09-19 12:01:09 -07:00
Maysam Yabandeh
60beefd6e0 WritePrepared Txn: Advance seq one per batch
Summary:
By default the seq number in DB is increased once per written key. WritePrepared txns requires the seq to be increased once per the entire batch so that the seq would be used as the prepare timestamp by which the transaction is identified. Also we need to increase seq for the commit marker since it would give a unique id to the commit timestamp of transactions.

Two unit tests are added to verify our understanding of how the seq should be increased. The recovery path requires much more work and is left to another patch.
Closes https://github.com/facebook/rocksdb/pull/2885

Differential Revision: D5837843

Pulled By: maysamyabandeh

fbshipit-source-id: a08960b93d727e1cf438c254d0c2636fb133cc1c
2017-09-18 14:45:08 -07:00
Maysam Yabandeh
c57050b770 Use the default copy constructor in Options
Summary:
Our current implementation of (semi-)copy constructor of DBOptions and ColumnFamilyOptions seems to intend value by value copy, which is what the default copy constructor does anyway. Moreover not using the default constructor has the risk of forgetting to add newly added options.

As an example, allow_2pc seems to be forgotten in the copy constructor which was causing one of the unit tests not seeing its effect.
Closes https://github.com/facebook/rocksdb/pull/2888

Differential Revision: D5846368

Pulled By: maysamyabandeh

fbshipit-source-id: 1ee92a2aeae93886754b7bc039c3411ea2458683
2017-09-15 17:15:10 -07:00
Yi Wu
6b3c71f6ed Fix DBImpl::NotifyOnCompactionCompleted data race
Summary:
Access of `cfd->current()` needs to hold db mutex. The data race is caught by TSAN but hard to reproduce: https://gist.github.com/yiwu-arbug/0fc6dc0de915297a1740aa9610be9373
Closes https://github.com/facebook/rocksdb/pull/2894

Differential Revision: D5843884

Pulled By: yiwu-arbug

fbshipit-source-id: 0a30a421bc96f51840821538ad6453dc0815a942
2017-09-15 11:56:31 -07:00
Siying Dong
edcbb36944 Three code-level optimization to Iterator::Next()
Summary:
Three small optimizations:
(1) iter_->IsKeyPinned() shouldn't be called if read_options.pin_data is not true. This may trigger function call all the way down the iterator tree.
(2) reuse the iterator key object in DBIter::FindNextUserEntryInternal(). The constructor of the class has some overheads.
(3) Move the switching direction logic in MergingIterator::Next() to a separate function.

These three in total improves readseq performance by about 3% in my benchmark setting.
Closes https://github.com/facebook/rocksdb/pull/2880

Differential Revision: D5829252

Pulled By: siying

fbshipit-source-id: 991aea10c6d6c3b43769cb4db168db62954ad1e3
2017-09-14 17:57:31 -07:00
Siying Dong
885b1c682e Two small refactoring for better inlining
Summary:
Move uncommon code paths in RangeDelAggregator::ShouldDelete() and IterKey::EnlargeBufferIfNeeded() to a separate function, so that the inlined strcuture can be more optimized.

Optimize it because these places show up in CPU profiling, though minimum. The performance is really hard measure. I ran db_bench with readseq benchmark against in-memory DB many times. The variation is big, but it seems to show 1% improvements.
Closes https://github.com/facebook/rocksdb/pull/2877

Differential Revision: D5828123

Pulled By: siying

fbshipit-source-id: 41a49e229f91e9f8409f85cc6f0dc70e31334e4b
2017-09-14 15:41:49 -07:00
Oleksandr Anyshchenko
ffac68367f Added save points for transactions C API
Summary:
Added possibility to set save points in transactions and then rollback to them
Closes https://github.com/facebook/rocksdb/pull/2876

Differential Revision: D5825829

Pulled By: yiwu-arbug

fbshipit-source-id: 62168992340bbcddecdaea3baa2a678475d1429d
2017-09-14 14:18:59 -07:00
Yi Wu
a843df668b Fix use-after-free in c_tset
Summary:
Fix asan error introduce by #2823
Closes https://github.com/facebook/rocksdb/pull/2879

Differential Revision: D5828454

Pulled By: yiwu-arbug

fbshipit-source-id: 50777855667f4e7b634279a654c3bfa01a1ac729
2017-09-13 16:12:02 -07:00
Andrew Kryczka
464fb36de9 fix hanging after CompactFiles with L0 overlap
Summary:
Bug report: https://www.facebook.com/groups/rocksdb.dev/permalink/1389452781153232/

Non-empty `level0_compactions_in_progress_` was aborting `CompactFiles` after incrementing `bg_compaction_scheduled_`, and in that case we never decremented it. This blocked future compactions and prevented DB close as we wait for scheduled compactions to finish/abort during close.

I eliminated `CompactFiles`'s dependency on `level0_compactions_in_progress_`. Since it takes a contiguous span of L0 files -- through the last L0 file if any L1+ files are included -- it's fine to run in parallel with other compactions involving L0. We make the same assumption in intra-L0 compaction.
Closes https://github.com/facebook/rocksdb/pull/2849

Differential Revision: D5780440

Pulled By: ajkr

fbshipit-source-id: 15b15d3faf5a699aed4b82a58352d4a7bb23e027
2017-09-13 15:41:38 -07:00
Oleksandr Anyshchenko
72e4190918 Additions for OptimisticTransactionDB in C API
Summary:
Added some bindings for `OptimisticTransactionDB` in C API
Closes https://github.com/facebook/rocksdb/pull/2823

Differential Revision: D5820672

Pulled By: yiwu-arbug

fbshipit-source-id: 7efd17f619cc0741feddd2050b8fc856f9288350
2017-09-13 12:12:11 -07:00
Amy Xu
5785b1fcb8 Fix naming in InternalKey
Summary:
- Switched all instances of SetMinPossibleForUserKey and SetMaxPossibleForUserKey in accordance to InternalKeyComparator's comparison logic
Closes https://github.com/facebook/rocksdb/pull/2868

Differential Revision: D5804152

Pulled By: axxufb

fbshipit-source-id: 80be35e04f2e8abc35cc64abe1fecb03af24e183
2017-09-12 17:17:42 -07:00
Maysam Yabandeh
2d30aaae47 Exclude incompatible options in test
Summary:
options.enable_pipelined_write and options.concurrent_prepare are incompatible and should not be set together.
Closes https://github.com/facebook/rocksdb/pull/2875

Differential Revision: D5818358

Pulled By: maysamyabandeh

fbshipit-source-id: dad862508f00817ab302f8b61729accf38315fb8
2017-09-12 14:58:46 -07:00
Archit Mishra
3c42807794 do not call merge when checking to see if key exists
Summary:
Changes:
* added check for value before merge is called on code path that should check if key exists
Closes https://github.com/facebook/rocksdb/pull/2814

Reviewed By: IslamAbdelRahman

Differential Revision: D5743966

Pulled By: armishra

fbshipit-source-id: 6ac4283bc510c8ca50827d87ef0ba631f2b33b18
2017-09-12 12:02:53 -07:00
Andrew Kryczka
025b85b4ac speedup DBTest.EncodeDecompressedBlockSizeTest
Summary:
it sometimes takes more than 10 minutes (i.e., times out) on our internal CI. mainly because bzip is super slow. so I reduced the amount of  work it tries to do.
Closes https://github.com/facebook/rocksdb/pull/2856

Differential Revision: D5795883

Pulled By: ajkr

fbshipit-source-id: e69f986ae60b44ecc26b6b024abd0f13bdf3a3c5
2017-09-12 11:26:47 -07:00
Siying Dong
64b6452e0c Make InternalKeyComparator final and directly use it in merging iterator
Summary:
Merging iterator invokes InternalKeyComparator.Compare() frequently to heap merge. By making InternalKeyComparator final and merging iterator to directly use InternalKeyComparator rather than through Iterator interface, we can give compiler a choice to avoid one more virtual function call if possible. I ran readseq benchmark in memory-only use case to make sure the performance at least doesn't regress.

I have to disable the final key word in debug build, as a hack test class depends on overriding the class.
Closes https://github.com/facebook/rocksdb/pull/2860

Differential Revision: D5800461

Pulled By: siying

fbshipit-source-id: ab876f22a09bb5c560740911412336e0e25ccb53
2017-09-11 12:04:21 -07:00
Siying Dong
2dd22e5449 Make DBIter class final
Summary:
DBIter is referenced in ArenaWrappedDBIter, which is a simple wrapper. If DBIter is final, some virtual function call can be avoided. Some functions can even be inlined, like DBIter.value() to ArenaWrappedDBIter.value() and DBIter.key() to ArenaWrappedDBIter.key(). The performance gain is hard to measure. I just ran the memory-only benchmark for readseq and saw it didn't regress. There shouldn't be any harm doing it. Just give compiler more choices.
Closes https://github.com/facebook/rocksdb/pull/2859

Differential Revision: D5799888

Pulled By: siying

fbshipit-source-id: 829788f91310c40282dcfb7e412e6ef489931143
2017-09-11 12:04:21 -07:00
Huachao Huang
2a5915049e Fix missing BYTES_PER_WRITE for pipeline write
Summary: Closes https://github.com/facebook/rocksdb/pull/2862

Differential Revision: D5805638

Pulled By: yiwu-arbug

fbshipit-source-id: 72d38c74395690023a719f400daff01527645a17
2017-09-11 11:41:27 -07:00
Maysam Yabandeh
f46464d383 write-prepared txn: call IsInSnapshot
Summary:
This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot.
Closes https://github.com/facebook/rocksdb/pull/2850

Differential Revision: D5787375

Pulled By: maysamyabandeh

fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c
2017-09-11 09:14:48 -07:00
Andrew Kryczka
3cd7ea2e8a rename stall-related internal stats
Summary:
Some of these names, like `MEMTABLE_COMPACTION`, did not mean anything. Tried to give them descriptive names.
Closes https://github.com/facebook/rocksdb/pull/2852

Differential Revision: D5782822

Pulled By: ajkr

fbshipit-source-id: f2695c4124af4073da4492d7135bae2411220f3a
2017-09-07 18:26:18 -07:00
Siying Dong
0e99323ac2 Fix CLANG Analyze
Summary:
clang analyze shows warnings after we upgrade the CLANG version. Fix them.
Closes https://github.com/facebook/rocksdb/pull/2839

Differential Revision: D5769060

Pulled By: siying

fbshipit-source-id: 3f8e4df715590d8984f6564b608fa08cfdfa5f14
2017-09-07 14:28:06 -07:00
Andrew Kryczka
10ddd59ba7 fix CompactFiles inclusion of older L0 files
Summary:
if we're moving any L0 files down, we need to include older L0 files since they may contain older versions of the keys being moved down.
Closes https://github.com/facebook/rocksdb/pull/2845

Differential Revision: D5773800

Pulled By: ajkr

fbshipit-source-id: 9f0770a8eaaeea4c87df2e7a2a1d65bf9d7f4f7e
2017-09-06 11:42:25 -07:00
Kamalalochana Subbaiah
e612e31740 Updated CRC32 Power Optimization Changes
Summary:
Support for PowerPC Architecture
Detecting AltiVec Support
Closes https://github.com/facebook/rocksdb/pull/2716

Differential Revision: D5606836

Pulled By: siying

fbshipit-source-id: 720262453b1546e5fdbbc668eff56848164113f3
2017-08-31 14:16:30 -07:00
Andrew Kryczka
c10cf166fa Dump non-final ZSTD compression type support
Summary: Closes https://github.com/facebook/rocksdb/pull/2810

Differential Revision: D5739947

Pulled By: ajkr

fbshipit-source-id: 09f99718b6b083c2711dcf17f7b68c305f3fd261
2017-08-30 16:41:24 -07:00
Artem Danilov
8a6708f5f2 Extend property map with compaction stats
Summary:
This branch extends existing property map which keeps values in doubles to keep values in strings so that it can be used to provide wider range of properties. The immediate need for that is to provide IO stall stats in an easy parseable way to MyRocks which is also part of this branch.
Closes https://github.com/facebook/rocksdb/pull/2794

Differential Revision: D5717676

Pulled By: Tema

fbshipit-source-id: e34ba5b79ba774697f7b97ce1138d8fd55471b8a
2017-08-30 15:26:55 -07:00
Huachao Huang
0980dc6c9a Fix wrong smallest key of delete range tombstones
Summary:
Since tombstones are not stored in order, we may get a wrong smallest key if we only consider the first added tombstone.
Check https://github.com/facebook/rocksdb/issues/2752 for more details.
Closes https://github.com/facebook/rocksdb/pull/2799

Differential Revision: D5728217

Pulled By: ajkr

fbshipit-source-id: 4a53edb0ca80d2a9fcf10749e52d47d57d6417d3
2017-08-29 18:41:35 -07:00
Andrew Kryczka
b767972313 avoid use-after-move error
Summary:
* db/range_del_aggregator.cc (AddTombstone): Avoid a potential
use-after-move bug. The original code would both use and move
`tombstone` in a context where the order of those operations is
not specified.  The fix is to perform the use on a new, preceding
statement.

Author: meyering
Closes https://github.com/facebook/rocksdb/pull/2796

Differential Revision: D5721163

Pulled By: ajkr

fbshipit-source-id: a1d328d6a77a17c6425e8069860a202e615e2f48
2017-08-29 12:11:56 -07:00
Maysam Yabandeh
fbfa3e7a43 WriteAtPrepare: Efficient read from snapshot list
Summary:
Divide the old snapshots to two lists: a few that fit into a cached array and the rest in a vector, which is expected to be empty in normal cases. The former is to optimize concurrent reads from snapshots without requiring locks. It is done by an array of std::atomic, from which std::memory_order_acquire reads are compiled to simple read instructions in most of the x86_64 architectures.
Closes https://github.com/facebook/rocksdb/pull/2758

Differential Revision: D5660504

Pulled By: maysamyabandeh

fbshipit-source-id: 524fcf9a8e7f90a92324536456912a99aaa6740c
2017-08-26 01:00:38 -07:00
Yi Wu
3c840d1a6d Allow DB reopen with reduced options.num_levels
Summary:
Allow user to reduce number of levels in LSM by issue a full CompactRange() and put the result in a lower level, and then reopen DB with reduced options.num_levels. Previous this will fail on reopen on when recovery replaying the previous MANIFEST and found a historical file was on a higher level than the new options.num_levels. The workaround was after CompactRange(), reopen the DB with old num_levels, which will create a new MANIFEST, and then reopen the DB again with new num_levels.

This patch relax the check of levels during recovery. It allows DB to open if there was a historical file on level > options.num_levels, but was also deleted.
Closes https://github.com/facebook/rocksdb/pull/2740

Differential Revision: D5629354

Pulled By: yiwu-arbug

fbshipit-source-id: 545903f6b36b6083e8cbaf777176aef2f488021d
2017-08-24 16:10:54 -07:00
Yi Wu
92bfd6c507 Fix DropColumnFamily data race
Summary:
It should hold db mutex while accessing max_total_in_memory_state_.
Closes https://github.com/facebook/rocksdb/pull/2784

Differential Revision: D5696536

Pulled By: yiwu-arbug

fbshipit-source-id: 45430634d7fe11909b38e42e5f169f618681c4ee
2017-08-24 14:56:04 -07:00
Andrew Kryczka
7fbb9eccaf support disabling checksum in block-based table
Summary:
store a zero as the checksum when disabled since it's easier to keep block trailer a fixed length.
Closes https://github.com/facebook/rocksdb/pull/2781

Differential Revision: D5694702

Pulled By: ajkr

fbshipit-source-id: 69cea9da415778ba2b600dfd9d0dfc8cb5188ecd
2017-08-23 19:40:47 -07:00
Andrew Kryczka
7eba54eb9b test compaction input-level split range tombstone assumption
Summary:
One of the core assumptions of DeleteRange is that files containing portions of the same range tombstone are treated as a single unit from the perspective of compaction picker. Need better tests for this. This PR adds the tests for manual compaction.
Closes https://github.com/facebook/rocksdb/pull/2769

Differential Revision: D5676677

Pulled By: ajkr

fbshipit-source-id: 1b4b3382b300ff7048b872911405fdf900e4fbec
2017-08-23 14:11:32 -07:00
mkosieradzki
a12479819d Improved transactions support in C API
Summary:
Solves #2632

Added OptimisticTransactionDB to the C API.
Added missing merge operations to Transaction.
Added missing get_for_update operation to transaction

If required I will create tests for this another day.
Closes https://github.com/facebook/rocksdb/pull/2633

Differential Revision: D5600906

Pulled By: yiwu-arbug

fbshipit-source-id: da23e4484433d8f59d471f778ff2ae210e3fe4eb
2017-08-23 12:40:28 -07:00
Andrew Kryczka
8ace1f79b5 add counter for deletion dropping optimization
Summary:
add this counter stat to track usage of deletion-dropping optimization. if usage is low, we can delete it to prevent bugs like #2726.
Closes https://github.com/facebook/rocksdb/pull/2761

Differential Revision: D5665421

Pulled By: ajkr

fbshipit-source-id: 881befa2d199838dac88709e7b376a43d304e3d4
2017-08-19 14:10:08 -07:00
Andrew Kryczka
ed0a4c93ef perf_context measure user bytes read
Summary:
With this PR, we can measure read-amp for queries where perf_context is enabled as follows:

```
SetPerfLevel(kEnableCount);
Get(1, "foo");
double read_amp = static_cast<double>(get_perf_context()->block_read_byte / get_perf_context()->get_read_bytes);
SetPerfLevel(kDisable);
```

Our internal infra enables perf_context for a sampling of queries. So we'll be able to compute the read-amp for the sample set, which can give us a good estimate of read-amp.
Closes https://github.com/facebook/rocksdb/pull/2749

Differential Revision: D5647240

Pulled By: ajkr

fbshipit-source-id: ad73550b06990cf040cc4528fa885360f308ec12
2017-08-18 11:43:33 -07:00
Sagar Vemuri
9a44b4c32c Allow merge operator to be called even with a single operand
Summary:
Added a function `MergeOperator::DoesAllowSingleMergeOperand()` to allow invoking a merge operator even with a single merge operand, if overriden.

This is needed for Cassandra-on-RocksDB work. All Cassandra writes are through merges and this will allow a single merge-value to be updated in the merge-operator invoked via a compaction, if needed, due to an expired TTL.
Closes https://github.com/facebook/rocksdb/pull/2721

Differential Revision: D5608706

Pulled By: sagar0

fbshipit-source-id: f299f9f91c4d1ac26e48bd5906e122c1c5e5f3fc
2017-08-16 23:42:00 -07:00
follitude
ac8fb77afd fix some misspellings
Summary:
PTAL ajkr
Closes https://github.com/facebook/rocksdb/pull/2750

Differential Revision: D5648052

Pulled By: ajkr

fbshipit-source-id: 7cd1ddd61364d5a55a10fdd293fa74b2bf89dd98
2017-08-16 21:57:20 -07:00
Andrew Kryczka
af012c0f83 fix deleterange with memtable prefix bloom
Summary:
the range delete tombstones in memtable should be added to the aggregator even when the memtable's prefix bloom filter tells us the lookup key's not there. This bug could cause data to temporarily reappear until the memtable containing range deletions is flushed.

Reported in #2743.
Closes https://github.com/facebook/rocksdb/pull/2745

Differential Revision: D5639007

Pulled By: ajkr

fbshipit-source-id: 04fc6facb6f978340a3f639536f4ca7c0d73dfc9
2017-08-16 19:13:01 -07:00
Andrew Kryczka
1c8dbe2aa2 update scores after picking universal compaction
Summary:
We forgot to recompute compaction scores after picking a universal compaction like we do in level compaction (a34b2e388e/db/compaction_picker.cc (L691-L695)). This leads to a fairness issue where we waste compactions on CFs/DB instances that don't need it while others can starve.

Previously, ccecf3f4fb fixed the issue for the read-amp-based compaction case; this PR avoids the issue earlier and also for size-ratio-based compactions.
Closes https://github.com/facebook/rocksdb/pull/2688

Differential Revision: D5566191

Pulled By: ajkr

fbshipit-source-id: 010bccb2a107f6a76f3d3022b90aadce5cc48feb
2017-08-16 18:42:33 -07:00
Maysam Yabandeh
eb6425303e Update WritePrepared with the pseudo code
Summary:
Implement the main body of WritePrepared pseudo code. This includes PrepareInternal and CommitInternal, as well as AddCommitted which updates the commit map. It also provides a IsInSnapshot method that could be later called form the read path to decide if a version is in the read snapshot or it should other be skipped.

This patch lacks unit tests and does not attempt to offer an efficient implementation. The idea is that to have the API specified so that we can work on related tasks in parallel.
Closes https://github.com/facebook/rocksdb/pull/2713

Differential Revision: D5640021

Pulled By: maysamyabandeh

fbshipit-source-id: bfa7a05e8d8498811fab714ce4b9c21530514e1c
2017-08-16 16:57:47 -07:00
Siying Dong
71598cdc75 Fix false removal of tombstone issue in FIFO and kCompactionStyleNone
Summary:
Similar to the bug fixed by https://github.com/facebook/rocksdb/pull/2726, FIFO with compaction and kCompactionStyleNone during user customized CompactFiles() with output level to be 0 can suffer from the same problem. Fix it by leveraging the bottommost_level_ flag.
Closes https://github.com/facebook/rocksdb/pull/2735

Differential Revision: D5626906

Pulled By: siying

fbshipit-source-id: 2b148d0461c61dbd986d74655e384419ae442158
2017-08-15 13:02:19 -07:00
Andrew Kryczka
acf935e40f fix deletion dropping in intra-L0
Summary:
`KeyNotExistsBeyondOutputLevel` didn't consider L0 files' key-ranges. So if a key only was covered by older L0 files' key-ranges, we would incorrectly drop deletions of that key. This PR just skips the deletion-dropping optimization when output level is L0.
Closes https://github.com/facebook/rocksdb/pull/2726

Differential Revision: D5617286

Pulled By: ajkr

fbshipit-source-id: 4bff1396b06d49a828ba4542f249191052915bce
2017-08-11 18:12:38 -07:00
yiwu-arbug
3f5888430a Fix c_test ASAN failure
Summary:
Fix c_test missing deletion of write batch pointer.
Closes https://github.com/facebook/rocksdb/pull/2725

Differential Revision: D5613866

Pulled By: yiwu-arbug

fbshipit-source-id: bf3f59a6812178577c9c25bae558ef36414a1f51
2017-08-11 13:00:15 -07:00
Andrew Kryczka
6f051e0c71 fix corruption_test valgrind
Summary: Closes https://github.com/facebook/rocksdb/pull/2724

Differential Revision: D5613416

Pulled By: ajkr

fbshipit-source-id: ed55fb66ab1b41dfdfe765fe3264a1c87a8acb00
2017-08-11 12:29:14 -07:00
Kent767
ac098a4626 expose set_skip_stats_update_on_db_open to C bindings
Summary:
It would be super helpful to not have to recompile rocksdb to get this performance tweak for mechanical disks.

I have signed the CLA.
Closes https://github.com/facebook/rocksdb/pull/2718

Differential Revision: D5606994

Pulled By: yiwu-arbug

fbshipit-source-id: c05e92bad0d03bd38211af1e1ced0d0d1e02f634
2017-08-11 12:16:45 -07:00
Siying Dong
666a005f9b Support prefetch last 512KB with direct I/O in block based file reader
Summary:
Right now, if direct I/O is enabled, prefetching the last 512KB cannot be applied, except compaction inputs or readahead is enabled for iterators. This can create a lot of I/O for HDD cases. To solve the problem, the 512KB is prefetched in block based table if direct I/O is enabled. The prefetched buffer is passed in totegher with random access file reader, so that we try to read from the buffer before reading from the file. This can be extended in the future to support flexible user iterator readahead too.
Closes https://github.com/facebook/rocksdb/pull/2708

Differential Revision: D5593091

Pulled By: siying

fbshipit-source-id: ee36ff6d8af11c312a2622272b21957a7b5c81e7
2017-08-11 12:16:45 -07:00
Stanislav Tkach
25df24254b Add column families related functions (C API)
Summary:
(#2564)
Closes https://github.com/facebook/rocksdb/pull/2669

Differential Revision: D5594151

Pulled By: yiwu-arbug

fbshipit-source-id: 67ae9446342f3323d6ecad8e811f4158da194270
2017-08-10 13:43:54 -07:00
Oleksandr Anyshchenko
0cecf8155b Write batch for TransactionDB in C API
Summary: Closes https://github.com/facebook/rocksdb/pull/2655

Differential Revision: D5600858

Pulled By: yiwu-arbug

fbshipit-source-id: cf52f9104e348438bf168dc6bf7af3837faf12ef
2017-08-10 11:58:53 -07:00
jimmyway
23c7d13540 fix comment
Summary:
Signed-off-by: tang.jin <tang.jin@istuary.com>
Closes https://github.com/facebook/rocksdb/pull/2644

Differential Revision: D5600861

Pulled By: yiwu-arbug

fbshipit-source-id: 9516636cb6e77b09fe0ebef78953adf4b7e88cc8
2017-08-09 22:57:01 -07:00
Aaron G
7848f0b24c add VerifyChecksum() to db.h
Summary:
We need a tool to check any sst file corruption in the db.
It will check all the sst files in current version and read all the blocks (data, meta, index) with checksum verification. If any verification fails, the function will return non-OK status.
Closes https://github.com/facebook/rocksdb/pull/2498

Differential Revision: D5324269

Pulled By: lightmark

fbshipit-source-id: 6f8a272008b722402a772acfc804524c9d1a483b
2017-08-09 15:58:13 -07:00
Chang Liu
d97a72d63f Try to repair db with wal_dir option, avoid leak some WAL files
Summary:
We should search wal_dir in Repairer::FindFiles function, and avoid use
LogFileNmae(dbname, number) to get WAL file's name, which will get a wrong
WAL filename. as following:

```
[WARN] [/home/liuchang/Workspace/rocksdb/db/repair.cc:310] Log #3: ignoring conversion error: IO error: While opening a file for sequentially reading: /tmp/rocksdbtest-1000/repair_test/000003.log: No such file or directory
```
  I have added a new test case to repair_test.cc, which try to repair db with all WAL options.

Signed-off-by: Chang Liu <liuchang0812@gmail.com>
Closes https://github.com/facebook/rocksdb/pull/2692

Differential Revision: D5575888

Pulled By: ajkr

fbshipit-source-id: 5b93e9f85cddc01663ccecd87631fa723ac466a3
2017-08-08 10:47:57 -07:00
Maysam Yabandeh
bdc056f8aa Refactor PessimisticTransaction
Summary:
This patch splits Commit and Prepare into lock-related logic and db-write-related logic. It moves lock-related logic to PessimisticTransaction to be reused by all children classes and movies the existing impl of db-write-related to PrepareInternal, CommitSingleInternal, and CommitInternal in WriteCommittedTxnImpl.
Closes https://github.com/facebook/rocksdb/pull/2691

Differential Revision: D5569464

Pulled By: maysamyabandeh

fbshipit-source-id: d1b8698e69801a4126c7bc211745d05c636f5325
2017-08-07 16:12:29 -07:00
Sagar Vemuri
20dc5e74f2 Optimize range-delete aggregator call in merge helper.
Summary:
In the condition:
```
if (range_del_agg != nullptr &&
    range_del_agg->ShouldDelete(
        iter->key(),
        RangeDelAggregator::RangePositioningMode::kForwardTraversal) &&
    filter != CompactionFilter::Decision::kRemoveAndSkipUntil) {
...
}
```
it could be possible that all the work done in `range_del_agg->ShouldDelete` is wasted due to not having the right `filter` value later on.
Instead, check `filter` value before even calling `range_del_agg->ShouldDelete`, which is a much more involved function.
Closes https://github.com/facebook/rocksdb/pull/2690

Differential Revision: D5568931

Pulled By: sagar0

fbshipit-source-id: 17512d52360425c7ae9de7675383f5d7bc3dad58
2017-08-05 00:15:35 -07:00
Andrew Kryczka
cc01985db0 Introduce bottom-pri thread pool for large universal compactions
Summary:
When we had a single thread pool for compactions, a thread could be busy for a long time (minutes) executing a compaction involving the bottom level. In multi-instance setups, the entire thread pool could be consumed by such bottom-level compactions. Then, top-level compactions (e.g., a few L0 files) would be blocked for a long time ("head-of-line blocking"). Such top-level compactions are critical to prevent compaction stalls as they can quickly reduce number of L0 files / sorted runs.

This diff introduces a bottom-priority queue for universal compactions including the bottom level. This alleviates the head-of-line blocking situation for fast, top-level compactions.

- Added `Env::Priority::BOTTOM` thread pool. This feature is only enabled if user explicitly configures it to have a positive number of threads.
- Changed `ThreadPoolImpl`'s default thread limit from one to zero. This change is invisible to users as we call `IncBackgroundThreadsIfNeeded` on the low-pri/high-pri pools during `DB::Open` with values of at least one. It is necessary, though, for bottom-pri to start with zero threads so the feature is disabled by default.
- Separated `ManualCompaction` into two parts in `PrepickedCompaction`. `PrepickedCompaction` is used for any compaction that's picked outside of its execution thread, either manual or automatic.
- Forward universal compactions involving last level to the bottom pool (worker thread's entry point is `BGWorkBottomCompaction`).
- Track `bg_bottom_compaction_scheduled_` so we can wait for bottom-level compactions to finish. We don't count them against the background jobs limits. So users of this feature will get an extra compaction for free.
Closes https://github.com/facebook/rocksdb/pull/2580

Differential Revision: D5422916

Pulled By: ajkr

fbshipit-source-id: a74bd11f1ea4933df3739b16808bb21fcd512333
2017-08-03 15:43:29 -07:00
Maysam Yabandeh
58410aee44 Fix the overflow bug in AwaitState
Summary:
https://github.com/facebook/rocksdb/issues/2559 reports an overflow in AwaitState. nbronson has debugged the issue and presented the fix, which is applied to this patch. Moreover this patch adds more comments to clarify the logic in AwaitState.

I tried with both 16 and 64 threads on update benchmark. The fix lowers cpu usage by 1.6 but also lowers the throughput by 1.6 and 2% respectively. Apparently the bug had favored using the spinning more often.

Benchmarks:
TEST_TMPDIR=/dev/shm/tmpdb time ./db_bench --benchmarks="fillrandom" --threads=16 --num=2000000
TEST_TMPDIR=/dev/shm/tmpdb time ./db_bench --use_existing_db=1 --benchmarks="updaterandom[X3]" --threads=16 --num=2000000
TEST_TMPDIR=/dev/shm/tmpdb time ./db_bench --use_existing_db=1 --benchmarks="updaterandom[X3]" --threads=64 --num=200000

Results
$ cat update-16t-bug.txt | tail -4
updaterandom [AVG    3 runs] : 234117 ops/sec;   51.8 MB/sec
updaterandom [MEDIAN 3 runs] : 233581 ops/sec;   51.7 MB/sec
3896.42user 1539.12system 6:50.61elapsed 1323%CPU (0avgtext+0avgdata 331308maxresident)k
0inputs+0outputs (0major+1281001minor)pagefaults 0swaps
$ cat update-16t-fixed.txt | tail -4
updaterandom [AVG    3 runs] : 230364 ops/sec;   51.0 MB/sec
updaterandom [MEDIAN 3 runs] : 226169 ops/sec;   50.0 MB/sec
3865.46user 1568.32system 6:57.63elapsed 1301%CPU (0avgtext+0avgdata 315012maxresident)k
0inputs+0outputs (0major+1342568minor)pagefaults 0swaps

$ cat update-64t-bug.txt | tail -4
updaterandom [AVG    3 runs] : 261878 ops/sec;   57.9 MB/sec
updaterandom [MEDIAN 3 runs] : 262859 ops/sec;   58.2 MB/sec
926.27user 578.06system 2:27.46elapsed 1020%CPU (0avgtext+0avgdata 475480maxresident)k
0inputs+0outputs (0major+1058728minor)pagefaults 0swaps
$ cat update-64t-fixed.txt | tail -4
updaterandom [AVG    3 runs] : 256699 ops/sec;   56.8 MB/sec
updaterandom [MEDIAN 3 runs] : 256380 ops/sec;   56.7 MB/sec
933.47user 575.37system 2:30.41elapsed 1003%CPU (0avgtext+0avgdata 482340maxresident)k
0inputs+0outputs (0major+1078557minor)pagefaults 0swaps
Closes https://github.com/facebook/rocksdb/pull/2679

Differential Revision: D5553732

Pulled By: maysamyabandeh

fbshipit-source-id: 98b72dc3a8e0f22ea29d4f7c7790af10c369c5bb
2017-08-03 10:43:28 -07:00
Maysam Yabandeh
c3d5c4d38a Refactor TransactionImpl
Summary:
This patch refactors TransactionImpl by separating the logic for pessimistic concurrency control from the implementation of how to write the data to rocksdb. The existing implementation is named WriteCommittedTxnImpl as it writes committed data to the db. A template named WritePreparedTxnImpl is also added which will be later completed to provide a an alternative implementation.
Closes https://github.com/facebook/rocksdb/pull/2676

Differential Revision: D5549998

Pulled By: maysamyabandeh

fbshipit-source-id: 16298e86b43ca4849324c1f35c731913c6d17bec
2017-08-03 08:57:22 -07:00
奏之章
3218edc573 Fix universal compaction bug
Summary:
this value ``` Compaction::is_trivial_move_ ``` uninitialized .
under universal compaction , we enable ```  CompactionOptionsUniversal::allow_trivial_move  ``` ,
9b11d4345a/db/compaction.cc (L245)
here is a disastrous bug , some sst trivial move to target level without overlap check ...
THEN , DATABASE DAMAGED , WE GOT A LEVEL WITH OVERLAP !
Closes https://github.com/facebook/rocksdb/pull/2634

Differential Revision: D5530722

Pulled By: siying

fbshipit-source-id: 425ab55bca5967110377d634258360bcf88c200e
2017-07-31 14:27:45 -07:00
Andrew Kryczka
6a36b3a7b9 fix db get/write stats
Summary:
we were passing `record_read_stats` (a bool) as the `hist_type` argument, which meant we were updating either `rocksdb.db.get.micros` (`hist_type == 0`) or `rocksdb.db.write.micros` (`hist_type == 1`) with wrong data.
Closes https://github.com/facebook/rocksdb/pull/2666

Differential Revision: D5520384

Pulled By: ajkr

fbshipit-source-id: 2f7c956aec32f8b58c5c18845ac478e0230c9516
2017-07-31 12:12:03 -07:00
Siying Dong
21696ba502 Replace dynamic_cast<>
Summary:
Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available.
Some nontrivial changes:
1. Add Comparator::GetRootComparator() to get around the internal comparator hack
2. Add the two experiemental functions to DB
3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string
4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric.
Closes https://github.com/facebook/rocksdb/pull/2645

Differential Revision: D5502723

Pulled By: siying

fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c
2017-07-28 16:27:16 -07:00
Sagar Vemuri
ac748c57ed Fix FIFO Compaction with TTL tests
Summary:
- FIFOCompactionWithTTLTest was flaky when run in parallel earlier, and hence it was disabled. Fixed it now.
- Also, faking sleep now instead of really sleeping to make tests more realistic by using TTLs like 1 hour and 1 day.
Closes https://github.com/facebook/rocksdb/pull/2650

Differential Revision: D5506038

Pulled By: sagar0

fbshipit-source-id: deb429a527f045e3e2c5138b547c3e8ac8586aa2
2017-07-28 14:42:59 -07:00
Andrew Kryczka
710411aea6 fix asan/valgrind for TableCache cleanup
Summary:
Breaking commit: d12691b86f

In the above commit, I moved the `TableCache` cleanup logic from `Version` destructor into `PurgeObsoleteFiles`. I missed cleaning up `TableCache` entries for the current `Version` during DB destruction.

This PR adds that logic to `VersionSet` destructor. One unfortunate side effect is now we're potentially deleting `TableReader`s after `column_family_set_.reset()`, which means we can't call `BlockBasedTableReader::Close` a second time as the block cache might already be destroyed.
Closes https://github.com/facebook/rocksdb/pull/2662

Differential Revision: D5515108

Pulled By: ajkr

fbshipit-source-id: 2cb820e19aa813e0d258d17f76b2d7b6b7ee0b18
2017-07-27 20:28:04 -07:00
Siying Dong
fca4d6da17 Build fewer tests in Travis platform_dependent tests
Summary:
platform_dependent tests in Travis now builds all tests, which is not needed. Only build those tests we need to run.
Closes https://github.com/facebook/rocksdb/pull/2647

Differential Revision: D5513954

Pulled By: siying

fbshipit-source-id: 4d540b146124e70dd25586c47939d19f93655b0a
2017-07-27 17:29:01 -07:00
Aaron Gao
8f553d3c52 remove unnecessary internal_comparator param in newIterator
Summary:
solved https://github.com/facebook/rocksdb/issues/2604
Closes https://github.com/facebook/rocksdb/pull/2648

Differential Revision: D5504875

Pulled By: lightmark

fbshipit-source-id: c14bb62ccbdc9e7bda9cd914cae4ea0765d882ee
2017-07-27 14:30:42 -07:00
Andrew Kryczka
d12691b86f move TableCache::EraseHandle outside of db mutex
Summary:
Post-compaction work holds onto db mutex for the longest time (found by tracing lock acquires/releases with LTTng and correlating timestamps with our info log). Further experimentation showed `TableCache::EraseHandle` is responsible for ~86% of time mutex is held. We can just release the handle outside the db mutex.
Closes https://github.com/facebook/rocksdb/pull/2654

Differential Revision: D5507126

Pulled By: ajkr

fbshipit-source-id: 703c01ddf2aea16bc0f9e33c08935d78aa6b781d
2017-07-27 12:14:41 -07:00
Siying Dong
e7697b8ce8 Fix LITE unit tests
Summary: Closes https://github.com/facebook/rocksdb/pull/2649

Differential Revision: D5505778

Pulled By: siying

fbshipit-source-id: 7e935603ede3d958ea087ed6b8cfc4121e8797bc
2017-07-26 21:11:47 -07:00
Siying Dong
c281b44829 Revert "CRC32 Power Optimization Changes"
Summary:
This reverts commit 2289d38115.
Closes https://github.com/facebook/rocksdb/pull/2652

Differential Revision: D5506163

Pulled By: siying

fbshipit-source-id: 105e31dd9d99090453a6b9f32c165206cd3affa3
2017-07-26 19:31:36 -07:00
Sagar Vemuri
9980de262c Fix FIFO compaction picker test
Summary:
A FIFO compaction picker test is accidentally testing against an instance of level compaction picker.
Closes https://github.com/facebook/rocksdb/pull/2641

Differential Revision: D5495390

Pulled By: sagar0

fbshipit-source-id: 301962736f629b1c499570fb504cdbe66bacb46f
2017-07-26 12:12:26 -07:00
Kamalalochana Subbaiah
2289d38115 CRC32 Power Optimization Changes
Summary:
Support for PowerPC Architecture
Detecting AltiVec Support
Closes https://github.com/facebook/rocksdb/pull/2353

Differential Revision: D5210948

Pulled By: siying

fbshipit-source-id: 859a8c063d37697addd89ba2b8a14e5efd5d24bf
2017-07-26 09:42:29 -07:00
Maysam Yabandeh
30b58cf71a Remove the orphan assert on !need_log_sync
Summary:
We initially had disabled support for write_options.sync when concurrent_prepare_ is set. We later added this support but the statement that asserts this combination is not used was left there. This patch cleans it up.
Closes https://github.com/facebook/rocksdb/pull/2642

Differential Revision: D5496101

Pulled By: maysamyabandeh

fbshipit-source-id: becbc503446f2a51bee24cc861958c090c724ec2
2017-07-25 18:41:52 -07:00
Yi Wu
fe1a5559f3 Fix flaky write_callback_test
Summary:
The test is failing occasionally on the assert: `ASSERT_TRUE(writer->state == WriteThread::State::STATE_INIT)`. This is because the test don't make the leader wait for long enough before updating state for its followers. The patch move the update to `threads_waiting` to the end of `WriteThread::JoinBatchGroup:Wait` callback to avoid this happening.

Also adding `WriteThread::JoinBatchGroup:Start` and have each thread wait there while another thread is linking to the linked-list. This is to make the check of `is_leader` more deterministic.

Also changing two while-loops of `compare_exchange_strong` to plain `fetch_add`, to make it look cleaner.
Closes https://github.com/facebook/rocksdb/pull/2640

Differential Revision: D5491525

Pulled By: yiwu-arbug

fbshipit-source-id: 6e897f122082bd6f98e6d51b31a25e5fd0a3fb82
2017-07-25 16:42:11 -07:00
Islam AbdelRahman
ea8ad4f678 Fix compaction div by zero logging
Summary:
We will divide by zero if `stats.micros` is zero, just add a simple check
This happens sometimes during running tests and UBSAN complains
Closes https://github.com/facebook/rocksdb/pull/2631

Differential Revision: D5481455

Pulled By: IslamAbdelRahman

fbshipit-source-id: 69aa24e64e21de15d9e2b8009adf01675fcc6598
2017-07-24 11:58:02 -07:00
kapitan-k
34112aeffd Added db paths to c
Summary: Closes https://github.com/facebook/rocksdb/pull/2613

Differential Revision: D5476064

Pulled By: sagar0

fbshipit-source-id: 6b30a9eacb93a945bbe499eafb90565fa9f1798b
2017-07-24 11:58:02 -07:00
Daniel Black
1d8aa2961c Gcc 7 ParsedInternalKey replace memset with clear function.
Summary:
I haven't looked to see if a class variable inside a loop like this is always initialised.
Closes https://github.com/facebook/rocksdb/pull/2602

Differential Revision: D5475937

Pulled By: IslamAbdelRahman

fbshipit-source-id: 8570b308f9a4b49e2a56ccc9e9b84d7c46568c15
2017-07-24 11:31:15 -07:00
Siying Dong
e67b35c076 Add Iterator::Refresh()
Summary:
Add and implement Iterator::Refresh(). When this function is called, if the super version doesn't change, update the sequence number of the iterator to the latest one and invalidate the iterator. If the super version changed, recreated the whole iterator. This can help users reuse the iterator more easily.
Closes https://github.com/facebook/rocksdb/pull/2621

Differential Revision: D5464500

Pulled By: siying

fbshipit-source-id: f548bd35e85c1efca2ea69273802f6704eba6ba9
2017-07-24 10:54:37 -07:00
Andrew Kryczka
a34b2e388e Fix caching of compaction picker's next index
Summary:
The previous implementation of caching `file_size` index made no sense. It only remembered the original span of locked files starting from beginning of `file_size`. We should remember the index after all compactions that have been considered but rejected. This will reduce the work we do while holding the db mutex.
Closes https://github.com/facebook/rocksdb/pull/2624

Differential Revision: D5468152

Pulled By: ajkr

fbshipit-source-id: ab92a4bffe76f9f174d861bb5812b974d1013400
2017-07-21 20:57:15 -07:00
Sagar Vemuri
72502cf227 Revert "comment out unused parameters"
Summary:
This reverts the previous commit 1d7048c598, which broke the build.

Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627

Differential Revision: D5476473

Pulled By: sagar0

fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
2017-07-21 18:26:26 -07:00
Victor Gao
1d7048c598 comment out unused parameters
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.

Reviewed By: igorsugak

Differential Revision: D5454343

fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
2017-07-21 14:57:44 -07:00
Andrew Kryczka
a22b9cc6fe overlapping endpoint fixes in level compaction picker
Summary:
This diff addresses two problems. Both problems cause us to miss scheduling desirable compactions. One side effect is compaction picking can spam logs, as there's no delay after failed attempts to pick compactions.

1. If a compaction pulled in a locked input-level file due to user-key overlap, we would not consider picking another file from the same input level.
2. If a compaction pulled in a locked output-level file due to user-key overlap, we would not consider picking any other compaction on any level.

The code changes are dependent, which is why I solved both problems in a single diff.

- Moved input-level `ExpandInputsToCleanCut` into the loop inside `PickFileToCompact`. This gives two benefits: (1) if it fails, we will try the next-largest file on the same input level; (2) we get the fully-expanded input-level key-range with which we can check for pending compactions in output level.
- Added another call to `ExpandInputsToCleanCut` inside `PickFileToCompact`'s to check for compaction conflicts in output level.
- Deleted call to `IsRangeInCompaction` in `PickFileToCompact`, as `ExpandInputsToCleanCut` also correctly handles the case where original output-level files (i.e., ones not pulled in due to user-key overlap) are pending compaction.
Closes https://github.com/facebook/rocksdb/pull/2615

Differential Revision: D5454643

Pulled By: ajkr

fbshipit-source-id: ea3fb5477d83e97148951af3fd4558d2039e9872
2017-07-19 20:42:00 -07:00
Andrew Kryczka
ffd2a2eefd delete ExpandInputsToCleanCut failure log
Summary:
I decided not even to keep it as an INFO-level log as it is too normal for compactions to be skipped due to locked input files. Removing logging here makes us consistent with how we treat locked files that weren't pulled in due to overlap.

We may want some error handling on line 422, which should never happen when called by `LevelCompactionBuilder::PickCompaction`, as `SetupInitialFiles` skips compactions where overlap causes the output level to pull in locked files.
Closes https://github.com/facebook/rocksdb/pull/2617

Differential Revision: D5458502

Pulled By: ajkr

fbshipit-source-id: c2e5f867c0a77c1812ce4242ab3e085b3eee0bae
2017-07-19 20:42:00 -07:00
Maysam Yabandeh
36651d14ee Moving static AdaptationContext to outside function
Summary:
Moving static AdaptationContext to outside function to bypass tsan's false report with static initializers.

It is because with optimization enabled std::atomic is simplified to as a simple read with no locks. The existing lock produced by static initializer is __cxa_guard_acquire which is apparently not understood by tsan as it is different from normal locks (__gthrw_pthread_mutex_lock).

This is a known problem with tsan:
https://stackoverflow.com/questions/27464190/gccs-tsan-reports-a-data-race-with-a-thread-safe-static-local
https://stackoverflow.com/questions/42062557/c-multithreading-is-initialization-of-a-local-static-lambda-thread-safe

A workaround that I tried was to move the static variable outside the function. It is not a good coding practice since it gives global visibility to variable but it is a hackish workaround until g++ tsan is improved.
Closes https://github.com/facebook/rocksdb/pull/2598

Differential Revision: D5445281

Pulled By: yiwu-arbug

fbshipit-source-id: 6142bd934eb5852d8fd7ce027af593ba697ed41d
2017-07-18 16:58:22 -07:00
Siying Dong
ae28634e9f Remove some left-over BSD headers
Summary: Closes https://github.com/facebook/rocksdb/pull/2608

Differential Revision: D5444797

Pulled By: siying

fbshipit-source-id: 690581d03f37822e059a16085088e8e2d8a45016
2017-07-18 11:56:57 -07:00
Sushma Devendrappa
0655b58582 enable PinnableSlice for RowCache
Summary:
This patch enables using PinnableSlice for RowCache, changes include
not releasing the cache handle immediately after lookup in TableCache::Get, instead pass a Cleanble function which does Cache::RleaseHandle.
Closes https://github.com/facebook/rocksdb/pull/2492

Differential Revision: D5316216

Pulled By: maysamyabandeh

fbshipit-source-id: d2a684bd7e4ba73772f762e58a82b5f4fbd5d362
2017-07-17 15:08:30 -07:00
Yi Wu
00464a3140 Fix column_family_test with LITE build
Summary:
Fix column_family_test with LITE build. I need this patch to fix 5.6 branch.
Closes https://github.com/facebook/rocksdb/pull/2597

Differential Revision: D5437171

Pulled By: yiwu-arbug

fbshipit-source-id: 88b9dc5925a6b47af10c1b41bc5b07c4251a84b5
2017-07-17 15:08:24 -07:00
Yedidya Feldblum
f1a056e005 CodeMod: Prefer ADD_FAILURE() over EXPECT_TRUE(false), et cetera
Summary:
CodeMod: Prefer `ADD_FAILURE()` over `EXPECT_TRUE(false)`, et cetera.

The tautologically-conditioned and tautologically-contradicted boolean expectations/assertions have better alternatives: unconditional passes and failures.

Reviewed By: Orvid

Differential Revision:
D5432398

Tags: codemod, codemod-opensource

fbshipit-source-id: d16b447e8696a6feaa94b41199f5052226ef6914
2017-07-16 21:26:02 -07:00
Siying Dong
3c327ac2d0 Change RocksDB License
Summary: Closes https://github.com/facebook/rocksdb/pull/2589

Differential Revision: D5431502

Pulled By: siying

fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75
2017-07-15 16:11:23 -07:00
奏之章
70440f7a63 Add virtual func IsDeleteRangeSupported
Summary:
this modify allows third-party tables able to support delete range
Closes https://github.com/facebook/rocksdb/pull/2035

Differential Revision: D5407973

Pulled By: ajkr

fbshipit-source-id: 82e364b7dd5a198660788d59543f15b8f95cc418
2017-07-12 16:58:45 -07:00
Sagar Vemuri
56656e12d6 Temporarily disable FIFOCompactionWithTTLTest
Summary:
FIFOCompactionWithTTLTests are flaky when run in parallel, as there is a time element involved to it. Temporarily disabling them while I investigate a more robust testing solution like, say,  mocking time.
Closes https://github.com/facebook/rocksdb/pull/2548

Differential Revision: D5386084

Pulled By: sagar0

fbshipit-source-id: 262886b25bdf091021d8553e780443a985e9bac4
2017-07-07 20:12:58 -07:00
Andrew Kryczka
33042573db Fix GetCurrentTime() initialization for valgrind
Summary:
Valgrind had false positive complaints about the initialization pattern for `GetCurrentTime()`'s argument in #2480. We can instead have the client initialize the time variable before calling `GetCurrentTime()`, and have `GetCurrentTime()` promise to only overwrite it in success case.
Closes https://github.com/facebook/rocksdb/pull/2526

Differential Revision: D5358689

Pulled By: ajkr

fbshipit-source-id: 857b189f24c19196f6bb299216f3e23e7bc4be42
2017-07-05 12:12:00 -07:00
Maysam Yabandeh
7604b463b5 Update the AddDBStats in LITE
Summary: Closes https://github.com/facebook/rocksdb/pull/2525

Differential Revision: D5356859

Pulled By: maysamyabandeh

fbshipit-source-id: f593adad2a8aab12dcd6ab25db076eca51d30d34
2017-06-30 10:56:50 -07:00
Maysam Yabandeh
1e34d07e18 Simplify and document sync rules for logs_ etc
Summary:
Adding/Correcting inline comments and clarify the sync rules. To make it simple to reason, the rules are a big general which ended up to some extra synchronizations. However such synchronizations are not on the fast path, and they are worth the simplicity.
Closes https://github.com/facebook/rocksdb/pull/2517

Differential Revision: D5348239

Pulled By: maysamyabandeh

fbshipit-source-id: ff2e59fb1e568c122d2cdbf598310f3613b7d212
2017-06-30 09:42:28 -07:00
Andrew Kryczka
d310e0f339 Regression test for empty dedicated range deletion file
Summary:
Issue: #2478
Fix: #2503

The bug happened when all of these conditions were satisfied:

- A subcompaction generates no keys
- `RangeDelAggregator::ShouldAddTombstones()` returns true because there's at least one non-obsoleted range deletion in its map
- None of the non-obsolete tombstones overlap with the subcompaction key-range

Under those conditions, we were creating a dedicated file for range deletions which was left empty, thus causing an error in VersionEdit.

I verified this test case fails before the #2503 fix and passes after.
Closes https://github.com/facebook/rocksdb/pull/2521

Differential Revision: D5352568

Pulled By: ajkr

fbshipit-source-id: f619cae39984ce9bb9b7a4e7a9ac0f2bb2ce43e9
2017-06-30 00:11:25 -07:00
Maysam Yabandeh
e9f91a5176 Add a fetch_add variation to AddDBStats
Summary:
AddDBStats is in two steps of load and store, which is more efficient than fetch_add. This is however not thread-safe. Currently we have to protect concurrent access to AddDBStats with a mutex which is less efficient that fetch_add.

This patch adds the option to do fetch_add when AddDBStats. The results for my 2pc benchmark on sysbench is:
- vanilla: 68618 tps
- removing mutex on AddDBStats (unsafe): 69767 tps
- fetch_add for all AddDBStats: 69200 tps
- fetch_add only for concurrently access AddDBStats (this patch): 69579 tps
Closes https://github.com/facebook/rocksdb/pull/2505

Differential Revision: D5330656

Pulled By: maysamyabandeh

fbshipit-source-id: af64d7bee135b0e86b4fac323a4f9d9113eaa383
2017-06-29 17:12:19 -07:00
zhangjinpeng1987
c1b375e96a skip generating empty sst
Summary:
When a compaction job output nothing, there is no necessary to generate a empty sst file which will cause `VersionEdit::EncodeTo` failed.
ref https://github.com/facebook/rocksdb/issues/2478
Closes https://github.com/facebook/rocksdb/pull/2503

Differential Revision: D5350799

Pulled By: ajkr

fbshipit-source-id: df0b4fcf3507fe1c3c435208b762e75478e00143
2017-06-29 15:26:52 -07:00
Mike Kolupaev
397ab11152 Improve Status message for block checksum mismatches
Summary:
We've got some DBs where iterators return Status with message "Corruption: block checksum mismatch" all the time. That's not very informative. It would be much easier to investigate if the error message contained the file name - then we would know e.g. how old the corrupted file is, which would be very useful for finding the root cause. This PR adds file name, offset and other stuff to some block corruption-related status messages.

It doesn't improve all the error messages, just a few that were easy to improve. I'm mostly interested in "block checksum mismatch" and "Bad table magic number" since they're the only corruption errors that I've ever seen in the wild.
Closes https://github.com/facebook/rocksdb/pull/2507

Differential Revision: D5345702

Pulled By: al13n321

fbshipit-source-id: fc8023d43f1935ad927cef1b9c55481ab3cb1339
2017-06-28 21:27:01 -07:00