Commit Graph

3278 Commits

Author SHA1 Message Date
maoyouxiang
a35451eaa4 fix deadlock with enable_pipelined_write=true and max_successive_merges > 0
Summary:
fix this https://github.com/facebook/rocksdb/issues/3916
Closes https://github.com/facebook/rocksdb/pull/3923

Differential Revision: D8215192

Pulled By: yiwu-arbug

fbshipit-source-id: a4c2f839a91d92dc70906d2b7c6de0fe014a2422
2018-05-31 11:13:14 -07:00
Siying Dong
4dd80debd0 Remove tests from ROCKSDB_VALGRIND_RUN
Summary:
In order to make valgrind check test to pass in a day, remove some tests that run prohibitively slow under valgrind.
Closes https://github.com/facebook/rocksdb/pull/3924

Differential Revision: D8210184

Pulled By: siying

fbshipit-source-id: 5b06fb08f3cf57571d422d05a0dbddc9f9376f7a
2018-05-30 16:15:16 -07:00
Anand Ananthabhotla
a736255de8 Delete triggered compaction for universal style
Summary:
This is still WIP, but I'm hoping for early feedback on the overall approach.

This patch implements deletion triggered compaction, which till now only
worked for leveled, for universal style. SST files are marked for
compaction by the CompactOnDeletionCollertor table property. This is
expected to be used when free disk space is low and the user wants to
reclaim space by deleting a bunch of keys. The deletions are expected to
be dense. In such a situation, we want to avoid a full compaction due to
its space overhead.

The strategy used in this case is similar to leveled. We pick one file
from the set of files marked for compaction. We then expand the inputs
to a clean cut on the same level, and then pick overlapping files from
the next non-mepty level. Picking files from the next level can cause
the key range to expand, and we opportunistically expand inputs in the
source level to include files wholly in this key range.

The main side effect of this is that it breaks the property of no time
range overlap between levels. This shouldn't break any functionality.
Closes https://github.com/facebook/rocksdb/pull/3860

Differential Revision: D8124397

Pulled By: anand1976

fbshipit-source-id: bfa2a9dd6817930e991b35d3a8e7e61304ed3dcf
2018-05-29 15:44:34 -07:00
Yanqin Jin
cf826de3ed Fix compilation error when OPT="-DROCKSDB_LITE".
Summary: Closes https://github.com/facebook/rocksdb/pull/3917

Differential Revision: D8187733

Pulled By: riversand963

fbshipit-source-id: e4aa179cd0791ca77167e357f99de9afd4aef910
2018-05-29 12:28:59 -07:00
奏之章
1c1bafa668 Fix VersionStorageInfo::EstimateLiveDataSize seg fault
Summary:
`HandleEstimateLiveDataSize`'s `need_out_of_mutex` is true
402b7aa07f/db/internal_stats.cc (L412-L413)
so , is will ref a `SuperVersion`
402b7aa07f/db/db_impl.cc (L1896-L1908)
so , the param `version` of `InternalStats::HandleEstimateLiveDataSize` is safe , but `cfd_->current()` is not safe !
402b7aa07f/db/internal_stats.cc (L790-L795)

the `cfd_->current()` maybe invalid ...

here's mongo-rocks crash backtrace
```
 mongod(mongo::printStackTrace(std::basic_ostream<char, std::char_traits<char> >&)+0x41) [0x7fe3a3137c51]
 mongod(+0x2152E89) [0x7fe3a3136e89]
 mongod(+0x21534F6) [0x7fe3a31374f6]
 libpthread.so.0(+0xF5E0) [0x7fe39f5e45e0]
 mongod(rocksdb::InternalKeyComparator::Compare(rocksdb::Slice const&, rocksdb::Slice const&) const+0x17) [0x7fe3a22375a7]
 mongod(rocksdb::VersionStorageInfo::EstimateLiveDataSize() const+0x3AA) [0x7fe3a228daba]
 mongod(rocksdb::InternalStats::HandleEstimateLiveDataSize(unsigned long*, rocksdb::DBImpl*, rocksdb::Version*)+0x20) [0x7fe3a2250d70]
 mongod(rocksdb::DBImpl::GetIntPropertyInternal(rocksdb::ColumnFamilyData*, rocksdb::DBPropertyInfo const&, bool, unsigned long*)+0xEF) [0x7fe3a21e3dbf]
```
Closes https://github.com/facebook/rocksdb/pull/3912

Differential Revision: D8179944

Pulled By: yiwu-arbug

fbshipit-source-id: 26f314a8f98f4c2dc4348745d759f26f0e8d95e1
2018-05-28 11:27:08 -07:00
Maysam Yabandeh
402b7aa07f Exclude seq from index keys
Summary:
Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator.
The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s
Closes https://github.com/facebook/rocksdb/pull/3894

Differential Revision: D8118775

Pulled By: maysamyabandeh

fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd
2018-05-25 18:42:43 -07:00
Yanqin Jin
aa53579d6c Fix segfault caused by object premature destruction
Summary:
Please refer to earlier discussion in [issue 3609](https://github.com/facebook/rocksdb/issues/3609).
There was also an alternative fix in [PR 3888](https://github.com/facebook/rocksdb/pull/3888), but the proposed solution requires complex change.

To summarize the cause of the problem. Upon creation of a column family, a `BlockBasedTableFactory` object is `new`ed and encapsulated by a `std::shared_ptr`. Since there is no other `std::shared_ptr` pointing to this `BlockBasedTableFactory`, when the column family is dropped, the `ColumnFamilyData` is `delete`d, causing the destructor of `std::shared_ptr`. Since there is no other `std::shared_ptr`, the underlying memory is also freed.
Later when the db exits, it releases all the table readers, including the table readers that have been operating on the dropped column family. This needs to access the `table_options` owned by `BlockBasedTableFactory` that has already been deleted. Therefore, a segfault is raised.
Previous workaround is to purge all obsolete files upon `ColumnFamilyData` destruction, which leads to a force release of table readers of the dropped column family. However this does not work when the user disables file deletion.

Our solution in this PR is making a copy of `table_options` in `BlockBasedTable::Rep`. This solution increases memory copy and usage, but is much simpler.

Test plan
```
$ make -j16
$ ./column_family_test --gtest_filter=ColumnFamilyTest.CreateDropAndDestroy:ColumnFamilyTest.CreateDropAndDestroyWithoutFileDeletion
```

Expected behavior:
All tests should pass.
Closes https://github.com/facebook/rocksdb/pull/3898

Differential Revision: D8149421

Pulled By: riversand963

fbshipit-source-id: eaecc2e064057ef607fbdd4cc275874f866c3438
2018-05-25 11:57:51 -07:00
QingpingWang
070319f7bb add flush_before_backup parameter to c api rocksdb_backup_engine_create_new_backup
Summary:
Add flush_before_backup to rocksdb_backup_engine_create_new_backup. make c api able to control the flush before backup behavior.
Closes https://github.com/facebook/rocksdb/pull/3897

Differential Revision: D8157676

Pulled By: ajkr

fbshipit-source-id: 88998c62f89f087bf8672398fd7ddafabbada505
2018-05-24 22:28:52 -07:00
Yi Wu
bc7e8d472e LRUCache midpoint insertion
Summary:
Implement midpoint insertion strategy where new blocks will be insert to the middle of LRU list, then move the head on the first hit in cache.
Closes https://github.com/facebook/rocksdb/pull/3877

Differential Revision: D8100895

Pulled By: yiwu-arbug

fbshipit-source-id: f4bd83cb8be469e5d02072cfc8bd66011391f3da
2018-05-24 15:57:33 -07:00
Yanqin Jin
4011012d9d Specify the underlying type of enums.
Summary:
Explicitly specify the underlying type of enums help developers understand the physical storage.
Closes https://github.com/facebook/rocksdb/pull/3892

Differential Revision: D8107027

Pulled By: riversand963

fbshipit-source-id: a00efecbba46df4a3c8eed0994a2d4972ad1a1d3
2018-05-23 16:12:59 -07:00
Andrew Kryczka
7db721b9a6 Avoid sleep in DBTest.GroupCommitTest to fix flakiness
Summary:
DBTest.GroupCommitTest would often fail when run under valgrind because its sleeps were insufficient to guarantee a group commit had multiple entries. Instead we can use sync point to force a leader to wait until a non-leader thread has enqueued its work, thus guaranteeing a leader can do group commit work for multiple threads.
Closes https://github.com/facebook/rocksdb/pull/3883

Differential Revision: D8079429

Pulled By: ajkr

fbshipit-source-id: 61dc50fad29d2c85547842f681288de60fa29049
2018-05-22 12:16:25 -07:00
Siying Dong
3db1ada3bf PersistRocksDBOptions() to use WritableFileWriter
Summary:
By using WritableFileWriter rather than WritableFile directly, we can buffer multiple Append() calls to one write() file system call, which will be expensive to underlying Env without its own write buffering.
Closes https://github.com/facebook/rocksdb/pull/3882

Differential Revision: D8080673

Pulled By: siying

fbshipit-source-id: e0db900cb3c178166aa738f3985db65e3ae2cf1b
2018-05-21 16:42:22 -07:00
Zhongyi Xie
c3ebc75843 Move prefix_extractor to MutableCFOptions
Summary:
Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
This PR aims to make it possible to dynamically change bloom filter config.
Closes https://github.com/facebook/rocksdb/pull/3601

Differential Revision: D7253114

Pulled By: miasantreble

fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
2018-05-21 14:43:11 -07:00
Yanqin Jin
263ef52b65 Update ColumnFamilyTest for multi-CF verification
Summary:
Change `keys_` from `set<string>` to `vector<set<string>>` so that each column
family's keys are stored in one set.

ajkr When you have a chance, can you PTAL? Thanks!
Closes https://github.com/facebook/rocksdb/pull/3871

Differential Revision: D8056447

Pulled By: riversand963

fbshipit-source-id: 650d0f9cad02b1bc005fc329ad76edbf053e6386
2018-05-21 11:57:42 -07:00
Andrew Kryczka
7b655214d2 Assert keys/values pinned by range deletion meta-block iterators
Summary:
`RangeDelAggregator` holds the pointers returned by `BlockIter::key()` and `BlockIter::value()` so requires the data to which they point is pinned. `BlockIter::key()` points into block memory and is guaranteed to be pinned if and only if prefix encoding is disabled (or, equivalently, restart interval is set to one). I think `BlockIter::value()` is always pinned. Added an assert for these and removed the wrong TODO about increasing restart interval, which would enable key prefix encoding and break the assertion.
Closes https://github.com/facebook/rocksdb/pull/3875

Differential Revision: D8063667

Pulled By: ajkr

fbshipit-source-id: 60b5ebcc0cdd610dd6aad9e74a23378793672c41
2018-05-21 09:57:00 -07:00
Zhongyi Xie
ed4d3393fb fix a division by zero bug
Summary:
fixes the failing clang_analyze contrun test
Closes https://github.com/facebook/rocksdb/pull/3872

Differential Revision: D8059241

Pulled By: miasantreble

fbshipit-source-id: e8fc1838004fe16a823456188386b8b39429803b
2018-05-18 21:57:24 -07:00
Siying Dong
17af09fcce Implement key shortening functions in ReverseBytewiseComparator
Summary:
Right now ReverseBytewiseComparator::FindShortestSeparator() doesn't really shorten key, and ReverseBytewiseComparator::FindShortestSuccessor() seems to return wrong results. The code is confusing too as it uses BytewiseComparatorImpl::FindShortestSeparator() but the function actually won't do anything if the the first key is larger than the second.

Implement ReverseBytewiseComparator::FindShortestSeparator() and override ReverseBytewiseComparator::FindShortestSuccessor() to be empty.
Closes https://github.com/facebook/rocksdb/pull/3836

Differential Revision: D7959762

Pulled By: siying

fbshipit-source-id: 93acb621c16ce6f23e087ae4e19f7d84d1254683
2018-05-17 18:27:16 -07:00
Zhongyi Xie
1d7ca20f29 add override to virtual functions
Summary:
this will fix the failing clang_check test
Closes https://github.com/facebook/rocksdb/pull/3868

Differential Revision: D8050880

Pulled By: miasantreble

fbshipit-source-id: 749932e2e4025f835c961c068d601e522a126da6
2018-05-17 17:57:48 -07:00
Mike Kolupaev
8bf555f487 Change and clarify the relationship between Valid(), status() and Seek*() for all iterators. Also fix some bugs
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
 * If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
 * When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.

However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).

This PR changes the convention to:
 * If status() is not ok, Valid() always returns false.
 * Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)

This does sacrifice the two use cases listed above, but siying said it's ok.

Overview of the changes:
 * A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
 * Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
 * A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.

To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:

Iterators that didn't need changes:
 * status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
 * Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.

Iterators with changes (see inline comments for details):
 * DBIter - an overhaul:
    - It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
    - It had a few code paths silently discarding subiterator's status. The stress test caught a few.
    - The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
    - Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
    - It used to not reset status on seek for some types of errors.
    - Some simplifications and better comments.
    - Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
 * MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
 * ForwardIterator - changed to the new convention, also slightly simplified.
 * ForwardLevelIterator - fixed some bugs and simplified.
 * LevelIterator - simplified.
 * TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
 * BlockBasedTableIterator - minor changes.
 * BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
 * PlainTableIterator - some seeks used to not reset status.
 * CuckooTableIterator - tiny code cleanup.
 * ManagedIterator - fixed some bugs.
 * BaseDeltaIterator - changed to the new convention and fixed a bug.
 * BlobDBIterator - seeks used to not reset status.
 * KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810

Differential Revision: D7888019

Pulled By: al13n321

fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
2018-05-17 02:56:56 -07:00
Maysam Yabandeh
46fde6b653 Fix race condition between log_.erase and log_.back
Summary:
log_ contract specifies that it should not be modified unless both mutex_ and log_write_mutex_ are held. log_.erase however does that with only holding mutex_. This causes a race condition with two_write_queues since logs_.back is read with holding only log_write_mutex_ (which is correct according to logs_ contract) but logs_.erase is called concurrently. This is probably the cause of logs_.back returning nullptr in https://github.com/facebook/rocksdb/issues/3852 although I could not reproduce it.
Fixes https://github.com/facebook/rocksdb/issues/3852
Closes https://github.com/facebook/rocksdb/pull/3859

Differential Revision: D8026103

Pulled By: maysamyabandeh

fbshipit-source-id: ee394e00fe4aa520d884c5ef87981e9d6b5ccb28
2018-05-16 13:01:33 -07:00
Maysam Yabandeh
12ad711247 Suppress tsan lock-order-inversion on FlushWAL
Summary:
TSAN reports a false alarm for lock-order-inversion in DBWriteTest.IOErrorOnWALWritePropagateToWriteThreadFollower but Open and FlushWAL are not run concurrently. Suppressing the error by skipping FlushWAL in the test until TSAN is fixed.

The alternative would be to use
```
TSAN_OPTIONS="suppressions=tsan-suppressions.txt" ./db_write_test
```
but it does not seem straightforward to integrate it to our test infra.
Closes https://github.com/facebook/rocksdb/pull/3854

Differential Revision: D8000202

Pulled By: maysamyabandeh

fbshipit-source-id: fde33483d963a7ad84d3145123821f64960a4802
2018-05-14 21:13:35 -07:00
Andrew Kryczka
3d7dc75b36 Bottommost level-based compactions in bottom-pri pool
Summary:
This feature was introduced for universal compaction in cc01985d. At that point we thought it'd be used only to prevent long-running universal full compactions from blocking short-lived upper-level compactions. Now we have a level compaction user who could benefit from it since they use more expensive compression algorithm in the bottom level. So enable it for level.
Closes https://github.com/facebook/rocksdb/pull/3835

Differential Revision: D7957179

Pulled By: ajkr

fbshipit-source-id: 177285d2cef3b650b6a4d81dc5db84bc441c9fe4
2018-05-14 14:57:15 -07:00
Maysam Yabandeh
718c1c9c1f Pass manual_wal_flush also to the first wal file
Summary:
Currently manual_wal_flush if set in the options will be used only for the wal files created during wal switch. The configuration thus does not affect the first wal file. The patch fixes that and also update the related unit tests.
This PR is built on top of https://github.com/facebook/rocksdb/pull/3756
Closes https://github.com/facebook/rocksdb/pull/3824

Differential Revision: D7909153

Pulled By: maysamyabandeh

fbshipit-source-id: 024ed99d2555db06bf096c902b998e432bb7b9ce
2018-05-14 10:57:56 -07:00
Sergey Elin
3272bc07c6 Fix formatting in log message
Summary:
Add missing space.
Closes https://github.com/facebook/rocksdb/pull/3826

Differential Revision: D7956059

Pulled By: miasantreble

fbshipit-source-id: 3aeba76385f8726399a3086c46de710636a31191
2018-05-11 11:28:54 -07:00
Andrew Kryczka
072ae671a7 Apply use_direct_io_for_flush_and_compaction to writes only
Summary:
Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used.

This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously.
Closes https://github.com/facebook/rocksdb/pull/3829

Differential Revision: D7915443

Pulled By: ajkr

fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279
2018-05-09 19:42:58 -07:00
Dmitri Smirnov
f92cd2feb4 Introduce and use the option to disable stall notifications structures
Summary:
and code. Removing this helps with insert performance.
Closes https://github.com/facebook/rocksdb/pull/3830

Differential Revision: D7921030

Pulled By: siying

fbshipit-source-id: 84e80d50a7ef96f5441c51c9a0d089c50217cce2
2018-05-09 10:13:53 -07:00
Andrew Kryczka
4bf169f07e Disable readahead when using mmap for reads
Summary:
`ReadaheadRandomAccessFile` had an unwritten assumption, which was that its wrapped file's `Read()` function always copies into the provided scratch buffer. Actually this was not true when the wrapped file was `PosixMmapReadableFile`, whose `Read()` implementation does no copying and instead returns a `Slice` pointing directly into the  `mmap`'d memory region. This PR:

- prevents `ReadaheadRandomAccessFile` from ever wrapping mmap readable files
- adds an assert for the assumption `ReadaheadRandomAccessFile` makes about the wrapped file's use of scratch buffer
Closes https://github.com/facebook/rocksdb/pull/3813

Differential Revision: D7891513

Pulled By: ajkr

fbshipit-source-id: dc64a55222d6af280c39a1852ee39e9e9d7cde7d
2018-05-08 12:13:18 -07:00
Maysam Yabandeh
d72a51e9e1 Split FaultInjectionTest.FaultTest to avoid timeout
Summary:
tsan flavor of this test occasionally times out in our test infra. The patch split the test to two, each working on half of the option range.
Before:
[       OK ] FaultTest/FaultInjectionTest.FaultTest/0 (5918 ms)
[       OK ] FaultTest/FaultInjectionTest.FaultTest/1 (5336 ms)
After:
[       OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/0 (2930 ms)
[       OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/1 (2676 ms)
[       OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/2 (2759 ms)
[       OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/3 (2546 ms)
Closes https://github.com/facebook/rocksdb/pull/3819

Differential Revision: D7894975

Pulled By: maysamyabandeh

fbshipit-source-id: 809f1411cbcc27f8aa71a6b29a16b039f51b67c9
2018-05-07 12:29:58 -07:00
LingBin
72942ad7a4 Recommit "Avoid adding tombstones of the same file to RangeDelAggregator multiple times"
Summary:
The origin commit #3635  will hurt performance for users who aren't using range deletions, because unneeded std::set operations, so it was reverted by commit 44653c7b7a. (see #3672)

To fix this, move the set to  and add a check in , i.e., file will be added only if  is non-nullptr.

The db_bench command which find the performance regression:
> ./db_bench --benchmarks=fillrandom,seekrandomwhilewriting --threads=1 --num=1000000 --reads=150000 --key_size=66 > --value_size=1262 --statistics=0 --compression_ratio=0.5 --histogram=1 --seek_nexts=1 --stats_per_interval=1 > --stats_interval_seconds=600 --max_background_flushes=4 --num_multi_db=1 --max_background_compactions=16 --seed=1522388277 > -write_buffer_size=1048576 --level0_file_num_compaction_trigger=10000 --compression_type=none

Before and after the modification, I re-run this command on the machine, the results of are as follows:

  **fillrandom**
 Table | P50 | P75 | P99 | P99.9 | P99.99 |
  ---- | --- | --- | --- | ----- | ------ |
 before commit | 5.92 | 8.57 | 19.63 | 980.97 | 12196.00 |
 after commit  | 5.91 | 8.55 | 19.34 | 965.56 | 13513.56 |

 **seekrandomwhilewriting**
  Table | P50 | P75 | P99 | P99.9 | P99.99 |
   ---- | --- | --- | --- | ----- | ------ |
 before commit | 1418.62 | 1867.01 | 3823.28 | 4980.99 | 9240.00 |
 after commit  | 1450.54 | 1880.61 | 3962.87 | 5429.60 | 7542.86 |
Closes https://github.com/facebook/rocksdb/pull/3800

Differential Revision: D7874245

Pulled By: ajkr

fbshipit-source-id: 2e8bec781b3f7399246babd66395c88619534a17
2018-05-04 16:45:15 -07:00
Maysam Yabandeh
171f415b30 Rename vars to satisfy unity built
Summary:
Tested by "make unity_test"
Closes https://github.com/facebook/rocksdb/pull/3807

Differential Revision: D7882657

Pulled By: maysamyabandeh

fbshipit-source-id: 84862c18d7f2fc762bd96ad070eaeb6936e45159
2018-05-04 15:28:06 -07:00
Zhongyi Xie
a703432808 MaxFileSizeForLevel: adjust max_file_size for dynamic level compaction
Summary:
`MutableCFOptions::RefreshDerivedOptions` always assume base level is L1, which is not true when `level_compaction_dynamic_level_bytes=true` and Level based compaction is used.
This PR fixes this by recomputing `max_file_size` at query time (in `MaxFileSizeForLevel`)
Fixes https://github.com/facebook/rocksdb/issues/3229

In master:

```
Level Files Size(MB)
--------------------
  0       14      846
  1        0        0
  2        0        0
  3        0        0
  4        0        0
  5       15      366
  6       11      481
Cumulative compaction: 3.83 GB write, 2.27 GB read
```
In branch:
```
Level Files Size(MB)
--------------------
  0        9      544
  1        0        0
  2        0        0
  3        0        0
  4        0        0
  5        0        0
  6      445      935
Cumulative compaction: 2.91 GB write, 1.46 GB read
```

db_bench command used:
```
./db_bench --benchmarks="fillrandom,deleterandom,fillrandom,levelstats,stats" --statistics -deletes=5000 -db=tmp -compression_type=none --num=20000 -value_size=100000 -level_compaction_dynamic_level_bytes=true -target_file_size_base=2097152 -target_file_size_multiplier=2
```
Closes https://github.com/facebook/rocksdb/pull/3755

Differential Revision: D7721381

Pulled By: miasantreble

fbshipit-source-id: 39afb8503190bac3b466adf9bbf2a9b3655789f8
2018-05-03 16:42:13 -07:00
Dmitri Smirnov
934f96de27 Better destroydb
Summary:
Delete archive directory before WAL folder
  since archive may be contained as a subfolder.
  Also improve loop readability.
Closes https://github.com/facebook/rocksdb/pull/3797

Differential Revision: D7866378

Pulled By: riversand963

fbshipit-source-id: 0c45d97677ce6fbefa3f8d602ef5e2a2a925e6f5
2018-05-03 16:13:09 -07:00
Maysam Yabandeh
a8d77ca381 Speedup ManualCompactionTest.Test
Summary:
ManualCompactionTest.Test occasionally times out in tsan flavor of our test infra. The patch reduces the number of keys to make the test run faster. The change does not seem to negatively impact the coverage of the test.
Closes https://github.com/facebook/rocksdb/pull/3802

Differential Revision: D7865596

Pulled By: maysamyabandeh

fbshipit-source-id: b4f60e32c3ae1677e25506f71c766e33fa985785
2018-05-03 16:13:09 -07:00
Siying Dong
d59549298f Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.

Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)

This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765

Differential Revision: D7747618

Pulled By: siying

fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
2018-05-03 15:43:09 -07:00
Zhongyi Xie
6cab3184f5 avoid double delete on dummy record insertion failure
Summary:
When the dummy record insertion fails, there is no need to explicitly delete the block as it will be registered for cleanup regardless.
Closes https://github.com/facebook/rocksdb/pull/3688

Differential Revision: D7537741

Pulled By: miasantreble

fbshipit-source-id: fcd3a3d3d382ee8e2c7ced0a4980e683d93a16d6
2018-05-01 16:01:28 -07:00
Victor Grishchenko
c9ace1d81b expose WAL iterator in the C API
Summary:
A minor change: I wrapped TransactionLogIterator for the C API.
I needed that for the golang binding.
Closes https://github.com/facebook/rocksdb/pull/3304

Differential Revision: D6628736

Pulled By: miasantreble

fbshipit-source-id: 3374f3c64b1d7b225696b8767090917761e2f30a
2018-04-27 16:56:59 -07:00
Huachao Huang
ed7a95b28c Add max_subcompactions as a compaction option
Summary:
Sometimes we want to compact files as fast as possible, but don't want to set a large `max_subcompactions` in the `DBOptions` by default.
I add a `max_subcompactions` options to `CompactionOptions` so that we can choose a proper concurrency dynamically.
Closes https://github.com/facebook/rocksdb/pull/3775

Differential Revision: D7792357

Pulled By: ajkr

fbshipit-source-id: 94f54c3784dce69e40a229721a79a97e80cd6a6c
2018-04-27 11:57:39 -07:00
Yanqin Jin
7dfbe33532 Rename pending_compaction_ to queued_for_compaction_.
Summary:
We use `queued_for_flush_` to indicate a column family has been added to the
flush queue. Similarly and to be consistent in our naming, we need to use `queued_for_compaction_` to indicate a column family has been added to the compaction queue. In the past we used
`pending_compaction_` which can also be ambiguous.
Closes https://github.com/facebook/rocksdb/pull/3781

Differential Revision: D7790063

Pulled By: riversand963

fbshipit-source-id: 6786b11a4fcaea36dc9b4672233dbe042f921804
2018-04-27 11:12:01 -07:00
Yanqin Jin
513b5ce618 Rename pending_flush_ to queued_for_flush_.
Summary:
With ColumnFamilyData::pending_flush_, we have the following code snippet in DBImpl::ScheedulePendingFlush

```
if (!cfd->pending_flush() && cfd->imm()->IsFlushPending()) {
...
}
```

`Pending` is ambiguous, and I feel `queued_for_flush` is a better name,
especially for the sake of readability.
Closes https://github.com/facebook/rocksdb/pull/3777

Differential Revision: D7783066

Pulled By: riversand963

fbshipit-source-id: f1bd8c8bfe5eafd2c94da0d8566c9b2b6bb57229
2018-04-26 21:12:51 -07:00
Siying Dong
63c965cdb4 Sync parent directory after deleting a file in delete scheduler
Summary:
sync parent directory after deleting a file in delete scheduler. Otherwise, trim speed may not be as smooth as what we want.
Closes https://github.com/facebook/rocksdb/pull/3767

Differential Revision: D7760136

Pulled By: siying

fbshipit-source-id: ec131d53b61953f09c60d67e901e5eeb2716b05f
2018-04-26 13:58:20 -07:00
Vincent Lee
7c9f23e6db Rate limiter should be allowed to share between different rocksdb instances in C API
Summary:
Currently, the `rocksdb_options_set_ratelimiter` in  `c.cc` will change the input to nil, which make it is
 not possible to use the shared rate limiter create by `rocksdb_ratelimiter_create` in different rocksdb option.

In this pr, I changed it to shared ptr.
Closes https://github.com/facebook/rocksdb/pull/3758

Differential Revision: D7749740

Pulled By: ajkr

fbshipit-source-id: c6121f8ca75402afdb4b295ce63c2338d253a1b5
2018-04-25 15:57:48 -07:00
Mike Kolupaev
affe01b0d5 Improve write time breakdown stats
Summary:
There's a group of stats in PerfContext for profiling the write path. They break down the write time into WAL write, memtable insert, throttling, and everything else. We use these stats a lot for figuring out the cause of slow writes.

These stats got a bit out of date and are now categorizing some interesting things as "everything else", and also do some double counting. This PR fixes it and adds two new stats: time spent waiting for other threads of the batch group, and time spent waiting for scheduling flushes/compactions. Probably these will be enough to explain all the occasional abnormally slow (multiple seconds) writes that we're seeing.
Closes https://github.com/facebook/rocksdb/pull/3602

Differential Revision: D7251562

Pulled By: al13n321

fbshipit-source-id: 0a2d0f5a4fa5677455e1f566da931cb46efe2a0d
2018-04-23 17:58:54 -07:00
Siying Dong
d5afa73789 Revert "Skip deleted WALs during recovery"
Summary:
This reverts commit 73f21a7b21.

It breaks compatibility. When created a DB using a build with this new change, opening the DB and reading the data will fail with this error:

"Corruption: Can't access /000000.sst: IO error: while stat a file for size: /tmp/xxxx/000000.sst: No such file or directory"

This is because the dummy AddFile4 entry generated by the new code will be treated as a real entry by an older build. The older build will think there is a real file with number 0, but there isn't such a file.
Closes https://github.com/facebook/rocksdb/pull/3762

Differential Revision: D7730035

Pulled By: siying

fbshipit-source-id: f2051859eff20ef1837575ecb1e1bb96b3751e77
2018-04-23 12:01:26 -07:00
Anand Ananthabhotla
dbdaa4662e Add a stat for MultiGet keys found, update memtable hit/miss stats
Summary:
1. Add a new ticker stat rocksdb.number.multiget.keys.found to track the
number of keys successfully read
2. Update rocksdb.memtable.hit/miss in DBImpl::MultiGet(). It was being done in
DBImpl::GetImpl(), but not MultiGet
Closes https://github.com/facebook/rocksdb/pull/3730

Differential Revision: D7677364

Pulled By: anand1976

fbshipit-source-id: af22bd0ef8ddc5cf2b4244b0a024e539fe48bca5
2018-04-20 15:28:19 -07:00
Maysam Yabandeh
c3d1e36cce WritePrepared Txn: enable TryAgain for duplicates at the end of the batch
Summary:
The WriteBatch::Iterate will try with a larger sequence number if the memtable reports a duplicate. This status is specified with TryAgain status. So far the assumption was that the last entry in the batch will never return TryAgain, which is correct when WAL is created via WritePrepared since it always appends a batch separator if a natural one does not exist. However when reading a WAL generated by WriteCommitted this batch separator might  not exist. Although WritePrepared is not supposed to be able to read the WAL generated by WriteCommitted we should avoid confusing scenarios in which the behavior becomes unpredictable. The path fixes that by allowing TryAgain even for the last entry of the write batch.
Closes https://github.com/facebook/rocksdb/pull/3747

Differential Revision: D7708391

Pulled By: maysamyabandeh

fbshipit-source-id: bfaddaa9b14a4cdaff6977f6f63c789a6ab1ee0d
2018-04-20 15:28:19 -07:00
przemyslaw.skibinski@percona.com
dee95a1afc Fix GitHub issue #3716: gcc-8 warnings
Summary:
Fix the following gcc-8 warnings:
- conflicting C language linkage declaration [-Werror]
- writing to an object with no trivial copy-assignment [-Werror=class-memaccess]
- array subscript -1 is below array bounds [-Werror=array-bounds]

Solves https://github.com/facebook/rocksdb/issues/3716
Closes https://github.com/facebook/rocksdb/pull/3736

Differential Revision: D7684161

Pulled By: yiwu-arbug

fbshipit-source-id: 47c0423d26b74add251f1d3595211eee1e41e54a
2018-04-20 13:42:47 -07:00
Zhongyi Xie
e1e826b980 check return status for Sync() and Append() calls to avoid corruption
Summary:
Right now in `SyncClosedLogs`, `CopyFile`, and `AddRecord`, where `Sync` and `Append` are invoked in a loop, the error status are not checked. This could lead to potential corruption as later calls will overwrite the error status.
Closes https://github.com/facebook/rocksdb/pull/3740

Differential Revision: D7678848

Pulled By: miasantreble

fbshipit-source-id: 4b0b412975989dfe80348f73217b9c4122a4bd77
2018-04-19 14:13:46 -07:00
Yi Wu
ad511684b2 Add block cache related DB properties
Summary:
Add DB properties "rocksdb.block-cache-capacity", "rocksdb.block-cache-usage", "rocksdb.block-cache-pinned-usage" to show block cache usage.
Closes https://github.com/facebook/rocksdb/pull/3734

Differential Revision: D7657180

Pulled By: yiwu-arbug

fbshipit-source-id: dd34a019d5878dab539c51ee82669e97b2b745fd
2018-04-18 21:42:25 -07:00
Yanqin Jin
5e48811844 Initialize a boolean member variable of a struct.
Summary:
The reason for this initialization is that LLVM UBSAN check will fail due to
uninitialized bool. [StackOverflow post](https://stackoverflow.com/questions/31420154/runtime-error-load-of-value-127-which-is-not-a-valid-value-for-type-bool).

UBSAN log:
> ===== Running external_sst_file_basic_test
[==========] Running 7 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 7 tests from ExternalSSTFileBasicTest
[ RUN      ] ExternalSSTFileBasicTest.Basic
[       OK ] ExternalSSTFileBasicTest.Basic (6 ms)
[ RUN      ] ExternalSSTFileBasicTest.NoCopy
db/external_sst_file_ingestion_job.h:23:8: runtime error: load of value 253, which is not a valid value for type 'bool'

miasantreble  I've tested this locally using the following command.
```
TEST_TMPDIR=/dev/shm/rocksdb COMPILE_WITH_UBSAN=1 OPT=-g make J=1 -j8 ubsan_check
```

ajkr This PR is related to your review comment in [PR](https://github.com/facebook/rocksdb/pull/3713/). It turns out that, with UBSAN enabled, we must provide a default value for boolean member variables.
Closes https://github.com/facebook/rocksdb/pull/3728

Differential Revision: D7642476

Pulled By: riversand963

fbshipit-source-id: 4c09a4b8d271151cb99ae7393db9e4ad9f29762e
2018-04-16 14:28:01 -07:00
Zhongyi Xie
954b496b3f fix memory leak in two_level_iterator
Summary:
this PR fixes a few failed contbuild:
1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406
2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662
3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag.
Closes https://github.com/facebook/rocksdb/pull/3718

Reviewed By: maysamyabandeh

Differential Revision: D7621192

Pulled By: miasantreble

fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146
2018-04-15 17:26:26 -07:00
zhangjinpeng1987
31ee4bf240 add kEntryRangeDeletion
Summary:
When there are many range deletions in a range, we want to trigger manual compaction on this range to reclaim disk space as soon as possible and speed up read.
After this change, we can collect informations of range deletions and store them into user properties which can guide our manual compaction.
Closes https://github.com/facebook/rocksdb/pull/3695

Differential Revision: D7570322

Pulled By: ajkr

fbshipit-source-id: c358fa43b0aac6cc954d2eadc7d3bd8015373369
2018-04-13 11:27:17 -07:00
Yanqin Jin
c81b0abedd Improve accuracy of I/O stats collection of external SST ingestion.
Summary:
RocksDB supports ingestion of external ssts. If ingestion_options.move_files is true, when performing ingestion, RocksDB first tries to link external ssts. If external SST file resides on a different FS, or the underlying FS does not support hard link, then RocksDB performs actual file copy. However, no matter which choice is made, current code increase bytes-written when updating compaction stats, which is inaccurate when RocksDB does NOT copy file.

Rename a sync point.
Closes https://github.com/facebook/rocksdb/pull/3713

Differential Revision: D7604151

Pulled By: riversand963

fbshipit-source-id: dd0c0d9b9a69c7d9ffceafc3d9c23371aa413586
2018-04-13 10:58:42 -07:00
David Lai
3be9b36453 comment unused parameters to turn on -Wunused-parameter flag
Summary:
This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557.
Closes https://github.com/facebook/rocksdb/pull/3662

Differential Revision: D7426121

Pulled By: Dayvedde

fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352
2018-04-12 17:59:16 -07:00
Yanqin Jin
d42bd041c5 Improve visibility into the reasons for compaction.
Summary:
Add `compaction_reason` as part of event log for event `compaction started`.
Add counters for each `CompactionReason`.
Closes https://github.com/facebook/rocksdb/pull/3679

Differential Revision: D7550348

Pulled By: riversand963

fbshipit-source-id: a19cff3a678c785aa5ef41aac78b9a5968fcc34d
2018-04-11 10:58:44 -07:00
Andrew Kryczka
019d7894eb fix calling SetOptions on deprecated options
Summary:
In `cf_options_type_info`, the deprecated options are all considered to have offset zero in the `MutableCFOptions` struct. Previously we weren't checking in `GetMutableOptionsFromStrings` whether the provided option was deprecated or not and simply writing the provided value to the offset specified by `cf_options_type_info`. That meant setting any deprecated option would overwrite the first element in the struct, which is `write_buffer_size`. `db_stress` hit this often since it calls `SetOptions` with `soft_rate_limit=0` and `hard_rate_limit=0`, which are both deprecated so cause `write_buffer_size` to be set to zero, which causes it to crash on the following assertion:

```
db_stress: db/memtable.cc:106: rocksdb::MemTable::MemTable(const rocksdb::InternalKeyComparator&, const rocksdb::ImmutableCFOptions&, const rocksdb::MutableCFOptions&, rocksdb::WriteBufferManager*, rocksdb::SequenceNumber, uint32_t): Assertion `!ShouldScheduleFlush()' failed.
```

We fix it by skipping deprecated options (and logging a warning) when users provide them to `SetOptions`. I didn't want to fail the call for compatibility reasons.
Closes https://github.com/facebook/rocksdb/pull/3700

Differential Revision: D7572596

Pulled By: ajkr

fbshipit-source-id: bd5d84e14c0c39f30c5d4c6df7c1503d2c28ecf1
2018-04-10 19:02:09 -07:00
Yanqin Jin
d95014b9df fix some text in comments.
Summary:
1. Remove redundant text.
2. Make terminology consistent across all comments and doc of RocksDB. Also do
   our best to conform to conventions. Specifically, use 'callback' instead of
   'call-back' [wikipedia](https://en.wikipedia.org/wiki/Callback_(computer_programming)).
Closes https://github.com/facebook/rocksdb/pull/3693

Differential Revision: D7560396

Pulled By: riversand963

fbshipit-source-id: ba8c251c487f4e7d1872a1a8dc680f9e35a6ffb8
2018-04-10 15:59:24 -07:00
Zhongyi Xie
2770a94c42 make MockTimeEnv::current_time_ atomic to fix data race
Summary:
fix a new TSAN failure
https://gist.github.com/miasantreble/7599c33f4e17da1024c67d4540dbe397
Closes https://github.com/facebook/rocksdb/pull/3694

Differential Revision: D7565310

Pulled By: miasantreble

fbshipit-source-id: f672c96e925797b34dec6e20b59527e8eebaa825
2018-04-10 14:13:18 -07:00
Gihwan Oh
65fe8d6cd6 Change a comment
Summary:
In this case, we add input files of compaction, not outputs.
Closes https://github.com/facebook/rocksdb/pull/3686

Differential Revision: D7556781

Pulled By: ajkr

fbshipit-source-id: ae135bb6eda60db8f275a9ba2d21c18aaadef5b7
2018-04-09 13:42:31 -07:00
Andrew Kryczka
1c27cbfbd1 fix intra-L0 FIFO for uncompressed use case
Summary:
- inflate the argument passed as `max_compact_bytes_per_del_file` by a bit (10%). The intent of this argument is prevent L0 files from being intra-L0 compacted multiple times. Without compression, some intra-L0 compactions exceed this limit (and thus aren't executed), even though none of their files have gone through intra-L0 before.
- fix `FindIntraL0Compaction` as it was rejecting some valid intra-L0 compactions. In particular, `compact_bytes_per_del_file` is the work-per-deleted-file for the span [0, span_len), whereas `new_compact_bytes_per_del_file` is the work-per-deleted-file for the span [0, span_len+1). The former is more correct for checking whether we've found an eligible span.
Closes https://github.com/facebook/rocksdb/pull/3684

Differential Revision: D7530396

Pulled By: ajkr

fbshipit-source-id: cad4f50902bdc428ac9ff6fffb13eb288648d85e
2018-04-09 13:42:31 -07:00
Zhongyi Xie
f3a1d9e049 fix data race
Summary:
Fix a TSAN failure in `DBRangeDelTest.ValidLevelSubcompactionBoundaries`:
https://gist.github.com/miasantreble/712e04b4de2ff7f193c98b1acf07e899
Closes https://github.com/facebook/rocksdb/pull/3691

Differential Revision: D7541400

Pulled By: miasantreble

fbshipit-source-id: b0b4538980bce7febd0385e61d6e046580bcaefb
2018-04-09 12:28:28 -07:00
Maysam Yabandeh
bde1c1a72a WritePrepared Txn: add stats
Summary:
Adding some stats that would be helpful to monitor if the DB has gone to unlikely stats that would hurt the performance. These are mostly when we end up needing to acquire a mutex.
Closes https://github.com/facebook/rocksdb/pull/3683

Differential Revision: D7529393

Pulled By: maysamyabandeh

fbshipit-source-id: f7d36279a8f39bd84d8ddbf64b5c97f670c5d6d9
2018-04-07 21:56:42 -07:00
Gihwan Oh
74767deec3 Fix typo
Summary:
regrad -> regard
Closes https://github.com/facebook/rocksdb/pull/3685

Differential Revision: D7540952

Pulled By: miasantreble

fbshipit-source-id: e08c9389f7fccf401c962a4441b62cd5e73a33ad
2018-04-06 15:42:50 -07:00
Phani Shekhar Mantripragada
446b32cfc3 Support for Column family specific paths.
Summary:
In this change, an option to set different paths for different column families is added.
This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.

Changes :
1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions.  This member is used to identify the path information whenever files are accessed.
2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
4) Unit tests are added appropriately.
Closes https://github.com/facebook/rocksdb/pull/3102

Differential Revision: D6951697

Pulled By: ajkr

fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
2018-04-05 19:58:20 -07:00
Dmitri Smirnov
147dfc7bdf Fix pre_release callback argument list.
Summary:
Primitive types constness does not affect the signature of the
  method and has no influence on whether the overriding method would
  actually have that const bool instead of just bool. In addition,
  it is rarely useful but does produce a compatibility warnings
  in VS 2015 compiler.
Closes https://github.com/facebook/rocksdb/pull/3663

Differential Revision: D7475739

Pulled By: ajkr

fbshipit-source-id: fb275378b5acc397399420ae6abb4b6bfe5bd32f
2018-04-05 11:12:16 -07:00
Zhongyi Xie
c827b2dc2a fix build for rocksdb lite
Summary:
currently rocksdb lite build fails due to the following errors:
> db/db_sst_test.cc:29:51: error: ‘FlushJobInfo’ does not name a type
   virtual void OnFlushCompleted(DB* /*db*/, const FlushJobInfo& info) override {
                                                   ^
db/db_sst_test.cc:29:16: error: ‘virtual void rocksdb::FlushedFileCollector::OnFlushCompleted(rocksdb::DB*, const int&)’ marked ‘override’, but does not override
   virtual void OnFlushCompleted(DB* /*db*/, const FlushJobInfo& info) override {
                ^
db/db_sst_test.cc:24:7: error: ‘class rocksdb::FlushedFileCollector’ has virtual functions and accessible non-virtual destructor [-Werror=non-virtual-dtor]
 class FlushedFileCollector : public EventListener {
       ^
db/db_sst_test.cc: In member function ‘virtual void rocksdb::FlushedFileCollector::OnFlushCompleted(rocksdb::DB*, const int&)’:
db/db_sst_test.cc:31:35: error: request for member ‘file_path’ in ‘info’, which is of non-class type ‘const int’
     flushed_files_.push_back(info.file_path);
                                   ^
cc1plus: all warnings being treated as errors
make: *** [db/db_sst_test.o] Error 1
Closes https://github.com/facebook/rocksdb/pull/3676

Differential Revision: D7493006

Pulled By: miasantreble

fbshipit-source-id: 77dff0a5b23e27db51be9b9798e3744e6fdec64f
2018-04-05 09:11:36 -07:00
Sagar Vemuri
7d9067991e Ttl-triggered and snapshot-release-triggered compactions should not be manual compactions
Summary:
Ttl-triggered and snapshot-release-triggered compactions should not be considered as manual compactions. This is a bug.
Closes https://github.com/facebook/rocksdb/pull/3678

Differential Revision: D7498151

Pulled By: sagar0

fbshipit-source-id: a2d5bed05268a4dc93d54ea97a9ae44b366df15d
2018-04-05 06:41:52 -07:00
Sagar Vemuri
04c11b867d Level Compaction with TTL
Summary:
Level Compaction with TTL.

As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space.

Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data.
This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space.

This functionality can be controlled by the newly introduced column family option -- ttl.

TODO for later:
- Make ttl mutable
- Extend TTL to Universal compaction as well? (TTL is already supported in FIFO)
- Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option.
Closes https://github.com/facebook/rocksdb/pull/3591

Differential Revision: D7275442

Pulled By: sagar0

fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267
2018-04-02 22:14:28 -07:00
Maysam Yabandeh
b225de7e10 WritePrepared Txn: smallest_prepare optimization
Summary:
The is an optimization to reduce lookup in the CommitCache when querying IsInSnapshot. The optimization takes the smallest uncommitted data at the time that the snapshot was taken and if the sequence number of the read data is lower than that number it assumes the data as committed.
To implement this optimization two changes are required: i) The AddPrepared function must be called sequentially to avoid out of order insertion in the PrepareHeap (otherwise the top of the heap does not indicate the smallest prepare in future too), ii) non-2PC transactions also call AddPrepared if they do not commit in one step.
Closes https://github.com/facebook/rocksdb/pull/3649

Differential Revision: D7388630

Pulled By: maysamyabandeh

fbshipit-source-id: b79506238c17467d590763582960d4d90181c600
2018-04-02 20:27:41 -07:00
Amy Tai
1579626d0d Enable cancelling manual compactions if they hit the sfm size limit
Summary:
Manual compactions should be cancelled, just like scheduled compactions are cancelled, if sfm->EnoughRoomForCompaction is not true.
Closes https://github.com/facebook/rocksdb/pull/3670

Differential Revision: D7457683

Pulled By: amytai

fbshipit-source-id: 669b02fdb707f75db576d03d2c818fb98d1876f5
2018-04-02 19:58:04 -07:00
Zhongyi Xie
44653c7b7a Revert "Avoid adding tombstones of the same file to RangeDelAggregato…
Summary:
…r multiple times"

This reverts commit e80709a33a.

lingbin PR https://github.com/facebook/rocksdb/pull/3635 is causing some performance regression for seekrandom workloads
I'm reverting the commit for now but feel free to submit new patches 😃

To reproduce the regression, you can run the following db_bench command
> ./db_bench --benchmarks=fillrandom,seekrandomwhilewriting --threads=1 --num=1000000 --reads=150000 --key_size=66 --value_size=1262 --statistics=0 --compression_ratio=0.5 --histogram=1 --seek_nexts=1 --stats_per_interval=1 --stats_interval_seconds=600 --max_background_flushes=4 --num_multi_db=1 --max_background_compactions=16 --seed=1522388277 -write_buffer_size=1048576 --level0_file_num_compaction_trigger=10000 --compression_type=none

write stats printed by db_bench:

Table | | | | | | | | | | |
 --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---
revert commit | Percentiles: | P50: | 80.77  | P75: |102.94  |P99: | 1786.44 | P99.9: | 1892.39 |P99.99: 2645.10 |
keep commit | Percentiles: | P50: | 221.72 | P75: | 686.62 | P99: | 1842.57 | P99.9: | 1899.70|  P99.99: 2814.29|
Closes https://github.com/facebook/rocksdb/pull/3672

Differential Revision: D7463315

Pulled By: miasantreble

fbshipit-source-id: 8e779c87591127f2c3694b91a56d9b459011959d
2018-04-02 19:58:04 -07:00
Fosco Marotto
d12112d05e Throw NoSpace instead of IOError when out of space.
Summary:
Replaces #1702 and is updated from feedback.
Closes https://github.com/facebook/rocksdb/pull/3531

Differential Revision: D7457395

Pulled By: gfosco

fbshipit-source-id: 25a21dd8cfa5a6e42e024208b444d9379d920c82
2018-03-30 15:27:18 -07:00
Maysam Yabandeh
73f21a7b21 Skip deleted WALs during recovery
Summary:
This patch record the deleted WAL numbers in the manifest to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Closes https://github.com/facebook/rocksdb/pull/3488

Differential Revision: D6967893

Pulled By: maysamyabandeh

fbshipit-source-id: 13119feb155a08ab6d4909f437c7a750480dc8a1
2018-03-30 11:28:05 -07:00
Maysam Yabandeh
89d989ed75 WritePrepared Txn: fix a bug in publishing recoverable state seq
Summary:
When using two_write_queue, the published seq and the last allocated sequence could be ahead of the LastSequence, even if both write queues are stopped as in WriteRecoverableState. The patch fixes a bug in WriteRecoverableState in which LastSequence was used as a reference but the result was applied to last fetched sequence and last published seq.
Closes https://github.com/facebook/rocksdb/pull/3665

Differential Revision: D7446099

Pulled By: maysamyabandeh

fbshipit-source-id: 1449bed9aed8e9db6af85946efd347cb8efd3c0b
2018-03-29 14:46:41 -07:00
Maysam Yabandeh
0377ff9dea WritePrepared Txn: make recoverable state visible after flush
Summary:
Currently if the CommitTimeWriteBatch is set to be used only as a state that is required only for recovery , the user cannot see that in DB until it is restarted. This while the state is already inserted into the DB after the memtable flush. It would be useful for debugging if make this state visible to the user after the flush by committing it. The patch does it by a invoking a callback that does the commit on the recoverable state.
Closes https://github.com/facebook/rocksdb/pull/3661

Differential Revision: D7424577

Pulled By: maysamyabandeh

fbshipit-source-id: 137f9408662f0853938b33fa440f27f04c1bbf5c
2018-03-28 12:12:08 -07:00
Yanqin Jin
1f5def1653 Fix race condition causing double deletion of ssts
Summary:
Possible interleaved execution of background compaction thread calling `FindObsoleteFiles (no full scan) / PurgeObsoleteFiles` and user thread calling `FindObsoleteFiles (full scan) / PurgeObsoleteFiles` can lead to race condition on which RocksDB attempts to delete a file twice. The second attempt will fail and return `IO error`. This may occur to other files,  but this PR targets sst.
Also add a unit test to verify that this PR fixes the issue.

The newly added unit test `obsolete_files_test` has a test case for this scenario, implemented in `ObsoleteFilesTest#RaceForObsoleteFileDeletion`. `TestSyncPoint`s are used to coordinate the interleaving the `user_thread` and background compaction thread. They execute as follows
```
timeline              user_thread                background_compaction thread
t1   |                                          FindObsoleteFiles(full_scan=false)
t2   |     FindObsoleteFiles(full_scan=true)
t3   |                                          PurgeObsoleteFiles
t4   |     PurgeObsoleteFiles
     V
```
When `user_thread` invokes `FindObsoleteFiles` with full scan, it collects ALL files in RocksDB directory, including the ones that background compaction thread have collected in its job context. Then `user_thread` will see an IO error when trying to delete these files in `PurgeObsoleteFiles` because background compaction thread has already deleted the file in `PurgeObsoleteFiles`.
To fix this, we make RocksDB remember which (SST) files have been found by threads after calling `FindObsoleteFiles` (see `DBImpl#files_grabbed_for_purge_`). Therefore, when another thread calls `FindObsoleteFiles` with full scan, it will not collect such files.

ajkr could you take a look and comment? Thanks!
Closes https://github.com/facebook/rocksdb/pull/3638

Differential Revision: D7384372

Pulled By: riversand963

fbshipit-source-id: 01489516d60012e722ee65a80e1449e589ce26d3
2018-03-28 10:29:59 -07:00
Maysam Yabandeh
35a4469bbf Fix race condition via concurrent FlushWAL
Summary:
Currently log_writer->AddRecord in WriteImpl is protected from concurrent calls via FlushWAL only if two_write_queues_ option is set. The patch fixes the problem by i) skip log_writer->AddRecord in FlushWAL if manual_wal_flush is not set, ii) protects log_writer->AddRecord in WriteImpl via log_write_mutex_ if manual_wal_flush_ is set but two_write_queues_ is not.

Fixes #3599
Closes https://github.com/facebook/rocksdb/pull/3656

Differential Revision: D7405608

Pulled By: maysamyabandeh

fbshipit-source-id: d6cc265051c77ae49c7c6df4f427350baaf46934
2018-03-26 16:29:56 -07:00
Maysam Yabandeh
3e417a6607 WritePrepared Txn: AddPrepared for all sub-batches
Summary:
Currently AddPrepared is performed only on the first sub-batch if there are duplicate keys in the write batch. This could cause a problem if the transaction takes too long to commit and the seq number of the first sub-patch moved to old_prepared_ but not the seq of the later ones. The patch fixes this by calling AddPrepared for all sub-patches.
Closes https://github.com/facebook/rocksdb/pull/3651

Differential Revision: D7388635

Pulled By: maysamyabandeh

fbshipit-source-id: 0ccd80c150d9bc42fe955e49ddb9d7ca353067b4
2018-03-23 17:30:04 -07:00
Dmitri Smirnov
d382ae7de6 Imporve perf of random read and insert compare by suggesting inlining to the compiler
Summary:
Results from 2015 compiler. This improve sequential insert. Random Read results are inconclusive but I hope 2017 will do a better job at inlining.

Before:
fillseq      :       **3.638 micros/op 274866 ops/sec;  213.9 MB/s**

After:
fillseq      :       **3.379 micros/op 295979 ops/sec;  230.3 MB/s**
Closes https://github.com/facebook/rocksdb/pull/3645

Differential Revision: D7382711

Pulled By: siying

fbshipit-source-id: 092a07ffe8a6e598d1226ceff0f11b35e6c5c8e4
2018-03-23 13:26:55 -07:00
LingBin
e80709a33a Avoid adding tombstones of the same file to RangeDelAggregator multiple times
Summary:
RangeDelAggregator will remember the files whose range tombstones have been added,
so the caller can check whether the file has been added before call AddTombstones.

Closes https://github.com/facebook/rocksdb/pull/3635

Differential Revision: D7354604

Pulled By: ajkr

fbshipit-source-id: 9b9f7ec130556028df417e650711554b46d8d107
2018-03-23 12:43:06 -07:00
Radoslaw Zarzynski
09b6bf828a InlineSkiplist: don't decode keys unnecessarily during comparisons
Summary:
Summary
========
`InlineSkipList<>::Insert` takes the `key` parameter as a C-string. Then, it performs multiple comparisons with it requiring the `GetLengthPrefixedSlice()` to be spawn in `MemTable::KeyComparator::operator()(const char* prefix_len_key1, const char* prefix_len_key2)` on the same data over and over. The patch tries to optimize that.

Rough performance comparison
=====
Big keys, no compression.

```
$ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256
(...)
fillrandom   :       4.222 micros/op 236836 ops/sec;   80.4 MB/s
```

```
$ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256
(...)
fillrandom   :       4.064 micros/op 246059 ops/sec;   83.5 MB/s
```

TODO
======
In ~~a separated~~ this PR:
- [x] Go outside the write path. Maybe even eradicate the C-string-taking variant of `KeyIsAfterNode` entirely.
- [x] Try to cache the transformations applied by `KeyComparator` & friends in situations where we havy many comparisons with the same key.
Closes https://github.com/facebook/rocksdb/pull/3516

Differential Revision: D7059300

Pulled By: ajkr

fbshipit-source-id: 6f027dbb619a488129f79f79b5f7dbe566fb2dbb
2018-03-23 12:14:30 -07:00
Zhongyi Xie
1cbc96d236 FlushReason improvement
Summary:
Right now flush reason "SuperVersion Change" covers a few different scenarios which is a bit vague. For example, the following db_bench job should trigger "Write Buffer Full"

> $ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304
$ grep 'flush_reason' /dev/shm/dbbench/LOG
...
2018/03/06-17:30:42.543638 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242543634, "job": 192, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018024, "flush_reason": "SuperVersion Change"}
2018/03/06-17:30:42.569541 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242569536, "job": 193, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"}
2018/03/06-17:30:42.596396 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242596392, "job": 194, "event": "flush_started", "num_memtables": 1, "num_entries": 7008, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "SuperVersion Change"}
2018/03/06-17:30:42.622444 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242622440, "job": 195, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"}

With the fix:
> 2018/03/19-14:40:02.341451 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602341444, "job": 98, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018008, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.379655 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602379642, "job": 100, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018016, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.418479 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602418474, "job": 101, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.455084 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602455079, "job": 102, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.492293 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602492288, "job": 104, "event": "flush_started", "num_memtables": 1, "num_entries": 7007, "num_deletes": 0, "memory_usage": 1018056, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.528720 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602528715, "job": 105, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.566255 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602566238, "job": 107, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018112, "flush_reason": "Write Buffer Full"}
Closes https://github.com/facebook/rocksdb/pull/3627

Differential Revision: D7328772

Pulled By: miasantreble

fbshipit-source-id: 67c94065fbdd36930f09930aad0aaa6d2c152bb8
2018-03-22 18:42:18 -07:00
Sagar Vemuri
2e3d407778 Fsync after writing global seq number in ExternalSstFileIngestionJob
Summary:
Fsync after writing global sequence number to the ingestion file in ExternalSstFileIngestionJob. Otherwise the file metadata could be incorrect.
Closes https://github.com/facebook/rocksdb/pull/3644

Differential Revision: D7373813

Pulled By: sagar0

fbshipit-source-id: 4da2c9e71a8beb5c08b4ac955f288ee1576358b8
2018-03-22 17:42:56 -07:00
Andrew Kryczka
4d51feab0b Rename function for handling WAL write error
Summary:
It was misnamed. It actually updates `bg_error_` if `PreprocessWrite()` or `WriteToWAL()` fail, not related to the user callback.
Closes https://github.com/facebook/rocksdb/pull/3485

Differential Revision: D6955787

Pulled By: ajkr

fbshipit-source-id: bd7afc3fdb7a52830c021cbfc25fcbc3ab7d5e10
2018-03-22 15:58:39 -07:00
Maysam Yabandeh
7429b20e39 WritePrepared Txn: fix race condition on publishing seq
Summary:
This commit fixes a race condition on calling SetLastPublishedSequence. The function must be called only from the 2nd write queue when two_write_queues is enabled. However there was a bug that would also call it from the main write queue if CommitTimeWriteBatch is provided to the commit request and yet use_only_the_last_commit_time_batch_for_recovery optimization is not enabled. To fix that we penalize the commit request in such cases by doing an additional write solely to publish the seq number from the 2nd queue.
Closes https://github.com/facebook/rocksdb/pull/3641

Differential Revision: D7361508

Pulled By: maysamyabandeh

fbshipit-source-id: bf8f7a27e5cccf5425dccbce25eb0032e8e5a4d7
2018-03-22 14:43:36 -07:00
QingpingWang
70282cf876 fix behavior does not match name for "IsFileDeletionsEnabled"
Summary:
for PR https://github.com/facebook/rocksdb/pull/3598
I deleted the original repo for some reason. Sorry for the inconvenience.
Closes https://github.com/facebook/rocksdb/pull/3612

Differential Revision: D7291671

Pulled By: ajkr

fbshipit-source-id: 918490ba86b13fe450d232af436cbe259d847c64
2018-03-21 22:13:34 -07:00
QingpingWang
2ce8f63f81 C API for PerfContext
Summary:
This pull request exposes the interface of PerfContext as C API
Closes https://github.com/facebook/rocksdb/pull/3607

Differential Revision: D7294225

Pulled By: ajkr

fbshipit-source-id: eddcfbc13538f379950b2c8b299486695ffb5e2c
2018-03-21 22:13:34 -07:00
Jingguo Yao
8823487ff7 doc: fix a typo
Summary:
s/synchromization/synchronization/
Closes https://github.com/facebook/rocksdb/pull/3583

Differential Revision: D7276596

Pulled By: sagar0

fbshipit-source-id: 552ec6d6935f642e1a3a7c552de6c94441ac50e0
2018-03-21 15:58:58 -07:00
Siying Dong
93d52696bf Memory Problem Of Destorying ColumnFamilyHandle after deleting the CF
Summary:
When destorying column family handle after the column family has been deleted, the handle may hold share pointers of some objects in ColumnFamilyOptions, but in the destructor, the destructing order may cause some of the objects to be destoryed before being used by the following steps. Fix it by making a copy of the option object and destory it as the last step.
Closes https://github.com/facebook/rocksdb/pull/3610

Differential Revision: D7281025

Pulled By: siying

fbshipit-source-id: ac18f3b2841788cba4ccfa1abd8d59158c1113bc
2018-03-20 17:13:12 -07:00
Andrew Kryczka
d1b26507bd fix db_compaction_test when compression disabled
Summary:
Previously, the compaction in `DBCompactionTestWithParam.ForceBottommostLevelCompaction` generated multiple files in no-compression use case, andone file in compression use case. I increased `target_file_size_base` so it generates one file in both use cases.
Closes https://github.com/facebook/rocksdb/pull/3625

Differential Revision: D7311885

Pulled By: ajkr

fbshipit-source-id: 97f249fa83a9924ac34357a4bb3189c969ecb107
2018-03-19 12:30:05 -07:00
zhsj
cc340268e9 fix wrong length in snprintf
Summary: Closes https://github.com/facebook/rocksdb/pull/3622

Differential Revision: D7307689

Pulled By: ajkr

fbshipit-source-id: b8f52effc63fea06c2058b39c60944c2c1f814b4
2018-03-16 13:27:55 -07:00
Huachao Huang
ecfca1ff59 Optimize overlap checking for external file ingestion
Summary:
If there are a lot of overlapped files in L0, creating a merging iterator for
all files in L0 to check overlap can be very slow because we need to read and
seek all files in L0. However, in that case, the ingested file is likely to
overlap with some files in L0, so if we check those files one by one, we can stop
once we encounter overlap.

Ref: https://github.com/facebook/rocksdb/issues/3540
Closes https://github.com/facebook/rocksdb/pull/3564

Differential Revision: D7196784

Pulled By: anand1976

fbshipit-source-id: 8700c1e903bd515d0fa7005b6ce9b3a3d9db2d67
2018-03-16 10:43:17 -07:00
Niv Dayan
da82aab126 allowing CompactFiles to return new file names
Summary:
This is a small API extension to allow the CompactFiles method to return the names of files that were created during the compaction.
Closes https://github.com/facebook/rocksdb/pull/3608

Differential Revision: D7275789

Pulled By: siying

fbshipit-source-id: 1ec0c3954a0f10cd877efb5f29f9be6c7b59e9ba
2018-03-15 11:58:12 -07:00
Dmitri Smirnov
6f7b7f91b5 Optionally create DuplicateDetector
Summary:
Address issue https://github.com/facebook/rocksdb/issues/3579
Closes https://github.com/facebook/rocksdb/pull/3589

Differential Revision: D7221161

Pulled By: yiwu-arbug

fbshipit-source-id: bd875ab0aa0e414dfa98b1bf036ba9b4ed351361
2018-03-14 00:57:25 -07:00
Andrew Kryczka
2256dab135 fix flaky DBSSTTest.DeleteSchedulerMultipleDBPaths
Summary:
I landed #3544 which made this test flaky. The reason was the files scheduled for deletion sometimes went through the trash-marking process, and sometimes were deleted directly. Our counter only bumped on the former code path, so if the latter code path was used, we'd miss counting a file deleted by deletion scheduler. This PR also bumps the counter in the latter code path.
Closes https://github.com/facebook/rocksdb/pull/3593

Differential Revision: D7226173

Pulled By: yiwu-arbug

fbshipit-source-id: 81ab44c60834df6ff88db1d73ea34e26c6e93c39
2018-03-13 14:57:26 -07:00
Amy Tai
e476d0e252 Adding stat to count cancelled compactions
Summary:
Added a stat that counts the number of cancelled compactions.
Closes https://github.com/facebook/rocksdb/pull/3574

Differential Revision: D7190259

Pulled By: amytai

fbshipit-source-id: d5ce82dc9398da6d6d34023ad4ed8cec909852a3
2018-03-08 10:42:28 -08:00
Bruce Mitchener
a3a3f5497c Fix some typos in comments and docs.
Summary: Closes https://github.com/facebook/rocksdb/pull/3568

Differential Revision: D7170953

Pulled By: siying

fbshipit-source-id: 9cfb8dd88b7266da920c0e0c1e10fb2c5af0641c
2018-03-08 10:27:25 -08:00
Lukas Rist
a277b0f2b7 Clarification regarding record format
Summary:
The CRC is actually calculated based on the record type and payload.
The wiki should also be updated accordingly and extended with a section on the recyclable record format.
Closes https://github.com/facebook/rocksdb/pull/3576

Differential Revision: D7196478

Pulled By: siying

fbshipit-source-id: 39f7a0395075cc73e2aa2bfc9e42c85bce35e765
2018-03-08 10:27:25 -08:00
Bruce Mitchener
0de710f5b8 Use nullptr instead of NULL / 0 more consistently.
Summary: Closes https://github.com/facebook/rocksdb/pull/3569

Differential Revision: D7170968

Pulled By: yiwu-arbug

fbshipit-source-id: 308a6b7dd358a04fd9a7de3d927bfd8abd57d348
2018-03-07 12:42:12 -08:00
Stuart
f021f1d9e1 Add rocksdb_open_with_ttl function in C API
Summary:
Change-Id: Ie6f9b10bce459f6bf0ade0e5877264b4e10da3f5
Signed-off-by: Stuart <Stuart.Hu@emc.com>
Closes https://github.com/facebook/rocksdb/pull/3553

Differential Revision: D7144833

Pulled By: sagar0

fbshipit-source-id: 815225fa6e560d8a5bc47ffd0a98118b107ce264
2018-03-06 20:57:20 -08:00
amytai
0a3db28d98 Disallow compactions if there isn't enough free space
Summary:
This diff handles cases where compaction causes an ENOSPC error.
This does not handle corner cases where another background job is started while compaction is running, and the other background job triggers ENOSPC, although we do allow the user to provision for these background jobs with SstFileManager::SetCompactionBufferSize.
It also does not handle the case where compaction has finished and some other background job independently triggers ENOSPC.

Usage: Functionality is inside SstFileManager. In particular, users should set SstFileManager::SetMaxAllowedSpaceUsage, which is the reference highwatermark for determining whether to cancel compactions.
Closes https://github.com/facebook/rocksdb/pull/3449

Differential Revision: D7016941

Pulled By: amytai

fbshipit-source-id: 8965ab8dd8b00972e771637a41b4e6c645450445
2018-03-06 16:27:54 -08:00
Andrew Kryczka
20c508c1ed Enable subcompactions in manual level-based compaction
Summary:
This is the simplest way I could think of to speed up `CompactRange`. It works but isn't that optimal because it relies on the same `max_compaction_bytes` and `max_subcompactions` options that are used in other places. If it turns out to be useful we can allow overriding these in `CompactRangeOptions` in the future.
Closes https://github.com/facebook/rocksdb/pull/3549

Differential Revision: D7117634

Pulled By: ajkr

fbshipit-source-id: d0cd03d6bd0d2fd7ea3fb13cd3b8bf7c47d11e42
2018-03-06 12:43:51 -08:00
Andrew Kryczka
6a3eebbab0 support multiple db_paths in SstFileManager
Summary:
Now that files scheduled for deletion are kept in the same directory, we don't need to constrain deletion scheduler to `db_paths[0]`. Previously this was done because there was a separate trash directory, and this constraint prevented files from being accidentally copied to another filesystem when they're scheduled for deletion.
Closes https://github.com/facebook/rocksdb/pull/3544

Differential Revision: D7093786

Pulled By: ajkr

fbshipit-source-id: 202f5c92d925eafebec1281fb95bb5828d33414f
2018-03-06 12:43:51 -08:00
Fosco Marotto
d518fe1da6 uint64_t and size_t changes to compile for iOS
Summary:
In attempting to build a static lib for use in iOS, I ran in to lots of type errors between uint64_t and size_t.  This PR contains the changes I made to get `TARGET_OS=IOS make static_lib` to succeed while also getting Xcode to build successfully with the resulting `librocksdb.a` library imported.

This also compiles for me on macOS and tests fine, but I'm really not sure if I made the correct decisions about where to `static_cast` and where to change types.

Also up for discussion: is iOS worth supporting?  Getting the static lib is just part one, we aren't providing any bridging headers or wrappers like the ObjectiveRocks project, it won't be a great experience.
Closes https://github.com/facebook/rocksdb/pull/3503

Differential Revision: D7106457

Pulled By: gfosco

fbshipit-source-id: 82ac2073de7e1f09b91f6b4faea91d18bd311f8e
2018-03-06 12:43:51 -08:00
Dmitri Smirnov
c364eb42b5 Windows cumulative patch
Summary:
This patch addressed several issues.
  Portability including db_test std::thread -> port::Thread Cc: @
  and %z to ROCKSDB portable macro. Cc: maysamyabandeh

  Implement Env::AreFilesSame

  Make the implementation of file unique number more robust

  Get rid of C-runtime and go directly to Windows API when dealing
  with file primitives.

  Implement GetSectorSize() and aling unbuffered read on the value if
  available.

  Adjust Windows Logger for the new interface, implement CloseImpl() Cc: anand1976

  Fix test running script issue where $status var was of incorrect scope
  so the failures were swallowed and not reported.

  DestroyDB() creates a logger and opens a LOG file in the directory
  being cleaned up. This holds a lock on the folder and the cleanup is
  prevented. This fails one of the checkpoin tests. We observe the same in production.
  We close the log file in this change.

 Fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose failure where the test
 attempts to open a directory with NewRandomAccessFile which does not
 work on Windows.
  Fix DBTest.SoftLimit as it is dependent on thread timing. CC: yiwu-arbug
Closes https://github.com/facebook/rocksdb/pull/3552

Differential Revision: D7156304

Pulled By: siying

fbshipit-source-id: 43db0a757f1dfceffeb2b7988043156639173f5b
2018-03-06 11:57:43 -08:00
Yi Wu
b864bc9b5b Blob DB: Improve FIFO eviction
Summary:
Improving blob db FIFO eviction with the following changes,
* Change blob_dir_size to max_db_size. Take into account SST file size when computing DB size.
* FIFO now only take into account live sst files and live blob files. It is normal for disk usage to go over max_db_size because there are obsolete sst files and blob files pending deletion.
* FIFO eviction now also evict TTL blob files that's still open. It doesn't evict non-TTL blob files.
* If FIFO is triggered, it will pass an expiration and the current sequence number to compaction filter. Compaction filter will then filter inlined keys to evict those with an earlier expiration and smaller sequence number. So call LSM FIFO.
* Compaction filter also filter those blob indexes where corresponding blob file is gone.
* Add an event listener to listen compaction/flush event and update sst file size.
* Implement DB::Close() to make sure base db, as well as event listener and compaction filter, destruct before blob db.
* More blob db statistics around FIFO.
* Fix some locking issue when accessing a blob file.
Closes https://github.com/facebook/rocksdb/pull/3556

Differential Revision: D7139328

Pulled By: yiwu-arbug

fbshipit-source-id: ea5edb07b33dfceacb2682f4789bea61de28bbfa
2018-03-06 11:57:42 -08:00
Maysam Yabandeh
62277e15c3 WritePrepared Txn: Move DuplicateDetector to util
Summary:
Move DuplicateDetector and SetComparator to its own header file in util. It would also address a complaint in the unity test.
Closes https://github.com/facebook/rocksdb/pull/3567

Differential Revision: D7163268

Pulled By: maysamyabandeh

fbshipit-source-id: 6ddf82773473646dbbc1284ae601a78c4907c778
2018-03-05 23:57:12 -08:00
Huachao Huang
9cb4856dbd Don't need to UpdateFilesByCompactionPri for kCompactionStyleNone
Summary: Closes https://github.com/facebook/rocksdb/pull/3563

Differential Revision: D7154653

Pulled By: ajkr

fbshipit-source-id: 4f32fb1b02451a934504c40be22b07fb1f2deb9c
2018-03-05 17:57:39 -08:00
Andrew Kryczka
5d68243e61 Comment out unused variables
Summary:
Submitting on behalf of another employee.
Closes https://github.com/facebook/rocksdb/pull/3557

Differential Revision: D7146025

Pulled By: ajkr

fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e
2018-03-05 13:13:41 -08:00
Maysam Yabandeh
680864ae54 WritePrepared Txn: Fix bug with duplicate keys during recovery
Summary:
Fix the following bugs:
- During recovery a duplicate key was inserted twice into the write batch of the recovery transaction,
once when the memtable returns false (because it was duplicates) and once for the 2nd attempt. This would result into different SubBatch count measured when the recovered transactions is committing.
- If a cf is flushed during recovery the memtable is not available to assist in detecting the duplicate key. This could result into not advancing the sequence number when iterating over duplicate keys of a flushed cf and hence inserting the next key with the wrong sequence number.
- SubBacthCounter would reset the comparator to default comparator after the first duplicate key. The 2nd duplicate key hence would have gone through a wrong comparator and not being detected.
Closes https://github.com/facebook/rocksdb/pull/3562

Differential Revision: D7149440

Pulled By: maysamyabandeh

fbshipit-source-id: 91ec317b165f363f5d11ff8b8c47c81cebb8ed77
2018-03-05 10:57:59 -08:00
Sagar Vemuri
15f55e5e06 Fix TSAN timeout in MergeOperatorPinningTest.Randomized/x test
Summary:
[FB - Internal]
MergeOperatorPinningTest.Randomized/x tests are frequently failing with timeouts when run with tsan, as they are exceeding 10 minute limit for tests. The tests are in turn getting disabled due to frequent failures.
I halved the number of rounds to make the test complete sooner. This reduces the number of testing iterations a little, but it still is much better than totally letting the test be disabled.
Closes https://github.com/facebook/rocksdb/pull/3523

Differential Revision: D7031498

Pulled By: sagar0

fbshipit-source-id: 9a694f2176b235259920a42bf24bca5346f7cff1
2018-03-02 16:27:21 -08:00
Yi Wu
1209b6db5c Blob DB: remove existing garbage collection implementation
Summary:
Red diff to remove existing implementation of garbage collection. The current approach is reference counting kind of approach and require a lot of effort to get the size counter right on compaction and deletion. I'm going to go with a simple mark-sweep kind of approach and will send another PR for that.

CompactionEventListener was added solely for blob db and it adds complexity and overhead to compaction iterator. Removing it as well.
Closes https://github.com/facebook/rocksdb/pull/3551

Differential Revision: D7130190

Pulled By: yiwu-arbug

fbshipit-source-id: c3a375ad2639a3f6ed179df6eda602372cc5b8df
2018-03-02 12:57:23 -08:00
Maysam Yabandeh
d060421c77 Fix a leak in prepared_section_completed_
Summary:
The zeroed entries were not removed from prepared_section_completed_ map. This patch adds a unit test to show the problem and fixes that by refactoring the code. The new code is more efficient since i) it uses two separate mutex to avoid contention between commit and prepare threads, ii) it uses a sorted vector for maintaining uniq log entires with prepare which avoids a very large heap with many duplicate entries.
Closes https://github.com/facebook/rocksdb/pull/3545

Differential Revision: D7106071

Pulled By: maysamyabandeh

fbshipit-source-id: b3ae17cb6cd37ef10b6b35e0086c15c758768a48
2018-03-01 20:41:56 -08:00
Yi Wu
bf937cf15b Add "rocksdb.live-sst-files-size" DB property
Summary:
Add "rocksdb.live-sst-files-size" DB property which only include files of latest version. Existing "rocksdb.total-sst-files-size" include files from all versions and thus include files that's obsolete but not yet deleted. I'm going to use this new property to cap blob db sst + blob files size.
Closes https://github.com/facebook/rocksdb/pull/3548

Differential Revision: D7116939

Pulled By: yiwu-arbug

fbshipit-source-id: c6a52e45ce0f24ef78708156e1a923c1dd6bc79a
2018-03-01 18:01:10 -08:00
leviathan1995
ec5843dca9 Comment typo
Summary: Closes https://github.com/facebook/rocksdb/pull/3546

Differential Revision: D7111708

Pulled By: ajkr

fbshipit-source-id: 522a4a00eb3e34c73afcb86c1f75cd2e90e7608d
2018-02-28 09:56:45 -08:00
Andrew Kryczka
3ae0047278 skip CompactRange flush based on memtable contents
Summary:
CompactRange has a call to Flush because we guarantee that, at the time it's called, all existing keys in the range will be pushed through the user's compaction filter. However, previously the flush was done blindly, so it'd happen even if the memtable does not contain keys in the range specified by the user. This caused unnecessarily many L0 files to be created, leading to write stalls in some cases. This PR checks the memtable's contents, and decides to flush only if it overlaps with `CompactRange`'s range.

- Move the memtable overlap check logic from `ExternalSstFileIngestionJob` to `ColumnFamilyData::RangesOverlapWithMemtables`
- Reuse the above logic in `CompactRange` and skip flushing if no overlap
Closes https://github.com/facebook/rocksdb/pull/3520

Differential Revision: D7018897

Pulled By: ajkr

fbshipit-source-id: a3c6b1cfae56687b49dd89ccac7c948e53545934
2018-02-27 17:12:44 -08:00
Zhongyi Xie
ad05cbb182 DB:Open should fail on tmpfs when use_direct_reads=true
Summary:
Before:

> $ TEST_TMPDIR=/dev/shm ./db_bench -use_direct_reads=true -benchmarks=readrandomwriterandom -num=10000000 -reads=100000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -readwritepercent=50 -key_size=16 -value_size=48 -threads=32
DB path: [/dev/shm/dbbench]
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
db_bench: tpp.c:84: __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed.
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument

After:
> TEST_TMPDIR=/dev/shm ./db_bench -use_direct_reads=true -benchmarks=readrandomwriterandom -num=10000000 -reads=100000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -readwritepercent=50 -key_size=16 -value_size=48 -threads=32
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
open error: Not implemented: Direct I/O is not supported by the specified DB.
Closes https://github.com/facebook/rocksdb/pull/3539

Differential Revision: D7082658

Pulled By: miasantreble

fbshipit-source-id: f9d9c6ec3b5e9e049cab52154940ee101ba4d342
2018-02-26 14:58:06 -08:00
Anand Ananthabhotla
dfbe52e099 Fix the Logger::Close() and DBImpl::Close() design pattern
Summary:
The recent Logger::Close() and DBImpl::Close() implementation rely on
calling the CloseImpl() virtual function from the destructor, which will
not work. Refactor the implementation to have a private close helper
function in derived classes that can be called by both CloseImpl() and
the destructor.
Closes https://github.com/facebook/rocksdb/pull/3528

Reviewed By: gfosco

Differential Revision: D7049303

Pulled By: anand1976

fbshipit-source-id: 76a64cbf403209216dfe4864ecf96b5d7f3db9f4
2018-02-23 13:57:26 -08:00
Siying Dong
30649dc6a1 Have a different function when ROCKSDB_JEMALLOC=0
Summary:
Some sanitizer is not happy with parameter name with ROCKSDB_JEMALLOC not set. Use another function instead.
Closes https://github.com/facebook/rocksdb/pull/3536

Differential Revision: D7064849

Pulled By: siying

fbshipit-source-id: c6ae94e044686176af1259df9172453d52c2f9d5
2018-02-23 11:42:33 -08:00
Igor Sugak
aba3409740 Back out "[codemod] - comment out unused parameters"
Reviewed By: igorsugak

fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37
2018-02-22 12:43:17 -08:00
David Lai
f4a030ce81 - comment out unused parameters
Reviewed By: everiq, igorsugak

Differential Revision: D7046710

fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461
2018-02-22 09:44:23 -08:00
Sagar Vemuri
8ada876dfe Add rocksdb.iterator.internal-key property
Summary:
Added a new iterator property: `rocksdb.iterator.internal-key` to get the internal-key (converted to user key) at which the iterator stopped.
Closes https://github.com/facebook/rocksdb/pull/3525

Differential Revision: D7033694

Pulled By: sagar0

fbshipit-source-id: d51e6c00f5e9d766c6276ef79774b81c6c5216f8
2018-02-20 19:12:09 -08:00
Maysam Yabandeh
c178da053b WritePrepared Txn: optimizations for sysbench update_noindex
Summary:
These are optimization that we applied to improve sysbech's update_noindex performance.
1. Make use of LIKELY compiler hint
2. Move std::atomic so the subclass
3. Make use of skip_prepared in non-2pc transactions.
Closes https://github.com/facebook/rocksdb/pull/3512

Differential Revision: D7000075

Pulled By: maysamyabandeh

fbshipit-source-id: 1ab8292584df1f6305a4992973fb1b7933632181
2018-02-16 08:42:31 -08:00
Mike Kolupaev
97307d888f Fix deadlock in ColumnFamilyData::InstallSuperVersion()
Summary:
Deadlock: a memtable flush holds DB::mutex_ and calls ThreadLocalPtr::Scrape(), which locks ThreadLocalPtr mutex; meanwhile, a thread exit handler locks ThreadLocalPtr mutex and calls SuperVersionUnrefHandle, which tries to lock DB::mutex_.

This deadlock is hit all the time on our workload. It blocks our release.

In general, the problem is that ThreadLocalPtr takes an arbitrary callback and calls it while holding a lock on a global mutex. The same global mutex is (at least in some cases) locked by almost all ThreadLocalPtr methods, on any instance of ThreadLocalPtr. So, there'll be a deadlock if the callback tries to do anything to any instance of ThreadLocalPtr, or waits for another thread to do so.

So, probably the only safe way to use ThreadLocalPtr callbacks is to do only do simple and lock-free things in them.

This PR fixes the deadlock by making sure that local_sv_ never holds the last reference to a SuperVersion, and therefore SuperVersionUnrefHandle never has to do any nontrivial cleanup.

I also searched for other uses of ThreadLocalPtr to see if they may have similar bugs. There's only one other use, in transaction_lock_mgr.cc, and it looks fine.
Closes https://github.com/facebook/rocksdb/pull/3510

Reviewed By: sagar0

Differential Revision: D7005346

Pulled By: al13n321

fbshipit-source-id: 37575591b84f07a891d6659e87e784660fde815f
2018-02-16 08:13:34 -08:00
Maysam Yabandeh
8eb1d445c3 Unbreak MemTableRep API change
Summary:
The MemTableRep API was broken by this commit: 813719e952
This patch reverts the changes and instead adds InsertKey (and etc.) overloads to extend the MemTableRep API without breaking the existing classes that inherit from it.
Closes https://github.com/facebook/rocksdb/pull/3513

Differential Revision: D7004134

Pulled By: maysamyabandeh

fbshipit-source-id: e568d91fe1e17dd76c0c1f6c7dd51a18633b1c4f
2018-02-15 17:27:24 -08:00
jsteemann
4e7a182d09 Several small "fixes"
Summary:
- removed a few unneeded variables
- fused some variable declarations and their assignments
- fixed right-trimming code in string_util.cc to not underflow
- simplifed an assertion
- move non-nullptr check assertion before dereferencing of that pointer
- pass an std::string function parameter by const reference instead of by value (avoiding potential copy)
Closes https://github.com/facebook/rocksdb/pull/3507

Differential Revision: D7004679

Pulled By: sagar0

fbshipit-source-id: 52944952d9b56dfcac3bea3cd7878e315bb563c4
2018-02-15 16:57:37 -08:00
Zhongyi Xie
c88c57cde1 Tweak external file ingestion seqno logic under universal compaction
Summary:
Right now it is possible that a file gets assigned to L0 but also assigned the seqno from a higher level which it doesn't fit
Under the current impl, it is possibe that seqno in lower levels (Ln) can be equal to smallest seqno of higher levels (Ln-1), which is undesirable from universal compaction's point of view.
This should fix the intermittent failure of `ExternalSSTFileBasicTest.IngestFileWithGlobalSeqnoPickedSeqno`
Closes https://github.com/facebook/rocksdb/pull/3411

Differential Revision: D6813802

Pulled By: miasantreble

fbshipit-source-id: 693d0462fa94725ccfb9d8858743e6d2d9992d14
2018-02-15 14:13:39 -08:00
Fosco Marotto
ba6ee1f749 Fix 2 more unused reference errors VS2017
Summary:
As in #3425
Closes https://github.com/facebook/rocksdb/pull/3497

Differential Revision: D6979588

Pulled By: gfosco

fbshipit-source-id: e9fb32d04ad45575dfe9de1d79348d158e474197
2018-02-14 11:12:36 -08:00
Igor Sugak
d08d05cb62 fix UBSAN errors in fault_injection_test
Summary:
This fixes shift and signed-integer-overflow UBSAN checks in fault_injection_test by using a larger and unsigned type.
Closes https://github.com/facebook/rocksdb/pull/3498

Reviewed By: siying

Differential Revision: D6981116

Pulled By: igorsugak

fbshipit-source-id: 3688f62cce570534b161e9b5f42109ebc9ae5a2c
2018-02-13 14:12:40 -08:00
Siying Dong
dadf01672a Rename one of the two LevelIterator
Summary:
A new LevelIterator was recently created. Rename the old one to make unity build happy. It's also not a good idea to have two classes in the same name anyway.
Closes https://github.com/facebook/rocksdb/pull/3499

Differential Revision: D6979325

Pulled By: siying

fbshipit-source-id: 3a032d93fe205650a08e92e5262594731ec726bb
2018-02-13 13:57:58 -08:00
Siying Dong
b555ed30a4 Customized BlockBasedTableIterator and LevelIterator
Summary:
Use a customzied BlockBasedTableIterator and LevelIterator to replace current implementations leveraging two-level-iterator. Hope the customized logic will make code easier to understand. As a side effect, BlockBasedTableIterator reduces the allocation for the data block iterator object, and avoid the virtual function call to it, because we can directly reference BlockIter, a final class. Similarly, LevelIterator reduces virtual function call to the dummy iterator iterating the file metadata. It also enabled further optimization.

The upper bound check is also moved from index block to data block. This implementation fits this iterator better. After the change, forwared iterator is slightly optimized to ensure we trim those iterators.

The two-level-iterator now is only used by partitioned index, so it is simplified.
Closes https://github.com/facebook/rocksdb/pull/3406

Differential Revision: D6809041

Pulled By: siying

fbshipit-source-id: 7da3b9b1d3c8e9d9405302c15920af1fcaf50ffa
2018-02-12 17:12:25 -08:00
Andrew Kryczka
ee1c802675 Add delay before flush in CompactRange to avoid write stalling
Summary:
- Refactored logic for checking write stall condition to a helper function: `GetWriteStallConditionAndCause`. Now it is decoupled from the logic for updating WriteController / stats in `RecalculateWriteStallConditions`, so we can reuse it for predicting whether write stall will occur.
- Updated `CompactRange` to first check whether the one additional immutable memtable / L0 file would cause stalling before it flushes. If so, it waits until that is no longer true.
- Updated `bg_cv_` to be signaled on `SetOptions` calls. The stall conditions `CompactRange` cares about can change when (1) flush finishes, (2) compaction finishes, or (3) options dynamically change. The cv was already signaled for (1) and (2) but not yet for (3).
Closes https://github.com/facebook/rocksdb/pull/3381

Differential Revision: D6754983

Pulled By: ajkr

fbshipit-source-id: 5613e03f1524df7192dc6ae885d40fd8f091d972
2018-02-12 15:42:47 -08:00
Zhongyi Xie
3f1bb07351 make flush_reason_ atomic to keep TSAN happy
Summary: Closes https://github.com/facebook/rocksdb/pull/3487

Differential Revision: D6967098

Pulled By: miasantreble

fbshipit-source-id: 48e0accf2e3b3f589ddb797ff8083c8520269bf0
2018-02-12 13:28:18 -08:00
Siying Dong
ef29d2a234 Explictly fail writes if key or value is not smaller than 4GB
Summary:
Right now, users will encounter unexpected bahavior if they use key or value larger than 4GB. We should explicitly fail the queriers.
Closes https://github.com/facebook/rocksdb/pull/3484

Differential Revision: D6953895

Pulled By: siying

fbshipit-source-id: b60491e1af064fc5d52971956661f6c18ceac24f
2018-02-09 14:57:54 -08:00
Yi Wu
fe228da0a9 WritePrepared Txn: Support merge operator
Summary:
CompactionIterator invoke MergeHelper::MergeUntil() to do partial merge between snapshot boundaries. Previously it only depend on sequence number to tell snapshot boundary, but we also need to make use of snapshot_checker to verify visibility of the merge operands to the snapshots. For example, say there is a snapshot with seq = 2 but only can see data with seq <= 1. There are three merges, each with seq = 1, 2, 3. A correct compaction output would be (1),(2+3). Without taking snapshot_checker into account when generating merge result, compaction will generate output (1+2),(3).

By filtering uncommitted keys with read callback, the read path already take care of merges well and don't need additional updates.
Closes https://github.com/facebook/rocksdb/pull/3475

Differential Revision: D6926087

Pulled By: yiwu-arbug

fbshipit-source-id: 8f539d6f897cfe29b6dc27a8992f68c2a629d40a
2018-02-09 14:57:54 -08:00
Zhongyi Xie
945f618ba5 log flush reason for better debugging experience
Summary:
It's always a mystery from the logs why flush was triggered -- user triggered it manually, WriteBufferManager triggered it,  logs were full, write buffer was full, etc.
This PR logs Flush reason whenever a flush is scheduled.
Closes https://github.com/facebook/rocksdb/pull/3401

Differential Revision: D6788142

Pulled By: miasantreble

fbshipit-source-id: a867e54d493c06adf5172bd36a180fb3faae3511
2018-02-09 12:12:43 -08:00
Siying Dong
821e0b1683 Disable options_settable_test in UBSAN and fix UBSAN failure in blob_…
Summary:
…db_test

options_settable_test won't pass UBSAN so disable it.
blob_db_test fails in UBSAN as SnapshotList doesn't initialize all the fields in dummy snapshot. Fix it. I don't understand why only blob_db_test fails though.
Closes https://github.com/facebook/rocksdb/pull/3477

Differential Revision: D6928681

Pulled By: siying

fbshipit-source-id: e31dd300fcdecdfd4f6af279a0987fd0cdec5122
2018-02-07 14:42:26 -08:00
Yi Wu
81736d8afe WritePrepared Txn: update compaction_iterator_test and db_iterator_test
Summary:
Update compaction_iterator_test with write-prepared transaction DB related tests. Transaction related tests are group in CompactionIteratorWithSnapshotCheckerTest. The existing test are duplicated to make them also test with dummy SnapshotChecker that will say every key is visible to every snapshot (this is okay, we still compare sequence number to verify visibility). Merge related tests are disabled and will be revisit in another PR.

Existing db_iterator_tests are also duplicated to test with dummy read_callback that will say every key is committed.
Closes https://github.com/facebook/rocksdb/pull/3466

Differential Revision: D6909253

Pulled By: yiwu-arbug

fbshipit-source-id: 2ae4656b843a55e2e9ff8beecf21f2832f96cd25
2018-02-06 14:12:13 -08:00
Maysam Yabandeh
88d8b2a2f5 WritePrepared Txn: Duplicate Keys, Txn Part
Summary:
This patch takes advantage of memtable being able to detect duplicate <key,seq> and returning TryAgain to handle duplicate keys in WritePrepared Txns. Through WriteBatchWithIndex's index it detects existence of at least a duplicate key in the write batch. If duplicate key was reported, it then pays the cost of counting the number of sub-patches by iterating over the write batch and pass it to DBImpl::Write. DB will make use of the provided batch_count to assign proper sequence numbers before sending them to the WAL. When later inserting the batch to the memtable, it increases the seq each time memtbale reports a duplicate (a sub-patch in our counting) and tries again.
Closes https://github.com/facebook/rocksdb/pull/3455

Differential Revision: D6873699

Pulled By: maysamyabandeh

fbshipit-source-id: db8487526c3a5dc1ddda0ea49f0f979b26ae648d
2018-02-05 18:43:24 -08:00
Anand Ananthabhotla
4b124fb9d3 Handle error return from WriteBuffer()
Summary:
There are a couple of places where we swallow any error from
WriteBuffer() - in SwitchMemtable() and DBImpl::CloseImpl(). Propagate
the error up in those cases rather than ignoring it.
Closes https://github.com/facebook/rocksdb/pull/3404

Differential Revision: D6879954

Pulled By: anand1976

fbshipit-source-id: 2ef88b554be5286b0a8bad7384ba17a105395bdb
2018-02-05 13:59:34 -08:00
Mike Kolupaev
cb5b8f2090 Fix use-after-free in tailing iterator with merge operator
Summary:
ForwardIterator::SVCleanup() sometimes didn't pin superversion when it was supposed to. See the added test for the scenario. Here's the ASAN output of the added test without the fix (using `COMPILE_WITH_ASAN=1 make`): https://pastebin.com/9rD0Ywws
Closes https://github.com/facebook/rocksdb/pull/3415

Differential Revision: D6817414

Pulled By: al13n321

fbshipit-source-id: bc80c44ea78a3a1fa885dfa448a26111f91afb24
2018-02-02 21:26:28 -08:00
Tamir Duberstein
cd5092e168 Suppress unused warnings
Summary:
- Use `__unused__` everywhere
- Suppress unused warnings in Release mode
    + This currently affects non-MSVC builds (e.g. mingw64).
Closes https://github.com/facebook/rocksdb/pull/3448

Differential Revision: D6885496

Pulled By: miasantreble

fbshipit-source-id: f2f6adacec940cc3851a9eee328fafbf61aad211
2018-02-02 12:27:07 -08:00
Fosco Marotto
ba8aa8fdc8 Upgrade Appveyor to VS2017
Summary:
Per some discussions, this will switch our Appveyor testing to use Visual Studio 2017.
Closes https://github.com/facebook/rocksdb/pull/3445

Differential Revision: D6874918

Pulled By: gfosco

fbshipit-source-id: c5a0032ca9f37f0d3baeae35c59d850d528c3176
2018-02-01 13:57:01 -08:00
Maysam Yabandeh
813719e952 WritePrepared Txn: Duplicate Keys, Memtable part
Summary:
Currently DB does not accept duplicate keys (keys with the same user key and the same sequence number). If Memtable returns false when receiving such keys, we can benefit from this signal to properly increase the sequence number in the rare cases when we have a duplicate key in the write batch written to DB under WritePrepared transactions.
Closes https://github.com/facebook/rocksdb/pull/3418

Differential Revision: D6822412

Pulled By: maysamyabandeh

fbshipit-source-id: adea3ce5073131cd38ed52b16bea0673b1a19e77
2018-01-31 18:57:07 -08:00
Fosco Marotto
5400800a56 Work around VS2017 warning for unused reference
Summary:
For #3407
Closes https://github.com/facebook/rocksdb/pull/3425

Differential Revision: D6836900

Pulled By: gfosco

fbshipit-source-id: 7bcaf7a1beeeeabb7c05584f2745e7b4a2473497
2018-01-31 11:58:10 -08:00
Andrew Kryczka
ab5ab36ac2 fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose file ID support check
Summary:
Updated the test case to handle tmpfs mounted at directories different from "/dev/shm/".
Closes https://github.com/facebook/rocksdb/pull/3440

Differential Revision: D6848213

Pulled By: ajkr

fbshipit-source-id: 465e9dbf0921d0930161f732db6b3766bb030589
2018-01-30 16:50:42 -08:00
Huachao Huang
ab43ff58b5 Delete files in multiple ranges at once
Summary:
Using `DeleteFilesInRange` to delete files in a lot of ranges can be slow, because
`VersionSet::LogAndApply` is expensive.

This PR adds a new `DeleteFilesInRange` function to delete files in multiple
ranges at once.

Close https://github.com/facebook/rocksdb/issues/2951
Closes https://github.com/facebook/rocksdb/pull/3431

Differential Revision: D6849228

Pulled By: ajkr

fbshipit-source-id: daeedcabd8def4b1d9ee95a58266dee77b5d68cb
2018-01-30 13:56:39 -08:00
Yi Wu
4bdf06e78f Fix DBFlushTest::ManualFlushWithMinWriteBufferNumberToMerge dead lock
Summary:
In the test, there can be a dead lock between background flush thread and foreground main thread as following:
* background flush thread:
  - holding db mutex, while
  - waiting on "DBImpl::FlushMemTableToOutputFile:BeforeInstallSV" sync point.
* foreground thread:
  - waiting for db mutex to write "key2"

Fixing by let background flush thread wait without holding db mutex.
Closes https://github.com/facebook/rocksdb/pull/3436

Differential Revision: D6841334

Pulled By: yiwu-arbug

fbshipit-source-id: b020768ac94e166e40953c5d09e505515a5f244d
2018-01-29 18:56:47 -08:00
Sagar Vemuri
e6605e5302 Tests for dynamic universal compaction options
Summary:
Added a test for three dynamic universal compaction options, in the realm of read amplification:
- size_ratio
- min_merge_width
- max_merge_width

Also updated DynamicUniversalCompactionSizeAmplification by adding a check on compaction reason.
Found a bug in compaction reason setting while working on this PR, and fixed in #3412 .

TODO for later: Still to add tests for these options: compression_size_percent, stop_style and trivial_move.
Closes https://github.com/facebook/rocksdb/pull/3419

Differential Revision: D6822217

Pulled By: sagar0

fbshipit-source-id: 074573fca6389053cbac229891a0163f38bb56c4
2018-01-29 16:42:45 -08:00
Zhongyi Xie
3fe0937180 Use block cache to track memory usage when ReadOptions.fill_cache=false
Summary:
ReadOptions.fill_cache is set in compaction inputs and can be set by users in their queries too. It tells RocksDB not to put a data block used to block cache.

The memory used by the data block is, however, not trackable by users.

To make the system more manageable, we can cost the block to block cache while using it, and then release it after using.
Closes https://github.com/facebook/rocksdb/pull/3333

Differential Revision: D6670230

Pulled By: miasantreble

fbshipit-source-id: ab848d3ed286bd081a13ee1903de357b56cbc308
2018-01-29 14:43:10 -08:00
Mark Isaacson
b8eb32f8cf Suppress lint in old files
Summary: Grandfather in super old lint issues to make a clean slate for moving forward that allows us to have stronger enforcement on new issues.

Reviewed By: yiwu-arbug

Differential Revision: D6821806

fbshipit-source-id: 22797d31ec58e9eb0255d3b66fedfcfcb0dc127c
2018-01-29 12:56:42 -08:00