Commit Graph

3173 Commits

Author SHA1 Message Date
Maysam Yabandeh
62277e15c3 WritePrepared Txn: Move DuplicateDetector to util
Summary:
Move DuplicateDetector and SetComparator to its own header file in util. It would also address a complaint in the unity test.
Closes https://github.com/facebook/rocksdb/pull/3567

Differential Revision: D7163268

Pulled By: maysamyabandeh

fbshipit-source-id: 6ddf82773473646dbbc1284ae601a78c4907c778
2018-03-05 23:57:12 -08:00
Huachao Huang
9cb4856dbd Don't need to UpdateFilesByCompactionPri for kCompactionStyleNone
Summary: Closes https://github.com/facebook/rocksdb/pull/3563

Differential Revision: D7154653

Pulled By: ajkr

fbshipit-source-id: 4f32fb1b02451a934504c40be22b07fb1f2deb9c
2018-03-05 17:57:39 -08:00
Andrew Kryczka
5d68243e61 Comment out unused variables
Summary:
Submitting on behalf of another employee.
Closes https://github.com/facebook/rocksdb/pull/3557

Differential Revision: D7146025

Pulled By: ajkr

fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e
2018-03-05 13:13:41 -08:00
Maysam Yabandeh
680864ae54 WritePrepared Txn: Fix bug with duplicate keys during recovery
Summary:
Fix the following bugs:
- During recovery a duplicate key was inserted twice into the write batch of the recovery transaction,
once when the memtable returns false (because it was duplicates) and once for the 2nd attempt. This would result into different SubBatch count measured when the recovered transactions is committing.
- If a cf is flushed during recovery the memtable is not available to assist in detecting the duplicate key. This could result into not advancing the sequence number when iterating over duplicate keys of a flushed cf and hence inserting the next key with the wrong sequence number.
- SubBacthCounter would reset the comparator to default comparator after the first duplicate key. The 2nd duplicate key hence would have gone through a wrong comparator and not being detected.
Closes https://github.com/facebook/rocksdb/pull/3562

Differential Revision: D7149440

Pulled By: maysamyabandeh

fbshipit-source-id: 91ec317b165f363f5d11ff8b8c47c81cebb8ed77
2018-03-05 10:57:59 -08:00
Sagar Vemuri
15f55e5e06 Fix TSAN timeout in MergeOperatorPinningTest.Randomized/x test
Summary:
[FB - Internal]
MergeOperatorPinningTest.Randomized/x tests are frequently failing with timeouts when run with tsan, as they are exceeding 10 minute limit for tests. The tests are in turn getting disabled due to frequent failures.
I halved the number of rounds to make the test complete sooner. This reduces the number of testing iterations a little, but it still is much better than totally letting the test be disabled.
Closes https://github.com/facebook/rocksdb/pull/3523

Differential Revision: D7031498

Pulled By: sagar0

fbshipit-source-id: 9a694f2176b235259920a42bf24bca5346f7cff1
2018-03-02 16:27:21 -08:00
Yi Wu
1209b6db5c Blob DB: remove existing garbage collection implementation
Summary:
Red diff to remove existing implementation of garbage collection. The current approach is reference counting kind of approach and require a lot of effort to get the size counter right on compaction and deletion. I'm going to go with a simple mark-sweep kind of approach and will send another PR for that.

CompactionEventListener was added solely for blob db and it adds complexity and overhead to compaction iterator. Removing it as well.
Closes https://github.com/facebook/rocksdb/pull/3551

Differential Revision: D7130190

Pulled By: yiwu-arbug

fbshipit-source-id: c3a375ad2639a3f6ed179df6eda602372cc5b8df
2018-03-02 12:57:23 -08:00
Maysam Yabandeh
d060421c77 Fix a leak in prepared_section_completed_
Summary:
The zeroed entries were not removed from prepared_section_completed_ map. This patch adds a unit test to show the problem and fixes that by refactoring the code. The new code is more efficient since i) it uses two separate mutex to avoid contention between commit and prepare threads, ii) it uses a sorted vector for maintaining uniq log entires with prepare which avoids a very large heap with many duplicate entries.
Closes https://github.com/facebook/rocksdb/pull/3545

Differential Revision: D7106071

Pulled By: maysamyabandeh

fbshipit-source-id: b3ae17cb6cd37ef10b6b35e0086c15c758768a48
2018-03-01 20:41:56 -08:00
Yi Wu
bf937cf15b Add "rocksdb.live-sst-files-size" DB property
Summary:
Add "rocksdb.live-sst-files-size" DB property which only include files of latest version. Existing "rocksdb.total-sst-files-size" include files from all versions and thus include files that's obsolete but not yet deleted. I'm going to use this new property to cap blob db sst + blob files size.
Closes https://github.com/facebook/rocksdb/pull/3548

Differential Revision: D7116939

Pulled By: yiwu-arbug

fbshipit-source-id: c6a52e45ce0f24ef78708156e1a923c1dd6bc79a
2018-03-01 18:01:10 -08:00
leviathan1995
ec5843dca9 Comment typo
Summary: Closes https://github.com/facebook/rocksdb/pull/3546

Differential Revision: D7111708

Pulled By: ajkr

fbshipit-source-id: 522a4a00eb3e34c73afcb86c1f75cd2e90e7608d
2018-02-28 09:56:45 -08:00
Andrew Kryczka
3ae0047278 skip CompactRange flush based on memtable contents
Summary:
CompactRange has a call to Flush because we guarantee that, at the time it's called, all existing keys in the range will be pushed through the user's compaction filter. However, previously the flush was done blindly, so it'd happen even if the memtable does not contain keys in the range specified by the user. This caused unnecessarily many L0 files to be created, leading to write stalls in some cases. This PR checks the memtable's contents, and decides to flush only if it overlaps with `CompactRange`'s range.

- Move the memtable overlap check logic from `ExternalSstFileIngestionJob` to `ColumnFamilyData::RangesOverlapWithMemtables`
- Reuse the above logic in `CompactRange` and skip flushing if no overlap
Closes https://github.com/facebook/rocksdb/pull/3520

Differential Revision: D7018897

Pulled By: ajkr

fbshipit-source-id: a3c6b1cfae56687b49dd89ccac7c948e53545934
2018-02-27 17:12:44 -08:00
Zhongyi Xie
ad05cbb182 DB:Open should fail on tmpfs when use_direct_reads=true
Summary:
Before:

> $ TEST_TMPDIR=/dev/shm ./db_bench -use_direct_reads=true -benchmarks=readrandomwriterandom -num=10000000 -reads=100000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -readwritepercent=50 -key_size=16 -value_size=48 -threads=32
DB path: [/dev/shm/dbbench]
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
db_bench: tpp.c:84: __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed.
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument

After:
> TEST_TMPDIR=/dev/shm ./db_bench -use_direct_reads=true -benchmarks=readrandomwriterandom -num=10000000 -reads=100000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -readwritepercent=50 -key_size=16 -value_size=48 -threads=32
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
open error: Not implemented: Direct I/O is not supported by the specified DB.
Closes https://github.com/facebook/rocksdb/pull/3539

Differential Revision: D7082658

Pulled By: miasantreble

fbshipit-source-id: f9d9c6ec3b5e9e049cab52154940ee101ba4d342
2018-02-26 14:58:06 -08:00
Anand Ananthabhotla
dfbe52e099 Fix the Logger::Close() and DBImpl::Close() design pattern
Summary:
The recent Logger::Close() and DBImpl::Close() implementation rely on
calling the CloseImpl() virtual function from the destructor, which will
not work. Refactor the implementation to have a private close helper
function in derived classes that can be called by both CloseImpl() and
the destructor.
Closes https://github.com/facebook/rocksdb/pull/3528

Reviewed By: gfosco

Differential Revision: D7049303

Pulled By: anand1976

fbshipit-source-id: 76a64cbf403209216dfe4864ecf96b5d7f3db9f4
2018-02-23 13:57:26 -08:00
Siying Dong
30649dc6a1 Have a different function when ROCKSDB_JEMALLOC=0
Summary:
Some sanitizer is not happy with parameter name with ROCKSDB_JEMALLOC not set. Use another function instead.
Closes https://github.com/facebook/rocksdb/pull/3536

Differential Revision: D7064849

Pulled By: siying

fbshipit-source-id: c6ae94e044686176af1259df9172453d52c2f9d5
2018-02-23 11:42:33 -08:00
Igor Sugak
aba3409740 Back out "[codemod] - comment out unused parameters"
Reviewed By: igorsugak

fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37
2018-02-22 12:43:17 -08:00
David Lai
f4a030ce81 - comment out unused parameters
Reviewed By: everiq, igorsugak

Differential Revision: D7046710

fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461
2018-02-22 09:44:23 -08:00
Sagar Vemuri
8ada876dfe Add rocksdb.iterator.internal-key property
Summary:
Added a new iterator property: `rocksdb.iterator.internal-key` to get the internal-key (converted to user key) at which the iterator stopped.
Closes https://github.com/facebook/rocksdb/pull/3525

Differential Revision: D7033694

Pulled By: sagar0

fbshipit-source-id: d51e6c00f5e9d766c6276ef79774b81c6c5216f8
2018-02-20 19:12:09 -08:00
Maysam Yabandeh
c178da053b WritePrepared Txn: optimizations for sysbench update_noindex
Summary:
These are optimization that we applied to improve sysbech's update_noindex performance.
1. Make use of LIKELY compiler hint
2. Move std::atomic so the subclass
3. Make use of skip_prepared in non-2pc transactions.
Closes https://github.com/facebook/rocksdb/pull/3512

Differential Revision: D7000075

Pulled By: maysamyabandeh

fbshipit-source-id: 1ab8292584df1f6305a4992973fb1b7933632181
2018-02-16 08:42:31 -08:00
Mike Kolupaev
97307d888f Fix deadlock in ColumnFamilyData::InstallSuperVersion()
Summary:
Deadlock: a memtable flush holds DB::mutex_ and calls ThreadLocalPtr::Scrape(), which locks ThreadLocalPtr mutex; meanwhile, a thread exit handler locks ThreadLocalPtr mutex and calls SuperVersionUnrefHandle, which tries to lock DB::mutex_.

This deadlock is hit all the time on our workload. It blocks our release.

In general, the problem is that ThreadLocalPtr takes an arbitrary callback and calls it while holding a lock on a global mutex. The same global mutex is (at least in some cases) locked by almost all ThreadLocalPtr methods, on any instance of ThreadLocalPtr. So, there'll be a deadlock if the callback tries to do anything to any instance of ThreadLocalPtr, or waits for another thread to do so.

So, probably the only safe way to use ThreadLocalPtr callbacks is to do only do simple and lock-free things in them.

This PR fixes the deadlock by making sure that local_sv_ never holds the last reference to a SuperVersion, and therefore SuperVersionUnrefHandle never has to do any nontrivial cleanup.

I also searched for other uses of ThreadLocalPtr to see if they may have similar bugs. There's only one other use, in transaction_lock_mgr.cc, and it looks fine.
Closes https://github.com/facebook/rocksdb/pull/3510

Reviewed By: sagar0

Differential Revision: D7005346

Pulled By: al13n321

fbshipit-source-id: 37575591b84f07a891d6659e87e784660fde815f
2018-02-16 08:13:34 -08:00
Maysam Yabandeh
8eb1d445c3 Unbreak MemTableRep API change
Summary:
The MemTableRep API was broken by this commit: 813719e952
This patch reverts the changes and instead adds InsertKey (and etc.) overloads to extend the MemTableRep API without breaking the existing classes that inherit from it.
Closes https://github.com/facebook/rocksdb/pull/3513

Differential Revision: D7004134

Pulled By: maysamyabandeh

fbshipit-source-id: e568d91fe1e17dd76c0c1f6c7dd51a18633b1c4f
2018-02-15 17:27:24 -08:00
jsteemann
4e7a182d09 Several small "fixes"
Summary:
- removed a few unneeded variables
- fused some variable declarations and their assignments
- fixed right-trimming code in string_util.cc to not underflow
- simplifed an assertion
- move non-nullptr check assertion before dereferencing of that pointer
- pass an std::string function parameter by const reference instead of by value (avoiding potential copy)
Closes https://github.com/facebook/rocksdb/pull/3507

Differential Revision: D7004679

Pulled By: sagar0

fbshipit-source-id: 52944952d9b56dfcac3bea3cd7878e315bb563c4
2018-02-15 16:57:37 -08:00
Zhongyi Xie
c88c57cde1 Tweak external file ingestion seqno logic under universal compaction
Summary:
Right now it is possible that a file gets assigned to L0 but also assigned the seqno from a higher level which it doesn't fit
Under the current impl, it is possibe that seqno in lower levels (Ln) can be equal to smallest seqno of higher levels (Ln-1), which is undesirable from universal compaction's point of view.
This should fix the intermittent failure of `ExternalSSTFileBasicTest.IngestFileWithGlobalSeqnoPickedSeqno`
Closes https://github.com/facebook/rocksdb/pull/3411

Differential Revision: D6813802

Pulled By: miasantreble

fbshipit-source-id: 693d0462fa94725ccfb9d8858743e6d2d9992d14
2018-02-15 14:13:39 -08:00
Fosco Marotto
ba6ee1f749 Fix 2 more unused reference errors VS2017
Summary:
As in #3425
Closes https://github.com/facebook/rocksdb/pull/3497

Differential Revision: D6979588

Pulled By: gfosco

fbshipit-source-id: e9fb32d04ad45575dfe9de1d79348d158e474197
2018-02-14 11:12:36 -08:00
Igor Sugak
d08d05cb62 fix UBSAN errors in fault_injection_test
Summary:
This fixes shift and signed-integer-overflow UBSAN checks in fault_injection_test by using a larger and unsigned type.
Closes https://github.com/facebook/rocksdb/pull/3498

Reviewed By: siying

Differential Revision: D6981116

Pulled By: igorsugak

fbshipit-source-id: 3688f62cce570534b161e9b5f42109ebc9ae5a2c
2018-02-13 14:12:40 -08:00
Siying Dong
dadf01672a Rename one of the two LevelIterator
Summary:
A new LevelIterator was recently created. Rename the old one to make unity build happy. It's also not a good idea to have two classes in the same name anyway.
Closes https://github.com/facebook/rocksdb/pull/3499

Differential Revision: D6979325

Pulled By: siying

fbshipit-source-id: 3a032d93fe205650a08e92e5262594731ec726bb
2018-02-13 13:57:58 -08:00
Siying Dong
b555ed30a4 Customized BlockBasedTableIterator and LevelIterator
Summary:
Use a customzied BlockBasedTableIterator and LevelIterator to replace current implementations leveraging two-level-iterator. Hope the customized logic will make code easier to understand. As a side effect, BlockBasedTableIterator reduces the allocation for the data block iterator object, and avoid the virtual function call to it, because we can directly reference BlockIter, a final class. Similarly, LevelIterator reduces virtual function call to the dummy iterator iterating the file metadata. It also enabled further optimization.

The upper bound check is also moved from index block to data block. This implementation fits this iterator better. After the change, forwared iterator is slightly optimized to ensure we trim those iterators.

The two-level-iterator now is only used by partitioned index, so it is simplified.
Closes https://github.com/facebook/rocksdb/pull/3406

Differential Revision: D6809041

Pulled By: siying

fbshipit-source-id: 7da3b9b1d3c8e9d9405302c15920af1fcaf50ffa
2018-02-12 17:12:25 -08:00
Andrew Kryczka
ee1c802675 Add delay before flush in CompactRange to avoid write stalling
Summary:
- Refactored logic for checking write stall condition to a helper function: `GetWriteStallConditionAndCause`. Now it is decoupled from the logic for updating WriteController / stats in `RecalculateWriteStallConditions`, so we can reuse it for predicting whether write stall will occur.
- Updated `CompactRange` to first check whether the one additional immutable memtable / L0 file would cause stalling before it flushes. If so, it waits until that is no longer true.
- Updated `bg_cv_` to be signaled on `SetOptions` calls. The stall conditions `CompactRange` cares about can change when (1) flush finishes, (2) compaction finishes, or (3) options dynamically change. The cv was already signaled for (1) and (2) but not yet for (3).
Closes https://github.com/facebook/rocksdb/pull/3381

Differential Revision: D6754983

Pulled By: ajkr

fbshipit-source-id: 5613e03f1524df7192dc6ae885d40fd8f091d972
2018-02-12 15:42:47 -08:00
Zhongyi Xie
3f1bb07351 make flush_reason_ atomic to keep TSAN happy
Summary: Closes https://github.com/facebook/rocksdb/pull/3487

Differential Revision: D6967098

Pulled By: miasantreble

fbshipit-source-id: 48e0accf2e3b3f589ddb797ff8083c8520269bf0
2018-02-12 13:28:18 -08:00
Siying Dong
ef29d2a234 Explictly fail writes if key or value is not smaller than 4GB
Summary:
Right now, users will encounter unexpected bahavior if they use key or value larger than 4GB. We should explicitly fail the queriers.
Closes https://github.com/facebook/rocksdb/pull/3484

Differential Revision: D6953895

Pulled By: siying

fbshipit-source-id: b60491e1af064fc5d52971956661f6c18ceac24f
2018-02-09 14:57:54 -08:00
Yi Wu
fe228da0a9 WritePrepared Txn: Support merge operator
Summary:
CompactionIterator invoke MergeHelper::MergeUntil() to do partial merge between snapshot boundaries. Previously it only depend on sequence number to tell snapshot boundary, but we also need to make use of snapshot_checker to verify visibility of the merge operands to the snapshots. For example, say there is a snapshot with seq = 2 but only can see data with seq <= 1. There are three merges, each with seq = 1, 2, 3. A correct compaction output would be (1),(2+3). Without taking snapshot_checker into account when generating merge result, compaction will generate output (1+2),(3).

By filtering uncommitted keys with read callback, the read path already take care of merges well and don't need additional updates.
Closes https://github.com/facebook/rocksdb/pull/3475

Differential Revision: D6926087

Pulled By: yiwu-arbug

fbshipit-source-id: 8f539d6f897cfe29b6dc27a8992f68c2a629d40a
2018-02-09 14:57:54 -08:00
Zhongyi Xie
945f618ba5 log flush reason for better debugging experience
Summary:
It's always a mystery from the logs why flush was triggered -- user triggered it manually, WriteBufferManager triggered it,  logs were full, write buffer was full, etc.
This PR logs Flush reason whenever a flush is scheduled.
Closes https://github.com/facebook/rocksdb/pull/3401

Differential Revision: D6788142

Pulled By: miasantreble

fbshipit-source-id: a867e54d493c06adf5172bd36a180fb3faae3511
2018-02-09 12:12:43 -08:00
Siying Dong
821e0b1683 Disable options_settable_test in UBSAN and fix UBSAN failure in blob_…
Summary:
…db_test

options_settable_test won't pass UBSAN so disable it.
blob_db_test fails in UBSAN as SnapshotList doesn't initialize all the fields in dummy snapshot. Fix it. I don't understand why only blob_db_test fails though.
Closes https://github.com/facebook/rocksdb/pull/3477

Differential Revision: D6928681

Pulled By: siying

fbshipit-source-id: e31dd300fcdecdfd4f6af279a0987fd0cdec5122
2018-02-07 14:42:26 -08:00
Yi Wu
81736d8afe WritePrepared Txn: update compaction_iterator_test and db_iterator_test
Summary:
Update compaction_iterator_test with write-prepared transaction DB related tests. Transaction related tests are group in CompactionIteratorWithSnapshotCheckerTest. The existing test are duplicated to make them also test with dummy SnapshotChecker that will say every key is visible to every snapshot (this is okay, we still compare sequence number to verify visibility). Merge related tests are disabled and will be revisit in another PR.

Existing db_iterator_tests are also duplicated to test with dummy read_callback that will say every key is committed.
Closes https://github.com/facebook/rocksdb/pull/3466

Differential Revision: D6909253

Pulled By: yiwu-arbug

fbshipit-source-id: 2ae4656b843a55e2e9ff8beecf21f2832f96cd25
2018-02-06 14:12:13 -08:00
Maysam Yabandeh
88d8b2a2f5 WritePrepared Txn: Duplicate Keys, Txn Part
Summary:
This patch takes advantage of memtable being able to detect duplicate <key,seq> and returning TryAgain to handle duplicate keys in WritePrepared Txns. Through WriteBatchWithIndex's index it detects existence of at least a duplicate key in the write batch. If duplicate key was reported, it then pays the cost of counting the number of sub-patches by iterating over the write batch and pass it to DBImpl::Write. DB will make use of the provided batch_count to assign proper sequence numbers before sending them to the WAL. When later inserting the batch to the memtable, it increases the seq each time memtbale reports a duplicate (a sub-patch in our counting) and tries again.
Closes https://github.com/facebook/rocksdb/pull/3455

Differential Revision: D6873699

Pulled By: maysamyabandeh

fbshipit-source-id: db8487526c3a5dc1ddda0ea49f0f979b26ae648d
2018-02-05 18:43:24 -08:00
Anand Ananthabhotla
4b124fb9d3 Handle error return from WriteBuffer()
Summary:
There are a couple of places where we swallow any error from
WriteBuffer() - in SwitchMemtable() and DBImpl::CloseImpl(). Propagate
the error up in those cases rather than ignoring it.
Closes https://github.com/facebook/rocksdb/pull/3404

Differential Revision: D6879954

Pulled By: anand1976

fbshipit-source-id: 2ef88b554be5286b0a8bad7384ba17a105395bdb
2018-02-05 13:59:34 -08:00
Mike Kolupaev
cb5b8f2090 Fix use-after-free in tailing iterator with merge operator
Summary:
ForwardIterator::SVCleanup() sometimes didn't pin superversion when it was supposed to. See the added test for the scenario. Here's the ASAN output of the added test without the fix (using `COMPILE_WITH_ASAN=1 make`): https://pastebin.com/9rD0Ywws
Closes https://github.com/facebook/rocksdb/pull/3415

Differential Revision: D6817414

Pulled By: al13n321

fbshipit-source-id: bc80c44ea78a3a1fa885dfa448a26111f91afb24
2018-02-02 21:26:28 -08:00
Tamir Duberstein
cd5092e168 Suppress unused warnings
Summary:
- Use `__unused__` everywhere
- Suppress unused warnings in Release mode
    + This currently affects non-MSVC builds (e.g. mingw64).
Closes https://github.com/facebook/rocksdb/pull/3448

Differential Revision: D6885496

Pulled By: miasantreble

fbshipit-source-id: f2f6adacec940cc3851a9eee328fafbf61aad211
2018-02-02 12:27:07 -08:00
Fosco Marotto
ba8aa8fdc8 Upgrade Appveyor to VS2017
Summary:
Per some discussions, this will switch our Appveyor testing to use Visual Studio 2017.
Closes https://github.com/facebook/rocksdb/pull/3445

Differential Revision: D6874918

Pulled By: gfosco

fbshipit-source-id: c5a0032ca9f37f0d3baeae35c59d850d528c3176
2018-02-01 13:57:01 -08:00
Maysam Yabandeh
813719e952 WritePrepared Txn: Duplicate Keys, Memtable part
Summary:
Currently DB does not accept duplicate keys (keys with the same user key and the same sequence number). If Memtable returns false when receiving such keys, we can benefit from this signal to properly increase the sequence number in the rare cases when we have a duplicate key in the write batch written to DB under WritePrepared transactions.
Closes https://github.com/facebook/rocksdb/pull/3418

Differential Revision: D6822412

Pulled By: maysamyabandeh

fbshipit-source-id: adea3ce5073131cd38ed52b16bea0673b1a19e77
2018-01-31 18:57:07 -08:00
Fosco Marotto
5400800a56 Work around VS2017 warning for unused reference
Summary:
For #3407
Closes https://github.com/facebook/rocksdb/pull/3425

Differential Revision: D6836900

Pulled By: gfosco

fbshipit-source-id: 7bcaf7a1beeeeabb7c05584f2745e7b4a2473497
2018-01-31 11:58:10 -08:00
Andrew Kryczka
ab5ab36ac2 fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose file ID support check
Summary:
Updated the test case to handle tmpfs mounted at directories different from "/dev/shm/".
Closes https://github.com/facebook/rocksdb/pull/3440

Differential Revision: D6848213

Pulled By: ajkr

fbshipit-source-id: 465e9dbf0921d0930161f732db6b3766bb030589
2018-01-30 16:50:42 -08:00
Huachao Huang
ab43ff58b5 Delete files in multiple ranges at once
Summary:
Using `DeleteFilesInRange` to delete files in a lot of ranges can be slow, because
`VersionSet::LogAndApply` is expensive.

This PR adds a new `DeleteFilesInRange` function to delete files in multiple
ranges at once.

Close https://github.com/facebook/rocksdb/issues/2951
Closes https://github.com/facebook/rocksdb/pull/3431

Differential Revision: D6849228

Pulled By: ajkr

fbshipit-source-id: daeedcabd8def4b1d9ee95a58266dee77b5d68cb
2018-01-30 13:56:39 -08:00
Yi Wu
4bdf06e78f Fix DBFlushTest::ManualFlushWithMinWriteBufferNumberToMerge dead lock
Summary:
In the test, there can be a dead lock between background flush thread and foreground main thread as following:
* background flush thread:
  - holding db mutex, while
  - waiting on "DBImpl::FlushMemTableToOutputFile:BeforeInstallSV" sync point.
* foreground thread:
  - waiting for db mutex to write "key2"

Fixing by let background flush thread wait without holding db mutex.
Closes https://github.com/facebook/rocksdb/pull/3436

Differential Revision: D6841334

Pulled By: yiwu-arbug

fbshipit-source-id: b020768ac94e166e40953c5d09e505515a5f244d
2018-01-29 18:56:47 -08:00
Sagar Vemuri
e6605e5302 Tests for dynamic universal compaction options
Summary:
Added a test for three dynamic universal compaction options, in the realm of read amplification:
- size_ratio
- min_merge_width
- max_merge_width

Also updated DynamicUniversalCompactionSizeAmplification by adding a check on compaction reason.
Found a bug in compaction reason setting while working on this PR, and fixed in #3412 .

TODO for later: Still to add tests for these options: compression_size_percent, stop_style and trivial_move.
Closes https://github.com/facebook/rocksdb/pull/3419

Differential Revision: D6822217

Pulled By: sagar0

fbshipit-source-id: 074573fca6389053cbac229891a0163f38bb56c4
2018-01-29 16:42:45 -08:00
Zhongyi Xie
3fe0937180 Use block cache to track memory usage when ReadOptions.fill_cache=false
Summary:
ReadOptions.fill_cache is set in compaction inputs and can be set by users in their queries too. It tells RocksDB not to put a data block used to block cache.

The memory used by the data block is, however, not trackable by users.

To make the system more manageable, we can cost the block to block cache while using it, and then release it after using.
Closes https://github.com/facebook/rocksdb/pull/3333

Differential Revision: D6670230

Pulled By: miasantreble

fbshipit-source-id: ab848d3ed286bd081a13ee1903de357b56cbc308
2018-01-29 14:43:10 -08:00
Mark Isaacson
b8eb32f8cf Suppress lint in old files
Summary: Grandfather in super old lint issues to make a clean slate for moving forward that allows us to have stronger enforcement on new issues.

Reviewed By: yiwu-arbug

Differential Revision: D6821806

fbshipit-source-id: 22797d31ec58e9eb0255d3b66fedfcfcb0dc127c
2018-01-29 12:56:42 -08:00
Sagar Vemuri
7fcc1d0ddf Incorrect Universal Compaction reason
Summary:
While writing tests for dynamic Universal Compaction options, I found that the compaction reasons we set for size-ratio based and sorted-run based universal compactions are swapped with each other. Fixed it.
Closes https://github.com/facebook/rocksdb/pull/3412

Differential Revision: D6820540

Pulled By: sagar0

fbshipit-source-id: 270a188968ba25b2c96a8339904416c4c87ff5b3
2018-01-26 11:12:40 -08:00
Yi Wu
c7226428dd WritePrepared Txn: Fix DBIterator and add test
Summary:
In DBIter, Prev() calls FindValueForCurrentKey() to search the current value backward. If it finds that there are too many stale value being skipped, it falls back to FindValueForCurrentKeyUsingSeek(), seeking directly to the key with snapshot sequence. After introducing read_callback, however, the key it seeks to might not be visible, according to read_callback. It thus needs to keep searching forward until the first visible value.
Closes https://github.com/facebook/rocksdb/pull/3382

Differential Revision: D6756148

Pulled By: yiwu-arbug

fbshipit-source-id: 064e39b1eec5e083af1c10142600f26d1d2697be
2018-01-23 16:57:11 -08:00
Yi Wu
d46e832e94 Assert last reference before destroy ColumnFamilyData
Summary:
In ColumnFamilySet destructor, assert it hold the last reference to cfd before destroy them.

Closes #3112
Closes https://github.com/facebook/rocksdb/pull/3397

Differential Revision: D6777967

Pulled By: yiwu-arbug

fbshipit-source-id: 60b19070e0c194b3b6146699140c1d68777866cb
2018-01-23 15:12:28 -08:00
Yi Wu
edc258127e DB::DumpSupportInfo should log all supported compression types
Summary:
DB::DumpSupportInfo should log all supported compression types.
Closes #3146
Closes https://github.com/facebook/rocksdb/pull/3396

Differential Revision: D6777019

Pulled By: yiwu-arbug

fbshipit-source-id: 5b17f1ffb2d71224e52f7d9c045434746c789fb0
2018-01-23 14:44:12 -08:00
Nathan VanBenschoten
ec0167eecb Fix WriteBatch rep_ format for RangeDeletion records
Summary:
This is a small amount of general cleanup I made while experimenting with https://github.com/facebook/rocksdb/issues/3391.
Closes https://github.com/facebook/rocksdb/pull/3392

Differential Revision: D6788365

Pulled By: yiwu-arbug

fbshipit-source-id: 2716e5aabd5424a4dfdaa954361a62c8eb721ae2
2018-01-23 12:57:32 -08:00
Siying Dong
7291a3f813 Improve fallocate size in compaction output
Summary:
Now in leveled compaction, we allocate solely based on output target file size. If the total input size is smaller than the number, we should use the total input size instead. Also, cap the allocate size to 1GB.
Closes https://github.com/facebook/rocksdb/pull/3385

Differential Revision: D6762363

Pulled By: siying

fbshipit-source-id: e30906f6e9bff3ec847d2166e44cb49c92f98a13
2018-01-22 16:43:46 -08:00
Islam AbdelRahman
c615689bb5 Support skipping bloom filters for SstFileWriter
Summary:
Add an option for SstFileWriter to skip building bloom filters
Closes https://github.com/facebook/rocksdb/pull/3360

Differential Revision: D6709120

Pulled By: IslamAbdelRahman

fbshipit-source-id: 964d4bce38822a048691792f447bcfbb4b6bd809
2018-01-22 14:42:18 -08:00
Bernard Spil
6f5ba0bf5b Fix building on FreeBSD
Summary:
FreeBSD uses jemalloc as the base malloc implementation.
The patch has been functional on FreeBSD as of the MariaDB 10.2 port.
Closes https://github.com/facebook/rocksdb/pull/3386

Differential Revision: D6765742

Pulled By: yiwu-arbug

fbshipit-source-id: d55dbc082eecf640ef3df9a21f26064ebe6587e8
2018-01-19 17:12:43 -08:00
Yi Wu
5568aec421 Fix DBTest::SoftLimit TSAN failure
Summary:
Fix data race found by TSAN around WriteStallListener: https://gist.github.com/yiwu-arbug/027d2448b903648f2f0f40b05258d80f
Closes https://github.com/facebook/rocksdb/pull/3384

Differential Revision: D6762167

Pulled By: yiwu-arbug

fbshipit-source-id: cd3a5c9f806de390bd1af6077ea6dbbc8bcaec09
2018-01-19 12:57:15 -08:00
Yi Wu
f1cb83fcf4 Fix Flush() keep waiting after flush finish
Summary:
Flush() call could be waiting indefinitely if min_write_buffer_number_to_merge is used. Consider the sequence:
1. User call Flush() with flush_options.wait = true
2. The manual flush started in the background
3. New memtable become immutable because of writes. The new memtable will not trigger flush if min_write_buffer_number_to_merge is not reached.
4. The manual flush finish.

Because of the new memtable created at step 3 not being flush, previous logic of WaitForFlushMemTable() keep waiting, despite the memtables it intent to flush has been flushed.

Here instead of checking if there are any more memtables to flush, WaitForFlushMemTable() also check the id of the earliest memtable. If the id is larger than that of latest memtable at the time flush was initiated, it means all the memtable at the time of flush start has all been flush.
Closes https://github.com/facebook/rocksdb/pull/3378

Differential Revision: D6746789

Pulled By: yiwu-arbug

fbshipit-source-id: 35e698f71c7f90b06337a93e6825f4ea3b619bfa
2018-01-18 17:45:16 -08:00
topilski
b9873162f0 Fixed get version on windows, moved throwing exceptions into cc file.
Summary:
Fixes for msys2 and mingw, hide exceptions into cpp  file.
Closes https://github.com/facebook/rocksdb/pull/3377

Differential Revision: D6746707

Pulled By: yiwu-arbug

fbshipit-source-id: 456b38df80bc48b8386a2cf87f669b5a4f9999a4
2018-01-18 14:56:56 -08:00
Andrew Kryczka
46e599fc6b fix live WALs purged while file deletions disabled
Summary:
When calling `DisableFileDeletions` followed by `GetSortedWalFiles`, we guarantee the files returned by the latter call won't be deleted until after file deletions are re-enabled. However, `GetSortedWalFiles` didn't omit files already planned for deletion via `PurgeObsoleteFiles`, so the guarantee could be broken.

We fix it by making `GetSortedWalFiles` wait for the number of pending purges to hit zero if file deletions are disabled. This condition is eventually met since `PurgeObsoleteFiles` is guaranteed to be called for the existing pending purges, and new purges cannot be scheduled while file deletions are disabled. Once the condition is met, `GetSortedWalFiles` simply returns the content of DB and archive directories, which nobody can delete (except for deletion scheduler, for which I plan to fix this bug later) until deletions are re-enabled.
Closes https://github.com/facebook/rocksdb/pull/3341

Differential Revision: D6681131

Pulled By: ajkr

fbshipit-source-id: 90b1e2f2362ea9ef715623841c0826611a817634
2018-01-17 17:42:04 -08:00
Andrew Kryczka
266d85fbec fix DBTest.AutomaticConflictsWithManualCompaction
Summary:
After af92d4ad11, only exclusive manual compaction can have conflict. dc360df81e updated the conflict-checking test case accordingly. But we missed the point that exclusive manual compaction can only conflict with automatic compactions scheduled after it, since it waits on pending automatic compactions before it begins running.

This PR updates the test case to ensure the automatic compactions are scheduled after the manual compaction starts but before it finishes, thus ensuring a conflict. I also cleaned up the test case to use less space as I saw it cause out-of-space error on travis.
Closes https://github.com/facebook/rocksdb/pull/3375

Differential Revision: D6735162

Pulled By: ajkr

fbshipit-source-id: 020530a4e150a4786792dce7cec5d66b420cb884
2018-01-16 23:12:00 -08:00
Yi Wu
dc360df81e Fix multiple build failures
Summary:
* Fix DBTest.CompactRangeWithEmptyBottomLevel lite build failure
* Fix DBTest.AutomaticConflictsWithManualCompaction failure introduce by #3366
* Fix BlockBasedTableTest::IndexUncompressed should be disabled if snappy is disabled
* Fix ASAN failure with DBBasicTest::DBClose test
Closes https://github.com/facebook/rocksdb/pull/3373

Differential Revision: D6732313

Pulled By: yiwu-arbug

fbshipit-source-id: 1eb9b9d9a8d795f56188fa9770db9353f6fdedc5
2018-01-16 17:30:39 -08:00
Sunguck Lee
af92d4ad11 Avoid too frequent MaybeScheduleFlushOrCompaction() call
Summary:
If there's manual compaction in the queue, then "HaveManualCompaction(compaction_queue_.front())" will return true, and this cause too frequent MaybeScheduleFlushOrCompaction().

https://github.com/facebook/rocksdb/issues/3198
Closes https://github.com/facebook/rocksdb/pull/3366

Differential Revision: D6729575

Pulled By: ajkr

fbshipit-source-id: 96da04f8fd33297b1ccaec3badd9090403da29b0
2018-01-16 13:12:12 -08:00
Anand Ananthabhotla
d0f1b49ab6 Add a Close() method to DB to return status when closing a db
Summary:
Currently, the only way to close an open DB is to destroy the DB
object. There is no way for the caller to know the status. In one
instance, the destructor encountered an error due to failure to
close a log file on HDFS. In order to prevent silent failures, we add
DB::Close() that calls CloseImpl() which must be implemented by its
descendants.
The main failure point in the destructor is closing the log file. This
patch also adds a Close() entry point to Logger in order to get status.
When DBOptions::info_log is allocated and owned by the DBImpl, it is
explicitly closed by DBImpl::CloseImpl().
Closes https://github.com/facebook/rocksdb/pull/3348

Differential Revision: D6698158

Pulled By: anand1976

fbshipit-source-id: 9468e2892553eb09c4c41b8723f590c0dbd8ab7d
2018-01-16 11:08:57 -08:00
Andrew Kryczka
43549c7d59 Prevent unnecessary calls to PurgeObsoleteFiles
Summary:
Split `JobContext::HaveSomethingToDelete` into two functions: itself and `JobContext::HaveSomethingToClean`. Now we won't call `DBImpl::PurgeObsoleteFiles` in cases where we really just need to call `JobContext::Clean`. The change is needed because I want to track pending calls to `PurgeObsoleteFiles` for a bug fix, which is much simpler if we only call it after `FindObsoleteFiles` finds files to delete.
Closes https://github.com/facebook/rocksdb/pull/3350

Differential Revision: D6690609

Pulled By: ajkr

fbshipit-source-id: 61502e7469288afe16a663a1b7df345baeaf246f
2018-01-12 13:27:08 -08:00
Andrew Kryczka
ba295cda29 replace DBTest.HugeNumbersOfLevel with a more targeted test case
Summary:
This test often causes out-of-space error when run on travis. We don't want such stress tests in our unit test suite.

The bug in #596, which this test intends to expose, can be repro'd as long as the bottommost level(s) are empty when CompactRange is called. I rewrote the test to cover this simple case without writing a lot of data.
Closes https://github.com/facebook/rocksdb/pull/3362

Differential Revision: D6710417

Pulled By: ajkr

fbshipit-source-id: 9a1ec85e738c813ac2fee29f1d5302065ecb54c5
2018-01-12 11:12:09 -08:00
Changli Gao
0a7ba0e548 Fix memleak when DB::DeleteFile()
Summary:
Because the corresponding read_first_record_cache_ item wasn't
erased, memory leaked.
Closes https://github.com/facebook/rocksdb/pull/1712

Differential Revision: D4363654

Pulled By: ajkr

fbshipit-source-id: 7da1adcfc8c380e4ffe05b8769fc2221ad17a225
2018-01-11 18:57:33 -08:00
Bo Liu
204af1eccc add WriteBatch::WriteBatch(std::string&&)
Summary:
to save a string copy for some use cases.

The change is pretty straightforward, please feel free to let me know if you want to suggest any tests for it.
Closes https://github.com/facebook/rocksdb/pull/3349

Differential Revision: D6706828

Pulled By: yiwu-arbug

fbshipit-source-id: 873ce4442937bdc030b395c7f99228eda7f59eb7
2018-01-11 15:43:56 -08:00
Andrew Kryczka
0c6e8be9e2 Fix directory name for db_basic_test
Summary:
It was using the same directory as `db_options_test` so transiently failed when unit tests were run in parallel.
Closes https://github.com/facebook/rocksdb/pull/3352

Differential Revision: D6691649

Pulled By: ajkr

fbshipit-source-id: bee433484fec4faedd5cadf2db3c92fdcc99a170
2018-01-10 15:41:46 -08:00
Siying Dong
6aa95f4d0f Fix a wrong log formatting
Summary:
I experienced weird segfault because of this mismatch of type in log formatting. Fix it.
Closes https://github.com/facebook/rocksdb/pull/3345

Differential Revision: D6687224

Pulled By: siying

fbshipit-source-id: c51fb1c008b7ebc3efdc353a4adad3e8f5b3e9de
2018-01-09 14:58:33 -08:00
Andrew Kryczka
0f0d2ab95a fix DBImpl instance variable naming
Summary:
got confused while reading `FindObsoleteFiles` due to thinking it's a local variable, so renamed it properly
Closes https://github.com/facebook/rocksdb/pull/3342

Differential Revision: D6684797

Pulled By: ajkr

fbshipit-source-id: a4df0aae1cccce99d4dd4d164aadc85b17707132
2018-01-09 12:56:58 -08:00
Chris Lu
24e2c1640d add support for allow_ingest_behind in C API
Summary:
https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files

Need to expose these functions in the C API to be used by Go bindings.
Closes https://github.com/facebook/rocksdb/pull/3011

Differential Revision: D6679563

Pulled By: sagar0

fbshipit-source-id: 536f844ddaeb0172c6d7e416d2a75e8f9e57c8ef
2018-01-08 17:26:31 -08:00
Andrew Kryczka
f00e176c5b fix ForwardIterator reference to temporary object
Summary:
Fixes the following ASAN error:

```
==2108042==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7fc50ae9b868 at pc 0x7fc5112aff55 bp 0x7fff9eb9dc10 sp 0x7fff9eb9dc08
=== How to use this, how to get the raw stack trace, and more: fburl.com/ASAN ===
READ of size 8 at 0x7fc50ae9b868 thread T0
SCARINESS: 23 (8-byte-read-stack-use-after-scope)
     #0 rocksdb/dbformat.h:164                   rocksdb::InternalKeyComparator::user_comparator() const
     #1 librocksdb_src_rocksdb_lib.so+0x1429a7d  rocksdb::RangeDelAggregator::InitRep(std::vector<...> const&)
     #2 librocksdb_src_rocksdb_lib.so+0x142ceae  rocksdb::RangeDelAggregator::AddTombstones(std::unique_ptr<...>)
     #3 librocksdb_src_rocksdb_lib.so+0x1382d88  rocksdb::ForwardIterator::RebuildIterators(bool)
     #4 librocksdb_src_rocksdb_lib.so+0x1382362  rocksdb::ForwardIterator::ForwardIterator(rocksdb::DBImpl*, rocksdb::ReadOptions const&, rocksdb::ColumnFamilyData*, rocksdb::SuperVersion*)
     #5 librocksdb_src_rocksdb_lib.so+0x11f433f  rocksdb::DBImpl::NewIterator(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*)
     #6 rocksdb/src/include/rocksdb/db.h:382     rocksdb::DB::NewIterator(rocksdb::ReadOptions const&)
     #7 rocksdb/db_range_del_test.cc:807         rocksdb::DBRangeDelTest_TailingIteratorRangeTombstoneUnsupported_Test::TestBody()
    #18 rocksdb/db_range_del_test.cc:1006        main

Address 0x7fc50ae9b868 is located in stack of thread T0 at offset 104 in frame
     #0 librocksdb_src_rocksdb_lib.so+0x13825af  rocksdb::ForwardIterator::RebuildIterators(bool)
```
Closes https://github.com/facebook/rocksdb/pull/3300

Differential Revision: D6612989

Pulled By: ajkr

fbshipit-source-id: e7ea2ed914c1b80a8a29d71d92440a6bd9cbcc80
2017-12-20 16:12:04 -08:00
Maysam Yabandeh
0ef3fdd732 Disable need_log_sync on bg err
Summary:
When there is a background error PreprocessWrite returns without marking the logs synced. If we keep need_log_sync to true, it would try to sync them at the end, which would break the logic. The patch would unset need_log_sync if the logs end up not being marked for sync in PreprocessWrite.
Closes https://github.com/facebook/rocksdb/pull/3293

Differential Revision: D6602347

Pulled By: maysamyabandeh

fbshipit-source-id: 37ee04209e8dcfd78de891654ce50d0954abeb38
2017-12-20 08:12:24 -08:00
Yi Wu
06149429d9 WritePrepared Txn: Return NotSupported on iterator refresh
Summary:
A proper implementation of Iterator::Refresh() for WritePreparedTxnDB would require release and acquire another snapshot. Since MyRocks don't make use of Iterator::Refresh(), we just simply mark it as not supported.
Closes https://github.com/facebook/rocksdb/pull/3290

Differential Revision: D6599931

Pulled By: yiwu-arbug

fbshipit-source-id: 4e1632d967316431424f6e458254ecf9a97567cf
2017-12-18 22:29:30 -08:00
Maysam Yabandeh
78c2eedb4f fix release order in validateNumberOfEntries
Summary:
ScopedArenaIterator should be defined after range_del_agg so that it destructs the assigned iterator, which depends on range_del_agg, before it range_del_agg is already destructed.
Closes https://github.com/facebook/rocksdb/pull/3281

Differential Revision: D6592332

Pulled By: maysamyabandeh

fbshipit-source-id: 89a15d8ed13d0fc856b0c47dce3d91778738dbac
2017-12-18 14:27:28 -08:00
Anand Ananthabhotla
fccc12f386 Add a histogram stat for memtable flush
Summary:
Add a new histogram stat called rocksdb.db.flush.micros for memtable
flush
Closes https://github.com/facebook/rocksdb/pull/3269

Differential Revision: D6559496

Pulled By: anand1976

fbshipit-source-id: f5c771ba2568630458751795e8c37a493ff9b14d
2017-12-15 18:57:00 -08:00
Yi Wu
237b292515 BlobDB: Remove the need to get sequence number per write
Summary:
Previously we store sequence number range of each blob files, and use the sequence number range to check if the file can be possibly visible by a snapshot. But it adds complexity to the code, since the sequence number is only available after a write. (The current implementation get sequence number by calling GetLatestSequenceNumber(), which is wrong.) With the patch, we are not storing sequence number range, and check if snapshot_sequence < obsolete_sequence to decide if the file is visible by a snapshot (previously we check if first_sequence <= snapshot_sequence < obsolete_sequence).
Closes https://github.com/facebook/rocksdb/pull/3274

Differential Revision: D6571497

Pulled By: yiwu-arbug

fbshipit-source-id: ca06479dc1fcd8782f6525b62b7762cd47d61909
2017-12-15 13:27:30 -08:00
Andrew Kryczka
5a7e08468a fix ThreadStatus for bottom-pri compaction threads
Summary:
added `ThreadType::BOTTOM_PRIORITY` which is used in the `ThreadStatus` object to indicate the thread is used for bottom-pri compactions. Previously there was a bug where we mislabeled such threads as `ThreadType::LOW_PRIORITY`.
Closes https://github.com/facebook/rocksdb/pull/3270

Differential Revision: D6559428

Pulled By: ajkr

fbshipit-source-id: 96b1a50a9c19492b1a5fd1b77cf7061a6f9f1d1c
2017-12-14 14:57:49 -08:00
Siying Dong
def6a00740 Print out compression type of new SST files in logging
Summary: Closes https://github.com/facebook/rocksdb/pull/3264

Differential Revision: D6552768

Pulled By: siying

fbshipit-source-id: 6303110aff22f341d5cff41f8d2d4f138a53652d
2017-12-14 10:27:43 -08:00
Maysam Yabandeh
546a63272f disableWAL with WriteImplWALOnly
Summary:
Currently WriteImplWALOnly simply returns when disableWAL is set. This is an incorrect behavior since it does not allocated the sequence number, which is a side-effect of writing to the WAL. This patch fixes the issue.
Closes https://github.com/facebook/rocksdb/pull/3262

Differential Revision: D6550974

Pulled By: maysamyabandeh

fbshipit-source-id: 745a83ae8f04e7ca6c8ffb247d6ef16c287c52e7
2017-12-13 07:57:44 -08:00
Zhongyi Xie
51c2ea0feb Reduce heavy hitter for Get operation
Summary:
This PR addresses the following heavy hitters in `Get` operation by moving calls to `StatisticsImpl::recordTick` from `BlockBasedTable` to `Version::Get`

- rocksdb.block.cache.bytes.write
- rocksdb.block.cache.add
- rocksdb.block.cache.data.miss
- rocksdb.block.cache.data.bytes.insert
- rocksdb.block.cache.data.add
- rocksdb.block.cache.hit
- rocksdb.block.cache.data.hit
- rocksdb.block.cache.bytes.read

The db_bench statistics before and after the change are:

|1GB block read|Children      |Self  |Command          |Shared Object        |Symbol|
|---|---|---|---|---|---|
|master:     |4.22%     |1.31%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
|updated:    |0.51%     |0.21%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
|     	     |0.14%     |0.14%  |db_bench  |db_bench  |[.] rocksdb::GetContext::record_counters|

|1MB block read|Children      |Self  |Command          |Shared Object        |Symbol|
|---|---|---|---|---|---|
|master:    |3.48%     |1.08%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
|updated:    |0.80%     |0.31%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
|    	     |0.35%     |0.35%  |db_bench  |db_bench  |[.] rocksdb::GetContext::record_counters|
Closes https://github.com/facebook/rocksdb/pull/3172

Differential Revision: D6330532

Pulled By: miasantreble

fbshipit-source-id: 2b492959e00a3db29e9437ecdcc5e48ca4ec5741
2017-12-12 21:11:33 -08:00
Islam AbdelRahman
9089373a01 Fix DeleteScheduler::MarkAsTrash() handling existing trash
Summary:
DeleteScheduler::MarkAsTrash() don't handle existing .trash files correctly
This cause rocksdb to not being able to delete existing .trash files on restart
Closes https://github.com/facebook/rocksdb/pull/3261

Differential Revision: D6548003

Pulled By: IslamAbdelRahman

fbshipit-source-id: c3800639412e587a690062c63076a5a08881e0e6
2017-12-12 18:17:13 -08:00
Yi Wu
e3a06f12d2 WritePrepared Txn: fix compaction filter snapshot checks
Summary:
Add snapshot_checker check whenever we need to check sequence against snapshots and decide what to do with an input key. The changes are related to one of:
* compaction filter
* single delete
* delete at bottom level
* merge
Closes https://github.com/facebook/rocksdb/pull/3251

Differential Revision: D6537850

Pulled By: yiwu-arbug

fbshipit-source-id: 3faba40ed5e37779f4a0cb7ae78af9546659c7f2
2017-12-12 11:12:24 -08:00
Zhongyi Xie
bb5ed4b1d1 exclude DynamicUniversalCompactionOptions from ROCKSDB_LITE
Summary:
since [SetOptions](https://github.com/facebook/rocksdb/blob/master/db/db_impl.cc#L494) is not supported in ROCKSDB_LITE
Right now unit test under lite is broken
Closes https://github.com/facebook/rocksdb/pull/3253

Differential Revision: D6539428

Pulled By: miasantreble

fbshipit-source-id: 13172b8ecbd75682330726498ea198969bc3e637
2017-12-11 16:28:20 -08:00
Yi Wu
9a27ac5d89 Fix drop column family data race
Summary:
A data race is caught by tsan_crash test between compaction and DropColumnFamily:
https://gist.github.com/yiwu-arbug/5a2b4baae05eeb99ae1719b650f30a44 Compaction checks if the column family has been dropped on each key input, while user can issue DropColumnFamily which updates cfd->dropped_, causing the data race. Fixing it by making cfd->dropped_ an atomic.
Closes https://github.com/facebook/rocksdb/pull/3250

Differential Revision: D6535991

Pulled By: yiwu-arbug

fbshipit-source-id: 5571df020beae7fa7db6fff5ad0d598f49962895
2017-12-11 13:57:48 -08:00
Zhongyi Xie
fcc8a6574d Make Universal compaction options dynamic
Summary:
Let me know if more test coverage is needed
Closes https://github.com/facebook/rocksdb/pull/3213

Differential Revision: D6457165

Pulled By: miasantreble

fbshipit-source-id: 3f944abff28aa7775237f1c4f61c64ccbad4eea9
2017-12-11 13:27:06 -08:00
Prashant D
6a183d1ae8 Fix coverity issues compaction_job, compaction_picker
Summary:
db/compaction_job.cc:
  ReportStartedCompaction(compaction);

CID 1419863 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
2. uninit_member: Non-static class member bottommost_level_ is not initialized in this constructor nor in any functions that it calls.

db/compaction_picker_universal.cc:
7struct InputFileInfo {
   	2. uninit_member: Non-static class member level is not initialized in this constructor nor in any functions that it calls.

CID 1405355 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
4. uninit_member: Non-static class member index is not initialized in this constructor nor in any functions that it calls.
 38  InputFileInfo() : f(nullptr) {}

db/dbformat.h:
 ParsedInternalKey()
 84      : sequence(kMaxSequenceNumber)  // Make code analyzer happy

CID 1168095 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
2. uninit_member: Non-static class member type is not initialized in this constructor nor in any functions that it calls.
 85  {}  // Intentionally left uninitialized (for speed)
Closes https://github.com/facebook/rocksdb/pull/3091

Differential Revision: D6534558

Pulled By: yiwu-arbug

fbshipit-source-id: 5ada975956196d267b3f149386842af71eda7553
2017-12-11 11:57:15 -08:00
Prashant D
34aa245dd8 Fix coverity issues version, write_batch
Summary:
db/version_builder.cc:
117        base_vstorage_->InternalComparator();

CID 1351713 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
2. uninit_member: Non-static class member field level_zero_cmp_.internal_comparator is not initialized in this constructor nor in any functions that it calls.

db/version_edit.h:
145  FdWithKeyRange()
146      : fd(),
147        smallest_key(),
148        largest_key() {

CID 1418254 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
2. uninit_member: Non-static class member file_metadata is not initialized in this constructor nor in any functions that it calls.
149  }

db/version_set.cc:
120    }

CID 1322789 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
4. uninit_member: Non-static class member curr_file_level_ is not initialized in this constructor nor in any functions that it calls.
121  }

db/write_batch.cc:
 939    assert(cf_mems_);

CID 1419862 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
3. uninit_member: Non-static class member rebuilding_trx_seq_ is not initialized in this constructor nor in any functions that it calls.
 940  }
Closes https://github.com/facebook/rocksdb/pull/3092

Differential Revision: D6505666

Pulled By: yiwu-arbug

fbshipit-source-id: fd2c68948a0280772691a419d72ac7e190951d86
2017-12-07 11:57:36 -08:00
Andrew Kryczka
2e3a00987e fix ASAN for DeleteFilesInRange test case
Summary:
error message was

```
==3095==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7ffd18216c40 at pc 0x0000005edda1 bp 0x7ffd18215550 sp 0x7ffd18214d00
...
Address 0x7ffd18216c40 is located in stack of thread T0 at offset 1952 in frame
     #0 internal_repo_rocksdb/db_compaction_test.cc:1520 rocksdb::DBCompactionTest_DeleteFileRangeFileEndpointsOverlapBug_Test::TestBody()
```

It was unsafe to have slices referring to the temporary string objects' buffers, as those strings were destroyed before the slices were used. Fixed it by assigning the strings returned by `Key()` to local variables.
Closes https://github.com/facebook/rocksdb/pull/3238

Differential Revision: D6507864

Pulled By: ajkr

fbshipit-source-id: dd07de1a0070c6748c1ab4f3d7bd31f9a81889d0
2017-12-07 11:12:43 -08:00
Sagar Vemuri
bbef8c3884 Log GetCurrentTime failures during Flush and Compaction
Summary:
`GetCurrentTime()` is used to populate `creation_time` table property during flushes and compactions. It is safe to ignore `GetCurrentTime()` failures here but they should be logged.

(Note that `creation_time` property was introduced as part of TTL-based FIFO compaction in #2480.)

Tes Plan:
`make check`
Closes https://github.com/facebook/rocksdb/pull/3231

Differential Revision: D6501935

Pulled By: sagar0

fbshipit-source-id: 376adcf4ab801d3a43ec4453894b9a10909c8eb6
2017-12-06 20:56:53 -08:00
Andrew Kryczka
78d1a5ec72 Preserve overlapping file endpoint invariant
Summary:
Fix for #2833.

- In `DeleteFilesInRange`, use `GetCleanInputsWithinInterval` instead of `GetOverlappingInputs` to make sure we get a clean cut set of files to delete.
- In `GetCleanInputsWithinInterval`, support nullptr as `begin_key` or `end_key`.
- In `GetOverlappingInputsRangeBinarySearch`, move the assertion for non-empty range away from `ExtendFileRangeWithinInterval`, which should be allowed to return an empty range (via `end_index < begin_index`).
Closes https://github.com/facebook/rocksdb/pull/2843

Differential Revision: D5772387

Pulled By: ajkr

fbshipit-source-id: e554e8461823c6be82b21a9262a2da02b3957881
2017-12-06 18:56:54 -08:00
Yi Wu
a7d32776f0 Fix write_callback_test compile error
Summary:
Rename shadow variable name db_impl.

Fixing #3227
Closes https://github.com/facebook/rocksdb/pull/3235

Differential Revision: D6504051

Pulled By: yiwu-arbug

fbshipit-source-id: 186c9378dabb11f8d6db56f45c95cc3b029fcb88
2017-12-06 17:12:27 -08:00
Yi Wu
20995c5729 Make iterator invalid on Merge error
Summary:
Since #1665, on merge error, iterator will be set to corrupted status, but it doesn't invalidate the iterator. Fixing it.
Closes https://github.com/facebook/rocksdb/pull/3226

Differential Revision: D6499094

Pulled By: yiwu-arbug

fbshipit-source-id: 80222930f949e31f90a6feaa37ddc3529b510d2c
2017-12-06 11:56:39 -08:00
Andrew Kryczka
63f1c0a57d fix gflags namespace
Summary:
I started adding gflags support for cmake on linux and got frustrated that I'd need to duplicate the build_detect_platform logic, which determines namespace based on attempting compilation. We can do it differently -- use the GFLAGS_NAMESPACE macro if available, and if not, that indicates it's an old gflags version without configurable namespace so we can simply hardcode "google".
Closes https://github.com/facebook/rocksdb/pull/3212

Differential Revision: D6456973

Pulled By: ajkr

fbshipit-source-id: 3e6d5bde3ca00d4496a120a7caf4687399f5d656
2017-12-01 10:42:05 -08:00
Maysam Yabandeh
18dcf7f98d WritePrepared Txn: PreReleaseCallback
Summary:
Add PreReleaseCallback to be called at the end of WriteImpl but before publishing the sequence number. The callback is used in WritePrepareTxn to i) update the commit map, ii) update the last published sequence number in the 2nd write queue. It also ensures that all the commits will go to the 2nd queue.
These changes will ensure that the commit map is updated before the sequence number is published and used by reading snapshots. If we use two write queues, the snapshots will use the seq number published by the 2nd queue. If we use one write queue (the default, the snapshots will use the last seq number in the memtable, which also indicates the last published seq number.
Closes https://github.com/facebook/rocksdb/pull/3205

Differential Revision: D6438959

Pulled By: maysamyabandeh

fbshipit-source-id: f8b6c434e94bc5f5ab9cb696879d4c23e2577ab9
2017-11-30 23:50:45 -08:00
zhangjinpeng1987
ffacaaa3ea fix Seek with lower_bound
Summary:
When Seek a key less than `lower_bound`, should return `lower_bound`.
ajkr PTAL
Closes https://github.com/facebook/rocksdb/pull/3199

Differential Revision: D6421126

Pulled By: ajkr

fbshipit-source-id: a06c825830573e0040630704f6bcb3f7f48626f7
2017-11-29 22:56:29 -08:00
kapitan-k
75d57a5d53 C API: Add some block based table options
Summary: Closes https://github.com/facebook/rocksdb/pull/3159

Differential Revision: D6428220

Pulled By: sagar0

fbshipit-source-id: 60508d09b5281f54b907a1c40e9631fc08343131
2017-11-28 14:12:44 -08:00
Yi Wu
3cf562be31 Fix IOError on WAL write doesn't propagate to write group follower
Summary:
This is a simpler version of #3097 by removing all unrelated changes.

Fixing the bug where concurrent writes may get Status::OK while it actually gets IOError on WAL write. This happens when multiple writes form a write batch group, and the leader get an IOError while writing to WAL. The leader failed to pass the error to followers in the group, and the followers end up returning Status::OK() while actually writing nothing. The bug only affect writes in a batch group. Future writes after the batch group will correctly return immediately with the IOError.
Closes https://github.com/facebook/rocksdb/pull/3201

Differential Revision: D6421644

Pulled By: yiwu-arbug

fbshipit-source-id: 1c2a455c5b73f6842423785eb8a9dbfbb191dc0e
2017-11-28 11:42:48 -08:00
Andrew Kryczka
1bdb44de95 optimize file ingestion checks for range deletion overlap
Summary:
Before we were checking every file in the level which was unnecessary. We can piggyback onto the code for checking point-key overlap, which already opens all the files that could possibly contain overlapping range deletions. This PR makes us check just the range deletions from those files, so no extra ones will be opened.
Closes https://github.com/facebook/rocksdb/pull/3179

Differential Revision: D6358125

Pulled By: ajkr

fbshipit-source-id: 00e200770fdb8f3cc6b1b2da232b755e4ba36279
2017-11-28 11:27:02 -08:00
Griffin Smith
2f09524762 Expose all remaining read and write options via the C API
Summary:
Expose read and write options via the C API
Closes https://github.com/facebook/rocksdb/pull/3185

Differential Revision: D6389658

Pulled By: sagar0

fbshipit-source-id: 1848912750329a476805b3cb2f315e7b71f61472
2017-11-28 10:28:46 -08:00
Maysam Yabandeh
e59cb2a19b Add seq_per_batch to WriteWithCallbackTest
Summary:
Augment WriteWithCallbackTest to also test when seq_per_batch is true.
Closes https://github.com/facebook/rocksdb/pull/3195

Differential Revision: D6398143

Pulled By: maysamyabandeh

fbshipit-source-id: 7bc4218609355ec20fed25df426a8455ec2390d3
2017-11-22 13:56:44 -08:00
Zhongyi Xie
5fac4729cc make compaction_readahead_size_ thread safe
Summary:
this should fix the failing tsan_check
Closes https://github.com/facebook/rocksdb/pull/3192

Differential Revision: D6390004

Pulled By: miasantreble

fbshipit-source-id: 6cadfc6f68febb1a77b0abcdb5416570dad926a5
2017-11-21 20:11:38 -08:00
anand1976
d394a6bb48 Add a ticker stat for number of keys skipped during iteration
Summary:
This diff adds a new ticker stat, NUMBER_ITER_SKIP, to count the
number of internal keys skipped during iteration. Keys can be skipped
due to deletes, or lower sequence number, or higher sequence number
than the one requested.

Also, fix the issue when StatisticsData is naturally aligned on cacheline boundary,
padding becomes a zero size array, which the Windows compiler doesn't
like. So add a cacheline worth of padding in that case to keep it happy.
We cannot conditionally add padding as gcc doesn't allow using sizeof
in preprocessor directives.
Closes https://github.com/facebook/rocksdb/pull/3177

Differential Revision: D6353897

Pulled By: anand1976

fbshipit-source-id: 441d5a09af9c4e22e7355242dfc0c7b27aa0a6c2
2017-11-20 21:26:37 -08:00
Gustav Davidsson
2d04ed65e4 Make trash-to-DB size ratio limit configurable
Summary:
Allow users to configure the trash-to-DB size ratio limit, so
that ratelimits for deletes can be enforced even when larger portions of
the database are being deleted.
Closes https://github.com/facebook/rocksdb/pull/3158

Differential Revision: D6304897

Pulled By: gdavidsson

fbshipit-source-id: a28dd13059ebab7d4171b953ed91ce383a84d6b3
2017-11-17 11:58:17 -08:00
Zhongyi Xie
32e31d49d1 Make DBOption compaction_readahead_size dynamic
Summary: Closes https://github.com/facebook/rocksdb/pull/3004

Differential Revision: D6056141

Pulled By: miasantreble

fbshipit-source-id: 56df1630f464fd56b07d25d38161f699e0528b7f
2017-11-16 17:57:25 -08:00
Maysam Yabandeh
54b43563be WritePrepared Txn: Refactoring WriteCallback
Summary:
Refactor the logic around WriteCallback in the write path to clarify when and how exactly we advance the sequence number and making sure it is consistent across the code.
Closes https://github.com/facebook/rocksdb/pull/3168

Differential Revision: D6324312

Pulled By: maysamyabandeh

fbshipit-source-id: 9a34f479561fdb2a5d01ef6d37a28908d03bbe33
2017-11-15 08:27:06 -08:00
Maysam Yabandeh
53863b76f9 WritePrepared Txn: fix bug with Rollback seq
Summary:
The sequence number was not properly advanced after a rollback marker. The patch extends the existing unit tests to detect the bug and also fixes it.
Closes https://github.com/facebook/rocksdb/pull/3157

Differential Revision: D6304291

Pulled By: maysamyabandeh

fbshipit-source-id: 1b519c44a5371b802da49c9e32bd00087a8da401
2017-11-15 08:27:06 -08:00
Maysam Yabandeh
175d5d6a9e Properly destruct rebuilding_trx_
Summary:
When testing rebuilding_trx_ in MemTableInserter might still be set before the tests finishes which would cause ASAN alarms for leaks. This patch deletes the pointers in MemTableInserter destructor.
Closes https://github.com/facebook/rocksdb/pull/3162

Differential Revision: D6317113

Pulled By: maysamyabandeh

fbshipit-source-id: a68be70709a4fff7ac2b768660119311968f9c21
2017-11-14 08:56:50 -08:00
Maysam Yabandeh
2515266725 WritePrepared Txn: Refactoring TrackKeys
Summary:
This patch clarifies and refactors the logic around tracked keys in transactions.
Closes https://github.com/facebook/rocksdb/pull/3140

Differential Revision: D6290258

Pulled By: maysamyabandeh

fbshipit-source-id: 03b50646264cbcc550813c060b180fc7451a55c1
2017-11-11 13:14:20 -08:00
Maysam Yabandeh
2edc92bc28 WritePrepared Txn: cross-compatibility test
Summary:
Add tests to ensure that WritePrepared and WriteCommitted policies are cross compatible when the db WAL is empty. This is important when the admin want to switch between the policies. In such case, before the switch the admin needs to empty the WAL by i) committing/rollbacking all the pending transactions, ii) FlushMemTables
Closes https://github.com/facebook/rocksdb/pull/3118

Differential Revision: D6227247

Pulled By: maysamyabandeh

fbshipit-source-id: bcde3d92c1e89cda3b9cfa69f6a20af5d8993db7
2017-11-11 11:28:37 -08:00
Maysam Yabandeh
857adf388f WritePrepared Txn: Refactor conf params
Summary:
Summary of changes:
- Move seq_per_batch out of Options
- Rename concurrent_prepare to two_write_queues
- Add allocate_seq_only_for_data_
Closes https://github.com/facebook/rocksdb/pull/3136

Differential Revision: D6304458

Pulled By: maysamyabandeh

fbshipit-source-id: 08e685bfa82bbc41b5b1c5eb7040a8ca6e05e58c
2017-11-10 17:28:12 -08:00
Dmitri Smirnov
f8e2db0717 Fix crashes, address test issues and adjust windows test script
Summary:
Add per-exe execution capability
  Add fix parsing of groups/tests
  Add timer test exclusion

 Fix unit tests
  Ifdef threadpool specific tests that do not pass on Vista threadpool.
  Remove spurious outout from prefix_test so test case listing works
  properly.
  Fix not using standard test directories results in file creation errors
  in sst_dump_test.

  BlobDb fixes:
    In C++ end() iterators can not be dereferenced. They are not valid.
	When deleting blob_db_ set it to nullptr before any other code executes.
	Not fixed:. On Windows you can not delete a file while it is open.
	[ RUN      ] BlobDBTest.ReadWhileGC
	d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options)
	IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied
	d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options)
	IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied

  write_batch
    Should not call front() if there is a chance the container is empty
Closes https://github.com/facebook/rocksdb/pull/3152

Differential Revision: D6293274

Pulled By: sagar0

fbshipit-source-id: 318c3717c22087fae13b18715dffb24565dbd956
2017-11-10 10:41:57 -08:00
Shaohua Li
eefd75a228 Stream
Summary:
Add a simple policy for NVMe write time life hint
Closes https://github.com/facebook/rocksdb/pull/3095

Differential Revision: D6298030

Pulled By: shligit

fbshipit-source-id: 9a72a42e32e92193af11599eb71f0cf77448e24d
2017-11-10 09:26:24 -08:00
kapitan-k
f1c5eaba56 updated c ingestexternalfileoptions for ingest behind
Summary: Closes https://github.com/facebook/rocksdb/pull/3151

Differential Revision: D6293861

Pulled By: ajkr

fbshipit-source-id: f8db0a71509d1cd8237f2d377bf9e1bb0464bdbf
2017-11-09 18:15:09 -08:00
Andrew Kryczka
93f69cb93a use bottommost compression when base level is bottommost
Summary:
The previous compression type selection caused unexpected behavior when the base level was also the bottommost level. The following sequence of events could happen:

- full compaction generates files with `bottommost_compression` type
- now base level is bottommost level since all files are in the same level
- any compaction causes files to be rewritten `compression_per_level` type since bottommost compression didn't apply to base level

I changed the code to make bottommost compression apply to base level.
Closes https://github.com/facebook/rocksdb/pull/3141

Differential Revision: D6264614

Pulled By: ajkr

fbshipit-source-id: d7aaa8675126896684154a1f2c9034d6214fde82
2017-11-09 17:42:00 -08:00
Sagar Vemuri
a6d8e30c05 Remove unnecessary status check in TableCache::NewIterator
Summary:
While investigating the usage of `new_table_iterator_nanos` perf counter, I saw some code was wrapper around with unnecessary status check ... so removed it.
Closes https://github.com/facebook/rocksdb/pull/3120

Differential Revision: D6229181

Pulled By: sagar0

fbshipit-source-id: f8a44fe67f5a05df94553fdb233b21e54e88cc34
2017-11-03 14:42:08 -07:00
Andrew Kryczka
24ad430600 pass key/value samples through zstd compression dictionary generator
Summary:
Instead of using samples directly, we now support passing the samples through zstd's dictionary generator when `CompressionOptions::zstd_max_train_bytes` is set to nonzero. If set to zero, we will use the samples directly as the dictionary -- same as before.

Note this is the first step of #2987, extracted into a separate PR per reviewer request.
Closes https://github.com/facebook/rocksdb/pull/3057

Differential Revision: D6116891

Pulled By: ajkr

fbshipit-source-id: 70ab13cc4c734fa02e554180eed0618b75255497
2017-11-02 22:56:36 -07:00
Andrew Kryczka
c4c1f961e7 dynamically change current memtable size
Summary:
Previously setting `write_buffer_size` with `SetOptions` would only apply to new memtables. An internal user wanted it to take effect immediately, instead of at an arbitrary future point, to prevent OOM.

This PR makes the memtable's size mutable, and makes `SetOptions()` mutate it. There is one case when we preserve the old behavior, which is when memtable prefix bloom filter is enabled and the user is increasing the memtable's capacity. That's because the prefix bloom filter's size is fixed and wouldn't work as well on a larger memtable.
Closes https://github.com/facebook/rocksdb/pull/3119

Differential Revision: D6228304

Pulled By: ajkr

fbshipit-source-id: e44bd9d10a5f8c9d8c464bf7436070bb3eafdfc9
2017-11-02 22:28:10 -07:00
Yi Wu
62578d80c1 Blob DB: Add compaction filter to remove expired blob index entries
Summary:
After adding expiration to blob index in #3066, we are now able to add a compaction filter to cleanup expired blob index entries.
Closes https://github.com/facebook/rocksdb/pull/3090

Differential Revision: D6183812

Pulled By: yiwu-arbug

fbshipit-source-id: 9cb03267a9702975290e758c9c176a2c03530b83
2017-11-02 17:27:38 -07:00
Yi Wu
7bfa88037e Blob DB: fix snapshot handling
Summary:
Blob db will keep blob file if data in the file is visible to an active snapshot. Before this patch it checks whether there is an active snapshot has sequence number greater than the earliest sequence in the file. This is problematic since we take snapshot on every read, if it keep having reads, old blob files will not be cleanup. Change to check if there is an active snapshot falls in the range of [earliest_sequence, obsolete_sequence) where obsolete sequence is
1. if data is relocated to another file by garbage collection, it is the latest sequence at the time garbage collection finish
2. otherwise, it is the latest sequence of the file
Closes https://github.com/facebook/rocksdb/pull/3087

Differential Revision: D6182519

Pulled By: yiwu-arbug

fbshipit-source-id: cdf4c35281f782eb2a9ad6a87b6727bbdff27a45
2017-11-02 15:58:27 -07:00
Andrew Kryczka
6778690b51 fix duplicate definition of GetEntryType()
Summary:
It's also defined in db/dbformat.cc per 7fe3b32896
Closes https://github.com/facebook/rocksdb/pull/3111

Differential Revision: D6219140

Pulled By: ajkr

fbshipit-source-id: 0f2b14e41457334a4665c6b7e3f42f1a060a0f35
2017-11-01 22:56:17 -07:00
Maysam Yabandeh
02693f64fc WritePrepared Txn: ValidateSnapshot
Summary:
Implements ValidateSnapshot for WritePrepared txns and also adds a unit test to clarify the contract of this function.
Closes https://github.com/facebook/rocksdb/pull/3101

Differential Revision: D6199405

Pulled By: maysamyabandeh

fbshipit-source-id: ace509934c307ea5d26f4bbac5f836d7c80fd240
2017-11-01 19:11:09 -07:00
Mikhail Antonov
7fe3b32896 Added support for differential snapshots
Summary:
The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).

This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.

From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".

This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.

For now, what's done here according to initial discussions:

Preserving deletes:
 - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
 - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
 - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.

Iterator changes:
 - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.

TableCache changes:
 - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.

What's left:

 - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
Closes https://github.com/facebook/rocksdb/pull/2999

Differential Revision: D6175602

Pulled By: mikhail-antonov

fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
2017-11-01 18:56:43 -07:00
Maysam Yabandeh
17731a43a6 WritePrepared Txn: Optimize for recoverable state
Summary:
GetCommitTimeWriteBatch is currently used to store some state as part of commit in 2PC. In MyRocks it is specifically used to store some data that would be needed only during recovery. So it is not need to be stored in memtable right after each commit.
This patch enables an optimization to write the GetCommitTimeWriteBatch only to the WAL. The batch will be written to memtable during recovery when the WAL is replayed. To cover the case when WAL is deleted after memtable flush, the batch is also buffered and written to memtable right before each memtable flush.
Closes https://github.com/facebook/rocksdb/pull/3071

Differential Revision: D6148023

Pulled By: maysamyabandeh

fbshipit-source-id: 2d09bae5565abe2017c0327421010d5c0d55eaa7
2017-11-01 17:26:46 -07:00
Shaohua Li
33c7d4ccd9 Make writable_file_max_buffer_size dynamic
Summary:
The DBOptions::writable_file_max_buffer_size can be changed dynamically.
Closes https://github.com/facebook/rocksdb/pull/3053

Differential Revision: D6152720

Pulled By: shligit

fbshipit-source-id: aa0c0cfcfae6a54eb17faadb148d904797c68681
2017-10-31 13:56:35 -07:00
Andrew Kryczka
b7bc9cc038 fix tracking oldest snapshot for bottom-level compaction
Summary:
The assertion was caught by `MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/5` when run in a loop. The caller doesn't track whether the released snapshot is oldest, so let this function handle that case.
Closes https://github.com/facebook/rocksdb/pull/3080

Differential Revision: D6185257

Pulled By: ajkr

fbshipit-source-id: 4b3015c11db5d31e46521a00af568546ef4558cd
2017-10-30 00:55:58 -07:00
Yi Wu
792ef10ca8 Return Status::InvalidArgument if user request sync write while disabling WAL
Summary:
write_options.sync = true and write_options.disableWAL is incompatible. When WAL is disabled, we are not able to persist the write immediately. Return an error in this case to avoid misuse of the options.
Closes https://github.com/facebook/rocksdb/pull/3086

Differential Revision: D6176822

Pulled By: yiwu-arbug

fbshipit-source-id: 1eb10028c14fe7d7c13c8bc12c0ef659f75aa071
2017-10-28 22:11:18 -07:00
Andrew Kryczka
6a9335dbbb always drop tombstones compacted to bottommost level
Summary:
Problem was in bottommost compaction, when an L0->L0 compaction happened and L0 was bottommost. Then we'd preserve tombstones according to `Compaction::KeyNotExistsBeyondOutputLevel`, while zeroing seqnum according to `CompactionIterator::PrepareOutput`, thus triggering the assertion in `PrepareOutput`. To fix, we can just drop tombstones in L0->L0 when the output is "bottommost", i.e., the compaction includes the oldest L0 file and there's nothing at lower levels.
Closes https://github.com/facebook/rocksdb/pull/3085

Differential Revision: D6175742

Pulled By: ajkr

fbshipit-source-id: 8ab19a2e001496f362e9eb0a71757e2f6ecfdb3b
2017-10-27 15:56:35 -07:00
Yi Wu
84a04af9a9 TableProperty::oldest_key_time defaults to 0
Summary:
We don't propagate TableProperty::oldest_key_time on compaction and just write the default value to SST files. It is more natural to default the value to 0.

Also revert db_sst_test back to before #2842.
Closes https://github.com/facebook/rocksdb/pull/3079

Differential Revision: D6165702

Pulled By: yiwu-arbug

fbshipit-source-id: ca3ce5928d96ae79a5beb12bb7d8c640a71478a0
2017-10-27 15:00:05 -07:00
Islam AbdelRahman
05993155ef Mark files as trash by using .trash extension
Summary:
SstFileManager move files that need to be deleted into a trash directory.
Deprecate this behaviour and instead add ".trash" extension to files that need to be deleted
Closes https://github.com/facebook/rocksdb/pull/2970

Differential Revision: D5976805

Pulled By: IslamAbdelRahman

fbshipit-source-id: 27374ece4315610b2792c30ffcd50232d4c9a343
2017-10-27 13:27:12 -07:00
Prashant D
50e95a63dd Fix coverity issues column_family, compaction_db/iterator
Summary:
db/column_family.h :
79  ColumnFamilyHandleInternal()

CID 1322806 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
2. uninit_member: Non-static class member internal_cfd_ is not initialized in this constructor nor in any functions that it calls.
 80      : ColumnFamilyHandleImpl(nullptr, nullptr, nullptr) {}

db/compacted_db_impl.cc:
 18CompactedDBImpl::CompactedDBImpl(
 19  const DBOptions& options, const std::string& dbname)
 20  : DBImpl(options, dbname) {
   	2. uninit_member: Non-static class member cfd_ is not initialized in this constructor nor in any functions that it calls.
   	4. uninit_member: Non-static class member version_ is not initialized in this constructor nor in any functions that it calls.

CID 1396120 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
6. uninit_member: Non-static class member user_comparator_ is not initialized in this constructor nor in any functions that it calls.
 21}

db/compaction_iterator.cc:
9. uninit_member: Non-static class member current_user_key_sequence_ is not initialized in this constructor nor in any functions that it calls.
11. uninit_member: Non-static class member current_user_key_snapshot_ is not initialized in this constructor nor in any functions that it calls.

CID 1419855 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
13. uninit_member: Non-static class member current_key_committed_ is not initialized in this constructor nor in any functions that it calls.
Closes https://github.com/facebook/rocksdb/pull/3084

Differential Revision: D6172999

Pulled By: sagar0

fbshipit-source-id: 084d73393faf8022c01359cfb445807b6a782460
2017-10-27 11:26:42 -07:00
Prashant D
47166baeac Fix coverity uninitialized fields warnings
Pulled By: ajkr

Differential Revision: D6170448

fbshipit-source-id: 5fd6d1608fc0df27c94d9f5059315ce7f79b8f5c
2017-10-26 21:11:50 -07:00
Andrew Kryczka
95667383db implement lower bound for iterators
Summary:
- for `SeekToFirst()`, just convert it to a regular `Seek()` if lower bound is specified
- for operations that iterate backwards over user keys (`SeekForPrev`, `SeekToLast`, `Prev`), change `PrevInternal` to check whether user key went below lower bound every time the user key changes -- same approach we use to ensure we stay within a prefix when `prefix_same_as_start=true`.
Closes https://github.com/facebook/rocksdb/pull/3074

Differential Revision: D6158654

Pulled By: ajkr

fbshipit-source-id: cb0e3a922e2650d2cd4d1c6e1c0f1e8b729ff518
2017-10-26 17:27:42 -07:00
Yi Wu
5a2a6483dc Blob DB: Inline small values in base DB
Summary:
Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values.

Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches.

See blob_index.h for the new blob index format. There are 4 cases when writing a new key:
* small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue)
* small value w/ TTL: put (type, expiration, value) to base db.
* large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db.
* large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db.
Closes https://github.com/facebook/rocksdb/pull/3066

Differential Revision: D6142115

Pulled By: yiwu-arbug

fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0
2017-10-26 12:30:54 -07:00
Andrew Kryczka
9b18cc2363 single-file bottom-level compaction when snapshot released
Summary:
When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys.

- Changed `CompactionPicker` to compact files in `BottommostFilesMarkedForCompaction()`. These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys.
- Changed `ReleaseSnapshot()` to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in `bottommost_files_mark_threshold_`, which allows us to avoid recomputing marked files for most snapshot releases.
- Changed `VersionStorageInfo` to track the list of bottommost files, which is recomputed every time the version changes by `UpdateBottommostFiles()`. The list of marked bottommost files is first computed in `ComputeBottommostFilesMarkedForCompaction()` when the version changes, but may also be recomputed when `ReleaseSnapshot()` is called.
- Extracted core logic of `Compaction::IsBottommostLevel()` into `VersionStorageInfo::RangeMightExistAfterSortedRun()` since logic to check whether a file is bottommost is now necessary outside of compaction.
Closes https://github.com/facebook/rocksdb/pull/3009

Differential Revision: D6062044

Pulled By: ajkr

fbshipit-source-id: 123d201cf140715a7d5928e8b3cb4f9cd9f7ad21
2017-10-25 16:30:37 -07:00
Islam AbdelRahman
addfe1ef4b Fix tombstone scans in SeekForPrev outside prefix
Summary:
When doing a Seek() or SeekForPrev() we should stop the moment we see a key with a different prefix as start if ReadOptions:: prefix_same_as_start was set to true

Right now we don't stop if we encounter a tombstone outside the prefix while executing SeekForPrev()
Closes https://github.com/facebook/rocksdb/pull/3067

Differential Revision: D6149638

Pulled By: IslamAbdelRahman

fbshipit-source-id: 7f659862d2bf552d3c9104a360c79439ceba2f18
2017-10-25 15:12:00 -07:00
zach shipko
386a57e6ef Fix build on OpenBSD
Summary:
A few simple changes to allow RocksDB to be built on OpenBSD. Let me know if any further changes are needed.
Closes https://github.com/facebook/rocksdb/pull/3061

Differential Revision: D6138800

Pulled By: ajkr

fbshipit-source-id: a13a17b5dc051e6518bd56a8c5efd1d24dd81b0c
2017-10-24 13:27:38 -07:00
Yi Wu
66a2c44ef4 Add DB::Properties::kEstimateOldestKeyTime
Summary:
With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property.

My plan is to override the property with a more accurate value with blob db, where we actually have timestamp.
Closes https://github.com/facebook/rocksdb/pull/2842

Differential Revision: D5770600

Pulled By: yiwu-arbug

fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a
2017-10-23 15:27:27 -07:00
Dmitri Smirnov
d2a65c59e1 Fix unused var warnings in Release mode
Summary:
MSVC does not support unused attribute at this time. A separate assignment line fixes the issue probably by being counted as usage for MSVC and it no longer complains about unused var.
Closes https://github.com/facebook/rocksdb/pull/3048

Differential Revision: D6126272

Pulled By: maysamyabandeh

fbshipit-source-id: 4907865db45fd75a39a15725c0695aaa17509c1f
2017-10-23 14:27:04 -07:00
Maysam Yabandeh
63822eb761 Enable two write queues for transactions
Summary:
Enable concurrent_prepare flag for WritePrepared transactions and extend the existing transaction tests with this config.
Closes https://github.com/facebook/rocksdb/pull/3046

Differential Revision: D6106534

Pulled By: maysamyabandeh

fbshipit-source-id: 88c8d21d45bc492beb0a131caea84a2ac5e7d38c
2017-10-23 14:27:04 -07:00
Sagar Vemuri
a02ed12638 Exclude DBTest.DynamicFIFOCompactionOptions test under RocksDB Lite
Summary:
This test shouldn't be enabled under the lite version; and this fixes the failing contrun test due to #3006.
Closes https://github.com/facebook/rocksdb/pull/3056

Differential Revision: D6114681

Pulled By: sagar0

fbshipit-source-id: dc5243549ae6b1353cec7edb820c771d95f66dda
2017-10-20 17:11:39 -07:00
Maysam Yabandeh
3ef55d2c7c Split CompactionFilterWithValueChange
Summary:
The test currently times out when it is run under tsan. This patch split it into 4 tests.
Closes https://github.com/facebook/rocksdb/pull/3047

Differential Revision: D6106515

Pulled By: maysamyabandeh

fbshipit-source-id: 03a28cdf8b1c097be2361b1b0cc3dc1acf2b5d63
2017-10-20 15:42:07 -07:00
Andrew Kryczka
f8b5bb2fd8 remove unused code
Summary:
fixup 6a541afcc4. This code didn't do anything because (1) `bytes_per_sync` is assigned in `EnvOptions`'s constructor; and (2) `OptimizeForCompactionTableWrite`'s return value was ignored, even though its only purpose is to return something.
Closes https://github.com/facebook/rocksdb/pull/3055

Differential Revision: D6114132

Pulled By: ajkr

fbshipit-source-id: ea4831770930e9cf83518e13eb2e1934d1f5487c
2017-10-20 14:11:52 -07:00
Sagar Vemuri
f0804db7f7 Make FIFO compaction options dynamically configurable
Summary:
ColumnFamilyOptions::compaction_options_fifo and all its sub-fields can be set dynamically now.

Some of the ways in which the fifo compaction options can be set are:
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024}"}})`
- `SetOptions({{"compaction_options_fifo", "{ttl=600;}"}})`
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024;ttl=600;}"}})`
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=51;ttl=49;allow_compaction=true;}"}})`

Most of the code has been made generic enough so that it could be reused later to make universal options (and other such nested defined-types) dynamic with very few lines of parsing/serializing code changes.
Introduced a few new functions like `ParseStruct`, `SerializeStruct` and `GetStringFromStruct`.
The duplicate code in `GetStringFromDBOptions` and `GetStringFromColumnFamilyOptions` has been moved into `GetStringFromStruct`. So they become just simple wrappers now.
Closes https://github.com/facebook/rocksdb/pull/3006

Differential Revision: D6058619

Pulled By: sagar0

fbshipit-source-id: 1e8f78b3374ca5249bb4f3be8a6d3bb4cbc52f92
2017-10-19 15:26:36 -07:00
Dmitri Smirnov
ebab2e2d42 Enable MSVC W4 with a few exceptions. Fix warnings and bugs
Summary: Closes https://github.com/facebook/rocksdb/pull/3018

Differential Revision: D6079011

Pulled By: yiwu-arbug

fbshipit-source-id: 988a721e7e7617967859dba71d660fc69f4dff57
2017-10-19 10:57:12 -07:00
Gihwan Oh
7deed2b43c Fix a typo in a comment
Summary:
instad of for specific level -> instead of a specific level
Closes https://github.com/facebook/rocksdb/pull/3040

Differential Revision: D6090811

Pulled By: sagar0

fbshipit-source-id: 499edef0a6f596c448f61791e6aca8f5cce08e9c
2017-10-18 12:32:28 -07:00
Maysam Yabandeh
7e38238981 WritePrepared Txn: Disable GC during recovery
Summary:
Disables GC during recovery of a WritePrepared txn db to avoid GCing uncommitted key values.
Closes https://github.com/facebook/rocksdb/pull/2980

Differential Revision: D6000191

Pulled By: maysamyabandeh

fbshipit-source-id: fc4d522c643d24ebf043f811fe4ecd0dd0294675
2017-10-18 09:11:50 -07:00
Nikhil Benesch
7891af8b53 expose a hook to skip tables during iteration
Summary:
As discussed on the mailing list (["Skipping entire SSTs while iterating"](https://groups.google.com/forum/#!topic/rocksdb/ujHCJVLrHlU)), this patch adds a `table_filter` to `ReadOptions` that allows specifying a callback to be executed during iteration before each table in the database is scanned. The callback is passed the table's properties; the table is scanned iff the callback returns true.

This can be used in conjunction with a `TablePropertiesCollector` to dramatically speed up scans by skipping tables that are known to contain irrelevant data for the scan at hand.

We're using this [downstream in CockroachDB](https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/engine/db.cc#L2009-L2022) already. With this feature, under ideal conditions, we can reduce the time of an incremental backup in  from hours to seconds.

FYI, the first commit in this PR fixes a segfault that I unfortunately have not figured out how to reproduce outside of CockroachDB. I'm hoping you accept it on the grounds that it is not correct to return 8-byte aligned memory from a call to `malloc` on some 64-bit platforms; one correct approach is to infer the necessary alignment from `std::max_align_t`, as done here. As noted in the first commit message, the bug is tickled by having a`std::function` in `struct ReadOptions`. That is, the following patch alone is enough to cause RocksDB to segfault when run from CockroachDB on Darwin.

```diff
 --- a/include/rocksdb/options.h
+++ b/include/rocksdb/options.h
@@ -1546,6 +1546,13 @@ struct ReadOptions {
   // Default: false
   bool ignore_range_deletions;

+  // A callback to determine whether relevant keys for this scan exist in a
+  // given table based on the table's properties. The callback is passed the
+  // properties of each table during iteration. If the callback returns false,
+  // the table will not be scanned.
+  // Default: empty (every table will be scanned)
+  std::function<bool(const TableProperties&)> table_filter;
+
   ReadOptions();
   ReadOptions(bool cksum, bool cache);
 };
```

/cc danhhz
Closes https://github.com/facebook/rocksdb/pull/2265

Differential Revision: D5054262

Pulled By: yiwu-arbug

fbshipit-source-id: dd6b28f2bba6cb8466250d8c5c542d3c92785476
2017-10-17 22:12:00 -07:00
Yi Wu
eaaef91178 Blob DB: Store blob index as kTypeBlobIndex in base db
Summary:
Blob db insert blob index to base db as kTypeBlobIndex type, to tell apart values written by plain rocksdb or blob db. This is to make it possible to migrate from existing rocksdb to blob db.

Also with the patch blob db garbage collection get away from OptimisticTransaction. Instead it use a custom write callback to achieve similar behavior as OptimisticTransaction. This is because we need to pass the is_blob_index flag to DBImpl::Get but OptimisticTransaction don't support it.
Closes https://github.com/facebook/rocksdb/pull/3000

Differential Revision: D6050044

Pulled By: yiwu-arbug

fbshipit-source-id: 61dc72ab9977625e75f78cd968e7d8a3976e3632
2017-10-17 17:28:11 -07:00
zhangjinpeng1987
966b32b57c fix delete range bug
Summary:
Fix this [issue](https://github.com/facebook/rocksdb/issues/2989).
ajkr PTAL

Close #2989
Closes https://github.com/facebook/rocksdb/pull/3017

Differential Revision: D6078541

Pulled By: yiwu-arbug

fbshipit-source-id: ef3db87b37b9156f83ca468aa39dea1f6dbde49d
2017-10-17 11:13:19 -07:00
Nikhil Benesch
c0208dffbe arena: derive alignment unit from std::max_align_t
Summary:
As raised in #2265, the arena allocator will return memory that is improperly aligned to store a `std::function` on macOS. Oddly, I'm unable to tickle this bug without adding a `std::function` field to `struct ReadOptions`—but my proposal in #2265 does exactly that.

In any case, here's a simple reproduction. Apply this bogus patch to get a `std::function` into `struct ReadOptions`

```
 --- a/include/rocksdb/options.h
+++ b/include/rocksdb/options.h
@@ -1035,6 +1035,8 @@ struct ReadOptions {
   // Default: 0
   uint64_t max_skippable_internal_keys;

+  std::function<void()> foo;
+
   ReadOptions();
   ReadOptions(bool cksum, bool cache);
 };
```

then compile `db_properties_test` *with ubsan* and run `ReadLatencyHistogramByLevel`:

```
$ make COMPILE_WITH_UBSAN=1 db_properties_test
$ ./db_properties_test --gtest_filter=DBPropertiesTest.ReadLatencyHistogramByLevel
```

ubsan will complain about several misaligned accesses:

```
Note: Google Test filter = DBPropertiesTest.ReadLatencyHistogramByLevel
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBPropertiesTest
[ RUN      ] DBPropertiesTest.ReadLatencyHistogramByLevel
util/coding.h:372:12: runtime error: load of misaligned address 0x00010d85516c for type 'const unsigned long', which requires 8 byte alignment
0x00010d85516c: note: pointer points here
  01 00 34 57 00 00 00 00  02 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  78 24 82 0a 01 00 00 00
              ^
util/coding.h:362:3: runtime error: store to misaligned address 0x7fff5733fac4 for type 'unsigned long', which requires 8 byte alignment
0x7fff5733fac4: note: pointer points here
  01 00 00 00 00 00 00 00  02 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  80 1d 96 0d 01 00 00 00
              ^
util/coding.h:372:12: runtime error: load of misaligned address 0x00010d85516c for type 'const unsigned long', which requires 8 byte alignment
0x00010d85516c: note: pointer points here
  01 00 34 57 00 00 00 00  02 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  78 24 82 0a 01 00 00 00
              ^
version_set.cc:854: runtime error: constructor call on misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
version_set.cc:512: runtime error: constructor call on misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
version_set.cc:505: runtime error: constructor call on misaligned address 0x00010dbfa5e8 for type 'rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
options.h:931: runtime error: constructor call on misaligned address 0x00010dbfa5e8 for type 'rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
options.h:931: runtime error: constructor call on misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment
0x00010dbfa628: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
functional:1583: runtime error: constructor call on misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment
0x00010dbfa628: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1585:9: runtime error: member access within misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment
0x00010dbfa628: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1585:9: runtime error: store to misaligned address 0x00010dbfa648 for type '__base *' (aka '__base<void ()> *'), which requires 16 byte alignment
0x00010dbfa648: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
db/version_set.cc:864:29: runtime error: upcast of misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:521:12: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:521:12: runtime error: load of misaligned address 0x00010dbfa5d8 for type 'rocksdb::TableCache *', which requires 16 byte alignment
0x00010dbfa5d8: note: pointer points here
 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00
              ^
db/version_set.cc:522:9: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:522:9: runtime error: reference binding to misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
db/version_set.cc:522:24: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:522:38: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:522:57: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:522:57: runtime error: load of misaligned address 0x00010dbfa678 for type 'rocksdb::RangeDelAggregator *', which requires 16 byte alignment
0x00010dbfa678: note: pointer points here
 01 00 00 00  d0 a1 bf 0d 01 00 00 00  00 00 00 00 00 00 00 00  f8 db 70 0a 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:523:54: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:523:54: runtime error: load of misaligned address 0x00010dbfa668 for type 'rocksdb::HistogramImpl *', which requires 16 byte alignment
0x00010dbfa668: note: pointer points here
 01 00 00 00  c8 88 a5 0d 01 00 00 00  00 00 00 00 01 00 00 00  d0 a1 bf 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:524:9: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:524:47: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:524:62: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/table_cache.cc:228:33: runtime error: reference binding to misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
table/block_based_table_reader.cc:1554:41: runtime error: reference binding to misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
table/block_based_table_reader.cc:1396:21: runtime error: reference binding to misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
include/rocksdb/options.h:931:8: runtime error: reference binding to misaligned address 0x00010dbfa628 for type 'const std::function<void ()>', which requires 16 byte alignment
0x00010dbfa628: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1584:13: runtime error: load of misaligned address 0x00010dbfa648 for type '__base *const' (aka '__base<void ()> *const'), which requires 16 byte alignment
0x00010dbfa648: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  c8 a5 97 0d 01 00 00 00  38 36 9b 0d
              ^
table/block_based_table_reader.cc:1555:24: runtime error: reference binding to misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
db/table_cache.cc:244:54: runtime error: load of misaligned address 0x00010dbfa618 for type 'const bool', which requires 16 byte alignment
0x00010dbfa618: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
db/table_cache.cc:246:49: runtime error: reference binding to misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
db/version_set.cc:532:12: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:532:12: runtime error: member access within misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
db/version_set.cc:532:26: runtime error: load of misaligned address 0x00010dbfa5f8 for type 'const rocksdb::Slice *const', which requires 16 byte alignment
0x00010dbfa5f8: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
version_set.cc:493: runtime error: member call on misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
version_set.cc:493: runtime error: member call on misaligned address 0x00010dbfa5e8 for type 'rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
options.h:931: runtime error: member call on misaligned address 0x00010dbfa5e8 for type 'rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
options.h:931: runtime error: member call on misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment
0x00010dbfa628: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
functional:1765: runtime error: member call on misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment
0x00010dbfa628: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1766:9: runtime error: member access within misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment
0x00010dbfa628: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1766:9: runtime error: load of misaligned address 0x00010dbfa648 for type '__base *' (aka '__base<void ()> *'), which requires 16 byte alignment
0x00010dbfa648: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  c8 a5 97 0d 01 00 00 00  38 36 9b 0d
              ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1766:27: runtime error: member access within misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment
0x00010dbfa628: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1768:14: runtime error: member access within misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment
0x00010dbfa628: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1768:14: runtime error: load of misaligned address 0x00010dbfa648 for type '__base *' (aka '__base<void ()> *'), which requires 16 byte alignment
0x00010dbfa648: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  c8 a5 97 0d 01 00 00 00  38 36 9b 0d
              ^
[       OK ] DBPropertiesTest.ReadLatencyHistogramByLevel (1599 ms)
[----------] 1 test from DBPropertiesTest (1599 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (1599 ms total)
[  PASSED  ] 1 test.
```

So it seems the root cause is that the internal implementation of `std::function` on macOS (and perhaps with libc++ generally?) requires 16-byte aligned memory, but the arena allocator only guarantees that the returned memory will be `sizeof(void*)` aligned, which is only 8-byte alignment on my machine. This patch solves the problem by adjusting the allocator to derive the necessary alignment from `alignof(std::max_align_t)`, which is properly 16 bytes on my machine.

As I mentioned in #2265, none of RocksDB's tests will cause this unaligned access to actually abort the process, but, on macOS, linking CockroachDB against a version of RocksDB with the above patch and letting it run for just a few seconds will cause a SIGABRT.

```
Process 19792 stopped
* thread #2, stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x0000000004f5e78f cockroach`DBNewIter + 95
cockroach`DBNewIter:
->  0x4f5e78f <+95>:  callq  *0x28(%rax)
    0x4f5e792 <+98>:  jmp    0x4f5e79e                 ; <+110>
    0x4f5e794 <+100>: movq   -0x50(%rbp), %rcx
    0x4f5e798 <+104>: movq   %rax, %rdi
(lldb) bt
* thread #2, stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x0000000004f5e78f cockroach`DBNewIter + 95
```

I'd get you a backtrace, but [Go doesn't include cgo debug information on macOS](https://github.com/golang/go/issues/6942). I've also tried building against libc++ on Linux, where debug information would be available, but I can't seem to trigger the bug there.

In any case, this PR both fixes the segfault in CockroachDB and fixes the warnings reported by ubsan.
Closes https://github.com/facebook/rocksdb/pull/2347

Differential Revision: D5108596

Pulled By: yiwu-arbug

fbshipit-source-id: bd5e4323b2ce915ed4fe78e123cb8996aec75a00
2017-10-17 11:13:19 -07:00
Changli Gao
b8cea7cc27 VersionBuilder: Erase with iterators for better performance
Summary: Closes https://github.com/facebook/rocksdb/pull/3007

Differential Revision: D6077701

Pulled By: yiwu-arbug

fbshipit-source-id: a6fd5b8a23f4feb1660b9ce027f651a7e90352b3
2017-10-17 10:12:37 -07:00