Commit Graph

1533 Commits

Author SHA1 Message Date
Dmitri Smirnov
2a62ca1750 Make Optimistic Tx database stackable
Summary:
This change models Optimistic Tx db after Pessimistic TX db. The motivation for this change is to make the ptr polymorphic so it can be held by the same raw or smart ptr.

Currently, due to the inheritance of the Opt Tx db not being rooted in the manner of Pess Tx from a single DB root it is more difficult to write clean code and have clear ownership of the database in cases when options dictate instantiate of plan DB, Pess Tx DB or Opt tx db.
Closes https://github.com/facebook/rocksdb/pull/3566

Differential Revision: D7184502

Pulled By: yiwu-arbug

fbshipit-source-id: 31d06efafd79497bb0c230e971857dba3bd962c3
2018-04-03 15:28:40 -07:00
Sagar Vemuri
04c11b867d Level Compaction with TTL
Summary:
Level Compaction with TTL.

As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space.

Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data.
This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space.

This functionality can be controlled by the newly introduced column family option -- ttl.

TODO for later:
- Make ttl mutable
- Extend TTL to Universal compaction as well? (TTL is already supported in FIFO)
- Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option.
Closes https://github.com/facebook/rocksdb/pull/3591

Differential Revision: D7275442

Pulled By: sagar0

fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267
2018-04-02 22:14:28 -07:00
Adam Retter
8917eee962 Fixed small typos
Summary: Closes https://github.com/facebook/rocksdb/pull/3667

Differential Revision: D7470060

Pulled By: miasantreble

fbshipit-source-id: 8e8545cda38f0805f35ccdb8841666a2d7a965f5
2018-04-01 17:14:46 -07:00
Sagar Vemuri
90c542347a Update comments about MergeOperator::AllowSingleOperand
Summary:
Updated comments around AllowSingleOperand.
Reason: A couple of users were confused and encountered issues due to no overriding PartialMerge with AllowSingleOperand=true.

I'll also look into modifying the default merge operator implementation so that overriding PartialMerge is not mandatory when AllowSingleOp=true.
Closes https://github.com/facebook/rocksdb/pull/3659

Differential Revision: D7422691

Pulled By: sagar0

fbshipit-source-id: 3d075a6ced0120f5d65cb7ae5412936f1862f342
2018-03-27 17:13:53 -07:00
Anand Ananthabhotla
f9f4d40f93 Align SST file data blocks to avoid spanning multiple pages
Summary:
Provide a block_align option in BlockBasedTableOptions to allow
alignment of SST file data blocks. This will avoid higher
IOPS/throughput load due to < 4KB data blocks spanning 2 4KB pages.
When this option is set to true, the block alignment is set to lower of
block size and 4KB.
Closes https://github.com/facebook/rocksdb/pull/3502

Differential Revision: D7400897

Pulled By: anand1976

fbshipit-source-id: 04cc3bd144e88e3431a4f97604e63ad7a0f06d44
2018-03-26 20:26:10 -07:00
Radoslaw Zarzynski
09b6bf828a InlineSkiplist: don't decode keys unnecessarily during comparisons
Summary:
Summary
========
`InlineSkipList<>::Insert` takes the `key` parameter as a C-string. Then, it performs multiple comparisons with it requiring the `GetLengthPrefixedSlice()` to be spawn in `MemTable::KeyComparator::operator()(const char* prefix_len_key1, const char* prefix_len_key2)` on the same data over and over. The patch tries to optimize that.

Rough performance comparison
=====
Big keys, no compression.

```
$ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256
(...)
fillrandom   :       4.222 micros/op 236836 ops/sec;   80.4 MB/s
```

```
$ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256
(...)
fillrandom   :       4.064 micros/op 246059 ops/sec;   83.5 MB/s
```

TODO
======
In ~~a separated~~ this PR:
- [x] Go outside the write path. Maybe even eradicate the C-string-taking variant of `KeyIsAfterNode` entirely.
- [x] Try to cache the transformations applied by `KeyComparator` & friends in situations where we havy many comparisons with the same key.
Closes https://github.com/facebook/rocksdb/pull/3516

Differential Revision: D7059300

Pulled By: ajkr

fbshipit-source-id: 6f027dbb619a488129f79f79b5f7dbe566fb2dbb
2018-03-23 12:14:30 -07:00
Zhongyi Xie
1cbc96d236 FlushReason improvement
Summary:
Right now flush reason "SuperVersion Change" covers a few different scenarios which is a bit vague. For example, the following db_bench job should trigger "Write Buffer Full"

> $ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304
$ grep 'flush_reason' /dev/shm/dbbench/LOG
...
2018/03/06-17:30:42.543638 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242543634, "job": 192, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018024, "flush_reason": "SuperVersion Change"}
2018/03/06-17:30:42.569541 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242569536, "job": 193, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"}
2018/03/06-17:30:42.596396 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242596392, "job": 194, "event": "flush_started", "num_memtables": 1, "num_entries": 7008, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "SuperVersion Change"}
2018/03/06-17:30:42.622444 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242622440, "job": 195, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"}

With the fix:
> 2018/03/19-14:40:02.341451 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602341444, "job": 98, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018008, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.379655 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602379642, "job": 100, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018016, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.418479 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602418474, "job": 101, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.455084 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602455079, "job": 102, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.492293 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602492288, "job": 104, "event": "flush_started", "num_memtables": 1, "num_entries": 7007, "num_deletes": 0, "memory_usage": 1018056, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.528720 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602528715, "job": 105, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.566255 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602566238, "job": 107, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018112, "flush_reason": "Write Buffer Full"}
Closes https://github.com/facebook/rocksdb/pull/3627

Differential Revision: D7328772

Pulled By: miasantreble

fbshipit-source-id: 67c94065fbdd36930f09930aad0aaa6d2c152bb8
2018-03-22 18:42:18 -07:00
Siying Dong
118058ba69 SstFileManager: add bytes_max_delete_chunk
Summary:
Add `bytes_max_delete_chunk` in SstFileManager so that we can drop a large file in multiple batches.
Closes https://github.com/facebook/rocksdb/pull/3640

Differential Revision: D7358679

Pulled By: siying

fbshipit-source-id: ef17f0da2f5723dbece2669485a9b91b3edc0bb7
2018-03-22 15:58:37 -07:00
Fosco Marotto
de6cf95a53 Update history for future 5.13 release
Summary: Closes https://github.com/facebook/rocksdb/pull/3631

Differential Revision: D7367519

Pulled By: gfosco

fbshipit-source-id: 57826cc1c9ffc9f2b351075567b8ad929809cb74
2018-03-22 14:59:27 -07:00
QingpingWang
2ce8f63f81 C API for PerfContext
Summary:
This pull request exposes the interface of PerfContext as C API
Closes https://github.com/facebook/rocksdb/pull/3607

Differential Revision: D7294225

Pulled By: ajkr

fbshipit-source-id: eddcfbc13538f379950b2c8b299486695ffb5e2c
2018-03-21 22:13:34 -07:00
Niv Dayan
da82aab126 allowing CompactFiles to return new file names
Summary:
This is a small API extension to allow the CompactFiles method to return the names of files that were created during the compaction.
Closes https://github.com/facebook/rocksdb/pull/3608

Differential Revision: D7275789

Pulled By: siying

fbshipit-source-id: 1ec0c3954a0f10cd877efb5f29f9be6c7b59e9ba
2018-03-15 11:58:12 -07:00
Sagar Vemuri
cc118b0e6f Update version
Summary:
We missed updating version.h on master when cutting 5.11.fb and 5.12.fb branches. It should be the same as the version in the latest release branch (or should it be one more?).

I noticed this when trying to run some upgrade/downgrade tests from 5.11 to some new code on master.
Closes https://github.com/facebook/rocksdb/pull/3611

Differential Revision: D7282917

Pulled By: sagar0

fbshipit-source-id: 205ee75b77c5b6bbcea95a272760b427025a4aba
2018-03-15 10:41:48 -07:00
Amy Tai
e476d0e252 Adding stat to count cancelled compactions
Summary:
Added a stat that counts the number of cancelled compactions.
Closes https://github.com/facebook/rocksdb/pull/3574

Differential Revision: D7190259

Pulled By: amytai

fbshipit-source-id: d5ce82dc9398da6d6d34023ad4ed8cec909852a3
2018-03-08 10:42:28 -08:00
Bruce Mitchener
a3a3f5497c Fix some typos in comments and docs.
Summary: Closes https://github.com/facebook/rocksdb/pull/3568

Differential Revision: D7170953

Pulled By: siying

fbshipit-source-id: 9cfb8dd88b7266da920c0e0c1e10fb2c5af0641c
2018-03-08 10:27:25 -08:00
Sagar Vemuri
e69f6e8629 Fix API name in a comment in db.h
Summary:
... so that people are not confused.
Closes https://github.com/facebook/rocksdb/pull/3580

Differential Revision: D7187175

Pulled By: sagar0

fbshipit-source-id: bce70093d52e38cd24c9432fd708885d7c2c013e
2018-03-07 15:27:17 -08:00
Bruce Mitchener
0de710f5b8 Use nullptr instead of NULL / 0 more consistently.
Summary: Closes https://github.com/facebook/rocksdb/pull/3569

Differential Revision: D7170968

Pulled By: yiwu-arbug

fbshipit-source-id: 308a6b7dd358a04fd9a7de3d927bfd8abd57d348
2018-03-07 12:42:12 -08:00
Stuart
f021f1d9e1 Add rocksdb_open_with_ttl function in C API
Summary:
Change-Id: Ie6f9b10bce459f6bf0ade0e5877264b4e10da3f5
Signed-off-by: Stuart <Stuart.Hu@emc.com>
Closes https://github.com/facebook/rocksdb/pull/3553

Differential Revision: D7144833

Pulled By: sagar0

fbshipit-source-id: 815225fa6e560d8a5bc47ffd0a98118b107ce264
2018-03-06 20:57:20 -08:00
amytai
0a3db28d98 Disallow compactions if there isn't enough free space
Summary:
This diff handles cases where compaction causes an ENOSPC error.
This does not handle corner cases where another background job is started while compaction is running, and the other background job triggers ENOSPC, although we do allow the user to provision for these background jobs with SstFileManager::SetCompactionBufferSize.
It also does not handle the case where compaction has finished and some other background job independently triggers ENOSPC.

Usage: Functionality is inside SstFileManager. In particular, users should set SstFileManager::SetMaxAllowedSpaceUsage, which is the reference highwatermark for determining whether to cancel compactions.
Closes https://github.com/facebook/rocksdb/pull/3449

Differential Revision: D7016941

Pulled By: amytai

fbshipit-source-id: 8965ab8dd8b00972e771637a41b4e6c645450445
2018-03-06 16:27:54 -08:00
Dmitri Smirnov
c364eb42b5 Windows cumulative patch
Summary:
This patch addressed several issues.
  Portability including db_test std::thread -> port::Thread Cc: @
  and %z to ROCKSDB portable macro. Cc: maysamyabandeh

  Implement Env::AreFilesSame

  Make the implementation of file unique number more robust

  Get rid of C-runtime and go directly to Windows API when dealing
  with file primitives.

  Implement GetSectorSize() and aling unbuffered read on the value if
  available.

  Adjust Windows Logger for the new interface, implement CloseImpl() Cc: anand1976

  Fix test running script issue where $status var was of incorrect scope
  so the failures were swallowed and not reported.

  DestroyDB() creates a logger and opens a LOG file in the directory
  being cleaned up. This holds a lock on the folder and the cleanup is
  prevented. This fails one of the checkpoin tests. We observe the same in production.
  We close the log file in this change.

 Fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose failure where the test
 attempts to open a directory with NewRandomAccessFile which does not
 work on Windows.
  Fix DBTest.SoftLimit as it is dependent on thread timing. CC: yiwu-arbug
Closes https://github.com/facebook/rocksdb/pull/3552

Differential Revision: D7156304

Pulled By: siying

fbshipit-source-id: 43db0a757f1dfceffeb2b7988043156639173f5b
2018-03-06 11:57:43 -08:00
Yi Wu
b864bc9b5b Blob DB: Improve FIFO eviction
Summary:
Improving blob db FIFO eviction with the following changes,
* Change blob_dir_size to max_db_size. Take into account SST file size when computing DB size.
* FIFO now only take into account live sst files and live blob files. It is normal for disk usage to go over max_db_size because there are obsolete sst files and blob files pending deletion.
* FIFO eviction now also evict TTL blob files that's still open. It doesn't evict non-TTL blob files.
* If FIFO is triggered, it will pass an expiration and the current sequence number to compaction filter. Compaction filter will then filter inlined keys to evict those with an earlier expiration and smaller sequence number. So call LSM FIFO.
* Compaction filter also filter those blob indexes where corresponding blob file is gone.
* Add an event listener to listen compaction/flush event and update sst file size.
* Implement DB::Close() to make sure base db, as well as event listener and compaction filter, destruct before blob db.
* More blob db statistics around FIFO.
* Fix some locking issue when accessing a blob file.
Closes https://github.com/facebook/rocksdb/pull/3556

Differential Revision: D7139328

Pulled By: yiwu-arbug

fbshipit-source-id: ea5edb07b33dfceacb2682f4789bea61de28bbfa
2018-03-06 11:57:42 -08:00
Andrew Kryczka
5d68243e61 Comment out unused variables
Summary:
Submitting on behalf of another employee.
Closes https://github.com/facebook/rocksdb/pull/3557

Differential Revision: D7146025

Pulled By: ajkr

fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e
2018-03-05 13:13:41 -08:00
Maysam Yabandeh
680864ae54 WritePrepared Txn: Fix bug with duplicate keys during recovery
Summary:
Fix the following bugs:
- During recovery a duplicate key was inserted twice into the write batch of the recovery transaction,
once when the memtable returns false (because it was duplicates) and once for the 2nd attempt. This would result into different SubBatch count measured when the recovered transactions is committing.
- If a cf is flushed during recovery the memtable is not available to assist in detecting the duplicate key. This could result into not advancing the sequence number when iterating over duplicate keys of a flushed cf and hence inserting the next key with the wrong sequence number.
- SubBacthCounter would reset the comparator to default comparator after the first duplicate key. The 2nd duplicate key hence would have gone through a wrong comparator and not being detected.
Closes https://github.com/facebook/rocksdb/pull/3562

Differential Revision: D7149440

Pulled By: maysamyabandeh

fbshipit-source-id: 91ec317b165f363f5d11ff8b8c47c81cebb8ed77
2018-03-05 10:57:59 -08:00
Yi Wu
1209b6db5c Blob DB: remove existing garbage collection implementation
Summary:
Red diff to remove existing implementation of garbage collection. The current approach is reference counting kind of approach and require a lot of effort to get the size counter right on compaction and deletion. I'm going to go with a simple mark-sweep kind of approach and will send another PR for that.

CompactionEventListener was added solely for blob db and it adds complexity and overhead to compaction iterator. Removing it as well.
Closes https://github.com/facebook/rocksdb/pull/3551

Differential Revision: D7130190

Pulled By: yiwu-arbug

fbshipit-source-id: c3a375ad2639a3f6ed179df6eda602372cc5b8df
2018-03-02 12:57:23 -08:00
Yi Wu
bf937cf15b Add "rocksdb.live-sst-files-size" DB property
Summary:
Add "rocksdb.live-sst-files-size" DB property which only include files of latest version. Existing "rocksdb.total-sst-files-size" include files from all versions and thus include files that's obsolete but not yet deleted. I'm going to use this new property to cap blob db sst + blob files size.
Closes https://github.com/facebook/rocksdb/pull/3548

Differential Revision: D7116939

Pulled By: yiwu-arbug

fbshipit-source-id: c6a52e45ce0f24ef78708156e1a923c1dd6bc79a
2018-03-01 18:01:10 -08:00
Andrew Kryczka
3ae0047278 skip CompactRange flush based on memtable contents
Summary:
CompactRange has a call to Flush because we guarantee that, at the time it's called, all existing keys in the range will be pushed through the user's compaction filter. However, previously the flush was done blindly, so it'd happen even if the memtable does not contain keys in the range specified by the user. This caused unnecessarily many L0 files to be created, leading to write stalls in some cases. This PR checks the memtable's contents, and decides to flush only if it overlaps with `CompactRange`'s range.

- Move the memtable overlap check logic from `ExternalSstFileIngestionJob` to `ColumnFamilyData::RangesOverlapWithMemtables`
- Reuse the above logic in `CompactRange` and skip flushing if no overlap
Closes https://github.com/facebook/rocksdb/pull/3520

Differential Revision: D7018897

Pulled By: ajkr

fbshipit-source-id: a3c6b1cfae56687b49dd89ccac7c948e53545934
2018-02-27 17:12:44 -08:00
Siying Dong
c287c098a4 Update comments in DB::Close()
Summary: Closes https://github.com/facebook/rocksdb/pull/3543

Differential Revision: D7093251

Pulled By: siying

fbshipit-source-id: 4066b82c95ecb65866c5842d68ab13ab9f85d567
2018-02-27 12:42:31 -08:00
Anand Ananthabhotla
dfbe52e099 Fix the Logger::Close() and DBImpl::Close() design pattern
Summary:
The recent Logger::Close() and DBImpl::Close() implementation rely on
calling the CloseImpl() virtual function from the destructor, which will
not work. Refactor the implementation to have a private close helper
function in derived classes that can be called by both CloseImpl() and
the destructor.
Closes https://github.com/facebook/rocksdb/pull/3528

Reviewed By: gfosco

Differential Revision: D7049303

Pulled By: anand1976

fbshipit-source-id: 76a64cbf403209216dfe4864ecf96b5d7f3db9f4
2018-02-23 13:57:26 -08:00
Maysam Yabandeh
90eca1e616 WritePrepared Txn: optimize SubBatchCnt
Summary:
Make use of the index in WriteBatchWithIndex to also count the number of sub-batches. This eliminates the need to separately scan the batch to count the number of sub-batches once a duplicate key is detected.
Closes https://github.com/facebook/rocksdb/pull/3529

Differential Revision: D7049947

Pulled By: maysamyabandeh

fbshipit-source-id: 81cbf12c4e662541c772c7265a8f91631e25c7cd
2018-02-22 18:12:26 -08:00
Igor Sugak
aba3409740 Back out "[codemod] - comment out unused parameters"
Reviewed By: igorsugak

fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37
2018-02-22 12:43:17 -08:00
David Lai
f4a030ce81 - comment out unused parameters
Reviewed By: everiq, igorsugak

Differential Revision: D7046710

fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461
2018-02-22 09:44:23 -08:00
Sagar Vemuri
8ada876dfe Add rocksdb.iterator.internal-key property
Summary:
Added a new iterator property: `rocksdb.iterator.internal-key` to get the internal-key (converted to user key) at which the iterator stopped.
Closes https://github.com/facebook/rocksdb/pull/3525

Differential Revision: D7033694

Pulled By: sagar0

fbshipit-source-id: d51e6c00f5e9d766c6276ef79774b81c6c5216f8
2018-02-20 19:12:09 -08:00
Maysam Yabandeh
8eb1d445c3 Unbreak MemTableRep API change
Summary:
The MemTableRep API was broken by this commit: 813719e952
This patch reverts the changes and instead adds InsertKey (and etc.) overloads to extend the MemTableRep API without breaking the existing classes that inherit from it.
Closes https://github.com/facebook/rocksdb/pull/3513

Differential Revision: D7004134

Pulled By: maysamyabandeh

fbshipit-source-id: e568d91fe1e17dd76c0c1f6c7dd51a18633b1c4f
2018-02-15 17:27:24 -08:00
Maysam Yabandeh
8a04ee4fd1 WritePrepared Txn: use TransactionDBWriteOptimizations (2nd attempt)
Summary:
TransactionDB::Write can receive some optimization hints from the user. One is to skip the concurrency control mechanism. WritePreparedTxnDB is currently ignoring such hints. This patch optimizes WritePreparedTxnDB::Write for skip_concurrency_control and skip_duplicate_key_check hints.
Closes https://github.com/facebook/rocksdb/pull/3496

Differential Revision: D6971784

Pulled By: maysamyabandeh

fbshipit-source-id: cbab10ad538fa2b8bcb47e37c77724afe6e30f03
2018-02-12 16:43:40 -08:00
Andrew Kryczka
ee1c802675 Add delay before flush in CompactRange to avoid write stalling
Summary:
- Refactored logic for checking write stall condition to a helper function: `GetWriteStallConditionAndCause`. Now it is decoupled from the logic for updating WriteController / stats in `RecalculateWriteStallConditions`, so we can reuse it for predicting whether write stall will occur.
- Updated `CompactRange` to first check whether the one additional immutable memtable / L0 file would cause stalling before it flushes. If so, it waits until that is no longer true.
- Updated `bg_cv_` to be signaled on `SetOptions` calls. The stall conditions `CompactRange` cares about can change when (1) flush finishes, (2) compaction finishes, or (3) options dynamically change. The cv was already signaled for (1) and (2) but not yet for (3).
Closes https://github.com/facebook/rocksdb/pull/3381

Differential Revision: D6754983

Pulled By: ajkr

fbshipit-source-id: 5613e03f1524df7192dc6ae885d40fd8f091d972
2018-02-12 15:42:47 -08:00
jsteemann
d62af7fa4f fix a typo (of a potential vi user)
Summary: Closes https://github.com/facebook/rocksdb/pull/3481

Differential Revision: D6939089

Pulled By: siying

fbshipit-source-id: ccce3ae10cc5ff50a74b85804afd044b21a3c3e2
2018-02-09 12:58:07 -08:00
Zhongyi Xie
945f618ba5 log flush reason for better debugging experience
Summary:
It's always a mystery from the logs why flush was triggered -- user triggered it manually, WriteBufferManager triggered it,  logs were full, write buffer was full, etc.
This PR logs Flush reason whenever a flush is scheduled.
Closes https://github.com/facebook/rocksdb/pull/3401

Differential Revision: D6788142

Pulled By: miasantreble

fbshipit-source-id: a867e54d493c06adf5172bd36a180fb3faae3511
2018-02-09 12:12:43 -08:00
Maysam Yabandeh
8feee28020 Add skip_cc option to TransactionDB::Write
Summary:
Compared to DB::Write, TransactionDB::Write has the additional overhead of creating and initializing an internal transaction object, as well as the overhead of locking/unlocking the keys. This patch extends the TransactionDB::Write with an skip_cc option to allow the users to indicate that the write batch do not conflict with others and the concurrency control and its overhead can be skipped. TransactionDB::Write by default calls DB::Write when skip_cc is set, which works for WriteCommitted WritePolicy. Any other flavor of TransactionDB that is not compatible with this default behavior (such as WritePreparedTxnDB) can extend ::Write and implement their own approach for taking into account the skip_cc optimization.
Closes https://github.com/facebook/rocksdb/pull/3457

Differential Revision: D6877318

Pulled By: maysamyabandeh

fbshipit-source-id: 56f4e21db87ff71492db4e376fb7c2b03dfeab6b
2018-02-06 15:28:24 -08:00
Maysam Yabandeh
88d8b2a2f5 WritePrepared Txn: Duplicate Keys, Txn Part
Summary:
This patch takes advantage of memtable being able to detect duplicate <key,seq> and returning TryAgain to handle duplicate keys in WritePrepared Txns. Through WriteBatchWithIndex's index it detects existence of at least a duplicate key in the write batch. If duplicate key was reported, it then pays the cost of counting the number of sub-patches by iterating over the write batch and pass it to DBImpl::Write. DB will make use of the provided batch_count to assign proper sequence numbers before sending them to the WAL. When later inserting the batch to the memtable, it increases the seq each time memtbale reports a duplicate (a sub-patch in our counting) and tries again.
Closes https://github.com/facebook/rocksdb/pull/3455

Differential Revision: D6873699

Pulled By: maysamyabandeh

fbshipit-source-id: db8487526c3a5dc1ddda0ea49f0f979b26ae648d
2018-02-05 18:43:24 -08:00
Anand Ananthabhotla
4b124fb9d3 Handle error return from WriteBuffer()
Summary:
There are a couple of places where we swallow any error from
WriteBuffer() - in SwitchMemtable() and DBImpl::CloseImpl(). Propagate
the error up in those cases rather than ignoring it.
Closes https://github.com/facebook/rocksdb/pull/3404

Differential Revision: D6879954

Pulled By: anand1976

fbshipit-source-id: 2ef88b554be5286b0a8bad7384ba17a105395bdb
2018-02-05 13:59:34 -08:00
Maysam Yabandeh
813719e952 WritePrepared Txn: Duplicate Keys, Memtable part
Summary:
Currently DB does not accept duplicate keys (keys with the same user key and the same sequence number). If Memtable returns false when receiving such keys, we can benefit from this signal to properly increase the sequence number in the rare cases when we have a duplicate key in the write batch written to DB under WritePrepared transactions.
Closes https://github.com/facebook/rocksdb/pull/3418

Differential Revision: D6822412

Pulled By: maysamyabandeh

fbshipit-source-id: adea3ce5073131cd38ed52b16bea0673b1a19e77
2018-01-31 18:57:07 -08:00
Yi Wu
e62a763752 Blob DB: miscellaneous changes
Summary:
* Expose garbage collection related options
* Minor logging and counter name update
* Remove unused constants.
Closes https://github.com/facebook/rocksdb/pull/3451

Differential Revision: D6867077

Pulled By: yiwu-arbug

fbshipit-source-id: 6c3272a9c9d78b125a0bd6b2e56d00d087cdd6c8
2018-01-31 18:13:23 -08:00
Huachao Huang
ab43ff58b5 Delete files in multiple ranges at once
Summary:
Using `DeleteFilesInRange` to delete files in a lot of ranges can be slow, because
`VersionSet::LogAndApply` is expensive.

This PR adds a new `DeleteFilesInRange` function to delete files in multiple
ranges at once.

Close https://github.com/facebook/rocksdb/issues/2951
Closes https://github.com/facebook/rocksdb/pull/3431

Differential Revision: D6849228

Pulled By: ajkr

fbshipit-source-id: daeedcabd8def4b1d9ee95a58266dee77b5d68cb
2018-01-30 13:56:39 -08:00
Ben Darnell
65cd6cd4b6 Rewrite comments on use_fsync option
Summary:
This replaces a vague warning about the mostly-obsolete ext3 filesystem with
a more detailed note about a historical bug in the still-relevant ext4.

Fixes #3410
Closes https://github.com/facebook/rocksdb/pull/3421

Differential Revision: D6834881

Pulled By: siying

fbshipit-source-id: 7771ef5c89a54c0ac17821680779c48178d0b400
2018-01-29 09:57:46 -08:00
Yi Wu
439855a774 StackableDB optionally take shared ownership of the underlying DB
Summary:
Allow StackableDB optionally takes a shared_ptr on construction and thus hold shared ownership of the underlying DB.
Closes https://github.com/facebook/rocksdb/pull/3423

Differential Revision: D6824163

Pulled By: yiwu-arbug

fbshipit-source-id: dbdc30c42e007533a987ef413785e192340f03eb
2018-01-26 15:28:44 -08:00
Ben Darnell
65d431639b Update comments about default WALRecoveryMode
Summary:
The default changed in 6a14f7a976 but this comment was not updated.
Closes https://github.com/facebook/rocksdb/pull/3409

Differential Revision: D6808264

Pulled By: maysamyabandeh

fbshipit-source-id: 0d7e2a054eb181e9a144fcb783cf0b2c77219bc0
2018-01-25 18:12:08 -08:00
Yi Wu
35d8e65a04 Make Iterator::SeekForPrev pure virtual
Summary:
To prevent user who implement the Iterator interface fail to implement SeekForPrev by mistake.
Closes https://github.com/facebook/rocksdb/pull/3402

Differential Revision: D6790681

Pulled By: yiwu-arbug

fbshipit-source-id: bd75b8ced30208982e0a1414d34384d93496827a
2018-01-23 16:12:19 -08:00
Nathan VanBenschoten
ec0167eecb Fix WriteBatch rep_ format for RangeDeletion records
Summary:
This is a small amount of general cleanup I made while experimenting with https://github.com/facebook/rocksdb/issues/3391.
Closes https://github.com/facebook/rocksdb/pull/3392

Differential Revision: D6788365

Pulled By: yiwu-arbug

fbshipit-source-id: 2716e5aabd5424a4dfdaa954361a62c8eb721ae2
2018-01-23 12:57:32 -08:00
Islam AbdelRahman
c615689bb5 Support skipping bloom filters for SstFileWriter
Summary:
Add an option for SstFileWriter to skip building bloom filters
Closes https://github.com/facebook/rocksdb/pull/3360

Differential Revision: D6709120

Pulled By: IslamAbdelRahman

fbshipit-source-id: 964d4bce38822a048691792f447bcfbb4b6bd809
2018-01-22 14:42:18 -08:00
topilski
b9873162f0 Fixed get version on windows, moved throwing exceptions into cc file.
Summary:
Fixes for msys2 and mingw, hide exceptions into cpp  file.
Closes https://github.com/facebook/rocksdb/pull/3377

Differential Revision: D6746707

Pulled By: yiwu-arbug

fbshipit-source-id: 456b38df80bc48b8386a2cf87f669b5a4f9999a4
2018-01-18 14:56:56 -08:00
jonasf
4decff6fa8 Add possibility to change ttl on open DB
Summary:
We have seen cases where it could be good to change TTL on already open DB.
Change ttl in TtlCompactionFilterFactory on open db.
Next time a filter is created, it will filter accroding to the set TTL.

Is this something that could be useful for others?
Any downsides?
Closes https://github.com/facebook/rocksdb/pull/3292

Differential Revision: D6731993

Pulled By: miasantreble

fbshipit-source-id: 73b94d69237b11e8730734389052429d621a6b1e
2018-01-18 10:42:15 -08:00
Anand Ananthabhotla
d0f1b49ab6 Add a Close() method to DB to return status when closing a db
Summary:
Currently, the only way to close an open DB is to destroy the DB
object. There is no way for the caller to know the status. In one
instance, the destructor encountered an error due to failure to
close a log file on HDFS. In order to prevent silent failures, we add
DB::Close() that calls CloseImpl() which must be implemented by its
descendants.
The main failure point in the destructor is closing the log file. This
patch also adds a Close() entry point to Logger in order to get status.
When DBOptions::info_log is allocated and owned by the DBImpl, it is
explicitly closed by DBImpl::CloseImpl().
Closes https://github.com/facebook/rocksdb/pull/3348

Differential Revision: D6698158

Pulled By: anand1976

fbshipit-source-id: 9468e2892553eb09c4c41b8723f590c0dbd8ab7d
2018-01-16 11:08:57 -08:00
Bo Liu
204af1eccc add WriteBatch::WriteBatch(std::string&&)
Summary:
to save a string copy for some use cases.

The change is pretty straightforward, please feel free to let me know if you want to suggest any tests for it.
Closes https://github.com/facebook/rocksdb/pull/3349

Differential Revision: D6706828

Pulled By: yiwu-arbug

fbshipit-source-id: 873ce4442937bdc030b395c7f99228eda7f59eb7
2018-01-11 15:43:56 -08:00
Anand Ananthabhotla
199405192d Add a BlockBasedTableOption to turn off index block compression.
Summary:
Add a new bool option index_uncompressed in BlockBasedTableOptions.
Closes https://github.com/facebook/rocksdb/pull/3303

Differential Revision: D6686161

Pulled By: anand1976

fbshipit-source-id: 748b46993d48a01e5f89b6bd3e41f06a59ec6054
2018-01-10 15:11:59 -08:00
Chris Lu
24e2c1640d add support for allow_ingest_behind in C API
Summary:
https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files

Need to expose these functions in the C API to be used by Go bindings.
Closes https://github.com/facebook/rocksdb/pull/3011

Differential Revision: D6679563

Pulled By: sagar0

fbshipit-source-id: 536f844ddaeb0172c6d7e416d2a75e8f9e57c8ef
2018-01-08 17:26:31 -08:00
Yi Wu
2190e96727 Remove incorrect comment
Summary:
We actually create individual compaction filter from compaction filter factory per sub-compaction in `CompactionJob::ProcessKeyValueCompaction`: https://github.com/facebook/rocksdb/blob/master/db/compaction_job.cc#L742
The comment seems incorrect.
Closes https://github.com/facebook/rocksdb/pull/3288

Differential Revision: D6598455

Pulled By: yiwu-arbug

fbshipit-source-id: a6bc059a9103b87a73ae6ec4bb01ca33f5d48cf5
2017-12-18 17:56:47 -08:00
Anand Ananthabhotla
fccc12f386 Add a histogram stat for memtable flush
Summary:
Add a new histogram stat called rocksdb.db.flush.micros for memtable
flush
Closes https://github.com/facebook/rocksdb/pull/3269

Differential Revision: D6559496

Pulled By: anand1976

fbshipit-source-id: f5c771ba2568630458751795e8c37a493ff9b14d
2017-12-15 18:57:00 -08:00
Andrew Kryczka
5a7e08468a fix ThreadStatus for bottom-pri compaction threads
Summary:
added `ThreadType::BOTTOM_PRIORITY` which is used in the `ThreadStatus` object to indicate the thread is used for bottom-pri compactions. Previously there was a bug where we mislabeled such threads as `ThreadType::LOW_PRIORITY`.
Closes https://github.com/facebook/rocksdb/pull/3270

Differential Revision: D6559428

Pulled By: ajkr

fbshipit-source-id: 96b1a50a9c19492b1a5fd1b77cf7061a6f9f1d1c
2017-12-14 14:57:49 -08:00
Siying Dong
0d5692e02b Switch version to 5.10
Summary: Closes https://github.com/facebook/rocksdb/pull/3252

Differential Revision: D6539373

Pulled By: siying

fbshipit-source-id: ce7c3d3fe625852179055295da9cf7bc80755025
2017-12-11 15:42:01 -08:00
Andrew Kryczka
78d1a5ec72 Preserve overlapping file endpoint invariant
Summary:
Fix for #2833.

- In `DeleteFilesInRange`, use `GetCleanInputsWithinInterval` instead of `GetOverlappingInputs` to make sure we get a clean cut set of files to delete.
- In `GetCleanInputsWithinInterval`, support nullptr as `begin_key` or `end_key`.
- In `GetOverlappingInputsRangeBinarySearch`, move the assertion for non-empty range away from `ExtendFileRangeWithinInterval`, which should be allowed to return an empty range (via `end_index < begin_index`).
Closes https://github.com/facebook/rocksdb/pull/2843

Differential Revision: D5772387

Pulled By: ajkr

fbshipit-source-id: e554e8461823c6be82b21a9262a2da02b3957881
2017-12-06 18:56:54 -08:00
kapitan-k
75d57a5d53 C API: Add some block based table options
Summary: Closes https://github.com/facebook/rocksdb/pull/3159

Differential Revision: D6428220

Pulled By: sagar0

fbshipit-source-id: 60508d09b5281f54b907a1c40e9631fc08343131
2017-11-28 14:12:44 -08:00
Yi Wu
78279350aa Blob DB: Add statistics
Summary:
Adding a list of blob db counters.

Also remove WaStats() which doesn't expose the stats and can be substitute by (BLOB_DB_BYTES_WRITTEN / BLOB_DB_BLOB_FILE_BYTES_WRITTEN).
Closes https://github.com/facebook/rocksdb/pull/3193

Differential Revision: D6394216

Pulled By: yiwu-arbug

fbshipit-source-id: 017508c8ff3fcd7ea7403c64d0f9834b24816803
2017-11-28 11:58:49 -08:00
Jeffrey Dallatezza
022c598abb Fix minor typo in comment
Summary:
mean -> meant
Closes https://github.com/facebook/rocksdb/pull/3202

Differential Revision: D6426443

Pulled By: sagar0

fbshipit-source-id: adaf07218580ee6903986fa5686de92f43f420e1
2017-11-28 11:27:02 -08:00
Phani Shekhar Mantripragada
4b65cfc723 Support for block_cache num_shards and other config via option string.
Summary:
Problem: Option string accepts only cache_size as parameter for block_cache which is specified as "block_cache=1M".
It doesn't accept other parameters like num_shards etc.

Changes :
1) ParseBlockBasedTableOption in block_based_table_factory is edited to accept cache options in the format "block_cache=<cache_size>:<num_shard_bits>:<strict_capacity_limit>:<high_pri_pool_ratio>".
Options other than cache_size are optional to maintain backward compatibility. The changes are valid for block_cache_compressed as well.
For example, "block_cache=1M:6:true:0.5", "block_cache=1M:6:true", "block_cache=1M:6" and "block_cache=1M" are all valid option strings.

2) Corresponding unit tests are added.
Closes https://github.com/facebook/rocksdb/pull/3108

Differential Revision: D6420997

Pulled By: sagar0

fbshipit-source-id: cdea8b785688d2802907974af27225ccc1c0cd43
2017-11-28 10:48:53 -08:00
Griffin Smith
2f09524762 Expose all remaining read and write options via the C API
Summary:
Expose read and write options via the C API
Closes https://github.com/facebook/rocksdb/pull/3185

Differential Revision: D6389658

Pulled By: sagar0

fbshipit-source-id: 1848912750329a476805b3cb2f315e7b71f61472
2017-11-28 10:28:46 -08:00
anand1976
d394a6bb48 Add a ticker stat for number of keys skipped during iteration
Summary:
This diff adds a new ticker stat, NUMBER_ITER_SKIP, to count the
number of internal keys skipped during iteration. Keys can be skipped
due to deletes, or lower sequence number, or higher sequence number
than the one requested.

Also, fix the issue when StatisticsData is naturally aligned on cacheline boundary,
padding becomes a zero size array, which the Windows compiler doesn't
like. So add a cacheline worth of padding in that case to keep it happy.
We cannot conditionally add padding as gcc doesn't allow using sizeof
in preprocessor directives.
Closes https://github.com/facebook/rocksdb/pull/3177

Differential Revision: D6353897

Pulled By: anand1976

fbshipit-source-id: 441d5a09af9c4e22e7355242dfc0c7b27aa0a6c2
2017-11-20 21:26:37 -08:00
Gustav Davidsson
2d04ed65e4 Make trash-to-DB size ratio limit configurable
Summary:
Allow users to configure the trash-to-DB size ratio limit, so
that ratelimits for deletes can be enforced even when larger portions of
the database are being deleted.
Closes https://github.com/facebook/rocksdb/pull/3158

Differential Revision: D6304897

Pulled By: gdavidsson

fbshipit-source-id: a28dd13059ebab7d4171b953ed91ce383a84d6b3
2017-11-17 11:58:17 -08:00
Maysam Yabandeh
53863b76f9 WritePrepared Txn: fix bug with Rollback seq
Summary:
The sequence number was not properly advanced after a rollback marker. The patch extends the existing unit tests to detect the bug and also fixes it.
Closes https://github.com/facebook/rocksdb/pull/3157

Differential Revision: D6304291

Pulled By: maysamyabandeh

fbshipit-source-id: 1b519c44a5371b802da49c9e32bd00087a8da401
2017-11-15 08:27:06 -08:00
Yi Wu
42564ada53 Blob DB: not using PinnableSlice move assignment
Summary:
The current implementation of PinnableSlice move assignment have an issue #3163. We are moving away from it instead of try to get the move assignment right, since it is too tricky.
Closes https://github.com/facebook/rocksdb/pull/3164

Differential Revision: D6319201

Pulled By: yiwu-arbug

fbshipit-source-id: 8f3279021f3710da4a4caa14fd238ed2df902c48
2017-11-13 18:12:20 -08:00
Maysam Yabandeh
2515266725 WritePrepared Txn: Refactoring TrackKeys
Summary:
This patch clarifies and refactors the logic around tracked keys in transactions.
Closes https://github.com/facebook/rocksdb/pull/3140

Differential Revision: D6290258

Pulled By: maysamyabandeh

fbshipit-source-id: 03b50646264cbcc550813c060b180fc7451a55c1
2017-11-11 13:14:20 -08:00
Maysam Yabandeh
2edc92bc28 WritePrepared Txn: cross-compatibility test
Summary:
Add tests to ensure that WritePrepared and WriteCommitted policies are cross compatible when the db WAL is empty. This is important when the admin want to switch between the policies. In such case, before the switch the admin needs to empty the WAL by i) committing/rollbacking all the pending transactions, ii) FlushMemTables
Closes https://github.com/facebook/rocksdb/pull/3118

Differential Revision: D6227247

Pulled By: maysamyabandeh

fbshipit-source-id: bcde3d92c1e89cda3b9cfa69f6a20af5d8993db7
2017-11-11 11:28:37 -08:00
Maysam Yabandeh
857adf388f WritePrepared Txn: Refactor conf params
Summary:
Summary of changes:
- Move seq_per_batch out of Options
- Rename concurrent_prepare to two_write_queues
- Add allocate_seq_only_for_data_
Closes https://github.com/facebook/rocksdb/pull/3136

Differential Revision: D6304458

Pulled By: maysamyabandeh

fbshipit-source-id: 08e685bfa82bbc41b5b1c5eb7040a8ca6e05e58c
2017-11-10 17:28:12 -08:00
Shaohua Li
eefd75a228 Stream
Summary:
Add a simple policy for NVMe write time life hint
Closes https://github.com/facebook/rocksdb/pull/3095

Differential Revision: D6298030

Pulled By: shligit

fbshipit-source-id: 9a72a42e32e92193af11599eb71f0cf77448e24d
2017-11-10 09:26:24 -08:00
kapitan-k
f1c5eaba56 updated c ingestexternalfileoptions for ingest behind
Summary: Closes https://github.com/facebook/rocksdb/pull/3151

Differential Revision: D6293861

Pulled By: ajkr

fbshipit-source-id: f8db0a71509d1cd8237f2d377bf9e1bb0464bdbf
2017-11-09 18:15:09 -08:00
Andrew Kryczka
93f69cb93a use bottommost compression when base level is bottommost
Summary:
The previous compression type selection caused unexpected behavior when the base level was also the bottommost level. The following sequence of events could happen:

- full compaction generates files with `bottommost_compression` type
- now base level is bottommost level since all files are in the same level
- any compaction causes files to be rewritten `compression_per_level` type since bottommost compression didn't apply to base level

I changed the code to make bottommost compression apply to base level.
Closes https://github.com/facebook/rocksdb/pull/3141

Differential Revision: D6264614

Pulled By: ajkr

fbshipit-source-id: d7aaa8675126896684154a1f2c9034d6214fde82
2017-11-09 17:42:00 -08:00
Manuel Ung
e03377c7fd Add lock wait time as a perf context counter
Summary:
Adds two new counters:

`key_lock_wait_count` counts how many times a lock was blocked by another transaction and had to wait, instead of being granted the lock immediately.
`key_lock_wait_time` counts the time spent acquiring locks.
Closes https://github.com/facebook/rocksdb/pull/3107

Differential Revision: D6217332

Pulled By: lth

fbshipit-source-id: 55d4f46da5550c333e523263422fd61d6a46deb9
2017-11-06 10:57:19 -08:00
Yi Wu
be410dede8 Fix PinnableSlice move assignment
Summary:
After move assignment, we need to re-initialized the moved PinnableSlice.

Also update blob_db_impl.cc to not reuse the moved PinnableSlice since it is supposed to be in an undefined state after move.
Closes https://github.com/facebook/rocksdb/pull/3127

Differential Revision: D6238585

Pulled By: yiwu-arbug

fbshipit-source-id: bd99f2e37406c4f7de160c7dee6a2e8126bc224e
2017-11-03 18:13:21 -07:00
Andrew Kryczka
24ad430600 pass key/value samples through zstd compression dictionary generator
Summary:
Instead of using samples directly, we now support passing the samples through zstd's dictionary generator when `CompressionOptions::zstd_max_train_bytes` is set to nonzero. If set to zero, we will use the samples directly as the dictionary -- same as before.

Note this is the first step of #2987, extracted into a separate PR per reviewer request.
Closes https://github.com/facebook/rocksdb/pull/3057

Differential Revision: D6116891

Pulled By: ajkr

fbshipit-source-id: 70ab13cc4c734fa02e554180eed0618b75255497
2017-11-02 22:56:36 -07:00
Prashant D
602fe9454c Fix coverity issues in include/rocksdb
Summary:
include/rocksdb/metadata.h:
struct ColumnFamilyMetaData {

CID 1322804 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
2. uninit_member: Non-static class member file_count is not initialized in this constructor nor in any functions that it calls.

struct SstFileMetaData {
        2. uninit_member: Non-static class member size is not initialized in this constructor nor in any functions that it calls.
        4. uninit_member: Non-static class member smallest_seqno is not initialized in this constructor nor in any functions that it calls.
        6. uninit_member: Non-static class member largest_seqno is not initialized in this constructor nor in any functions that it calls.
        8. uninit_member: Non-static class member num_reads_sampled is not initialized in this constructor nor in any functions that it calls.

CID 1322807 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
10. uninit_member: Non-static class member being_compacted is not initialized in this constructor nor in any functions that it calls.

include/rocksdb/sst_file_writer.h:
struct ExternalSstFileInfo {
        2. uninit_member: Non-static class member sequence_number is not initialized in this constructor nor in any functions that it calls.
        4. uninit_member: Non-static class member file_size is not initialized in this constructor nor in any functions that it calls.
        6. uninit_member: Non-static class member num_entries is not initialized in this constructor nor in any functions that it calls.

CID 1351697 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
8. uninit_member: Non-static class member version is not initialized in this constructor nor in any functions that it calls.
 31  ExternalSstFileInfo() {}

include/rocksdb/utilities/transaction.h:
explicit Transaction(const TransactionDB* db) {}
        2. uninit_member: Non-static class member log_number_ is not initialized in this constructor nor in any functions that it calls.

CID 1396133 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
4. uninit_member: Non-static class member field txn_state_._M_i is not initialized in this constructor nor in any functions that it calls.
473  Transaction() {}
Closes https://github.com/facebook/rocksdb/pull/3100

Differential Revision: D6227651

Pulled By: sagar0

fbshipit-source-id: 5caa4a2cf9471d1f9c3c073f81473636e1f0aa14
2017-11-02 17:56:48 -07:00
Yi Wu
62578d80c1 Blob DB: Add compaction filter to remove expired blob index entries
Summary:
After adding expiration to blob index in #3066, we are now able to add a compaction filter to cleanup expired blob index entries.
Closes https://github.com/facebook/rocksdb/pull/3090

Differential Revision: D6183812

Pulled By: yiwu-arbug

fbshipit-source-id: 9cb03267a9702975290e758c9c176a2c03530b83
2017-11-02 17:27:38 -07:00
Maysam Yabandeh
25fbd9a996 Remove the experimental notes about partitioning
Summary:
This patch will remove the existing comments that declare partitioning indexes and filters as experimental.
Closes https://github.com/facebook/rocksdb/pull/3115

Differential Revision: D6222227

Pulled By: maysamyabandeh

fbshipit-source-id: 6179ec43b22c518494051b674d91c9e1b54d4ac0
2017-11-02 11:14:30 -07:00
Andrew Kryczka
cd124215df release 5.9
Summary:
updated HISTORY.md and version.h for the release.
Closes https://github.com/facebook/rocksdb/pull/3110

Differential Revision: D6218645

Pulled By: ajkr

fbshipit-source-id: 99ab8473e9088b02d7596e92351cce7a60a99e93
2017-11-01 21:26:14 -07:00
Mikhail Antonov
7fe3b32896 Added support for differential snapshots
Summary:
The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).

This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.

From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".

This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.

For now, what's done here according to initial discussions:

Preserving deletes:
 - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
 - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
 - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.

Iterator changes:
 - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.

TableCache changes:
 - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.

What's left:

 - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
Closes https://github.com/facebook/rocksdb/pull/2999

Differential Revision: D6175602

Pulled By: mikhail-antonov

fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
2017-11-01 18:56:43 -07:00
Maysam Yabandeh
17731a43a6 WritePrepared Txn: Optimize for recoverable state
Summary:
GetCommitTimeWriteBatch is currently used to store some state as part of commit in 2PC. In MyRocks it is specifically used to store some data that would be needed only during recovery. So it is not need to be stored in memtable right after each commit.
This patch enables an optimization to write the GetCommitTimeWriteBatch only to the WAL. The batch will be written to memtable during recovery when the WAL is replayed. To cover the case when WAL is deleted after memtable flush, the batch is also buffered and written to memtable right before each memtable flush.
Closes https://github.com/facebook/rocksdb/pull/3071

Differential Revision: D6148023

Pulled By: maysamyabandeh

fbshipit-source-id: 2d09bae5565abe2017c0327421010d5c0d55eaa7
2017-11-01 17:26:46 -07:00
Shaohua Li
33c7d4ccd9 Make writable_file_max_buffer_size dynamic
Summary:
The DBOptions::writable_file_max_buffer_size can be changed dynamically.
Closes https://github.com/facebook/rocksdb/pull/3053

Differential Revision: D6152720

Pulled By: shligit

fbshipit-source-id: aa0c0cfcfae6a54eb17faadb148d904797c68681
2017-10-31 13:56:35 -07:00
Yi Wu
84a04af9a9 TableProperty::oldest_key_time defaults to 0
Summary:
We don't propagate TableProperty::oldest_key_time on compaction and just write the default value to SST files. It is more natural to default the value to 0.

Also revert db_sst_test back to before #2842.
Closes https://github.com/facebook/rocksdb/pull/3079

Differential Revision: D6165702

Pulled By: yiwu-arbug

fbshipit-source-id: ca3ce5928d96ae79a5beb12bb7d8c640a71478a0
2017-10-27 15:00:05 -07:00
Islam AbdelRahman
05993155ef Mark files as trash by using .trash extension
Summary:
SstFileManager move files that need to be deleted into a trash directory.
Deprecate this behaviour and instead add ".trash" extension to files that need to be deleted
Closes https://github.com/facebook/rocksdb/pull/2970

Differential Revision: D5976805

Pulled By: IslamAbdelRahman

fbshipit-source-id: 27374ece4315610b2792c30ffcd50232d4c9a343
2017-10-27 13:27:12 -07:00
Andrew Kryczka
95667383db implement lower bound for iterators
Summary:
- for `SeekToFirst()`, just convert it to a regular `Seek()` if lower bound is specified
- for operations that iterate backwards over user keys (`SeekForPrev`, `SeekToLast`, `Prev`), change `PrevInternal` to check whether user key went below lower bound every time the user key changes -- same approach we use to ensure we stay within a prefix when `prefix_same_as_start=true`.
Closes https://github.com/facebook/rocksdb/pull/3074

Differential Revision: D6158654

Pulled By: ajkr

fbshipit-source-id: cb0e3a922e2650d2cd4d1c6e1c0f1e8b729ff518
2017-10-26 17:27:42 -07:00
Yi Wu
5a2a6483dc Blob DB: Inline small values in base DB
Summary:
Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values.

Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches.

See blob_index.h for the new blob index format. There are 4 cases when writing a new key:
* small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue)
* small value w/ TTL: put (type, expiration, value) to base db.
* large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db.
* large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db.
Closes https://github.com/facebook/rocksdb/pull/3066

Differential Revision: D6142115

Pulled By: yiwu-arbug

fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0
2017-10-26 12:30:54 -07:00
Andrew Kryczka
9b18cc2363 single-file bottom-level compaction when snapshot released
Summary:
When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys.

- Changed `CompactionPicker` to compact files in `BottommostFilesMarkedForCompaction()`. These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys.
- Changed `ReleaseSnapshot()` to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in `bottommost_files_mark_threshold_`, which allows us to avoid recomputing marked files for most snapshot releases.
- Changed `VersionStorageInfo` to track the list of bottommost files, which is recomputed every time the version changes by `UpdateBottommostFiles()`. The list of marked bottommost files is first computed in `ComputeBottommostFilesMarkedForCompaction()` when the version changes, but may also be recomputed when `ReleaseSnapshot()` is called.
- Extracted core logic of `Compaction::IsBottommostLevel()` into `VersionStorageInfo::RangeMightExistAfterSortedRun()` since logic to check whether a file is bottommost is now necessary outside of compaction.
Closes https://github.com/facebook/rocksdb/pull/3009

Differential Revision: D6062044

Pulled By: ajkr

fbshipit-source-id: 123d201cf140715a7d5928e8b3cb4f9cd9f7ad21
2017-10-25 16:30:37 -07:00
Yi Wu
66a2c44ef4 Add DB::Properties::kEstimateOldestKeyTime
Summary:
With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property.

My plan is to override the property with a more accurate value with blob db, where we actually have timestamp.
Closes https://github.com/facebook/rocksdb/pull/2842

Differential Revision: D5770600

Pulled By: yiwu-arbug

fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a
2017-10-23 15:27:27 -07:00
Sagar Vemuri
f0804db7f7 Make FIFO compaction options dynamically configurable
Summary:
ColumnFamilyOptions::compaction_options_fifo and all its sub-fields can be set dynamically now.

Some of the ways in which the fifo compaction options can be set are:
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024}"}})`
- `SetOptions({{"compaction_options_fifo", "{ttl=600;}"}})`
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024;ttl=600;}"}})`
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=51;ttl=49;allow_compaction=true;}"}})`

Most of the code has been made generic enough so that it could be reused later to make universal options (and other such nested defined-types) dynamic with very few lines of parsing/serializing code changes.
Introduced a few new functions like `ParseStruct`, `SerializeStruct` and `GetStringFromStruct`.
The duplicate code in `GetStringFromDBOptions` and `GetStringFromColumnFamilyOptions` has been moved into `GetStringFromStruct`. So they become just simple wrappers now.
Closes https://github.com/facebook/rocksdb/pull/3006

Differential Revision: D6058619

Pulled By: sagar0

fbshipit-source-id: 1e8f78b3374ca5249bb4f3be8a6d3bb4cbc52f92
2017-10-19 15:26:36 -07:00
Dmitri Smirnov
ebab2e2d42 Enable MSVC W4 with a few exceptions. Fix warnings and bugs
Summary: Closes https://github.com/facebook/rocksdb/pull/3018

Differential Revision: D6079011

Pulled By: yiwu-arbug

fbshipit-source-id: 988a721e7e7617967859dba71d660fc69f4dff57
2017-10-19 10:57:12 -07:00
Nikhil Benesch
7891af8b53 expose a hook to skip tables during iteration
Summary:
As discussed on the mailing list (["Skipping entire SSTs while iterating"](https://groups.google.com/forum/#!topic/rocksdb/ujHCJVLrHlU)), this patch adds a `table_filter` to `ReadOptions` that allows specifying a callback to be executed during iteration before each table in the database is scanned. The callback is passed the table's properties; the table is scanned iff the callback returns true.

This can be used in conjunction with a `TablePropertiesCollector` to dramatically speed up scans by skipping tables that are known to contain irrelevant data for the scan at hand.

We're using this [downstream in CockroachDB](https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/engine/db.cc#L2009-L2022) already. With this feature, under ideal conditions, we can reduce the time of an incremental backup in  from hours to seconds.

FYI, the first commit in this PR fixes a segfault that I unfortunately have not figured out how to reproduce outside of CockroachDB. I'm hoping you accept it on the grounds that it is not correct to return 8-byte aligned memory from a call to `malloc` on some 64-bit platforms; one correct approach is to infer the necessary alignment from `std::max_align_t`, as done here. As noted in the first commit message, the bug is tickled by having a`std::function` in `struct ReadOptions`. That is, the following patch alone is enough to cause RocksDB to segfault when run from CockroachDB on Darwin.

```diff
 --- a/include/rocksdb/options.h
+++ b/include/rocksdb/options.h
@@ -1546,6 +1546,13 @@ struct ReadOptions {
   // Default: false
   bool ignore_range_deletions;

+  // A callback to determine whether relevant keys for this scan exist in a
+  // given table based on the table's properties. The callback is passed the
+  // properties of each table during iteration. If the callback returns false,
+  // the table will not be scanned.
+  // Default: empty (every table will be scanned)
+  std::function<bool(const TableProperties&)> table_filter;
+
   ReadOptions();
   ReadOptions(bool cksum, bool cache);
 };
```

/cc danhhz
Closes https://github.com/facebook/rocksdb/pull/2265

Differential Revision: D5054262

Pulled By: yiwu-arbug

fbshipit-source-id: dd6b28f2bba6cb8466250d8c5c542d3c92785476
2017-10-17 22:12:00 -07:00
codeeply
f7843f30a8 Move ~Comparator define to comparator.h
Summary:
When I impl my own comparator, and build in release mode.
The following compile error occurs.
undefined reference to `typeinfo for rocksdb::Comparator'

This fix allows users build with RTTI off when has their own comparator.
Closes https://github.com/facebook/rocksdb/pull/3008

Differential Revision: D6077354

Pulled By: yiwu-arbug

fbshipit-source-id: 914c26dbab72f0ad1f0e15f8666a3fb2f10bfed8
2017-10-17 09:58:13 -07:00
Yi Wu
31d3e41810 PinnableSlice move assignment
Summary:
Allow `std::move(pinnable_slice)`.
Closes https://github.com/facebook/rocksdb/pull/2997

Differential Revision: D6036782

Pulled By: yiwu-arbug

fbshipit-source-id: 583fb0419a97e437ff530f4305822341cd3381fa
2017-10-12 18:28:24 -07:00
Sagar Vemuri
5a38e18627 Make some WriteOptions defaults more explicit
Summary:
Some WriteOptions defaults were not clearly documented. So, added comments to make the defaults more explicit.
Closes https://github.com/facebook/rocksdb/pull/2984

Differential Revision: D6014500

Pulled By: sagar0

fbshipit-source-id: a28078818e335e42b303c1fc6fbfec692ed16c7c
2017-10-09 16:36:52 -07:00
Adrien Schildknecht
01542400a8 Inform caller when rocksdb is stalling writes
Summary:
Add a new function in Listener to let the caller know when rocksdb
is stalling writes.
Closes https://github.com/facebook/rocksdb/pull/2897

Differential Revision: D5860124

Pulled By: schischi

fbshipit-source-id: ee791606169aa64f772c86f817cebf02624e05e1
2017-10-05 18:11:43 -07:00
Maysam Yabandeh
4e3c3d8c6a WritePrepared Txn: duplicate keys
Summary:
With WriteCommitted, when the write batch has duplicate keys, the txn db simply inserts them to the db with different seq numbers and let the db ignore/merge the duplicate values at the read time. With WritePrepared all the entries of the batch are inserted with the same seq number which prevents us from benefiting from this simple solution.

This patch applies a hackish solution to unblock the end-to-end testing. The hack is to be replaced with a proper solution soon. The patch simply detects the duplicate key insertions, and mark the previous one as obsolete. Then before writing to the db it rewrites the batch eliminating the obsolete keys. This would incur a memcpy cost. Furthermore handing duplicate merge would require to do FullMerge instead of simply ignoring the previous value, which is not handled by this patch.
Closes https://github.com/facebook/rocksdb/pull/2969

Differential Revision: D5976337

Pulled By: maysamyabandeh

fbshipit-source-id: 114e65b66f137d8454ff2d1d782b8c05da95f989
2017-10-05 07:41:02 -07:00
Andrew Kryczka
1026e794a3 rate limit auto-tuning
Summary:
Dynamic adjustment of rate limit according to demand for background I/O. It increases by a factor when limiter is drained too frequently, and decreases by the same factor when limiter is not drained frequently enough. The parameters for this behavior are fixed in `GenericRateLimiter::Tune`. Other changes:

- make rate limiter's `Env*` configurable for testing
- track num drain intervals in RateLimiter so we don't have to rely on stats, which may be shared across different DB instances from the ones that share the RateLimiter.
Closes https://github.com/facebook/rocksdb/pull/2899

Differential Revision: D5858704

Pulled By: ajkr

fbshipit-source-id: cc2bac30f85e7f6fd63655d0a6732ef9ed7403b1
2017-10-04 19:15:01 -07:00
Yi Wu
d1cab2b64e Add ValueType::kTypeBlobIndex
Summary:
Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to
1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex.
2. Make rocksdb able to detect if the db contains value written by blob db, if so return error.
3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type).

The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob().

Changes on blob db side will be in a separate patch.
Closes https://github.com/facebook/rocksdb/pull/2886

Differential Revision: D5838431

Pulled By: yiwu-arbug

fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca
2017-10-03 09:11:23 -07:00