Commit Graph

1494 Commits

Author SHA1 Message Date
Xiaofei Du
a0102aa6d7 Make database files' permissions configurable
Summary: Closes https://github.com/facebook/rocksdb/pull/3709

Differential Revision: D7610227

Pulled By: xiaofeidu008

fbshipit-source-id: 88a52f0f9f96e2195fccde995cf9760b785e9f07
2018-04-13 13:13:04 -07:00
zhangjinpeng1987
31ee4bf240 add kEntryRangeDeletion
Summary:
When there are many range deletions in a range, we want to trigger manual compaction on this range to reclaim disk space as soon as possible and speed up read.
After this change, we can collect informations of range deletions and store them into user properties which can guide our manual compaction.
Closes https://github.com/facebook/rocksdb/pull/3695

Differential Revision: D7570322

Pulled By: ajkr

fbshipit-source-id: c358fa43b0aac6cc954d2eadc7d3bd8015373369
2018-04-13 11:27:17 -07:00
David Lai
3be9b36453 comment unused parameters to turn on -Wunused-parameter flag
Summary:
This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557.
Closes https://github.com/facebook/rocksdb/pull/3662

Differential Revision: D7426121

Pulled By: Dayvedde

fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352
2018-04-12 17:59:16 -07:00
Maysam Yabandeh
d15397ba10 WritePrepared Txn: rollback_merge_operands hack
Summary:
This is a hack as temporary fix of MyRocks with rollbacking  the merge operands. The way MyRocks uses merge operands is without protection of locks, which violates the assumption behind the rollback algorithm. They are ok with not being rolled back as it would just create a gap in the autoincrement column. The hack add an option to disable the rollback of merge operands by default and only enables it to let the unit test pass.
Closes https://github.com/facebook/rocksdb/pull/3711

Differential Revision: D7597177

Pulled By: maysamyabandeh

fbshipit-source-id: 544be0f666c7e7abb7f651ec8b23124e05056728
2018-04-12 11:58:11 -07:00
Yanqin Jin
d42bd041c5 Improve visibility into the reasons for compaction.
Summary:
Add `compaction_reason` as part of event log for event `compaction started`.
Add counters for each `CompactionReason`.
Closes https://github.com/facebook/rocksdb/pull/3679

Differential Revision: D7550348

Pulled By: riversand963

fbshipit-source-id: a19cff3a678c785aa5ef41aac78b9a5968fcc34d
2018-04-11 10:58:44 -07:00
Yanqin Jin
d95014b9df fix some text in comments.
Summary:
1. Remove redundant text.
2. Make terminology consistent across all comments and doc of RocksDB. Also do
   our best to conform to conventions. Specifically, use 'callback' instead of
   'call-back' [wikipedia](https://en.wikipedia.org/wiki/Callback_(computer_programming)).
Closes https://github.com/facebook/rocksdb/pull/3693

Differential Revision: D7560396

Pulled By: riversand963

fbshipit-source-id: ba8c251c487f4e7d1872a1a8dc680f9e35a6ffb8
2018-04-10 15:59:24 -07:00
Maysam Yabandeh
bde1c1a72a WritePrepared Txn: add stats
Summary:
Adding some stats that would be helpful to monitor if the DB has gone to unlikely stats that would hurt the performance. These are mostly when we end up needing to acquire a mutex.
Closes https://github.com/facebook/rocksdb/pull/3683

Differential Revision: D7529393

Pulled By: maysamyabandeh

fbshipit-source-id: f7d36279a8f39bd84d8ddbf64b5c97f670c5d6d9
2018-04-07 21:56:42 -07:00
Maysam Yabandeh
eb5a295440 WritePrepared Txn: add write_committed option to dump_wal
Summary:
Currently dump_wal cannot print the prepared records from the WAL that is generated by WRITE_PREPARED write policy since the default reaction of the handler is to return NotSupported if markers of WRITE_PREPARED are encountered. This patch enables the admin to pass --write_committed=false option, which will be accordingly passed to the handler. Note that DBFileDumperCommand and DBDumperCommand are still not updated by this patch but firstly they are not urgent and secondly we need to revise this approach later when we also add WRITE_UNPREPARED markers so I leave it for future work.

Tested by running it on a WAL generated by WRITE_PREPARED:
$ ./ldb dump_wal --walfile=/dev/shm/dbbench/000003.log  | grep BEGIN_PREARE | head -1
1,2,70,0,BEGIN_PREARE
$ ./ldb dump_wal --walfile=/dev/shm/dbbench/000003.log --write_committed=false | grep BEGIN_PREARE | head -1
1,2,70,0,BEGIN_PREARE PUT(0) : 0x30303031313330313938 PUT(0) : 0x30303032353732313935 END_PREPARE(0x74786E31313535383434323738303738363938313335312D30)
Closes https://github.com/facebook/rocksdb/pull/3682

Differential Revision: D7522090

Pulled By: maysamyabandeh

fbshipit-source-id: a0332207261c61e18b2f9dfbe9feecd9a1339aca
2018-04-07 21:56:42 -07:00
Andrew Kryczka
faba3fb53d protect valid backup files when max_valid_backups_to_open is set
Summary:
When `max_valid_backups_to_open` is set, the `BackupEngine` doesn't know about the files referenced by existing backups. This PR prevents us from deleting valid files when that option is set, in cases where we are unable to accurately determine refcount. There are warnings logged when we may miss deleting unreferenced files, and a recommendation in the header for users to periodically unset this option and run a full `GarbageCollect`.
Closes https://github.com/facebook/rocksdb/pull/3518

Differential Revision: D7008331

Pulled By: ajkr

fbshipit-source-id: 87907f964dc9716e229d08636a895d2fc7b72305
2018-04-05 21:13:21 -07:00
Phani Shekhar Mantripragada
446b32cfc3 Support for Column family specific paths.
Summary:
In this change, an option to set different paths for different column families is added.
This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.

Changes :
1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions.  This member is used to identify the path information whenever files are accessed.
2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
4) Unit tests are added appropriately.
Closes https://github.com/facebook/rocksdb/pull/3102

Differential Revision: D6951697

Pulled By: ajkr

fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
2018-04-05 19:58:20 -07:00
Maysam Yabandeh
67182678a5 Stats for false positive rate of full filtesr
Summary:
Adds two stats to allow us measuring the false positive rate of full filters:
- The total count of positives: rocksdb.bloom.filter.full.positive
- The total count of true positives: rocksdb.bloom.filter.full.true.positive
Not the term "full" in the stat name to indicate that they are meaningful in full filters. block-based filters are to be deprecated soon and supporting it is not worth the the additional cost of if-then-else branches.

Closes #3680

Tested by:
$ ./db_bench -benchmarks=fillrandom  -db /dev/shm/rocksdb-tmpdb --num=1000000 -bloom_bits=10
$ ./db_bench -benchmarks="readwhilewriting"  -db /dev/shm/rocksdb-tmpdb --statistics -bloom_bits=10 --duration=60 --num=2000000 --use_existing_db 2>&1 > /tmp/full.log
$ grep filter.full /tmp/full.log
rocksdb.bloom.filter.full.positive COUNT : 3628593
rocksdb.bloom.filter.full.true.positive COUNT : 3536026
which gives the false positive rate of 2.5%
Closes https://github.com/facebook/rocksdb/pull/3681

Differential Revision: D7517570

Pulled By: maysamyabandeh

fbshipit-source-id: 630ab1a473afdce404916d297035b6318de4c052
2018-04-05 15:58:48 -07:00
Dmitri Smirnov
2a62ca1750 Make Optimistic Tx database stackable
Summary:
This change models Optimistic Tx db after Pessimistic TX db. The motivation for this change is to make the ptr polymorphic so it can be held by the same raw or smart ptr.

Currently, due to the inheritance of the Opt Tx db not being rooted in the manner of Pess Tx from a single DB root it is more difficult to write clean code and have clear ownership of the database in cases when options dictate instantiate of plan DB, Pess Tx DB or Opt tx db.
Closes https://github.com/facebook/rocksdb/pull/3566

Differential Revision: D7184502

Pulled By: yiwu-arbug

fbshipit-source-id: 31d06efafd79497bb0c230e971857dba3bd962c3
2018-04-03 15:28:40 -07:00
Sagar Vemuri
04c11b867d Level Compaction with TTL
Summary:
Level Compaction with TTL.

As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space.

Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data.
This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space.

This functionality can be controlled by the newly introduced column family option -- ttl.

TODO for later:
- Make ttl mutable
- Extend TTL to Universal compaction as well? (TTL is already supported in FIFO)
- Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option.
Closes https://github.com/facebook/rocksdb/pull/3591

Differential Revision: D7275442

Pulled By: sagar0

fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267
2018-04-02 22:14:28 -07:00
Adam Retter
8917eee962 Fixed small typos
Summary: Closes https://github.com/facebook/rocksdb/pull/3667

Differential Revision: D7470060

Pulled By: miasantreble

fbshipit-source-id: 8e8545cda38f0805f35ccdb8841666a2d7a965f5
2018-04-01 17:14:46 -07:00
Sagar Vemuri
90c542347a Update comments about MergeOperator::AllowSingleOperand
Summary:
Updated comments around AllowSingleOperand.
Reason: A couple of users were confused and encountered issues due to no overriding PartialMerge with AllowSingleOperand=true.

I'll also look into modifying the default merge operator implementation so that overriding PartialMerge is not mandatory when AllowSingleOp=true.
Closes https://github.com/facebook/rocksdb/pull/3659

Differential Revision: D7422691

Pulled By: sagar0

fbshipit-source-id: 3d075a6ced0120f5d65cb7ae5412936f1862f342
2018-03-27 17:13:53 -07:00
Anand Ananthabhotla
f9f4d40f93 Align SST file data blocks to avoid spanning multiple pages
Summary:
Provide a block_align option in BlockBasedTableOptions to allow
alignment of SST file data blocks. This will avoid higher
IOPS/throughput load due to < 4KB data blocks spanning 2 4KB pages.
When this option is set to true, the block alignment is set to lower of
block size and 4KB.
Closes https://github.com/facebook/rocksdb/pull/3502

Differential Revision: D7400897

Pulled By: anand1976

fbshipit-source-id: 04cc3bd144e88e3431a4f97604e63ad7a0f06d44
2018-03-26 20:26:10 -07:00
Radoslaw Zarzynski
09b6bf828a InlineSkiplist: don't decode keys unnecessarily during comparisons
Summary:
Summary
========
`InlineSkipList<>::Insert` takes the `key` parameter as a C-string. Then, it performs multiple comparisons with it requiring the `GetLengthPrefixedSlice()` to be spawn in `MemTable::KeyComparator::operator()(const char* prefix_len_key1, const char* prefix_len_key2)` on the same data over and over. The patch tries to optimize that.

Rough performance comparison
=====
Big keys, no compression.

```
$ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256
(...)
fillrandom   :       4.222 micros/op 236836 ops/sec;   80.4 MB/s
```

```
$ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256
(...)
fillrandom   :       4.064 micros/op 246059 ops/sec;   83.5 MB/s
```

TODO
======
In ~~a separated~~ this PR:
- [x] Go outside the write path. Maybe even eradicate the C-string-taking variant of `KeyIsAfterNode` entirely.
- [x] Try to cache the transformations applied by `KeyComparator` & friends in situations where we havy many comparisons with the same key.
Closes https://github.com/facebook/rocksdb/pull/3516

Differential Revision: D7059300

Pulled By: ajkr

fbshipit-source-id: 6f027dbb619a488129f79f79b5f7dbe566fb2dbb
2018-03-23 12:14:30 -07:00
Zhongyi Xie
1cbc96d236 FlushReason improvement
Summary:
Right now flush reason "SuperVersion Change" covers a few different scenarios which is a bit vague. For example, the following db_bench job should trigger "Write Buffer Full"

> $ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304
$ grep 'flush_reason' /dev/shm/dbbench/LOG
...
2018/03/06-17:30:42.543638 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242543634, "job": 192, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018024, "flush_reason": "SuperVersion Change"}
2018/03/06-17:30:42.569541 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242569536, "job": 193, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"}
2018/03/06-17:30:42.596396 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242596392, "job": 194, "event": "flush_started", "num_memtables": 1, "num_entries": 7008, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "SuperVersion Change"}
2018/03/06-17:30:42.622444 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242622440, "job": 195, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"}

With the fix:
> 2018/03/19-14:40:02.341451 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602341444, "job": 98, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018008, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.379655 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602379642, "job": 100, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018016, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.418479 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602418474, "job": 101, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.455084 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602455079, "job": 102, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.492293 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602492288, "job": 104, "event": "flush_started", "num_memtables": 1, "num_entries": 7007, "num_deletes": 0, "memory_usage": 1018056, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.528720 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602528715, "job": 105, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.566255 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602566238, "job": 107, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018112, "flush_reason": "Write Buffer Full"}
Closes https://github.com/facebook/rocksdb/pull/3627

Differential Revision: D7328772

Pulled By: miasantreble

fbshipit-source-id: 67c94065fbdd36930f09930aad0aaa6d2c152bb8
2018-03-22 18:42:18 -07:00
Siying Dong
118058ba69 SstFileManager: add bytes_max_delete_chunk
Summary:
Add `bytes_max_delete_chunk` in SstFileManager so that we can drop a large file in multiple batches.
Closes https://github.com/facebook/rocksdb/pull/3640

Differential Revision: D7358679

Pulled By: siying

fbshipit-source-id: ef17f0da2f5723dbece2669485a9b91b3edc0bb7
2018-03-22 15:58:37 -07:00
Fosco Marotto
de6cf95a53 Update history for future 5.13 release
Summary: Closes https://github.com/facebook/rocksdb/pull/3631

Differential Revision: D7367519

Pulled By: gfosco

fbshipit-source-id: 57826cc1c9ffc9f2b351075567b8ad929809cb74
2018-03-22 14:59:27 -07:00
QingpingWang
2ce8f63f81 C API for PerfContext
Summary:
This pull request exposes the interface of PerfContext as C API
Closes https://github.com/facebook/rocksdb/pull/3607

Differential Revision: D7294225

Pulled By: ajkr

fbshipit-source-id: eddcfbc13538f379950b2c8b299486695ffb5e2c
2018-03-21 22:13:34 -07:00
Niv Dayan
da82aab126 allowing CompactFiles to return new file names
Summary:
This is a small API extension to allow the CompactFiles method to return the names of files that were created during the compaction.
Closes https://github.com/facebook/rocksdb/pull/3608

Differential Revision: D7275789

Pulled By: siying

fbshipit-source-id: 1ec0c3954a0f10cd877efb5f29f9be6c7b59e9ba
2018-03-15 11:58:12 -07:00
Sagar Vemuri
cc118b0e6f Update version
Summary:
We missed updating version.h on master when cutting 5.11.fb and 5.12.fb branches. It should be the same as the version in the latest release branch (or should it be one more?).

I noticed this when trying to run some upgrade/downgrade tests from 5.11 to some new code on master.
Closes https://github.com/facebook/rocksdb/pull/3611

Differential Revision: D7282917

Pulled By: sagar0

fbshipit-source-id: 205ee75b77c5b6bbcea95a272760b427025a4aba
2018-03-15 10:41:48 -07:00
Amy Tai
e476d0e252 Adding stat to count cancelled compactions
Summary:
Added a stat that counts the number of cancelled compactions.
Closes https://github.com/facebook/rocksdb/pull/3574

Differential Revision: D7190259

Pulled By: amytai

fbshipit-source-id: d5ce82dc9398da6d6d34023ad4ed8cec909852a3
2018-03-08 10:42:28 -08:00
Bruce Mitchener
a3a3f5497c Fix some typos in comments and docs.
Summary: Closes https://github.com/facebook/rocksdb/pull/3568

Differential Revision: D7170953

Pulled By: siying

fbshipit-source-id: 9cfb8dd88b7266da920c0e0c1e10fb2c5af0641c
2018-03-08 10:27:25 -08:00
Sagar Vemuri
e69f6e8629 Fix API name in a comment in db.h
Summary:
... so that people are not confused.
Closes https://github.com/facebook/rocksdb/pull/3580

Differential Revision: D7187175

Pulled By: sagar0

fbshipit-source-id: bce70093d52e38cd24c9432fd708885d7c2c013e
2018-03-07 15:27:17 -08:00
Bruce Mitchener
0de710f5b8 Use nullptr instead of NULL / 0 more consistently.
Summary: Closes https://github.com/facebook/rocksdb/pull/3569

Differential Revision: D7170968

Pulled By: yiwu-arbug

fbshipit-source-id: 308a6b7dd358a04fd9a7de3d927bfd8abd57d348
2018-03-07 12:42:12 -08:00
Stuart
f021f1d9e1 Add rocksdb_open_with_ttl function in C API
Summary:
Change-Id: Ie6f9b10bce459f6bf0ade0e5877264b4e10da3f5
Signed-off-by: Stuart <Stuart.Hu@emc.com>
Closes https://github.com/facebook/rocksdb/pull/3553

Differential Revision: D7144833

Pulled By: sagar0

fbshipit-source-id: 815225fa6e560d8a5bc47ffd0a98118b107ce264
2018-03-06 20:57:20 -08:00
amytai
0a3db28d98 Disallow compactions if there isn't enough free space
Summary:
This diff handles cases where compaction causes an ENOSPC error.
This does not handle corner cases where another background job is started while compaction is running, and the other background job triggers ENOSPC, although we do allow the user to provision for these background jobs with SstFileManager::SetCompactionBufferSize.
It also does not handle the case where compaction has finished and some other background job independently triggers ENOSPC.

Usage: Functionality is inside SstFileManager. In particular, users should set SstFileManager::SetMaxAllowedSpaceUsage, which is the reference highwatermark for determining whether to cancel compactions.
Closes https://github.com/facebook/rocksdb/pull/3449

Differential Revision: D7016941

Pulled By: amytai

fbshipit-source-id: 8965ab8dd8b00972e771637a41b4e6c645450445
2018-03-06 16:27:54 -08:00
Dmitri Smirnov
c364eb42b5 Windows cumulative patch
Summary:
This patch addressed several issues.
  Portability including db_test std::thread -> port::Thread Cc: @
  and %z to ROCKSDB portable macro. Cc: maysamyabandeh

  Implement Env::AreFilesSame

  Make the implementation of file unique number more robust

  Get rid of C-runtime and go directly to Windows API when dealing
  with file primitives.

  Implement GetSectorSize() and aling unbuffered read on the value if
  available.

  Adjust Windows Logger for the new interface, implement CloseImpl() Cc: anand1976

  Fix test running script issue where $status var was of incorrect scope
  so the failures were swallowed and not reported.

  DestroyDB() creates a logger and opens a LOG file in the directory
  being cleaned up. This holds a lock on the folder and the cleanup is
  prevented. This fails one of the checkpoin tests. We observe the same in production.
  We close the log file in this change.

 Fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose failure where the test
 attempts to open a directory with NewRandomAccessFile which does not
 work on Windows.
  Fix DBTest.SoftLimit as it is dependent on thread timing. CC: yiwu-arbug
Closes https://github.com/facebook/rocksdb/pull/3552

Differential Revision: D7156304

Pulled By: siying

fbshipit-source-id: 43db0a757f1dfceffeb2b7988043156639173f5b
2018-03-06 11:57:43 -08:00
Yi Wu
b864bc9b5b Blob DB: Improve FIFO eviction
Summary:
Improving blob db FIFO eviction with the following changes,
* Change blob_dir_size to max_db_size. Take into account SST file size when computing DB size.
* FIFO now only take into account live sst files and live blob files. It is normal for disk usage to go over max_db_size because there are obsolete sst files and blob files pending deletion.
* FIFO eviction now also evict TTL blob files that's still open. It doesn't evict non-TTL blob files.
* If FIFO is triggered, it will pass an expiration and the current sequence number to compaction filter. Compaction filter will then filter inlined keys to evict those with an earlier expiration and smaller sequence number. So call LSM FIFO.
* Compaction filter also filter those blob indexes where corresponding blob file is gone.
* Add an event listener to listen compaction/flush event and update sst file size.
* Implement DB::Close() to make sure base db, as well as event listener and compaction filter, destruct before blob db.
* More blob db statistics around FIFO.
* Fix some locking issue when accessing a blob file.
Closes https://github.com/facebook/rocksdb/pull/3556

Differential Revision: D7139328

Pulled By: yiwu-arbug

fbshipit-source-id: ea5edb07b33dfceacb2682f4789bea61de28bbfa
2018-03-06 11:57:42 -08:00
Andrew Kryczka
5d68243e61 Comment out unused variables
Summary:
Submitting on behalf of another employee.
Closes https://github.com/facebook/rocksdb/pull/3557

Differential Revision: D7146025

Pulled By: ajkr

fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e
2018-03-05 13:13:41 -08:00
Maysam Yabandeh
680864ae54 WritePrepared Txn: Fix bug with duplicate keys during recovery
Summary:
Fix the following bugs:
- During recovery a duplicate key was inserted twice into the write batch of the recovery transaction,
once when the memtable returns false (because it was duplicates) and once for the 2nd attempt. This would result into different SubBatch count measured when the recovered transactions is committing.
- If a cf is flushed during recovery the memtable is not available to assist in detecting the duplicate key. This could result into not advancing the sequence number when iterating over duplicate keys of a flushed cf and hence inserting the next key with the wrong sequence number.
- SubBacthCounter would reset the comparator to default comparator after the first duplicate key. The 2nd duplicate key hence would have gone through a wrong comparator and not being detected.
Closes https://github.com/facebook/rocksdb/pull/3562

Differential Revision: D7149440

Pulled By: maysamyabandeh

fbshipit-source-id: 91ec317b165f363f5d11ff8b8c47c81cebb8ed77
2018-03-05 10:57:59 -08:00
Yi Wu
1209b6db5c Blob DB: remove existing garbage collection implementation
Summary:
Red diff to remove existing implementation of garbage collection. The current approach is reference counting kind of approach and require a lot of effort to get the size counter right on compaction and deletion. I'm going to go with a simple mark-sweep kind of approach and will send another PR for that.

CompactionEventListener was added solely for blob db and it adds complexity and overhead to compaction iterator. Removing it as well.
Closes https://github.com/facebook/rocksdb/pull/3551

Differential Revision: D7130190

Pulled By: yiwu-arbug

fbshipit-source-id: c3a375ad2639a3f6ed179df6eda602372cc5b8df
2018-03-02 12:57:23 -08:00
Yi Wu
bf937cf15b Add "rocksdb.live-sst-files-size" DB property
Summary:
Add "rocksdb.live-sst-files-size" DB property which only include files of latest version. Existing "rocksdb.total-sst-files-size" include files from all versions and thus include files that's obsolete but not yet deleted. I'm going to use this new property to cap blob db sst + blob files size.
Closes https://github.com/facebook/rocksdb/pull/3548

Differential Revision: D7116939

Pulled By: yiwu-arbug

fbshipit-source-id: c6a52e45ce0f24ef78708156e1a923c1dd6bc79a
2018-03-01 18:01:10 -08:00
Andrew Kryczka
3ae0047278 skip CompactRange flush based on memtable contents
Summary:
CompactRange has a call to Flush because we guarantee that, at the time it's called, all existing keys in the range will be pushed through the user's compaction filter. However, previously the flush was done blindly, so it'd happen even if the memtable does not contain keys in the range specified by the user. This caused unnecessarily many L0 files to be created, leading to write stalls in some cases. This PR checks the memtable's contents, and decides to flush only if it overlaps with `CompactRange`'s range.

- Move the memtable overlap check logic from `ExternalSstFileIngestionJob` to `ColumnFamilyData::RangesOverlapWithMemtables`
- Reuse the above logic in `CompactRange` and skip flushing if no overlap
Closes https://github.com/facebook/rocksdb/pull/3520

Differential Revision: D7018897

Pulled By: ajkr

fbshipit-source-id: a3c6b1cfae56687b49dd89ccac7c948e53545934
2018-02-27 17:12:44 -08:00
Siying Dong
c287c098a4 Update comments in DB::Close()
Summary: Closes https://github.com/facebook/rocksdb/pull/3543

Differential Revision: D7093251

Pulled By: siying

fbshipit-source-id: 4066b82c95ecb65866c5842d68ab13ab9f85d567
2018-02-27 12:42:31 -08:00
Anand Ananthabhotla
dfbe52e099 Fix the Logger::Close() and DBImpl::Close() design pattern
Summary:
The recent Logger::Close() and DBImpl::Close() implementation rely on
calling the CloseImpl() virtual function from the destructor, which will
not work. Refactor the implementation to have a private close helper
function in derived classes that can be called by both CloseImpl() and
the destructor.
Closes https://github.com/facebook/rocksdb/pull/3528

Reviewed By: gfosco

Differential Revision: D7049303

Pulled By: anand1976

fbshipit-source-id: 76a64cbf403209216dfe4864ecf96b5d7f3db9f4
2018-02-23 13:57:26 -08:00
Maysam Yabandeh
90eca1e616 WritePrepared Txn: optimize SubBatchCnt
Summary:
Make use of the index in WriteBatchWithIndex to also count the number of sub-batches. This eliminates the need to separately scan the batch to count the number of sub-batches once a duplicate key is detected.
Closes https://github.com/facebook/rocksdb/pull/3529

Differential Revision: D7049947

Pulled By: maysamyabandeh

fbshipit-source-id: 81cbf12c4e662541c772c7265a8f91631e25c7cd
2018-02-22 18:12:26 -08:00
Igor Sugak
aba3409740 Back out "[codemod] - comment out unused parameters"
Reviewed By: igorsugak

fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37
2018-02-22 12:43:17 -08:00
David Lai
f4a030ce81 - comment out unused parameters
Reviewed By: everiq, igorsugak

Differential Revision: D7046710

fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461
2018-02-22 09:44:23 -08:00
Sagar Vemuri
8ada876dfe Add rocksdb.iterator.internal-key property
Summary:
Added a new iterator property: `rocksdb.iterator.internal-key` to get the internal-key (converted to user key) at which the iterator stopped.
Closes https://github.com/facebook/rocksdb/pull/3525

Differential Revision: D7033694

Pulled By: sagar0

fbshipit-source-id: d51e6c00f5e9d766c6276ef79774b81c6c5216f8
2018-02-20 19:12:09 -08:00
Maysam Yabandeh
8eb1d445c3 Unbreak MemTableRep API change
Summary:
The MemTableRep API was broken by this commit: 813719e952
This patch reverts the changes and instead adds InsertKey (and etc.) overloads to extend the MemTableRep API without breaking the existing classes that inherit from it.
Closes https://github.com/facebook/rocksdb/pull/3513

Differential Revision: D7004134

Pulled By: maysamyabandeh

fbshipit-source-id: e568d91fe1e17dd76c0c1f6c7dd51a18633b1c4f
2018-02-15 17:27:24 -08:00
Maysam Yabandeh
8a04ee4fd1 WritePrepared Txn: use TransactionDBWriteOptimizations (2nd attempt)
Summary:
TransactionDB::Write can receive some optimization hints from the user. One is to skip the concurrency control mechanism. WritePreparedTxnDB is currently ignoring such hints. This patch optimizes WritePreparedTxnDB::Write for skip_concurrency_control and skip_duplicate_key_check hints.
Closes https://github.com/facebook/rocksdb/pull/3496

Differential Revision: D6971784

Pulled By: maysamyabandeh

fbshipit-source-id: cbab10ad538fa2b8bcb47e37c77724afe6e30f03
2018-02-12 16:43:40 -08:00
Andrew Kryczka
ee1c802675 Add delay before flush in CompactRange to avoid write stalling
Summary:
- Refactored logic for checking write stall condition to a helper function: `GetWriteStallConditionAndCause`. Now it is decoupled from the logic for updating WriteController / stats in `RecalculateWriteStallConditions`, so we can reuse it for predicting whether write stall will occur.
- Updated `CompactRange` to first check whether the one additional immutable memtable / L0 file would cause stalling before it flushes. If so, it waits until that is no longer true.
- Updated `bg_cv_` to be signaled on `SetOptions` calls. The stall conditions `CompactRange` cares about can change when (1) flush finishes, (2) compaction finishes, or (3) options dynamically change. The cv was already signaled for (1) and (2) but not yet for (3).
Closes https://github.com/facebook/rocksdb/pull/3381

Differential Revision: D6754983

Pulled By: ajkr

fbshipit-source-id: 5613e03f1524df7192dc6ae885d40fd8f091d972
2018-02-12 15:42:47 -08:00
jsteemann
d62af7fa4f fix a typo (of a potential vi user)
Summary: Closes https://github.com/facebook/rocksdb/pull/3481

Differential Revision: D6939089

Pulled By: siying

fbshipit-source-id: ccce3ae10cc5ff50a74b85804afd044b21a3c3e2
2018-02-09 12:58:07 -08:00
Zhongyi Xie
945f618ba5 log flush reason for better debugging experience
Summary:
It's always a mystery from the logs why flush was triggered -- user triggered it manually, WriteBufferManager triggered it,  logs were full, write buffer was full, etc.
This PR logs Flush reason whenever a flush is scheduled.
Closes https://github.com/facebook/rocksdb/pull/3401

Differential Revision: D6788142

Pulled By: miasantreble

fbshipit-source-id: a867e54d493c06adf5172bd36a180fb3faae3511
2018-02-09 12:12:43 -08:00
Maysam Yabandeh
8feee28020 Add skip_cc option to TransactionDB::Write
Summary:
Compared to DB::Write, TransactionDB::Write has the additional overhead of creating and initializing an internal transaction object, as well as the overhead of locking/unlocking the keys. This patch extends the TransactionDB::Write with an skip_cc option to allow the users to indicate that the write batch do not conflict with others and the concurrency control and its overhead can be skipped. TransactionDB::Write by default calls DB::Write when skip_cc is set, which works for WriteCommitted WritePolicy. Any other flavor of TransactionDB that is not compatible with this default behavior (such as WritePreparedTxnDB) can extend ::Write and implement their own approach for taking into account the skip_cc optimization.
Closes https://github.com/facebook/rocksdb/pull/3457

Differential Revision: D6877318

Pulled By: maysamyabandeh

fbshipit-source-id: 56f4e21db87ff71492db4e376fb7c2b03dfeab6b
2018-02-06 15:28:24 -08:00
Maysam Yabandeh
88d8b2a2f5 WritePrepared Txn: Duplicate Keys, Txn Part
Summary:
This patch takes advantage of memtable being able to detect duplicate <key,seq> and returning TryAgain to handle duplicate keys in WritePrepared Txns. Through WriteBatchWithIndex's index it detects existence of at least a duplicate key in the write batch. If duplicate key was reported, it then pays the cost of counting the number of sub-patches by iterating over the write batch and pass it to DBImpl::Write. DB will make use of the provided batch_count to assign proper sequence numbers before sending them to the WAL. When later inserting the batch to the memtable, it increases the seq each time memtbale reports a duplicate (a sub-patch in our counting) and tries again.
Closes https://github.com/facebook/rocksdb/pull/3455

Differential Revision: D6873699

Pulled By: maysamyabandeh

fbshipit-source-id: db8487526c3a5dc1ddda0ea49f0f979b26ae648d
2018-02-05 18:43:24 -08:00
Anand Ananthabhotla
4b124fb9d3 Handle error return from WriteBuffer()
Summary:
There are a couple of places where we swallow any error from
WriteBuffer() - in SwitchMemtable() and DBImpl::CloseImpl(). Propagate
the error up in those cases rather than ignoring it.
Closes https://github.com/facebook/rocksdb/pull/3404

Differential Revision: D6879954

Pulled By: anand1976

fbshipit-source-id: 2ef88b554be5286b0a8bad7384ba17a105395bdb
2018-02-05 13:59:34 -08:00