Commit Graph

1796 Commits

Author SHA1 Message Date
Andrew Kryczka
8272a6de57 Optionally wait on bytes_per_sync to smooth I/O (#5183)
Summary:
The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways.

My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes.

Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it.

There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically).

The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5183

Differential Revision: D14953553

Pulled By: siying

fbshipit-source-id: 445c3862e019fb7b470f9c7f314fc231b62706e9
2019-04-22 11:51:39 -07:00
Mike Kolupaev
df38c1ce66 Add BlockBasedTableOptions::index_shortening (#5174)
Summary:
Introduce BlockBasedTableOptions::index_shortening to give users control on which key shortening techniques to be used in building index blocks. Before this patch, both separators and successor keys where shortened in indexes. With this patch, the default is set to kShortenSeparators to only shorten the separators. Since each index block has many separators and only one successor (last key), the change should not have negative impact on index block size. However it should prevent many unnecessary block loads where due to approximation introduced by shorted successor, seek would land us to the previous block and then fix it by moving to the next one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5174

Differential Revision: D14884185

Pulled By: al13n321

fbshipit-source-id: 1b08bc8c03edcf09b6b8c16e9a7eea08ad4dd534
2019-04-22 08:20:35 -07:00
jsteemann
de76909464 refactor SavePoints (#5192)
Summary:
Savepoints are assumed to be used in a stack-wise fashion (only
the top element should be used), so they were stored by `WriteBatch`
in a member variable `save_points` using an std::stack.

Conceptually this is fine, but the implementation had a few issues:
- the `save_points_` instance variable was a plain pointer to a heap-
  allocated `SavePoints` struct. The destructor of `WriteBatch` simply
  deletes this pointer. However, the copy constructor of WriteBatch
  just copied that pointer, meaning that copying a WriteBatch with
  active savepoints will very likely have crashed before. Now a proper
  copy of the savepoints is made in the copy constructor, and not just
  a copy of the pointer
- `save_points_` was an std::stack, which defaults to `std::deque` for
  the underlying container. A deque is a bit over the top here, as we
  only need access to the most recent savepoint (i.e. stack.top()) but
  never any elements at the front. std::deque is rather expensive to
  initialize in common environments. For example, the STL implementation
  shipped with GNU g++ will perform a heap allocation of more than 500
  bytes to create an empty deque object. Although the `save_points_`
  container is created lazily by RocksDB, moving from a deque to a plain
  `std::vector` is much more memory-efficient. So `save_points_` is now
  a vector.
- `save_points_` was changed from a plain pointer to an `std::unique_ptr`,
  making ownership more explicit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5192

Differential Revision: D15024074

Pulled By: maysamyabandeh

fbshipit-source-id: 5b128786d3789cde94e46465c9e91badd07a25d7
2019-04-19 20:33:04 -07:00
anand76
5265c5709e Remove a couple of non-public includes from public header file (#5219)
Summary:
Cleanup a couple of stray includes left by #5011.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5219

Differential Revision: D15007244

Pulled By: anand1976

fbshipit-source-id: 15ca1d4f977b5b60e99df3bfb8fc3db217d19bdd
2019-04-19 11:10:33 -07:00
Sagar Vemuri
efa948741c Use creation_time or mtime when file_creation_time=0 (#5184)
Summary:
We found an issue in Periodic Compactions (introduced in #5166) where files were not being picked up for compactions as all the SST files created with older versions of RocksDB have `file_creation_time` as 0. (Note that `file_creation_time` is a new table property introduced in #5166).

To address this, Periodic compactions now fall back to looking at the `creation_time` table property or the file's modification time (as given by the Env) when `file_creation_time` table property is found to be 0.

Here how the file's modification time (and, in turn, the file age) is computed now:
1. Use `file_creation_time` table property if it is > 0.
1. If not, then use `creation_time` table property if it is > 0.
1. If not, then use file's mtime stat metadata given by the underlying Env.
Don't consider the file at all for compaction if the modification time cannot be correctly determined based on the above conditions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5184

Differential Revision: D14907795

Pulled By: sagar0

fbshipit-source-id: 4bb2f3631f9a3e04470c674a1d13544584e1e56c
2019-04-18 22:39:34 -07:00
Fosco Marotto
6c2bf9e916 Add copyright headers per FB open-source checkup tool. (#5199)
Summary:
internal task: T35568575
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5199

Differential Revision: D14962794

Pulled By: gfosco

fbshipit-source-id: 93838ede6d0235eaecff90d200faed9a8515bbbe
2019-04-18 10:55:01 -07:00
Zhongyi Xie
baa5302447 Avoid double-compacting data in bottom level in manual compactions (#5138)
Summary:
Depending on the config, manual compaction (leveled compaction style) does following compactions:
L0->L1
L1->L2
...
Ln-1 -> Ln
Ln -> Ln
The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only.
Resolves issue https://github.com/facebook/rocksdb/issues/4995
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138

Differential Revision: D14940106

Pulled By: miasantreble

fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5
2019-04-16 23:32:20 -07:00
Fosco Marotto
b5cad5c986 Update history and version to 6.1.1 (#5171)
Summary:
Including latest fixes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5171

Differential Revision: D14875157

Pulled By: gfosco

fbshipit-source-id: 86ec7ee3553a9b25ab71ed98966ce08a16322e2c
2019-04-15 10:49:38 -07:00
Manuel Ung
d655a3aab7 Remove extraneous call to TrackKey (#5173)
Summary:
In `PessimisticTransaction::TryLock`, we were calling `TrackKey` even when assume_tracked=true, which defeats the purpose of assume_tracked. Remove this.

For keys that are already tracked, TrackKey will actually bump some counters (num_reads/num_writes) which are consumed in `TransactionBaseImpl::GetTrackedKeysSinceSavePoint`, and this is used to determine which keys were tracked since the last savepoint. I believe this functionality should still work, since I think the user should not call GetForUpdate/Put(assume_tracked=true) across savepoints, and if they do, they should not expect the Put(assume_tracked=true) to show up as a tracked key in the second savepoint.

This is another 2-3% cpu improvement.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5173

Differential Revision: D14883809

Pulled By: lth

fbshipit-source-id: 7d09f0772da422384af0519773e310c22b0cbca3
2019-04-12 16:37:12 -07:00
anand76
fefd4b98c5 Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.

Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency

The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.

Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).

Batch   Sizes

1        | 2        | 4         | 8      | 16  | 32

Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074        - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14        - MultiGet (w/ batching)

Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135

Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62

Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891

dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011

Differential Revision: D14348703

Pulled By: anand1976

fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
2019-04-11 14:28:26 -07:00
Siying Dong
ed9f5e21aa Change OptimizeForPointLookup() and OptimizeForSmallDb() (#5165)
Summary:
Change the behavior of OptimizeForSmallDb() so that it is less likely to go out of memory.
Change the behavior of OptimizeForPointLookup() to take advantage of the new memtable whole key filter, and move away from prefix extractor as well as hash-based indexing, as they are prone to misuse.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5165

Differential Revision: D14880709

Pulled By: siying

fbshipit-source-id: 9af30e3c9e151eceea6d6b38701a58f1f9fb692d
2019-04-11 10:45:36 -07:00
Sagar Vemuri
d3d20dcdca Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.

This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted.  And also, of course, it helps to cleanup data older than certain threshold.

- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).

This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166

Differential Revision: D14884441

Pulled By: sagar0

fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
2019-04-10 19:31:18 -07:00
Sergei Glushchenko
39c6c5fc1b Expose DB methods to lock and unlock the WAL (#5146)
Summary:
Expose DB methods to lock and unlock the WAL.

These methods are intended to use by MyRocks in order to obtain WAL
coordinates in consistent way.

Usage scenario is following:

MySQL has performance_schema.log_status which provides information that
enables a backup tool to copy the required log files without locking for
the duration of copy. To populate this table MySQL does following:

1. Lock the binary log. Transactions are not allowed to commit now
2. Save the binary log coordinates
3. Walk through the storage engines and lock writes on each engine. For
   InnoDB, redo log is locked. For MyRocks, WAL should be locked.
4. Ask storage engines for their coordinates. InnoDB reports its current
   LSN and checkpoint LSN. MyRocks should report active WAL files names
   and sizes.
5. Release storage engine's locks
6. Unlock binary log

Backup tool will then use this information to copy InnoDB, RocksDB and
MySQL binary logs up to specified positions to end up with consistent DB
state after restore.

Currently, RocksDB allows to obtain the list of WAL files. Only missing
bit is the method to lock the writes to WAL files.

LockWAL method must flush the WAL in order for the reported size to be
accurate (GetSortedWALFiles is using file system stat call to return the
file size), also, since backup tool is going to copy the WAL, it is
better to be flushed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5146

Differential Revision: D14815447

Pulled By: maysamyabandeh

fbshipit-source-id: eec9535a6025229ed471119f19fe7b3d8ae888a3
2019-04-06 06:40:36 -07:00
Mike Kolupaev
306b9adfd8 Add missing methods to EnvWrapper, and more wrappers in Env.h (#5131)
Summary:
- Some newer methods of Env weren't wrapped in EnvWrapper. Fixed.
 - Added more wrapper classes similar to WritableFileWrapper: SequentialFileWrapper, RandomAccessFileWrapper, RandomRWFileWrapper, DirectoryWrapper, LoggerWrapper.
 - Moved the code around a bit, removed some unused friendships, added some comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5131

Differential Revision: D14738932

Pulled By: al13n321

fbshipit-source-id: 99a9b1af28f2c629e7b7501389fa920b5ce30218
2019-04-04 14:47:41 -07:00
Adam Simpkins
c06c4c01c5 Fix many bugs in log statement arguments (#5089)
Summary:
Annotate all of the logging functions to inform the compiler that these
use printf-style formatting arguments.  This allows the compiler to emit
warnings if the format arguments are incorrect.

This also fixes many problems reported now that format string checking
is enabled.  Many of these are simply mix-ups in the argument type (e.g,
int vs uint64_t), but in several cases the wrong number of arguments
were being passed in which can cause the code to crash.

The primary motivation for this was to fix the log message in
`DBImpl::SwitchMemtable()` which caused a segfault due to an extra %s
format parameter with no argument supplied.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5089

Differential Revision: D14574795

Pulled By: simpkins

fbshipit-source-id: 0921b03f0743652bf4ae21e414ff54b3bb65422a
2019-04-04 12:12:11 -07:00
Zhongyi Xie
26015f3b48 add compression options to table properties (#5081)
Summary:
Since we are planning to use dictionary compression and to use different compression level, it is quite useful to add compression options to TableProperties. For example, in MyRocks, if the feature is available, we can query from information_schema.rocksdb_sst_props to see if all sst files are converted to ZSTD dictionary compressions. Resolves https://github.com/facebook/rocksdb/issues/4992

With this PR, user can query table properties through `GetPropertiesOfAllTables` API and get compression options as std::string:
`window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0;`
or table_properties->ToString() will also contain it
`# data blocks=1; # entries=13; # deletions=0; # merge operands=0; # range deletions=0; raw key size=143; raw average key size=11.000000; raw value size=39; raw average value size=3.000000; data block size=120; index block size (user-key? 0, delta-value? 0)=27; filter block size=0; (estimated) table size=147; filter policy name=N/A; prefix extractor name=nullptr; column family ID=0; column family name=default; comparator name=leveldb.BytewiseComparator; merge operator name=nullptr; property collectors names=[]; SST file compression algo=Snappy; SST file compression options=window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ; creation time=1552946632; time stamp of earliest key=1552946632;`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5081

Differential Revision: D14716692

Pulled By: miasantreble

fbshipit-source-id: 7d2f2cf84e052bff876e71b4212cfdebf5be32dd
2019-04-02 14:52:34 -07:00
Maysam Yabandeh
14b3f683a1 WriteUnPrepared: less virtual in iterator callback (#5049)
Summary:
WriteUnPrepared adds a virtual function, MaxUnpreparedSequenceNumber, to ReadCallback, which returns 0 unless WriteUnPrepared is enabled and the transaction has uncommitted data written to the DB. Together with snapshot sequence number, this determines the last sequence that is visible to reads.
The patch clarifies the guarantees of the GetIterator API in WriteUnPrepared transactions and make use of that to statically initialize the read callback and thus avoid the virtual call.
Furthermore it increases the minimum value for min_uncommitted from 0 to 1 as seq 0 is used only for last level keys that are committed in all snapshots.

The following benchmark shows +0.26% higher throughput in seekrandom benchmark.

Benchmark:
./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench

./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG    10 runs] : 20355 ops/sec;  225.2 MB/sec
seekrandom [MEDIAN 10 runs] : 20425 ops/sec;  225.9 MB/sec

./db_bench_lessvirtual3 --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG    10 runs] : 20409 ops/sec;  225.8 MB/sec
seekrandom [MEDIAN 10 runs] : 20487 ops/sec;  226.6 MB/sec
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5049

Differential Revision: D14366459

Pulled By: maysamyabandeh

fbshipit-source-id: ebaff8908332a5ae9af7defeadabcb624be660ef
2019-04-02 14:47:16 -07:00
Mike Kolupaev
120bc4715b Add DBOptions. avoid_unnecessary_blocking_io to defer file deletions (#5043)
Summary:
Just like ReadOptions::background_purge_on_iterator_cleanup but for ColumnFamilyHandle instead of Iterator.

In our use case we sometimes call ColumnFamilyHandle's destructor from low-latency threads, and sometimes it blocks the thread for a few seconds deleting the files. To avoid that, we can either offload ColumnFamilyHandle's destruction to a background thread on our side, or add this option on rocksdb side. This PR does the latter, to be consistent with how we solve exactly the same problem for iterators using background_purge_on_iterator_cleanup option.

(EDIT: It's avoid_unnecessary_blocking_io now, and affects both CF drops and iterator destructors.)
I'm not quite comfortable with having two separate options (background_purge_on_iterator_cleanup and background_purge_on_cf_cleanup) for such a rarely used thing. Maybe we should merge them? Rename background_purge_on_cf_cleanup to something like delete_files_on_background_threads_only or avoid_blocking_io_in_unexpected_places, and make iterators use it instead of the one in ReadOptions? I can do that here if you guys think it's better.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5043

Differential Revision: D14339233

Pulled By: al13n321

fbshipit-source-id: ccf7efa11c85c9a5b91d969bb55627d0fb01e7b8
2019-04-01 17:10:40 -07:00
anand76
dae3b5545c Smooth the deletion of WAL files (#5116)
Summary:
WAL files are currently not subject to deletion rate limiting by DeleteScheduler. If the size of the WAL files is significant, this can cause a high delete rate on SSDs that may affect other operations. To fix it, force WAL file deletions to go through the SstFileManager. Original PR for this is #2768
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5116

Differential Revision: D14669437

Pulled By: anand1976

fbshipit-source-id: c5f62d0640cebaa1574de841a1d01e4ce2faadf0
2019-03-28 15:17:13 -07:00
Siying Dong
a98317f555 Option string/map can set merge operator from object registry (#5123)
Summary:
Allow customized merge operator to be loaded from option file/map/string
by allowing users to pre-regiester merge operators to object registry.

Also update HISTORY.md and header files for the same feature for comparator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5123

Differential Revision: D14658488

Pulled By: siying

fbshipit-source-id: 86ea2fbd2a0a04632d8ea9fceaffefd041f6ae61
2019-03-28 14:54:29 -07:00
Siying Dong
1f7f5a5a79 Run automatic formatter against public header files (#5115)
Summary:
Automatically format public headers so it looks more consistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5115

Differential Revision: D14632854

Pulled By: siying

fbshipit-source-id: ce9929ea62f9dcd65c69660b23eed1931cb0ae84
2019-03-27 13:24:25 -07:00
Fosco Marotto
8c072044d2 Update history and version for 6.1
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5119

Differential Revision: D14645216

Pulled By: gfosco

fbshipit-source-id: f7c83dca22c2486fc5d8697b61638c382889d073
2019-03-27 11:21:34 -07:00
Siying Dong
2b4d5ceb47 Remove some "using std::..." from header files. (#5113)
Summary:
The code convention we are following, Google C++ Style, discourage
alias in header files, especially public headers:
https://google.github.io/styleguide/cppguide.html#Aliases
Remove some of them. Might removed some from .cc files as well to be consistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5113

Differential Revision: D14633030

Pulled By: siying

fbshipit-source-id: b990edc919d5de60295992284f980195e501d424
2019-03-27 10:28:21 -07:00
Yanqin Jin
9358178edc Support for single-primary, multi-secondary instances (#4899)
Summary:
This PR allows RocksDB to run in single-primary, multi-secondary process mode.
The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary.
Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`.

This PR has several components:
1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary.

2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue.

3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`.
3.1 Tail the primary's MANIFEST during recovery.
3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`.
3.3 Tailing WAL will be in a future PR.

4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899

Differential Revision: D14510945

Pulled By: riversand963

fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
2019-03-26 16:45:31 -07:00
Shi Feng
01e6badbb6 Introduce CPU timers for iterator seek and next (#5076)
Summary:
Introduce CPU timers for iterator seek and next operations. Seek
counter includes SeekToFirst, SeekToLast and SeekForPrev, w/ the
caveat that SeekToLast timer doesn't include some post processing
time if upper bound is defined.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5076

Differential Revision: D14525218

Pulled By: fredfsh

fbshipit-source-id: 03ba25df3b22b06c072621e4de0eacfa1445f0d9
2019-03-26 16:32:13 -07:00
Zhongyi Xie
52e6404e0f ldb command parsing: allow option values to contain equals signs (#5088)
Summary:
Right now ldb command doesn't allow cases where option values contain equals sign. For example,
```
ldb --db=/tmp/test scan --from='q=3' --max_keys=1
```
after parsing, ldb will have one option 'db', 'max_keys' and one flag 'from'.
This PR updates the parsing logic so that it now supports the above mentioned cases
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5088

Differential Revision: D14600869

Pulled By: miasantreble

fbshipit-source-id: c6ef518c74a98d7b6675ea5954ae08b1bda5554e
2019-03-25 13:23:11 -07:00
Rashmi Sharma
a4396f9218 Make it easier for users to load options from option file and set shared block cache. (#5063)
Summary:
[RocksDB] Make it easier for users to load options from option file and set shared block cache.
Right now, it requires several dynamic casting for users to set the shared block cache to their option struct cast from the option file.
If people don't do that, every CF of every DB will generate its own 8MB block cache. It's not a usable setting. So we are dragging every user who loads options from the file into such a mess.
Instead, we should allow them to pass their cache object to LoadLatestOptions() and LoadOptionsFromFile(), so that those loaded option structs will have the shared block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5063

Differential Revision: D14518584

Pulled By: rashmishrm

fbshipit-source-id: c91430ff9425a0e67d76fc67931d755f491ca5aa
2019-03-21 16:25:28 -07:00
Levi Tamasi
34f8ac0c99 Make adaptivity of LRU cache mutexes configurable (#5054)
Summary:
The patch adds a new config option to LRUCacheOptions that enables
users to choose whether to use an adaptive mutex for the LRU block
cache (on platforms where adaptive mutexes are supported). The default
is true if RocksDB is compiled with -DROCKSDB_DEFAULT_TO_ADAPTIVE_MUTEX,
false otherwise.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5054

Differential Revision: D14542749

Pulled By: ltamasi

fbshipit-source-id: 0065715ab6cf91f10444b737fed8c8aee6a8a0d2
2019-03-20 12:33:44 -07:00
Zhongyi Xie
a291f3a1e5 Collect compaction stats by priority and dump to info LOG (#5050)
Summary:
In order to better understand compaction done by different priority thread pool, we now collect compaction stats by priority and also print them to info LOG through stats dump.

```
** Compaction Stats [default] **
Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Low      0/0    0.00 KB   0.0     16.8    11.3      5.5       5.6      0.1       0.0   0.0    406.4    136.1     42.24             34.96        45    0.939     13M  8865K
High      0/0    0.00 KB   0.0      0.0     0.0      0.0      11.4     11.4       0.0   0.0      0.0     76.2    153.00             35.74     12185    0.013       0      0
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5050

Differential Revision: D14408583

Pulled By: miasantreble

fbshipit-source-id: e53746586ea27cb8abc9fec35805bd80ed30f608
2019-03-19 17:28:19 -07:00
Andrew Audibert
e50326f327 Document the interaction between disableWAL and BackupEngine (#5071)
Summary:
BackupEngine relies on write-ahead logs to back up the memtable. Disabling write-ahead logs
can result in backups failing to preserve unflushed keys. This PR updates the documentation to specify this behavior, and suggest always flushing the memtable when write-ahead logs are disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5071

Differential Revision: D14524124

Pulled By: miasantreble

fbshipit-source-id: 635f855f8a42ad60273b5efd226139b511e3e5d5
2019-03-19 14:58:14 -07:00
Wenjie Yang
36c2a7cfb1 Add an option to filter traces (#5082)
Summary:
Add an option to filter out READ or WRITE operations while tracing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5082

Differential Revision: D14515083

Pulled By: mrmiywj

fbshipit-source-id: 2504c89a9abf1dd629cad44b4104092702d77610
2019-03-19 14:36:51 -07:00
Hiroaki Nakamura
f2f6acbef3 Add missing C API for transaction (#5077)
Summary:
Partly addresses https://github.com/facebook/rocksdb/issues/4999
I verified `make static_lib` runs fine.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5077

Differential Revision: D14521101

Pulled By: maysamyabandeh

fbshipit-source-id: ba88e74a51d2d793cac7260d505b1a54254b53af
2019-03-19 09:43:22 -07:00
Shobhit Dayal
b45b1cde3e Feature for sampling and reporting compressibility (#4842)
Summary:
This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms:
1. lz4 or snappy for fast compression
2. zstd or zlib for slow but higher compression.

The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType.

The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842

Differential Revision: D13629011

Pulled By: shobhitdayal

fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa
2019-03-18 12:15:34 -07:00
anand76
b4fa51dfaf Update bg_error when log flush fails in SwitchMemtable() (#5072)
Summary:
There is a potential failure case in DBImpl::SwitchMemtable() that is not handled properly. The call to cur_log_writer->WriteBuffer() can fail due to an IO error. In that case, we need to call SetBGError() in order set the background error since the WriteBuffer() failure may result in data loss.

Also, the asserts for !new_mem and !new_log are incorrect, as those would have been allocated by the time this failure is detected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5072

Differential Revision: D14461384

Pulled By: anand1976

fbshipit-source-id: fb59bce9d61378f37d2dfcd28c0b704b0f43c3cf
2019-03-15 15:19:25 -07:00
Siying Dong
0920bf4e68 Revert "Remove PlainTable's feature store_index_in_file (#4914)" (#5034)
Summary:
This reverts commit ee1818081f.

We are not ready to deprecate this feature. revert it for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5034

Differential Revision: D14287246

Pulled By: siying

fbshipit-source-id: e4beafdeaee1c94364fdaa6ba198218d158339f7
2019-03-01 15:45:45 -08:00
Siying Dong
aef763b6d6 Make statistics's stats_level change thread-safe (#5030)
Summary:
Right now, users can change statistics.stats_level while DB is running, but TSAN may report
data race. We make stats_level_ to be atomic, and access them using accessors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5030

Differential Revision: D14267519

Pulled By: siying

fbshipit-source-id: 37d7ebeff7a43a406230143422a16af899163f73
2019-03-01 10:42:09 -08:00
Siying Dong
5e298f865b Add two more StatsLevel (#5027)
Summary:
Statistics cost too much CPU for some use cases. Add two stats levels
so that people can choose to skip two types of expensive stats, timers and
histograms.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027

Differential Revision: D14252765

Pulled By: siying

fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592
2019-02-28 10:27:59 -08:00
Adam Retter
bb474e9a02 Add missing functionality to RocksJava (#4833)
Summary:
This is my latest round of changes to add missing items to RocksJava. More to come in future PRs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4833

Differential Revision: D14152266

Pulled By: sagar0

fbshipit-source-id: d6cff67e26da06c131491b5cf6911a8cd0db0775
2019-02-22 14:46:46 -08:00
Zhongyi Xie
c4f5d0aa15 add GetStatsHistory to retrieve stats snapshots (#4748)
Summary:
This PR adds public `GetStatsHistory` API to retrieve stats history in the form of an std map. The key of the map is the timestamp in microseconds when the stats snapshot is taken, the value is another std map from stats name to stats value (stored in std string). Two DBOptions are introduced: `stats_persist_period_sec` (default 10 minutes) controls the intervals between two snapshots are taken; `max_stats_history_count` (default 10) controls the max number of history snapshots to keep in memory. RocksDB will stop collecting stats snapshots if `stats_persist_period_sec` is set to 0.

(This PR is the in-memory part of https://github.com/facebook/rocksdb/pull/4535)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4748

Differential Revision: D13961471

Pulled By: miasantreble

fbshipit-source-id: ac836d401ecb84ea92216bf9966f969dedf4ad04
2019-02-20 15:52:54 -08:00
Fosco Marotto
48c8d8445e Update version and history for 6.0 2019-02-20 10:10:11 -08:00
Maysam Yabandeh
0f4244fe00 WritePrepared: Improve stress tests with slow threads (#4974)
Summary:
The transaction stress tests, stress a high concurrency scenario. In WritePrepared/WriteUnPrepared we need to also stress the scenarios where an inserting/reading transaction is very slow. This would stress the corner cases that the caching is not sufficient and other slower data structures are engaged. To emulate such cases we make use of slow inserter/verifier threads and also reduce the size of cache data structures.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4974

Differential Revision: D14143070

Pulled By: maysamyabandeh

fbshipit-source-id: 81eb674678faf9fae0f654cd60ebcc74e26aeee7
2019-02-19 16:56:49 -08:00
Zhongyi Xie
ed995c6a69 add whole key bloom filter support in memtables (#4985)
Summary:
MyRocks calls `GetForUpdate` on `INSERT`, for unique key check, and in almost all cases GetForUpdate returns empty result. For such cases, whole key bloom filter is helpful.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4985

Differential Revision: D14118257

Pulled By: miasantreble

fbshipit-source-id: d35cb7109c62fd5ad541a26968e3a3e16d3e85ea
2019-02-19 12:15:39 -08:00
Aubin Sanyal
3231a2e581 Deprecate ttl option from CompactionOptionsFIFO (#4965)
Summary:
We introduced ttl option in CompactionOptionsFIFO when ttl-based file
deletion (compaction) was supported only as part of FIFO Compaction. But
with the extension of ttl semantics even to Level compaction,
CompactionOptionsFIFO.ttl can now be deprecated. Instead we will start
using ColumnFamilyOptions.ttl for FIFO compaction as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4965

Differential Revision: D14072960

Pulled By: sagar0

fbshipit-source-id: c98cc2ae695a28136295787cd88d36a220fc219e
2019-02-15 09:51:41 -08:00
Yanqin Jin
4fc442029a Avoid using kInAtomicGroup tag for single-cf op (#4981)
Summary:
if an operation just involves a single column family, then we do
not have to set the kInAtomicGroup tag when writing to MANIFEST. This change
can fix a compatibility test failure, i.e. 5.15 and earlier cannot recognize
kInAtomicGroup tag.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4981

Differential Revision: D14072687

Pulled By: riversand963

fbshipit-source-id: 46b0c61e399f16c6b7169de0b33430d0ed90d6d4
2019-02-13 18:33:42 -08:00
Yanqin Jin
5af9446ee6 Remove Lua compaction filter from RocksDB main repo (#4971)
Summary:
as title. For people who continue to need Lua compaction filter, you
can copy the include/rocksdb/utilities/rocks_lua/lua_compaction_filter.h and
utilities/lua/rocks_lua_compaction_filter.cc to your own codebase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4971

Differential Revision: D14047468

Pulled By: riversand963

fbshipit-source-id: 9ad1a6484a7c94e478f1e108127a3184e4069f70
2019-02-13 12:42:44 -08:00
Yanqin Jin
a69d4deefb Atomic ingest (#4895)
Summary:
Make file ingestion atomic.

 as title.
Ingesting external SST files into multiple column families should be atomic. If
a crash occurs and db reopens, either all column families have successfully
ingested the files before the crash, or non of the ingestions have any effect
on the state of the db.

Also add unit tests for atomic ingestion.

Note that the unit test here does not cover the case of incomplete atomic group
in the MANIFEST, which is covered in VersionSetTest already.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4895

Differential Revision: D13718245

Pulled By: riversand963

fbshipit-source-id: 7df97cc483af73ad44dd6993008f99b083852198
2019-02-12 19:16:17 -08:00
Maysam Yabandeh
9144d1f186 WritePrepared: add private options to TransactionDBOptions (#4966)
Summary:
WritePreparedTransactionDB operates with more options which should not be configurable to avoid complicating it for the users. For testing purposes however we need to change the default value of this parameters. This patch makes these parameters private fields in TransactionDBOptions so that the existing ::Open API could use them seamlessly without however exposing them to the users.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4966

Differential Revision: D14015986

Pulled By: maysamyabandeh

fbshipit-source-id: 13037efa7dfdd6f73ec7a19414b66571e044c633
2019-02-11 14:44:02 -08:00
tang-jianfeng
08809f5e6c Implement trace sampling (#4963)
Summary:
Implement trace sampling to allow user to specify the sampling frequency, i.e. save one per how many requests, so that a user does not need to log all if he/she is interested in only a sampled set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4963

Differential Revision: D14011190

Pulled By: tang-jianfeng

fbshipit-source-id: 078b631d9319b67cb089dd2c30e21d0df8dc406a
2019-02-08 18:08:18 -08:00
Maysam Yabandeh
39fb88f14e Reset size_ to 0 in PinnableSlice::Reset (#4962)
Summary:
It would avoid bugs if the reused PinnableSlice is not actually reassigned and yet the programmer makes conclusions based on the size of the Slice.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4962

Differential Revision: D14012710

Pulled By: maysamyabandeh

fbshipit-source-id: 23f4e173386b5461fd5650f44cde470805f4e816
2019-02-08 16:51:17 -08:00
Siying Dong
f48758e939 Deprecate CompactionFilter::IgnoreSnapshots() = false (#4954)
Summary:
We found that the behavior of CompactionFilter::IgnoreSnapshots() = false isn't
what we have expected. We thought that snapshot will always be preserved.
However, we just realized that, if no snapshot is created while compaction
starts, and a snapshot is created after that, the data seen from the snapshot
can successfully be dropped by the compaction. This creates a strange behavior
to the feature, which is hard to explain. Like what is documented in code
comment, this feature is not very useful with snapshot anyway. The decision
is to deprecate the feature.

We keep the function to avoid to break users code. However, we will fail
compactions if false is returned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4954

Differential Revision: D13981900

Pulled By: siying

fbshipit-source-id: 2db8c2c3865acd86a28dca625945d1481b1d1e36
2019-02-07 16:57:33 -08:00
Siying Dong
cf3a671733 Remove cuckoo hash memtable (#4953)
Summary:
Cuckoo Hash is less useful than we initially expected. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4953

Differential Revision: D13979264

Pulled By: siying

fbshipit-source-id: 2a60afdaa989f045357398b43a1cc5d46f4492ed
2019-02-07 16:15:27 -08:00
Zhongyi Xie
00ed41daee Allow copy for PerfContext objects (#4919)
Summary:
Existing implementation of PerfContext does not define copy constructor or assignment operator, which could potentially cause problems when user create copies and resets the builtin one. This PR address the issue by providing these two constructors with deep copy semantics.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4919

Differential Revision: D13960406

Pulled By: miasantreble

fbshipit-source-id: 36aab5aaee65d4480f537e4e22148faa45e8e334
2019-02-05 14:29:08 -08:00
Andrew Kryczka
0ea57115a3 Fix WriteBatchBase::DeleteRange API comment (#4935)
Summary:
The `DeleteRange` end key is exclusive, not inclusive. Updated API comment accordingly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4935

Differential Revision: D13905406

Pulled By: ajkr

fbshipit-source-id: f577db841a279427991ecf9005cd56b30c8eb3c7
2019-01-31 14:43:40 -08:00
Alexander Zinoviev
32a6dd9a41 Add a new CPU time counter to compaction report (#4889)
Summary:
Measure CPU time consumed for a compaction and report it in the stats report
Enable NowCPUNanos() to work for MacOS
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889

Differential Revision: D13701276

Pulled By: zinoale

fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055
2019-01-29 17:24:00 -08:00
Yanqin Jin
158da7a6ee Verify checksum before ingestion (#4916)
Summary:
before file ingestion (in preparation phase), verify the checksums of
the blocks of the external SST file, including properties block with global
seqno.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4916

Differential Revision: D13863501

Pulled By: riversand963

fbshipit-source-id: dc54697f970e3807832e2460f7228fcc7efe81ee
2019-01-29 17:17:29 -08:00
Siying Dong
ee1818081f Remove PlainTable's feature store_index_in_file (#4914)
Summary:
Store_index_in_file is a less useful feature. To simplify the code to maintain, we are dropping the feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4914

Differential Revision: D13791883

Pulled By: siying

fbshipit-source-id: d187c5d662584866103e4b77d09dfb925509ae2e
2019-01-28 12:50:22 -08:00
Dmitry Fink
e07aa8669d Allow full merge when root of history for a key is reached (#4909)
Summary:
Previously compaction was not collapsing operands for a first
key on a layer, even in cases when it was its root of history. Some
tests (CompactionJobTest.NonAssocMerge) was actually accounting
for that bug,
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4909

Differential Revision: D13781169

Pulled By: finik

fbshipit-source-id: d2de353ecf05bec39b942cd8d5b97a8dc445f336
2019-01-23 21:46:10 -08:00
Andrew Kryczka
8ec3e72551 Cache dictionary used for decompressing data blocks (#4881)
Summary:
- If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`.
- If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle.
- New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881

Differential Revision: D13663801

Pulled By: ajkr

fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f
2019-01-23 18:15:47 -08:00
Siying Dong
d94aa2f7db Make compaction_pri = kMinOverlappingRatio to be default (#4911)
Summary:
compaction_pri = kMinOverlappingRatio usually provides much better write amplification than the default.
https://github.com/facebook/rocksdb/pull/4907 fixes one shortcome of this option. Make it default.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4911

Differential Revision: D13789262

Pulled By: siying

fbshipit-source-id: d90acf8c4dede44f00d183ca4c7a210259378269
2019-01-23 16:47:38 -08:00
Yanqin Jin
e79df377c5 Use chrono::time_point instead of time_t (#4868)
Summary:
By convention, time_t almost always stores the integral number of seconds since
00:00 hours, Jan 1, 1970 UTC, according to http://www.cplusplus.com/reference/ctime/time_t/.
We surely want more precision than seconds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4868

Differential Revision: D13633046

Pulled By: riversand963

fbshipit-source-id: 4e01e23a22e8838023c51a91247a286dbf3a5396
2019-01-16 09:51:05 -08:00
Siying Dong
1fb2e274c5 Remove some components (#4101)
Summary:
Remove some components that we never heard people using them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4101

Differential Revision: D8825431

Pulled By: siying

fbshipit-source-id: 97a12ad3cad4ab12c82741a5ba49669aaa854180
2019-01-10 13:30:09 -08:00
Yanqin Jin
75714b4c08 Initialize two members in PerfContext (#4859)
Summary:
as titled.
Currently it's possible to create a local object of type PerfContext since it's
part of public API. Then it's safe to initialize the two members to 0.
If PerfContext is created as thread-local object, then all members are
zero-initialized according to C++ standard.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4859

Differential Revision: D13614504

Pulled By: riversand963

fbshipit-source-id: 406ff548e105a074f379ad1054d56fece5f524a0
2019-01-09 15:55:03 -08:00
Huachao Huang
74f7d7551e tools: use provided options instead of the default (#4839)
Summary:
The current implementation hardcode the default options in different
places, which makes it impossible to support other environments (like
encrypted environment).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4839

Differential Revision: D13573578

Pulled By: sagar0

fbshipit-source-id: 76b58b4b758902798d10ff2f52d9f39abff015e7
2019-01-03 11:23:49 -08:00
Max
b1288cdc24 Fix typos in comments (#4819)
Summary:
Fix some typos in comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4819

Differential Revision: D13548543

Pulled By: siying

fbshipit-source-id: ca2e128fa47bef32892fc3627a7541fd9e2d5c3f
2018-12-26 09:43:56 -08:00
Alexander Zinoviev
80bf8975fd Add a new per level counter for block cache hit (#4796)
Summary:
Add a new per level counter for block cache hits, increase it by one on every successful attempt to get an entry from cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4796

Differential Revision: D13513688

Pulled By: zinoale

fbshipit-source-id: 104df038f1232e3356e162eb2d8ca138e34a8281
2018-12-21 13:20:05 -08:00
Siying Dong
da1c64b6e7 Introduce a CPU time counter in perf_context (#4741)
Summary:
Introduce the first CPU timing counter, perf_context.get_cpu_nanos. This opens a door to more CPU counters in the future.
Only Posix Env has it implemented using clock_gettime() with CLOCK_THREAD_CPUTIME_ID. How accurate the counter is depends on the platform.
Make PerfStepTimer to take an Env as an argument, and sometimes pass it in. The direct reason is to make the unit tests to use SpecialEnv where we can ingest logic there. But in long term, this is a good change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4741

Differential Revision: D13287798

Pulled By: siying

fbshipit-source-id: 090361049d9d5095d1d1a369fe1338d2e2e1c73f
2018-12-20 12:03:44 -08:00
Adam Retter
1b0c9ce396 Fix Windows broken build error due to non-const override (#4798)
Summary:
1) `transaction_base.h` overrides from `transaction.h` with a `const boolean do_validate`.
The non-const base declaration, which I cannot see the need for, causes a compilation error on Microsoft Windows.

2) Implicit cast from `double` to `uint64_t` causes a compilation error on Microsoft Windows.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4798

Differential Revision: D13519734

Pulled By: sagar0

fbshipit-source-id: 6e8cb80e9a589b1122e1500c21b8e3a3a472b459
2018-12-19 13:29:51 -08:00
Roman Zeyde
a62c6626e0 Support setting options on column families via C bindings (#4785)
Summary:
Currently, it supports setting options only on the default column family.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4785

Differential Revision: D13491819

Pulled By: ajkr

fbshipit-source-id: 75c78bd86222bb05568e538562af84fb53eb4d8d
2018-12-17 13:52:12 -08:00
Adam Singer
a914a1c6dc Add getMin, getMax, getCount, getSum to HistogramData class object. (#4742)
Summary:
Expose common stats min,max,count,sum via statistics JNI. These stats are not fully exposed on the Java side as is, but are available on the native side.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4742

Differential Revision: D13403766

Pulled By: ajkr

fbshipit-source-id: 5b70f7bd3fb7490aab73dcbd09f13490fce5c773
2018-12-14 14:28:44 -08:00
Adam Singer
d6dfe516ff Synchronize ticker and histogram metrics for Java API (#4733)
Summary:
Updating the `HistogramType.java` and `TickerType.java` to expose and correct metrics for statistics callbacks.

Moved `NO_ITERATOR_CREATED` to the proper stat name and deprecated `NO_ITERATORS`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4733

Differential Revision: D13466936

Pulled By: sagar0

fbshipit-source-id: a58d1edcc07c7b68c3525b1aa05828212c89c6c7
2018-12-14 11:37:05 -08:00
DorianZheng
2670fe8c73 Get CompactionJobInfo from CompactFiles
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4716

Differential Revision: D13207677

Pulled By: ajkr

fbshipit-source-id: d0ccf5a66df6cbb07288b0c5ebad81fd9df3926b
2018-12-13 14:21:24 -08:00
Burton Li
a8b9891f95 Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918

We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.

With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.

ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.

The usage is straight forward:
e.g.:

//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...

//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0);  // full throttle (0 task)

//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);

// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;

// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;

...

//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332

Differential Revision: D13226590

Pulled By: siying

fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 13:18:28 -08:00
DorianZheng
4862720e08 Expose column family id to FlushJobInfo
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4772

Differential Revision: D13428923

Pulled By: ajkr

fbshipit-source-id: e351e9c5eea97816db25429e129357a8af90712a
2018-12-11 20:33:42 -08:00
Maysam Yabandeh
21fca397cc Fix inline comments for assumed_tracked (#4762)
Summary:
Fix the definition of assumed_tracked in Transaction that was introduced in #4680
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4762

Differential Revision: D13399150

Pulled By: maysamyabandeh

fbshipit-source-id: 2a30fe49e3c44adacd7e45cd48eae95023ca9dca
2018-12-10 09:56:21 -08:00
Anand Ananthabhotla
1b01d23be2 Add PerfContext counters for index/filter block cache stats (#4540)
Summary:
Add counters to track block cache index/filter hits and misses. We currently count aggregate hits and misses, which includes index/filter/data blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4540

Differential Revision: D10459652

Pulled By: anand1976

fbshipit-source-id: 0c59eee7f12f5103dcb6686f0e7995babe63d425
2018-12-07 15:07:56 -08:00
Maysam Yabandeh
b878f93c70 Extend Transaction::GetForUpdate with do_validate (#4680)
Summary:
Transaction::GetForUpdate is extended with a do_validate parameter with default value of true. If false it skips validating the snapshot (if there is any) before doing the read. After the read it also returns the latest value (expects the ReadOptions::snapshot to be nullptr). This allows RocksDB applications to use GetForUpdate similarly to how InnoDB does. Similarly ::Merge, ::Put, ::Delete, and ::SingleDelete are extended with assume_exclusive_tracked with default value of false. It true it indicates that call is assumed to be after a ::GetForUpdate(do_validate=false).
The Java APIs are accordingly updated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4680

Differential Revision: D13068508

Pulled By: maysamyabandeh

fbshipit-source-id: f0b59db28f7f6a078b60844d902057140765e67d
2018-12-06 17:49:00 -08:00
Zhongyi Xie
2f1ca4e838 Revert "BaseDeltaIterator: always check valid() before accessing key(… (#4744)
Summary:
…) (#4702)"

This reverts commit 3a18bb3e15.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4744

Differential Revision: D13311869

Pulled By: miasantreble

fbshipit-source-id: 6300b12cc34828d8b9274e907a3aef1506d5d553
2018-12-03 23:38:27 -08:00
Zhongyi Xie
3a18bb3e15 BaseDeltaIterator: always check valid() before accessing key() (#4702)
Summary:
Current implementation of `current_over_upper_bound_` fails to take into consideration that keys might be invalid in either base iterator or delta iterator. Calling key() in such scenario will lead to assertion failure and runtime errors.
This PR addresses the bug by adding check for valid keys before calling `IsOverUpperBound()`, also added test coverage for iterate_upper_bound usage in BaseDeltaIterator
Also recommit https://github.com/facebook/rocksdb/pull/4656 (It was reverted earlier due to bugs)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4702

Differential Revision: D13146643

Pulled By: miasantreble

fbshipit-source-id: 6d136929da12d0f2e2a5cea474a8038ec5cdf1d0
2018-11-30 15:35:13 -08:00
Siying Dong
6e938c904f Make NewBloomFilterPolicy() use full filter by default (#4735)
Summary:
Full block (use_block_based_builder=false) Bloom filter has clear CPU saving benefits but with limitation of using temp memory when building an SST file proportional to the SST file size. We reduced the chance of having large SST files with multi-level universal compaction. Now we change to a default with better performance.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4735

Differential Revision: D13266674

Pulled By: siying

fbshipit-source-id: 7594a4c3e32568a5a2adce22bb0e46553e55c602
2018-11-30 13:13:27 -08:00
Yi Wu
cf1df5d3cb JemallocNodumpAllocator: option to limit tcache memory usage (#4736)
Summary:
Add option to limit tcache usage by allocation size. This is to reduce total tcache size in case there are many user threads accessing the allocator and incur non-trivial memory usage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4736

Differential Revision: D13269305

Pulled By: yiwu-arbug

fbshipit-source-id: 95a9b7fc67facd66837c849137e30e137112e19d
2018-11-29 17:33:40 -08:00
Zhichao Cao
7125e24619 Add the max trace file size limitation option to Tracing (#4610)
Summary:
If user do not end the trace manually, the tracing will continue which can potential use up all the storage space and cause problem. In this PR, the max trace file size is added to the TraceOptions and user can set the value if they need or the default is 64GB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4610

Differential Revision: D12893400

Pulled By: zhichao-cao

fbshipit-source-id: acf4b5a6076bb691778bdfbac4864e1006758953
2018-11-27 14:27:05 -08:00
Huachao Huang
5e72bc113a Add SstFileReader to read sst files (#4717)
Summary:
A user friendly sst file reader is useful when we want to access sst
files outside of RocksDB. For example, we can generate an sst file
with SstFileWriter and send it to other places, then use SstFileReader
to read the file and process the entries in other ways.

Also rename the original SstFileReader to SstFileDumper because of
name conflict, and seems SstFileDumper is more appropriate for tools.

TODO: there is only a very simple test now, because I want to get some feedback first.
If the changes look good, I will add more tests soon.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4717

Differential Revision: D13212686

Pulled By: ajkr

fbshipit-source-id: 737593383264c954b79e63edaf44aaae0d947e56
2018-11-27 13:02:23 -08:00
Abhishek Madan
e76448185c Remove DeleteRange experimental comment (#4709)
Summary:
DeleteRange is now ready for production use. Change the header comment to reflect this, and update HISTORY.md with the feature's status.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4709

Differential Revision: D13209055

Pulled By: abhimadan

fbshipit-source-id: 65423eb1a4927cf593c38254cd87c322f73ae137
2018-11-27 11:11:35 -08:00
Adam Singer
1db4a096d4 Test mapping of Histograms and HistogramsNameMap (#4720)
Summary:
Adding sanity check test for mapping of `Histograms` and `HistogramsNameMap`

```
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from StatisticsTest
[ RUN      ] StatisticsTest.SanityTickers
[       OK ] StatisticsTest.SanityTickers (0 ms)
[ RUN      ] StatisticsTest.SanityHistograms
[       OK ] StatisticsTest.SanityHistograms (0 ms)
[----------] 2 tests from StatisticsTest (0 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (0 ms total)
[  PASSED  ] 2 tests.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4720

Differential Revision: D13217061

Pulled By: ajkr

fbshipit-source-id: 6427f4e684c36b2f3c3440808b74fee86a364683
2018-11-27 10:48:30 -08:00
Soli
f1837595a3 FIX #3278: Move global const object definitions from .h to .cc (#4691)
Summary:
Summary

We should declare constants in headers and define them in source files.
But this commit is only aimed at compound types.

I don't know if it is necessary to do the same thing to fundamental types.

I used this command to find all of the constant definitions in header files.

`find . -name "*.h" | xargs grep -e "^const .*=.*"`

And here is what I found:

```
./db/version_edit.h:const uint64_t kFileNumberMask = 0x3FFFFFFFFFFFFFFF;
./include/rocksdb/env.h:const size_t kDefaultPageSize = 4 * 1024;
./include/rocksdb/statistics.h:const std::vector<std::pair<Tickers, std::string>> TickersNameMap = {
./include/rocksdb/statistics.h:const std::vector<std::pair<Histograms, std::string>> HistogramsNameMap = {
./include/rocksdb/table.h:const uint32_t kPlainTableVariableLength = 0;
./include/rocksdb/utilities/transaction_db.h:const uint32_t kInitialMaxDeadlocks = 5;
./port/port_posix.h:const uint32_t kMaxUint32 = std::numeric_limits<uint32_t>::max();
./port/port_posix.h:const int kMaxInt32 = std::numeric_limits<int32_t>::max();
./port/port_posix.h:const uint64_t kMaxUint64 = std::numeric_limits<uint64_t>::max();
./port/port_posix.h:const int64_t kMaxInt64 = std::numeric_limits<int64_t>::max();
./port/port_posix.h:const size_t kMaxSizet = std::numeric_limits<size_t>::max();
./port/win/port_win.h:const uint32_t kMaxUint32 = UINT32_MAX;
./port/win/port_win.h:const int kMaxInt32 = INT32_MAX;
./port/win/port_win.h:const int64_t kMaxInt64 = INT64_MAX;
./port/win/port_win.h:const uint64_t kMaxUint64 = UINT64_MAX;
./port/win/port_win.h:const size_t kMaxSizet = UINT64_MAX;
./port/win/port_win.h:const size_t kMaxSizet = UINT_MAX;
./port/win/port_win.h:const uint32_t kMaxUint32 = std::numeric_limits<uint32_t>::max();
./port/win/port_win.h:const int kMaxInt32 = std::numeric_limits<int>::max();
./port/win/port_win.h:const uint64_t kMaxUint64 = std::numeric_limits<uint64_t>::max();
./port/win/port_win.h:const int64_t kMaxInt64 = std::numeric_limits<int64_t>::max();
./port/win/port_win.h:const size_t kMaxSizet = std::numeric_limits<size_t>::max();
./port/win/port_win.h:const bool kLittleEndian = true;
./table/cuckoo_table_factory.h:const uint32_t kCuckooMurmurSeedMultiplier = 816922183;
./table/data_block_hash_index.h:const uint8_t kNoEntry = 255;
./table/data_block_hash_index.h:const uint8_t kCollision = 254;
./table/data_block_hash_index.h:const uint8_t kMaxRestartSupportedByHashIndex = 253;
./table/data_block_hash_index.h:const size_t kMaxBlockSizeSupportedByHashIndex = 1u << 16;
./table/data_block_hash_index.h:const double kDefaultUtilRatio = 0.75;
./table/filter_block.h:const uint64_t kNotValid = ULLONG_MAX;
./table/format.h:const int kMagicNumberLengthByte = 8;
./third-party/fbson/FbsonJsonParser.h:const char* const kJsonDelim = " ,]}\t\r\n";
./third-party/fbson/FbsonJsonParser.h:const char* const kWhiteSpace = " \t\n\r";
./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const BiggestInt kMaxBiggestInt =
./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const char kDeathTestStyleFlag[] = "death_test_style";
./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const char kDeathTestUseFork[] = "death_test_use_fork";
./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const char kInternalRunDeathTestFlag[] = "internal_run_death_test";
./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const char* pets[] = {"cat", "dog"};
./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const size_t kProtobufOneLinerMaxLength = 50;
./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const int kMaxStackTraceDepth = 100;
./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const T* WithParamInterface<T>::parameter_ = NULL;
./util/coding.h:const unsigned int kMaxVarint64Length = 10;
./util/filename.h:const size_t kFormatFileNumberBufSize = 38;
./util/testutil.h:const SliceTransform* RandomSliceTransform(Random* rnd, int pre_defined = -1);
./util/trace_replay.h:const std::string kTraceMagic = "feedcafedeadbeef";
./util/trace_replay.h:const unsigned int kTraceTimestampSize = 8;
./util/trace_replay.h:const unsigned int kTraceTypeSize = 1;
./util/trace_replay.h:const unsigned int kTracePayloadLengthSize = 4;
./util/trace_replay.h:const unsigned int kTraceMetadataSize =
./utilities/cassandra/serialize.h:const int64_t kCharMask = 0xFFLL;
./utilities/cassandra/serialize.h:const int32_t kBitsPerByte = 8;
```

And these 3 lines are related to this commit:

```
./include/rocksdb/statistics.h:const std::vector<std::pair<Tickers, std::string>> TickersNameMap = {
./include/rocksdb/statistics.h:const std::vector<std::pair<Histograms, std::string>> HistogramsNameMap = {
./util/trace_replay.h:const std::string kTraceMagic = "feedcafedeadbeef";
```

Any comments would be appreciated.
Thanks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4691

Differential Revision: D13208049

Pulled By: ajkr

fbshipit-source-id: e5ee55fdaec5447fc5798c6721e2821e7cdc0d5b
2018-11-26 21:32:03 -08:00
Zhongyi Xie
a21cb22ee3 Revert "apply ReadOptions.iterate_upper_bound to transaction iterator… (#4705)
Summary:
… (#4656)"

This reverts commit b76398a82b.

Will add test coverage for iterate_upper_bound before re-commit b76398
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4705

Differential Revision: D13148592

Pulled By: miasantreble

fbshipit-source-id: 4d1ce0bfd9f7a5359a7688bd780eb06a66f45b1f
2018-11-24 10:46:28 -08:00
Yi Wu
05d9d82181 Revert "Move MemoryAllocator option from Cache to BlockBasedTableOpti… (#4697)
Summary:
…ons (#4676)"

This reverts commit b32d087dbb.

`MemoryAllocator` needs to be with `Cache`, since cache entry can
outlive DB and block based table. The cache needs to hold reference to
memory allocator when deleting cache entry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4697

Differential Revision: D13133490

Pulled By: yiwu-arbug

fbshipit-source-id: 8ef7e8a51263bfd929f892fd062665ff4ce9ce5a
2018-11-21 11:29:57 -08:00
Andrew Kryczka
a138e351bc Fix compatibility of public ticker stats (#4701)
Summary:
- Added back the `NO_ITERATORS` that was removed in 5945e16dfc.
- Marked it as deprecated since it is no longer populated, but kept for API compatibility.
- Made sure the new tickers, `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETED`, are appended at the end of the enum, in case people are relying on the int values.

The change where `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETED` were introduced is unreleased so I believe it is ok to change their ordering.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4701

Differential Revision: D13142887

Pulled By: ajkr

fbshipit-source-id: 29a336ce5b46632ce50ad42ccc4a29013f71d6d6
2018-11-20 13:13:16 -08:00
Siying Dong
13579e8c5a WriteBufferManger doens't cost to cache if no limit is set (#4695)
Summary:
WriteBufferManger is not invoked when allocating memory for memtable if the limit is not set even if a cache is passed. It is inconsistent from the comment syas. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4695

Differential Revision: D13112722

Pulled By: siying

fbshipit-source-id: 0b27eef63867f679cd06033ea56907c0569597f4
2018-11-18 16:55:43 -08:00
Andrew Kryczka
9d6d4867ab Fix uninitialized fields in file metadata (#4693)
Summary:
This is a quick fix for the uninitialized bugs in `LiveFileMetaData` and `SstFileMetaData` that were uncovered in #4686.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4693

Differential Revision: D13113189

Pulled By: ajkr

fbshipit-source-id: 18e798d031d2a59d0b55fc010c135e0126f4042d
2018-11-16 20:49:17 -08:00
Maysam Yabandeh
c2a20f1776 Fix ignoring params in default impl of GetForUpdate (#4679)
Summary:
The default implementation of GetForUpdate that receives PinnableSlice was mistakenly dropping column_family and exclusive parameters.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4679

Differential Revision: D13062531

Pulled By: maysamyabandeh

fbshipit-source-id: 7625d0c1ba872a5d894b58ced42147d6c8556a6f
2018-11-14 11:29:38 -08:00
Siying Dong
b82e57d425 Remove two variables from BlockContents class and don't use class Block for compressed block (#4650)
Summary:
We carry compression type and "cachable" variables for every block in the block cache, while they take well-known values. 8-byte is wasted for each block (2-byte for useful information but it takes 8 bytes because of padding). With this change, these two variables are removed.

The cachable information is only useful in the process of reading the block. We use other information to infer from it. For compressed blocks, the compression type is a part of the block content itself so we can get it from there.

Some code is slightly refactored so that the cachable information can flow better.

Another change is to only use class BlockContents for compressed block, and narrow the class Block to only be used for uncompressed blocks, including blocks in compressed block cache. This can make the Block class less confusing. It also saves tens of bytes for each block in compressed block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4650

Differential Revision: D12969070

Pulled By: siying

fbshipit-source-id: 548b62724e9eb66993026429fd9c7c3acd1f95ed
2018-11-13 17:02:55 -08:00
Zhongyi Xie
b76398a82b apply ReadOptions.iterate_upper_bound to transaction iterator (#4656)
Summary:
Currently transaction iterator does not apply `ReadOptions.iterate_upper_bound` when iterating. This PR attempts to fix the problem by having `BaseDeltaIterator` enforcing the upper bound check when iterator state is changed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4656

Differential Revision: D13039257

Pulled By: miasantreble

fbshipit-source-id: 909eb9f6b4597a4d80418fb139f32ec82c6ec1d1
2018-11-13 15:44:15 -08:00
Yi Wu
b32d087dbb Move MemoryAllocator option from Cache to BlockBasedTableOptions (#4676)
Summary:
Per offline discussion with siying, `MemoryAllocator` and `Cache` should be decouple. The idea is that memory allocator handles memory allocation, while cache handle cache policy.

It is normal that external cache libraries pack couple the two components for better optimization. If we want to integrate with such library in the future, we can make a wrapper of the library implementing both `Cache` and `MemoryAllocator` interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4676

Differential Revision: D13047662

Pulled By: yiwu-arbug

fbshipit-source-id: cd42e246d80ab600b4de47d073f7d2db308ce6dd
2018-11-13 13:48:38 -08:00
QingpingWang
4f0fcb78ae Expose num entries and deletions of sst files (#4623)
Summary:
he ratio of num_deletions to num_entries of a level can be useful to determine if a manual compaction needs to be triggered on a level.
Also refer #3980
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4623

Differential Revision: D13045744

Pulled By: sagar0

fbshipit-source-id: 71f3c8e363a8ffd194ec3bb0ed0b69612231f0b3
2018-11-13 11:52:19 -08:00
Soli Como
5945e16dfc Divide NO_ITERATORS into two counters NO_ITERATOR_CREATED and NO_ITERATOR_DELETE (#4498)
Summary:
Currently, `Statistics` can record tick by `recordTick()` whose second parameter is an `uint64_t`.
That means tick can only increase.
If we want to reduce tick, we have to work around like `RecordTick(statistics_, NO_ITERATORS, uint64_t(-1));`.
That's kind of a hack.

So, this PR divide `NO_ITERATORS` into two counters `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETE`, making the counters increase only.

Fixes #3013 .
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4498

Differential Revision: D10395010

Pulled By: sagar0

fbshipit-source-id: cfb523b22a37411c794b4e9da090f1ae30293db2
2018-11-13 11:46:32 -08:00
Zhongyi Xie
b313019326 use per-level perfcontext for DB::Get calls (#4617)
Summary:
this PR adds two more per-level perf context counters to track
* number of keys returned in Get call, break down by levels
* total processing time at each level during Get call
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4617

Differential Revision: D12898024

Pulled By: miasantreble

fbshipit-source-id: 6b84ef1c8097c0d9e97bee1a774958f56ab4a6c4
2018-11-13 10:40:49 -08:00
Fosco Marotto
050d73551b Update history and version for future 5.18 release.
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4669

Differential Revision: D13031522

Pulled By: gfosco

fbshipit-source-id: d09e655a9d5556594f195c5d1b786900932145ce
2018-11-12 14:45:47 -08:00
Sagar Vemuri
0c4678fd1b Update documentation about dynamic options (#4653)
Summary:
Updated the comments around all options which are currently dynamic.
Without explicitly specifying in the comments around the options, its hard for RocksDB users to know if an option is dynamically changeable or not unless they dig into the implementation details.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4653

Differential Revision: D12966735

Pulled By: sagar0

fbshipit-source-id: 1a01acbf6fe506b989e71629ce223f9803ebae27
2018-11-09 13:30:49 -08:00
Sagar Vemuri
dc3528077a Update all unique/shared_ptr instances to be qualified with namespace std (#4638)
Summary:
Ran the following commands to recursively change all the files under RocksDB:
```
find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
```
Running `make format` updated some formatting on the files touched.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638

Differential Revision: D12934992

Pulled By: sagar0

fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
2018-11-09 11:19:58 -08:00
Andrew Kryczka
fffac43cfb Add DB property for SST files kept from deletion (#4618)
Summary:
This property can help debug why SST files aren't being deleted. Previously we only had the property "rocksdb.is-file-deletions-enabled". However, even when that returned true, obsolete SSTs may still not be deleted due to the coarse-grained mechanism we use to prevent newly created SSTs from being accidentally deleted. That coarse-grained mechanism uses a lower bound file number for SSTs that should not be deleted, and this property exposes that lower bound.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4618

Differential Revision: D12898179

Pulled By: ajkr

fbshipit-source-id: fe68acc041ddbcc9276bbd48976524d95aafc776
2018-11-05 20:24:40 -08:00
Bo Hou
cd9404bb77 xxhash 64 support
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4607

Reviewed By: siying

Differential Revision: D12836696

Pulled By: jsjhoubo

fbshipit-source-id: 7122ccb712d0b0f1cd998aa4477e0da1401bd870
2018-11-01 15:44:06 -07:00
Abhishek Madan
eaaf1a6f05 Promote rocksdb.{deleted.keys,merge.operands} to main table properties (#4594)
Summary:
Since the number of range deletions are reported in
TableProperties, it is confusing to not report the number of merge
operands and point deletions as top-level properties; they are
accessible through the public API, but since they are not the "main"
properties, they do not appear in aggregated table properties, or the
string representation of table properties.

This change promotes those two property keys to
`rocksdb/table_properties.h`, adds corresponding uint64 members for
them, deprecates the old access methods `GetDeletedKeys()` and
`GetMergeOperands()` (though they are still usable for now), and removes
`InternalKeyPropertiesCollector`. The property key strings are the same
as before this change, so this should be able to read DBs written from older
versions (though I haven't tested this yet).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4594

Differential Revision: D12826893

Pulled By: abhimadan

fbshipit-source-id: 9e4e4fbdc5b0da161c89582566d184101ba8eb68
2018-10-30 15:34:27 -07:00
Yi Wu
5f5fddabc7 port folly::JemallocNodumpAllocator (#4534)
Summary:
Introduce `JemallocNodumpAllocator`, which allow exclusion of block cache usage from core dump. It utilize custom hook of jemalloc arena, and when jemalloc arena request memory from system, the allocator use the hook to set `MADV_DONTDUMP ` to the memory. The implementation is basically the same as `folly::JemallocNodumpAllocator`, except for some minor difference:
1. It only support jemalloc >= 5.0
2. When the allocator destruct, it explicitly destruct the corresponding arena via `arena.<i>.destroy` via `mallctl`.

Depending on #4502.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4534

Differential Revision: D10435474

Pulled By: yiwu-arbug

fbshipit-source-id: e80edea755d3853182485d2be710376384ce0bb4
2018-10-26 17:29:18 -07:00
Yanqin Jin
5b4c709fad Enable atomic flush (#4023)
Summary:
Adds a DB option `atomic_flush` to control whether to enable this feature. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4023

Differential Revision: D8518381

Pulled By: riversand963

fbshipit-source-id: 1e3bb33e99bb102876a31b378d93b0138ff6634f
2018-10-26 15:08:43 -07:00
Yi Wu
f560c8f5c8 s/CacheAllocator/MemoryAllocator/g (#4590)
Summary:
Rename the interface, as it is mean to be a generic interface for memory allocation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4590

Differential Revision: D10866340

Pulled By: yiwu-arbug

fbshipit-source-id: 85cb753351a40cb856c046aeaa3f3b369eef3d16
2018-10-26 14:30:30 -07:00
Sagar Vemuri
abb8ecb4cd Add missing methods to WritableFileWrapper (#4584)
Summary:
`WritableFileWrapper` was missing some newer methods that were added to `WritableFile`. Without these functions, the missing wrapper methods would fallback to using the default implementations in WritableFile instead of using the corresponding implementations in, say, `PosixWritableFile` or `WinWritableFile`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4584

Differential Revision: D10559199

Pulled By: sagar0

fbshipit-source-id: 0d0f18a486aee727d5b8eebd3110a41988e27391
2018-10-24 12:19:54 -07:00
jsteemann
d1c0d3f358 Small issues (#4564)
Summary:
Couple of very minor improvements (typos in comments, full qualification of class name, reordering members of a struct to make it smaller)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4564

Differential Revision: D10510183

Pulled By: maysamyabandeh

fbshipit-source-id: c7ddf9bfbf2db08cd31896c3fd93789d3fa68c8b
2018-10-23 10:35:57 -07:00
Siying Dong
7024263682 Dynamic level to adjust level multiplier when write is too heavy (#4338)
Summary:
Level compaction usually performs poorly when the writes so heavy that the level targets can't be guaranteed. With this improvement, we improve level_compaction_dynamic_level_bytes = true so that in the write heavy cases, the level multiplier can be slightly adjusted based on the size of L0.

We keep the behavior the same if number of L0 files is under 2X compaction trigger and the total size is less than options.max_bytes_for_level_base, so that unless write is so heavy that compaction cannot keep up, the behavior doesn't change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4338

Differential Revision: D9636782

Pulled By: siying

fbshipit-source-id: e27fc17a7c29c84b00064cc17536a01dacef7595
2018-10-22 10:21:47 -07:00
Zhongyi Xie
d6ec288703 Add PerfContextByLevel to provide per level perf context information (#4226)
Summary:
Current implementation of perf context is level agnostic. Making it hard to do performance evaluation for the LSM tree. This PR adds `PerfContextByLevel` to decompose the counters by level.
This will be helpful when analyzing point and range query performance as well as tuning bloom filter
Also replaced __thread with thread_local keyword for perf_context
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4226

Differential Revision: D10369509

Pulled By: miasantreble

fbshipit-source-id: f1ced4e0de5fcebdb7f9cff36164516bc6382d82
2018-10-17 11:19:40 -07:00
Yanqin Jin
ce52274640 Replace 'string' with 'const string&' in FileOperationInfo (#4491)
Summary:
Using const string& can avoid one extra string copy. This PR addresses a recent comment made by siying  on #3933.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4491

Differential Revision: D10381211

Pulled By: riversand963

fbshipit-source-id: 27fc2d65d84bc7cd07833c77cdc47f06dcfaeb31
2018-10-15 13:46:01 -07:00
Yanqin Jin
729a617b5b Add listener to sample file io (#3933)
Summary:
We would like to collect file-system-level statistics including file name, offset, length, return code, latency, etc., which requires to add callbacks to intercept file IO function calls when RocksDB is running.
To collect file-system-level statistics, users can inherit the class `EventListener`, as in `TestFileOperationListener `. Note that `TestFileOperationListener::ShouldBeNotifiedOnFileIO()` returns true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3933

Differential Revision: D10219571

Pulled By: riversand963

fbshipit-source-id: 7acc577a2d31097766a27adb6f78eaf8b1e8ff15
2018-10-12 18:36:11 -07:00
Peter Pei
09814f2cfc support OnCompactionBegin (#4431)
Summary:
fix #4288

Add `OnCompactionBegin` support to `rocksdb::EventListener`.

Currently, we only have these three callbacks:

- OnFlushBegin
- OnFlushCompleted
- OnCompactionCompleted

As paolococchi requested in #4288 , and ajkr agreed, we should also support `OnCompactionBegin`.

This PR is a try to implement the support of `OnCompactionBegin`.

Hope it is useful to you.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4431

Differential Revision: D10055515

Pulled By: yiwu-arbug

fbshipit-source-id: 39c0f95f8e9ff1c7ca3a10787502a17f258d2334
2018-10-10 17:32:27 -07:00
Fosco Marotto
35f26beca5 Update version macro for 5.17 (#4472)
Summary:
Forgot this in previous commit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4472

Differential Revision: D10244227

Pulled By: gfosco

fbshipit-source-id: ba0cf7a2f5271f0d9f9443004e2620887cd5fd11
2018-10-08 16:22:17 -07:00
DorianZheng
e0f05754ba Expose column family id to OnCompactionCompleted (#4466)
Summary:
The controller you requested could not be found. PTAL
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4466

Differential Revision: D10241358

Pulled By: yiwu-arbug

fbshipit-source-id: 99664eb286860a6c8844d50efeb0ef6f0e10dd1e
2018-10-08 14:24:16 -07:00
Andrew Gallagher
897fe6a4a3 rocksdb: put #pragma once before #ifdef
Summary: Work around upstream bug with modules: https://bugs.llvm.org/show_bug.cgi?id=39184.

Reviewed By: yiwu-arbug

Differential Revision: D10209569

fbshipit-source-id: 696853a02a3869e9c33d0e61168ad4b0436fa3c0
2018-10-04 17:10:21 -07:00
Zhongyi Xie
ce1fc5af09 fix unused param allocator in compression.h (#4453)
Summary:
this should fix currently failing contrun test: rocksdb-contrun-no_compression, rocksdb-contrun-tsan, rocksdb-contrun-tsan_crash
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4453

Differential Revision: D10202626

Pulled By: miasantreble

fbshipit-source-id: 850b07f14f671b5998c22d8239e2a55b2fc1e355
2018-10-04 13:24:22 -07:00
Igor Canadi
1cf5deb8fd Introduce CacheAllocator, a custom allocator for cache blocks (#4437)
Summary:
This is a conceptually simple change, but it touches many files to
pass the allocator through function calls.

We introduce CacheAllocator, which can be used by clients to configure
custom allocator for cache blocks. Our motivation is to hook this up
with folly's `JemallocNodumpAllocator`
(f43ce6d686/folly/experimental/JemallocNodumpAllocator.h),
but there are many other possible use cases.

Additionally, this commit cleans up memory allocation in
`util/compression.h`, making sure that all allocations are wrapped in a
unique_ptr as soon as possible.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4437

Differential Revision: D10132814

Pulled By: yiwu-arbug

fbshipit-source-id: be1343a4b69f6048df127939fea9bbc96969f564
2018-10-02 17:24:58 -07:00
Andrew Kryczka
ac6f435a9a Fix CompactFiles support for kDisableCompressionOption (#4438)
Summary:
Previously `CompactFiles` with `CompressionType::kDisableCompressionOption` caused program to crash on assertion failure. This PR fixes the crash by adding support for that setting. Now, that setting will cause RocksDB to choose compression according to the column family's options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4438

Differential Revision: D10115761

Pulled By: ajkr

fbshipit-source-id: a553c6fa76fa5b6f73b0d165d95640da6f454122
2018-10-01 01:18:10 -07:00
Anand Ananthabhotla
72712f4e28 Allow dynamic modification of window size and deletion trigger (#4403)
Summary:
Make the CompactOnDeletionCollectorFactory class public, and provide
methods to update the window size and deletion trigger params. These
will take effect on subsequent created SST files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4403

Differential Revision: D9976857

Pulled By: anand1976

fbshipit-source-id: 31dbf0511c12fa2bb9b2a7ba620079e0ee09cf48
2018-09-20 15:15:28 -07:00
Anand Ananthabhotla
a27fce408e Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
  a) On the first occurance of an out of space error during compaction,
subsequent
  compactions will be delayed until the disk free space check indicates
  enough available space. The required space is computed as the sum of
  input sizes.
  b) The free space check requirement will be removed once the amount of
  free space is greater than the size reserved by in progress
  compactions when the first error occured
  c) If the out of space error is a hard error, a background thread in
  SFM will poll for sufficient headroom before triggering the recovery
  of the database and putting it in write-only mode. The headroom is
  calculated as the sum of the write_buffer_size of all the DB instances
  associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()

Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164

Differential Revision: D9846378

Pulled By: anand1976

fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 13:43:04 -07:00
Vitaly Isaev
0bd2ede10e Memory usage stats in C API (#4340)
Summary:
Please consider this small PR providing access to the `MemoryUsage::GetApproximateMemoryUsageByType` function in plain C API. Actually I'm working on Go application and now trying to investigate the reasons of high memory consumption (#4313). Go [wrappers](https://github.com/tecbot/gorocksdb) are built on the top of Rocksdb C API. According to the #706, `MemoryUsage::GetApproximateMemoryUsageByType` is considered as the best option to get database internal memory usage stats, but it wasn't supported in C API yet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4340

Differential Revision: D9655135

Pulled By: ajkr

fbshipit-source-id: a3d2f3f47c143ae75862fbcca2f571ea1b49e14a
2018-09-13 14:27:31 -07:00
Maysam Yabandeh
3f5282268f Skip concurrency control during recovery of pessimistic txn (#4346)
Summary:
TransactionOptions::skip_concurrency_control allows pessimistic transactions to skip the overhead of concurrency control. This could be as an optimization if the application knows that the transaction would not have any conflict with concurrent transactions. It is currently used during recovery assuming (i) application guarantees no conflict between prepared transactions in the WAL (ii) application guarantees that recovered transactions will be rolled back/commit before new transactions start.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4346

Differential Revision: D9759149

Pulled By: maysamyabandeh

fbshipit-source-id: f896e84fa58b0b584be904c7fd3883a41ea3215b
2018-09-10 16:57:53 -07:00
Kefu Chai
faf529fd7c env_librados.h: drop redundant #endif (#4354)
Summary:
without this change, rocksdb_env_librados_test fails to build.

it's a regression introduced by 64324e32

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4354

Differential Revision: D9702665

Pulled By: riversand963

fbshipit-source-id: 65134eaff0543733210edfc77f89c96709da7a3f
2018-09-07 11:12:44 -07:00
Maysam Yabandeh
655ef7d77f Inline doc for format_version 4 (#4350)
Summary:
Fixes #4337
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4350

Differential Revision: D9700871

Pulled By: maysamyabandeh

fbshipit-source-id: fe1e07803783f34588dc14aba66d51117ca4a180
2018-09-07 07:57:30 -07:00
cngzhnp
64324e329e Support pragma once in all header files and cleanup some warnings (#4339)
Summary:
As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited.

Besides this, try to fix some warnings about loss of data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339

Differential Revision: D9654990

Pulled By: ajkr

fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848
2018-09-05 18:13:31 -07:00
Yi Wu
462ed70d64 BlobDB: GetLiveFiles and GetLiveFilesMetadata return relative path (#4326)
Summary:
`GetLiveFiles` and `GetLiveFilesMetadata` should return path relative to db path.

It is a separate issue when `path_relative` is false how can we return relative path. But `DBImpl::GetLiveFiles` don't handle it as well when there are multiple `db_paths`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4326

Differential Revision: D9545904

Pulled By: yiwu-arbug

fbshipit-source-id: 6762d879fcb561df2b612e6fdfb4a6b51db03f5d
2018-08-31 12:12:49 -07:00
Mikhail Antonov
927f274939 Avoiding write stall caused by manual flushes (#4297)
Summary:
Basically at the moment it seems it's possible to cause write stall by calling flush (either manually vis DB::Flush(), or from Backup Engine directly calling FlushMemTable() while background flush may be already happening.

One of the ways to fix it is that in DBImpl::CompactRange() we already check for possible stall and delay flush if needed before we actually proceed to call FlushMemTable(). We can simply move this delay logic to separate method and call it from FlushMemTable.

This is draft patch, for first look; need to check tests/update SyncPoints and most certainly would need to add allow_write_stall method to FlushOptions().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4297

Differential Revision: D9420705

Pulled By: mikhail-antonov

fbshipit-source-id: f81d206b55e1d7b39e4dc64242fdfbceeea03fcc
2018-08-29 12:12:55 -07:00
Gauresh Rane
ad789e4e0d Adding a method for memtable class for memtable getting flushed. (#4304)
Summary:
Memtables are selected for flushing by the flush job. Currently we
have listener which is invoked when memtables for a column family are
flushed. That listener does not indicate which memtable was flushed in
the notification. If clients want to know if particular data in the
memtable was retired, there is no straight forward way to know this.
This method will help users who implement memtablerep factory and extend
interface for memtablerep, to know if the data in the memtable was
retired.
Another option that was tried, was to depend on memtable destructor to
be called after flush to mark that data was persisted. This works all
the time but sometimes there can huge delays between actual flush
happening and memtable getting destroyed. Hence, if anyone who is
waiting for data to persist will have to wait that longer.
It is expected that anyone who is implementing this method to have
return quickly as it blocks RocksDB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4304

Reviewed By: riversand963

Differential Revision: D9472312

Pulled By: gdrane

fbshipit-source-id: 8e693308dee749586af3a4c5d4fcf1fa5276ea4d
2018-08-23 17:14:25 -07:00
Yanqin Jin
bb5dcea98e Add path to WritableFileWriter. (#4039)
Summary:
We want to sample the file I/O issued by RocksDB and report the function calls. This requires us to include the file paths otherwise it's hard to tell what has been going on.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4039

Differential Revision: D8670178

Pulled By: riversand963

fbshipit-source-id: 97ee806d1c583a2983e28e213ee764dc6ac28f7a
2018-08-23 10:12:58 -07:00
jsteemann
90f744941d adds missing PopSavePoint method to Transaction (#4256)
Summary:
Transaction has had methods to deal with SavePoints already, but
was missing the PopSavePoint method provided by WriteBatch and
WriteBatchWithIndex.
This PR adds PopSavePoint to Transaction as well. Having the method
on Transaction-level too is useful for applications that repeatedly
execute a sequence of operations that normally succeed, but infrequently
need to get rolled back. Using SavePoints here is sensible, but as
operations normally succeed the application may pile up a lot of
useless SavePoints inside a Transaction, leading to slightly increased
memory usage for managing the unneeded SavePoints.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4256

Differential Revision: D9326932

Pulled By: yiwu-arbug

fbshipit-source-id: 53a0af18a6c7e87feff8a56f1f3eab9df7f371d6
2018-08-17 11:57:30 -07:00
Fenggang Wu
9d646a6311 Add db_bench options of data block hash index (#4281)
Summary:
Add `--data_block_index_type` and `--data_block_hash_table_util_ratio` option to `db_bench`.

`--data_block_index_type` can be either of `binary` (default) or `binary_and_hash`;
`--data_block_hash_table_util_ratio` will be a double. The default value is `0.75`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4281

Differential Revision: D9361476

Pulled By: fgwu

fbshipit-source-id: dc53e01acef9db81b9eec5e8a96f3bc8ed718c10
2018-08-16 18:42:46 -07:00
Siying Dong
9c0c8f5ff6 GetAllKeyVersions() to take an extra argument of max_num_ikeys. (#4271)
Summary:
Right now, `ldb idump` may have memory out of control if there is a big range of tombstones. Add an option to cut maxinum number of keys in GetAllKeyVersions(), and push down --max_num_ikeys from ldb.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4271

Differential Revision: D9369149

Pulled By: siying

fbshipit-source-id: 7cbb797b7d2fa16573495a7e84937456d3ff25bf
2018-08-16 15:57:08 -07:00
Andrey Zagrebin
aeed4f0749 #3865 fix performance regression introduced by MergeOperator.ShouldMerge (#4266)
Summary:
This PR addresses issue #3865 and implements the following approach to fix it:
 - adds `MergeContext::GetOperandsDirectionForward` and `MergeContext::GetOperandsDirectionBackward` to query merge operands in a specific order
 - `MergeContext::GetOperands` becomes a shortcut for `MergeContext::GetOperandsDirectionForward`
 - pass `MergeContext::GetOperandsDirectionBackward` to `MergeOperator::ShouldMerge` and document the order
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4266

Differential Revision: D9360750

Pulled By: sagar0

fbshipit-source-id: 20cb73ff017760b062ecdcf4382560767086e092
2018-08-16 10:58:05 -07:00
Fenggang Wu
19ec44fd39 Improve point-lookup performance using a data block hash index (#4174)
Summary:
Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.`

The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174

Differential Revision: D8965914

Pulled By: fgwu

fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198
2018-08-15 14:30:03 -07:00
Huachao Huang
d916a1105a c-api: add some missing options
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4267

Differential Revision: D9309505

Pulled By: anand1976

fbshipit-source-id: eb9fee8037f4ff24dc1cdd5cc5ef41c231a03e1f
2018-08-13 18:42:30 -07:00
Maysam Yabandeh
caf0f53a74 Index value delta encoding (#3983)
Summary:
Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the  block cache more efficiently. The feature is enabled with using format_version 4.

The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size.
Results with sysbench read-only using 4k blocks and using 16 index restart interval:
Format 2:
19585   rocksdb read-only range=100
Format 3:
19569   rocksdb read-only range=100
Format 4:
19352   rocksdb read-only range=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983

Differential Revision: D8361343

Pulled By: maysamyabandeh

fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651
2018-08-09 16:58:40 -07:00
Georgios Bitzes
1b813a9b2e Make rocksdb::Slice more interoperable with std::string_view (#4242)
Summary:
This change allows using std::string_view objects directly in the
DB API:

    db->Get(some_string_view_object, ...);

The conversion from std::string_view to rocksdb::Slice is done
automatically, thanks to the added constructor.

I'm stopping short of adding an implicit conversion operator
from rocksdb::Slice to std::string_view, as I don't think that's
a good idea for PinnableSlices.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4242

Differential Revision: D9224134

Pulled By: anand1976

fbshipit-source-id: f50aad04dd0b01737907c0fb88d495c83a81f4e4
2018-08-09 14:43:34 -07:00
Yanqin Jin
de7f423a82 Add SST ingestion to ldb (#4205)
Summary:
We add two subcommands `write_extern_sst` and `ingest_extern_sst` to ldb. This PR avoids changing existing code because we hope to cherry-pick to earlier releases to support compatibility check for external SST file ingestion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4205

Differential Revision: D9112711

Pulled By: riversand963

fbshipit-source-id: 7cae88380d4de86da8440230e87eca66755648e4
2018-08-09 14:29:11 -07:00
Huachao Huang
badfd70a3e types: add kEntryBlobIndex for TablePropertiesCollector (#4233)
Summary:
So that we can act accordingly on blob index entries
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4233

Differential Revision: D9190205

Pulled By: yiwu-arbug

fbshipit-source-id: e5b84d5b41e44fa7a76762f1f7b0305369bb3a0c
2018-08-06 18:27:44 -07:00
Gustav Davidsson
a15354d04e Expose GetTotalTrashSize in SstFileManager interface (#4206)
Summary:
Hi, it would be great if we could expose this API, so that LogDevice can use it to track the total size of trash files and alarm if it grows too large in relation to disk size. There's probably other customers that would be interested in this as well. :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4206

Differential Revision: D9115516

Pulled By: gdavidsson

fbshipit-source-id: f34993a940e39cb0a0b544ae8298546499b7e047
2018-08-04 17:57:48 -07:00
Sagar Vemuri
12b6cdeed3 Trace and Replay for RocksDB (#3837)
Summary:
A framework for tracing and replaying RocksDB operations.

A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench.

- Column-families are supported
- Multi-threaded tracing is supported.
- TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it.
- This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations.

Currently supported DB operations:
- Writes:
-- Put
-- Merge
-- Delete
-- SingleDelete
-- DeleteRange
-- Write
- Reads:
-- Get (point lookups)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837

Differential Revision: D7974837

Pulled By: sagar0

fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72
2018-08-01 00:27:08 -07:00
Fenggang Wu
ee7617167f DataBlockHashIndex: Specify that DataBlockHashIndex is not yet implemented in the comment
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4203

Differential Revision: D9090912

Pulled By: fgwu

fbshipit-source-id: 6a68be83693ddf2a5c060290382141f0d2fb400b
2018-07-31 11:43:08 -07:00
Yanqin Jin
54de56844d Remove random writes from SST file ingestion (#4172)
Summary:
RocksDB used to store global_seqno in external SST files written by
SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the
`global_seqno`. Since random write is not supported in some non-POSIX compliant
file systems, external SST file ingestion is not supported on these file
systems. To address this limitation, we no longer update `global_seqno` during
file ingestion. Later RocksDB uses the MANIFEST and other information in table
properties to deduce global seqno for externally-ingested SST files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172

Differential Revision: D8961465

Pulled By: riversand963

fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071
2018-07-27 16:12:23 -07:00
Fenggang Wu
a11df583ec Add DataBlockIndexType option in BlockBasedTableOptions (#4150)
Summary:
Added DataBlockIndexType option in BlockBasedTableOptions.
```
enum DataBlockIndexType : char {
    kDataBlockBinarySearch = 0, // traditional block type
    kDataBlockHashIndex = 1, // additional hash index appended to the end.
};
```
The default type is the traditional binary seek option: `kDataBlockBinarySearch`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4150

Differential Revision: D8895958

Pulled By: fgwu

fbshipit-source-id: 480adef48104cf11d30db3bad9a73f98b4a80c10
2018-07-27 15:42:27 -07:00
Maysam Yabandeh
c33b32671e Correct description of GetColumnFamilyMetaData (#4196)
Summary:
The inline doc was incorrectly mentioned a return status while the function does not return a value.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4196

Differential Revision: D9030927

Pulled By: maysamyabandeh

fbshipit-source-id: 07c34dc6bf521021bf790ac1bfedb676171129ec
2018-07-27 11:42:37 -07:00
Maysam Yabandeh
e0906eb785 Clarify max_total_wal_size's scope (#4194)
Summary:
max_total_wal_size takes effect only when there are more than one column families. The patch clarify that in the inline docs

Closes https://github.com/facebook/rocksdb/issues/4180
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4194

Differential Revision: D9028767

Pulled By: maysamyabandeh

fbshipit-source-id: 8d730ca7f15e76e7ee9ff88b2b48030b2d1b7078
2018-07-27 09:29:44 -07:00
Yanqin Jin
18f538038a Increase version number to 5.16 (#4176)
Summary:
Given that we have cut 5.15, we should bump the version number to the next
version, i.e. 5.16.
Also update HISTORY.md
cc sagar0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4176

Differential Revision: D8977965

Pulled By: riversand963

fbshipit-source-id: 481d75d2f446946f0eb2afb7e94ef894c8c87e1e
2018-07-24 13:43:33 -07:00
Manuel Ung
ea212e5316 WriteUnPrepared: Implement unprepared batches for transactions (#4104)
Summary:
This adds support for writing unprepared batches based on size defined in `TransactionOptions::max_write_batch_size`. This is done by overriding methods that modify data (Put/Delete/SingleDelete/Merge) and checking first if write batch size has exceeded threshold. If so, the write batch is written to DB as an unprepared batch.

Support for Commit/Rollback for unprepared batch is added as well. This has been done by simply extending the WritePrepared Commit/Rollback logic to take care of all unprep_seq numbers either when updating prepare heap, or adding to commit map. For updating the commit map, this logic exists inside `WriteUnpreparedCommitEntryPreReleaseCallback`.

A test change was also made to have transactions unregister themselves when committing without prepare. This is because with write unprepared, there may be unprepared entries (which act similarly to prepared entries) already when a commit is done without prepare.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4104

Differential Revision: D8785717

Pulled By: lth

fbshipit-source-id: c02006e281ec1ce00f628e2a7beec0ee73096a91
2018-07-24 00:13:18 -07:00
Chang Su
374c37da5b move static msgs out of Status class (#4144)
Summary:
The member msgs of class Status contains all types of status messages.
When users dump a Status object, msgs will confuse users. So move it out
of class Status by making it as file-local static variable.

Closes #3831 .
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4144

Differential Revision: D8941419

Pulled By: sagar0

fbshipit-source-id: 56b0510258465ff26db15aa6b04e01532e053e3d
2018-07-23 15:44:16 -07:00