rocksdb

Author	SHA1	Message	Date
Abhishek Madan	7528130e38	Cache fragmented range tombstones in BlockBasedTableReader (#4493 ) Summary: This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses. On the same DB used in #4449, running `readrandom` results in the following: ``` readrandom : 0.983 micros/op 1017076 ops/sec; 78.3 MB/s (63103 of 100000 found) ``` Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results): ``` Tombstones? \| avg micros/op \| stddev micros/op \| avg ops/s \| stddev ops/s ----------------- \| ------------- \| ---------------- \| ------------ \| ------------ None \| 0.6186 \| 0.04637 \| 1,625,252.90 \| 124,679.41 500 Expanded \| 0.6019 \| 0.03628 \| 1,666,670.40 \| 101,142.65 500 Unexpanded \| 0.6435 \| 0.03994 \| 1,559,979.40 \| 104,090.52 1k Expanded \| 0.6034 \| 0.04349 \| 1,665,128.10 \| 125,144.57 1k Unexpanded \| 0.6261 \| 0.03093 \| 1,600,457.50 \| 79,024.94 5k Expanded \| 0.6163 \| 0.05926 \| 1,636,668.80 \| 154,888.85 5k Unexpanded \| 0.6402 \| 0.04002 \| 1,567,804.70 \| 100,965.55 10k Expanded \| 0.6036 \| 0.05105 \| 1,667,237.70 \| 142,830.36 10k Unexpanded \| 0.6128 \| 0.02598 \| 1,634,633.40 \| 72,161.82 25k Expanded \| 0.6198 \| 0.04542 \| 1,620,980.50 \| 116,662.93 25k Unexpanded \| 0.5478 \| 0.0362 \| 1,833,059.10 \| 121,233.81 50k Expanded \| 0.5104 \| 0.04347 \| 1,973,107.90 \| 184,073.49 50k Unexpanded \| 0.4528 \| 0.03387 \| 2,219,034.50 \| 170,984.32 ``` After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493 Differential Revision: D10842844 Pulled By: abhimadan fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509	2018-10-25 19:26:44 -07:00
Zhongyi Xie	fe0d23059d	Fix two contrun job failures (#4587 ) Summary: Currently there are two contrun test failures: * rocksdb-contrun-lite: > tools/db_bench_tool.cc: In function ‘int rocksdb::db_bench_tool(int, char)’: tools/db_bench_tool.cc:5814:5: error: ‘DumpMallocStats’ is not a member of ‘rocksdb’ rocksdb::DumpMallocStats(&stats_string); ^ make: * [tools/db_bench_tool.o] Error 1 * rocksdb-contrun-unity: > In file included from unity.cc:44:0: db/range_tombstone_fragmenter.cc: In member function ‘void rocksdb::FragmentedRangeTombstoneIterator::FragmentTombstones(std::unique_ptr<rocksdb::InternalIteratorBase<rocksdb::Slice> >, rocksdb::SequenceNumber)’: db/range_tombstone_fragmenter.cc:90:14: error: reference to ‘ParsedInternalKeyComparator’ is ambiguous auto cmp = ParsedInternalKeyComparator(icmp_); This PR will fix them Pull Request resolved: https://github.com/facebook/rocksdb/pull/4587 Differential Revision: D10846554 Pulled By: miasantreble fbshipit-source-id: 8d3358879e105060197b1379c84aecf51b352b93	2018-10-24 20:16:45 -07:00
Yanqin Jin	eb8c9918f7	Remove unused variable Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4585 Differential Revision: D10841983 Pulled By: riversand963 fbshipit-source-id: 6a7e0b40065bcfbb10a2cac0cec1e8da0750a617	2018-10-24 15:51:45 -07:00
Abhishek Madan	8c78348c77	Use only "local" range tombstones during Get (#4449 ) Summary: Previously, range tombstones were accumulated from every level, which was necessary if a range tombstone in a higher level covered a key in a lower level. However, RangeDelAggregator::AddTombstones's complexity is based on the number of tombstones that are currently stored in it, which is wasteful in the Get case, where we only need to know the highest sequence number of range tombstones that cover the key from higher levels, and compute the highest covering sequence number at the current level. This change introduces this optimization, and removes the use of RangeDelAggregator from the Get path. In the benchmark results, the following command was used to initialize the database: ``` ./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8 ``` ...and the following command was used to measure read throughput: ``` ./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32 ``` The filluniquerandom command was only run once, and the resulting database was used to measure read performance before and after the PR. Both binaries were compiled with `DEBUG_LEVEL=0`. Readrandom results before PR: ``` readrandom : 4.544 micros/op 220090 ops/sec; 16.9 MB/s (63103 of 100000 found) ``` Readrandom results after PR: ``` readrandom : 11.147 micros/op 89707 ops/sec; 6.9 MB/s (63103 of 100000 found) ``` So it's actually slower right now, but this PR paves the way for future optimizations (see #4493). ---- Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449 Differential Revision: D10370575 Pulled By: abhimadan fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d	2018-10-24 12:31:12 -07:00
Zhongyi Xie	21bf7421ca	use per-level perf context for bloom filter related counters (#4581 ) Summary: PR https://github.com/facebook/rocksdb/pull/4226 introduced per-level perf context which allows breaking down perf context by levels. This PR takes advantage of the feature to populate a few counters related to bloom filters Pull Request resolved: https://github.com/facebook/rocksdb/pull/4581 Differential Revision: D10518010 Pulled By: miasantreble fbshipit-source-id: 011244561783ec860d32d5b0fa6bce6e78d70ef8	2018-10-24 12:21:38 -07:00
Neil Mayhew	43dbd4411e	Adapt three unit tests with newer compiler/libraries (#4562 ) Summary: This fixes three tests that fail with relatively recent tools and libraries: The tests are: * `spatial_db_test` * `table_test` * `db_universal_compaction_test` I'm using: * `gcc` 7.3.0 * `glibc` 2.27 * `snappy` 1.1.7 * `gflags` 2.2.1 * `zlib` 1.2.11 * `bzip2` 1.0.6.0.1 * `lz4` 1.8.2 * `jemalloc` 5.0.1 The versions used in the Travis environment (which is two Ubuntu LTS versions behind the current one and doesn't use `lz4` or `jemalloc`) don't seem to have a problem. However, to be safe, I verified that these tests pass with and without my changes in a trusty Docker container without `lz4` and `jemalloc`. However, I do get an unrelated set of other failures when using a trusty Docker container that uses `lz4` and `jemalloc`: ``` db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/0, where GetParam() = (1, false) (1189 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1, where GetParam() = (1, true) (1246 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2, where GetParam() = (3, false) (1237 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3, where GetParam() = (3, true) (1195 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4, where GetParam() = (5, false) (1161 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5, where GetParam() = (5, true) (1229 ms) ``` I haven't attempted to fix these since I'm not using trusty and Travis doesn't use `lz4` and `jemalloc`. However, the final commit in this PR does at least fix the compilation errors that occur when using trusty's version of `lz4`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4562 Differential Revision: D10510917 Pulled By: maysamyabandeh fbshipit-source-id: 59534042015ec339270e5fc2f6ac4d859370d189	2018-10-24 08:17:56 -07:00
Zhongyi Xie	f6b151f16d	fix clang analyzer error (#4583 ) Summary: clang analyzer currently fails with the following warnings: > db/log_reader.cc:323:9: warning: Undefined or garbage value returned to caller return r; ^~~~~~~~ db/log_reader.cc:344:11: warning: Undefined or garbage value returned to caller return r; ^~~~~~~~ db/log_reader.cc:369:11: warning: Undefined or garbage value returned to caller return r; Pull Request resolved: https://github.com/facebook/rocksdb/pull/4583 Differential Revision: D10523517 Pulled By: miasantreble fbshipit-source-id: 0cc8b8f27657b202bead148bbe7c4aa84fed095b	2018-10-23 22:14:54 -07:00
Maysam Yabandeh	c34cc40424	Fix user comparator receiving internal key (#4575 ) Summary: There was a bug that the user comparator would receive the internal key instead of the user key. The bug was due to RangeMightExistAfterSortedRun expecting user key but receiving internal key when called in GenerateBottommostFiles. The patch augment an existing unit test to reproduce the bug and fixes it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4575 Differential Revision: D10500434 Pulled By: maysamyabandeh fbshipit-source-id: 858346d2fd102cce9e20516d77338c112bdfe366	2018-10-23 08:14:46 -07:00
Siying Dong	7024263682	Dynamic level to adjust level multiplier when write is too heavy (#4338 ) Summary: Level compaction usually performs poorly when the writes so heavy that the level targets can't be guaranteed. With this improvement, we improve level_compaction_dynamic_level_bytes = true so that in the write heavy cases, the level multiplier can be slightly adjusted based on the size of L0. We keep the behavior the same if number of L0 files is under 2X compaction trigger and the total size is less than options.max_bytes_for_level_base, so that unless write is so heavy that compaction cannot keep up, the behavior doesn't change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4338 Differential Revision: D9636782 Pulled By: siying fbshipit-source-id: e27fc17a7c29c84b00064cc17536a01dacef7595	2018-10-22 10:21:47 -07:00
Yi Wu	933250e355	Fix RepeatableThreadTest::MockEnvTest hang (#4560 ) Summary: When `MockTimeEnv` is used in test to mock time methods, we cannot use `CondVar::TimedWait` because it is using real time, not the mocked time for wait timeout. On Mac the method can return immediately without awaking other waiting threads, if the real time is larger than `wait_until` (which is a mocked time). When that happen, the `wait()` method will fall into an infinite loop. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4560 Differential Revision: D10472851 Pulled By: yiwu-arbug fbshipit-source-id: 898902546ace7db7ac509337dd8677a527209d19	2018-10-21 20:17:18 -07:00
Yanqin Jin	da4aa59b4c	Add read retry support to log reader (#4394 ) Summary: Current `log::Reader` does not perform retry after encountering `EOF`. In the future, we need the log reader to be able to retry tailing the log even after `EOF`. Current implementation is simple. It does not provide more advanced retry policies. Will address this in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4394 Differential Revision: D9926508 Pulled By: riversand963 fbshipit-source-id: d86d145792a41bd64a72f642a2a08c7b7b5201e1	2018-10-19 11:53:00 -07:00
Maysam Yabandeh	0afa5b53d7	Disable GroupCommitTest in Appveyor (#4536 ) Summary: We have already disabled it on Travis since it has been too flaky. The same problem arises in Appveyor as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4536 Differential Revision: D10452240 Pulled By: maysamyabandeh fbshipit-source-id: 728f4ecddf780097159dc0a0737d460eb5ce4f09	2018-10-18 14:21:09 -07:00
Abhishek Madan	45f213b558	Lazily initialize RangeDelAggregator stripe map entries (#4497 ) Summary: When there are no range deletions, flush and compaction perform a binary search on an effectively empty map every time they call ShouldDelete. This PR lazily initializes each stripe map entry so that the binary search can be elided in these cases. After this PR, the total amount of time spent in compactions is 52.541331s, and the total amount of time spent in flush is 5.532608s, the former of which is a significant improvement from the results after #4495. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4497 Differential Revision: D10428610 Pulled By: abhimadan fbshipit-source-id: 6f7e1ce3698fac3ef86d1197955e6b72e0931a0f	2018-10-17 11:47:34 -07:00
Zhongyi Xie	d6ec288703	Add PerfContextByLevel to provide per level perf context information (#4226 ) Summary: Current implementation of perf context is level agnostic. Making it hard to do performance evaluation for the LSM tree. This PR adds `PerfContextByLevel` to decompose the counters by level. This will be helpful when analyzing point and range query performance as well as tuning bloom filter Also replaced __thread with thread_local keyword for perf_context Pull Request resolved: https://github.com/facebook/rocksdb/pull/4226 Differential Revision: D10369509 Pulled By: miasantreble fbshipit-source-id: f1ced4e0de5fcebdb7f9cff36164516bc6382d82	2018-10-17 11:19:40 -07:00
anand1976	1e3845805d	Properly determine a truncated CompactRange stop key (#4496 ) Summary: When a CompactRange() call for a level is truncated before the end key is reached, because it exceeds max_compaction_bytes, we need to properly set the compaction_end parameter to indicate the stop key. The next CompactRange will use that as the begin key. We set it to the smallest key of the next file in the level after expanding inputs to get a clean cut. Previously, we were setting it before expanding inputs. So we could end up recompacting some files. In a pathological case, where a single key has many entries spanning all the files in the level (possibly due to merge operands without a partial merge operator, thus resulting in compaction output identical to the input), this would result in an endless loop over the same set of files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4496 Differential Revision: D10395026 Pulled By: anand1976 fbshipit-source-id: f0c2f89fee29b4b3be53b6467b53abba8e9146a9	2018-10-15 23:22:51 -07:00
Yanqin Jin	e633983cf1	Add support to flush multiple CFs atomically (#4262 ) Summary: Leverage existing `FlushJob` to implement atomic flush of multiple column families. This PR depends on other PRs and is a subset of #3752 . This PR itself is not sufficient in fulfilling atomic flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4262 Differential Revision: D9283109 Pulled By: riversand963 fbshipit-source-id: 65401f913e4160b0a61c0be6cd02adc15dad28ed	2018-10-15 20:01:17 -07:00
Andrew Kryczka	32b4d4ad47	Avoid per-key linear scan over snapshots in compaction (#4495 ) Summary: `CompactionIterator::snapshots_` is ordered by ascending seqnum, just like `DBImpl`'s linked list of snapshots from which it was copied. This PR exploits this ordering to make `findEarliestVisibleSnapshot` do binary search rather than linear scan. This can make flush/compaction significantly faster when many snapshots exist since that function is called on every single key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4495 Differential Revision: D10386470 Pulled By: ajkr fbshipit-source-id: 29734991631227b6b7b677e156ac567690118a8b	2018-10-15 16:21:22 -07:00
Yanqin Jin	729a617b5b	Add listener to sample file io (#3933 ) Summary: We would like to collect file-system-level statistics including file name, offset, length, return code, latency, etc., which requires to add callbacks to intercept file IO function calls when RocksDB is running. To collect file-system-level statistics, users can inherit the class `EventListener`, as in `TestFileOperationListener `. Note that `TestFileOperationListener::ShouldBeNotifiedOnFileIO()` returns true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3933 Differential Revision: D10219571 Pulled By: riversand963 fbshipit-source-id: 7acc577a2d31097766a27adb6f78eaf8b1e8ff15	2018-10-12 18:36:11 -07:00
Yi Wu	6f8d4bdff1	Fix compile error with jemalloc (#4488 ) Summary: The "je_" prefix of jemalloc APIs presents only when the macro `JEMALLOC_NO_RENAME` from jemalloc.h presents. With the patch I'm also adding -DROCKSDB_JEMALLOC flag in buck TARGETS. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4488 Differential Revision: D10355971 Pulled By: yiwu-arbug fbshipit-source-id: 03a2d69790a44ac89219c7525763fa937a63d95a	2018-10-12 11:50:50 -07:00
Chinmay Kamat	6422356a27	Acquire lock on DB LOCK file before starting repair. (#4435 ) Summary: This commit adds code to acquire lock on the DB LOCK file before starting the repair process. This will prevent multiple processes from performing repair on the same DB simultaneously. Fixes repair_test to work with this change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4435 Differential Revision: D10361499 Pulled By: riversand963 fbshipit-source-id: 3c512c48b7193d383b2279ccecabdb660ac1cf22	2018-10-12 10:41:54 -07:00
Abhishek Madan	7dd1641048	Use vector in UncollapsedRangeDelMap (#4487 ) Summary: Using `./range_del_aggregator_bench --use_collapsed=false --num_range_tombstones=5000 --num_runs=1000`, here are the results before and after this change: Before: ``` ========================= Results: ========================= AddTombstones: 1822.61 us ShouldDelete (first): 94.5286 us ``` After: ``` ========================= Results: ========================= AddTombstones: 199.26 us ShouldDelete (first): 38.9344 us ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4487 Differential Revision: D10347288 Pulled By: abhimadan fbshipit-source-id: d44efe3a166d583acfdc3ec1199e0892f34dbfb7	2018-10-11 15:29:14 -07:00
UncP	531786ebf7	DBWriteImpl: remove redundant code (#4450 ) Summary: in `WriteThread::LaunchParallelMemTableWriters`, there is ` write_group->running.store(write_group->size); ` https://github.com/facebook/rocksdb/blob/master/db/write_thread.cc#L510 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4450 Differential Revision: D10201900 Pulled By: yiwu-arbug fbshipit-source-id: 96c8fbbba5aff7ba8a6ceb3117a2bd7cc9b2f34b	2018-10-10 21:00:32 -07:00
Simon Grätzer	ceded4535d	WriteBatch::Iterate wrongly returns Status::Corruption (#4478 ) Summary: Wrong I overwrite `WriteBatch::Handler::Continue` to return _false_ at some point, I always get the `Status::Corruption` error. I don't think this check is used correctly here: The counter in `found` cannot reflect all entries in the WriteBatch when we exit the loop early. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4478 Differential Revision: D10317416 Pulled By: yiwu-arbug fbshipit-source-id: cccae3382805035f9b3239b66682b5fcbba6bb61	2018-10-10 20:57:27 -07:00
Andrew Kryczka	7e56072290	Fix merge operand reappearing when covered by DeleteRange (#4481 ) Summary: Even during `DBIter::Prev()`, there is a case where we need to use `RangeDelPositioningMode::kForwardTraversal`. In particular, when we hit too many internal keys for a single user key, we use seek to find the newest internal key. If it's a merge operand, we then scan forwards, collecting the merge operands. This forward scan should be using `RangeDelPositioningMode::kForwardTraversal`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4481 Differential Revision: D10319507 Pulled By: ajkr fbshipit-source-id: b5ce7352461f3a7696b28a5136ae0076f2bde51f	2018-10-10 18:16:12 -07:00
Peter Pei	09814f2cfc	support OnCompactionBegin (#4431 ) Summary: fix #4288 Add `OnCompactionBegin` support to `rocksdb::EventListener`. Currently, we only have these three callbacks: - OnFlushBegin - OnFlushCompleted - OnCompactionCompleted As paolococchi requested in #4288 , and ajkr agreed, we should also support `OnCompactionBegin`. This PR is a try to implement the support of `OnCompactionBegin`. Hope it is useful to you. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4431 Differential Revision: D10055515 Pulled By: yiwu-arbug fbshipit-source-id: 39c0f95f8e9ff1c7ca3a10787502a17f258d2334	2018-10-10 17:32:27 -07:00
Andrew Kryczka	faa70fc575	DeleteRange regression tests using public API (#4476 ) Summary: I wrote a couple tests using the public API to expose/prevent the bugs we talked. In particular, - When files have overlapping endpoints and a range tombstone spans them, ensure the largest key does not reappear to readers. This was happening due to a bug that skipped writing range tombstones to an output file when their begin key exactly matched the file's largest key. - When a tombstone spans multiple atomic compaction units, ensure newer keys do not disappear by being compacted beneath it. This happened due to a range tombstone appearing untruncated to readers when it spanned files with overlapping endpoints, even if it extended into files without overlapping endpoints (i.e., different atomic compaction units). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4476 Differential Revision: D10286001 Pulled By: ajkr fbshipit-source-id: bb5ca51d0f90812fb37bfe1d01aec93f7eda55aa	2018-10-10 12:30:11 -07:00
Abhishek Madan	9c6fea7fe1	Update HISTORY.md, fix unity_test failure (#4479 ) Summary: Follow-up to https://github.com/facebook/rocksdb/pull/4432. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4479 Differential Revision: D10304151 Pulled By: abhimadan fbshipit-source-id: 3608b95c324702ca26791f95cb26dae1d49efbe7	2018-10-10 12:09:56 -07:00
Anand Ananthabhotla	854a4be03f	Handle mixed slowdown/no_slowdown writer properly (#4475 ) Summary: There is a bug when the write queue leader is blocked on a write delay/stop, and the queue has writers with WriteOptions::no_slowdown set to true. They are not woken up until the write stall is cleared. The fix introduces a dummy writer inserted at the tail to indicate a write stall and prevent further inserts into the queue, and a condition variable that writers who can tolerate slowdown wait on before adding themselves to the queue. The leader calls WriteThread::BeginWriteStall() to add the dummy writer and then walk the queue to fail any writers with no_slowdown set. Once the stall clears, the leader calls WriteThread::EndWriteStall() to remove the dummy writer and signal the condition variable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4475 Differential Revision: D10285827 Pulled By: anand1976 fbshipit-source-id: 747465e5e7f07a829b1fb0bc1afcd7b93f4ab1a9	2018-10-09 22:52:40 -07:00
jsteemann	141ef7f8d3	avoid copying when iterating using range-based for (#4459 ) Summary: this avoids a few copies of std::string and other structs in the context of range-based for loops. instead of copying the values for each iteration, use a const reference to avoid copying. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4459 Differential Revision: D10282045 Pulled By: sagar0 fbshipit-source-id: 5012e910dca279abd2be847e1fb432d96274edfb	2018-10-09 17:15:51 -07:00
jsteemann	517d3b8b77	fix typo in error message, twice (#4457 ) Summary: Fixes a typo in error messages returned by Iterator::GetProperty(...) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4457 Differential Revision: D10281965 Pulled By: sagar0 fbshipit-source-id: 1cd3c665f467ef06cdfd9f482692e6f8568f3d22	2018-10-09 17:07:27 -07:00
Abhishek Madan	3a4bd36fed	Truncate range tombstones by leveraging InternalKeys (#4432 ) Summary: To more accurately truncate range tombstones at SST boundaries, we now represent them in RangeDelAggregator using InternalKeys, which are end-key-exclusive as they were before this change. During compaction, "atomic compaction unit boundaries" (the range of keys contained in neighbouring and overlaping SSTs) are propagated down to RangeDelAggregator to truncate range tombstones at those boundariies instead. See https://github.com/facebook/rocksdb/pull/4432#discussion_r221072219 and https://github.com/facebook/rocksdb/pull/4432#discussion_r221138683 for motivating examples. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4432 Differential Revision: D10263952 Pulled By: abhimadan fbshipit-source-id: 2fe85ff8a02b3a6a2de2edfe708012797a7bd579	2018-10-09 15:19:38 -07:00
Zhongyi Xie	283a700f5d	add locking around calls to RecalculateWriteStallConditions in column_family_test (#4474 ) Summary: this should fix the current failing TSAN jobs: The callstack for TSAN: > WARNING: ThreadSanitizer: data race (pid=87440) Read of size 8 at 0x7d580000fce0 by thread T22 (mutexes: write M548703): #0 rocksdb::InternalStats::DumpCFStatsNoFileHistogram(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) db/internal_stats.cc:1204 (column_family_test+0x00000080eca7) #1 rocksdb::InternalStats::DumpCFStats(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) db/internal_stats.cc:1169 (column_family_test+0x0000008106d0) #2 rocksdb::InternalStats::HandleCFStats(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, rocksdb::Slice) db/internal_stats.cc:578 (column_family_test+0x000000810720) #3 rocksdb::InternalStats::GetStringProperty(rocksdb::DBPropertyInfo const&, rocksdb::Slice const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) db/internal_stats.cc:488 (column_family_test+0x00000080670c) #4 rocksdb::DBImpl::DumpStats() db/db_impl.cc:625 (column_family_test+0x00000070ce9a) > Previous write of size 8 at 0x7d580000fce0 by main thread: #0 rocksdb::InternalStats::AddCFStats(rocksdb::InternalStats::InternalCFStatsType, unsigned long) db/internal_stats.h:324 (column_family_test+0x000000693bbf) #1 rocksdb::ColumnFamilyData::RecalculateWriteStallConditions(rocksdb::MutableCFOptions const&) db/column_family.cc:818 (column_family_test+0x000000693bbf) #2 rocksdb::ColumnFamilyTest_WriteStallSingleColumnFamily_Test::TestBody() db/column_family_test.cc:2563 (column_family_test+0x0000005e5a49) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4474 Differential Revision: D10262099 Pulled By: miasantreble fbshipit-source-id: 1247973a3ca32e399b4575d3401dd5439c39efc5	2018-10-09 14:10:13 -07:00
Zhongyi Xie	cac87fcf57	move dump stats to a separate thread (#4382 ) Summary: Currently statistics are supposed to be dumped to info log at intervals of `options.stats_dump_period_sec`. However the implementation choice was to bind it with compaction thread, meaning if the database has been serving very light traffic, the stats may not get dumped at all. We decided to separate stats dumping into a new timed thread using `TimerQueue`, which is already used in blob_db. This will allow us schedule new timed tasks with more deterministic behavior. Tested with db_bench using `--stats_dump_period_sec=20` in command line: > LOG:2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG:2018/09/17-14:08:05.643286 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG:2018/09/17-14:08:25.691325 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG:2018/09/17-14:08:45.740989 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG content: > 2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- 2018/09/17-14:07:45.575080 7fe99fbfe700 [WARN] [db/db_impl.cc:606] DB Stats Uptime(secs): 20.0 total, 20.0 interval Cumulative writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5.57 GB, 285.01 MB/s Cumulative WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 GB, 285.01 MB/s Cumulative stall: 00:00:0.012 H:M:S, 0.1 percent Interval writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5700.71 MB, 285.01 MB/s Interval WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 MB, 285.01 MB/s Interval stall: 00:00:0.012 H:M:S, 0.1 percent Compaction Stats [default] Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Pull Request resolved: https://github.com/facebook/rocksdb/pull/4382 Differential Revision: D9933051 Pulled By: miasantreble fbshipit-source-id: 6d12bb1e4977674eea4bf2d2ac6d486b814bb2fa	2018-10-08 22:54:43 -07:00
DorianZheng	27090ae8f6	Fix DBImpl::GetColumnFamilyHandleUnlocked race condition (#4391 ) Summary: - Fix DBImpl API race condition The timeline of execution flow is as follow: ``` timeline user_thread1 user_thread2 t1 \| cfh = GetColumnFamilyHandleUnlocked(0) t2 \| id1 = cfh->GetID() t3 \| GetColumnFamilyHandleUnlocked(1) t4 \| id2 = cfh->GetID() V ``` The original implementation return a pointer to a stateful variable, so that the return `ColumnFamilyHandle` will be changed when another thread calls `GetColumnFamilyHandleUnlocked` with different `column family id` - Expose ColumnFamily ID to compaction event listener - Fix the return status of `DBImpl::GetLatestSequenceForKey` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4391 Differential Revision: D10221243 Pulled By: yiwu-arbug fbshipit-source-id: dec60ee9ff0c8261a2f2413a8506ec1063991993	2018-10-08 14:24:16 -07:00
DorianZheng	e0f05754ba	Expose column family id to OnCompactionCompleted (#4466 ) Summary: The controller you requested could not be found. PTAL Pull Request resolved: https://github.com/facebook/rocksdb/pull/4466 Differential Revision: D10241358 Pulled By: yiwu-arbug fbshipit-source-id: 99664eb286860a6c8844d50efeb0ef6f0e10dd1e	2018-10-08 14:24:16 -07:00
DorianZheng	7487a7628c	Fix return status of DBImpl::GetLatestSequenceForKey Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4467 Differential Revision: D10241418 Pulled By: yiwu-arbug fbshipit-source-id: f6adbe7292b2c934e14971c7432b3eb115c35026	2018-10-08 14:22:05 -07:00
Maysam Yabandeh	21b51dfec4	Add inline comments to flush job (#4464 ) Summary: It also renames InstallMemtableFlushResults to MaybeInstallMemtableFlushResults to clarify its contract. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4464 Differential Revision: D10224918 Pulled By: maysamyabandeh fbshipit-source-id: 04e3f2d8542002cb9f8010cb436f5152751b3cbe	2018-10-05 15:41:17 -07:00
Maysam Yabandeh	1fb6805527	Fix snprintf buffer overflow bug (#4465 ) Summary: The contract of snprintf says that it returns "The number of characters that would have been written if n had been sufficiently large" http://www.cplusplus.com/reference/cstdio/snprintf/ The existing code however was assuming that the return value is the actual number of written bytes and uses that to reposition the starting point on the next call to snprintf. This leads to buffer overflow when the last call to snprintf has filled up the buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4465 Differential Revision: D10224080 Pulled By: maysamyabandeh fbshipit-source-id: 40f44e122d15b0db439812a0a361167cf012de3e	2018-10-05 14:50:51 -07:00
Dmitry Alimov	e13d8dcbbb	Fix typos in comments (#4456 ) Summary: Fix some typos in the comments Pull Request resolved: https://github.com/facebook/rocksdb/pull/4456 Differential Revision: D10209214 Pulled By: miasantreble fbshipit-source-id: dff857ba60396bc95126e635db96d7dc8330d2cb	2018-10-04 20:46:50 -07:00
Zhongyi Xie	ce1fc5af09	fix unused param `allocator` in compression.h (#4453 ) Summary: this should fix currently failing contrun test: rocksdb-contrun-no_compression, rocksdb-contrun-tsan, rocksdb-contrun-tsan_crash Pull Request resolved: https://github.com/facebook/rocksdb/pull/4453 Differential Revision: D10202626 Pulled By: miasantreble fbshipit-source-id: 850b07f14f671b5998c22d8239e2a55b2fc1e355	2018-10-04 13:24:22 -07:00
JiYou	a1f6142f38	VersionSet: GetOverlappingInputs() fix overflow and optimize. (#4385 ) Summary: This fix is for `level == 0` in `GetOverlappingInputs()`: - In `GetOverlappingInputs()`, if `level == 0`, it has potential risk of overflow if `i == 0`. - Optmize process when `expand = true`, the expected complexity can be reduced to O(n). Signed-off-by: JiYou <jiyou09@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/4385 Differential Revision: D10181001 Pulled By: riversand963 fbshipit-source-id: 46eef8a1d1605c9329c164e6471cd5c5b6de16b5	2018-10-03 18:40:59 -07:00
Yanqin Jin	4e58b2ea3d	Check for compression lib support before test exec (#4443 ) Summary: Before running CompactFilesTest.SentinelCompressionType, we should check whether zlib and snappy are supported. CompactFilesTest.SentinelCompressionType is a newly added test. Compilation and linking with different options, e.g. COMPILE_WITH_TSAN, COMPILE_WITH_ASAN, etc. lead to generation of different binaries. On the one hand, it's not clear why zlib or snappy is present under ASAN, but not under TSAN. On the other hand, changing the compilation flags for TSAN or ASAN seems a bigger change worth much more attention. To unblock the cont-runs, I suggest that we simply add these two checks at the beginning of the test, as we did for GeneralTableTest.ApproximateOffsetOfCompressed in table/table_test.cc. Future actions include invesigating the absence of zlib and snappy when compiling with TSAN, i.e. COMPILE_WITH_TSAN=1, if necessary. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4443 Differential Revision: D10140935 Pulled By: riversand963 fbshipit-source-id: 62f96d1e685386accd2ef0b98f6f754d3fd67b3e	2018-10-02 10:42:01 -07:00
Yanqin Jin	be5cc4c7b8	Remove a race condition between lsdir and rm (#4440 ) Summary: In DBCompactionTestWithParam::ManualLevelCompactionOutputPathId, there is a race condition between `DBTestBase::GetSstFileCount` and `DBImpl::PurgeObsoleteFiles`. The following graph explains why. ``` Timeline db_compact_test_t bg_flush_t bg_compact_t \| [initiate bg flush and \| start waiting] \| flush \| DeleteObsoleteFiles \| [waken up by bg_flush_t which \| signaled in DeleteObsoleteFiles] \| \| [initiate compaction and \| start waiting] \| \| [compact, \| set manual.done to true] \| [signal at the end of \| BackgroundCallFlush] \| \| [waken up by bg_flush_t \| which signaled before \| returning from \| BackgroundCallFlush] \| \| Check manual.done is true \| \| GetSstFileCount <-- race condition --> PurgeObsoleteFiles V ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4440 Differential Revision: D10122628 Pulled By: riversand963 fbshipit-source-id: 3ede73c39fee6ad804dc6ac1ed84759c7e63977f	2018-10-01 11:57:55 -07:00
Andrew Kryczka	ac6f435a9a	Fix CompactFiles support for kDisableCompressionOption (#4438 ) Summary: Previously `CompactFiles` with `CompressionType::kDisableCompressionOption` caused program to crash on assertion failure. This PR fixes the crash by adding support for that setting. Now, that setting will cause RocksDB to choose compression according to the column family's options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4438 Differential Revision: D10115761 Pulled By: ajkr fbshipit-source-id: a553c6fa76fa5b6f73b0d165d95640da6f454122	2018-10-01 01:18:10 -07:00
JiYou	75ca13875c	FindFile: use std::lower_bound reduce the repeated code. (#4372 ) Summary: `FindFile()` and `FindFileInRange()` actually works as the same of `std::lower_bound()`. Use `std::lower_bound()` to reduce the repeated code. - change `FindFile()` and `FindFileInRange()` to use `std::lower_bound()` Signed-off-by: JiYou <jiyou09@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/4372 Differential Revision: D9919677 Pulled By: ajkr fbshipit-source-id: f74aaa30e2f80e410e299c5a5bca4eaf2a7a26de	2018-09-27 10:35:00 -07:00
Yi Wu	dc813e4b85	Improve log handling when recover without flush (#4405 ) Summary: Improve log handling when avoid_flush_during_recovery=true. 1. restore total_log_size_ after recovery, by summing up existing log sizes. Fixes #4253. 2. truncate the last existing log, since this log can contain preallocated space and it will be a waste to keep the space. It avoids a crash loop of user application cause a lot of log with non-trivial size being created and ultimately take up all disk space. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4405 Differential Revision: D9953933 Pulled By: yiwu-arbug fbshipit-source-id: 967780fee8acec7f358b6eb65190fb4684f82e56	2018-09-26 10:37:48 -07:00
Nikhil Benesch	17edc82a4b	Handle tombstones at the same seqno in the CollapsedRangeDelMap (#4424 ) Summary: The CollapsedRangeDelMap was entirely mishandling tombstones at the same sequence number when the tombstones did not have identical start and end keys. Such tombstones are common since `90fc40690`, which causes tombstones to be split during compactions. For example, if the tombstone [a, c) @ 1 lies across a compaction boundary at b, it will be split into [a, b) @ 1 and [b, c) @ 1. Without this patch, the collapsed range deletion map would look like this: a -> 1 b -> 1 c -> 0 Notice how the b -> 1 entry is redundant. When the tombstones overlap, the problem is even worse. Consider tombstones [a, c) @ 1 and [b, d) @ 1, which produces this map without this patch: a -> 1 b -> 1 c -> 0 d -> 0 This map is corrupt, as a map can never contain adjacent sentinel (zero) entries. When the iterator advances from b to c, it will notice that c is a sentinel enty and skip to d--but d is also a sentinel entry! Asking what tombstone this iterator points to will trigger an assertion, as it is not pointing to a valid tombstone. /cc ajkr Pull Request resolved: https://github.com/facebook/rocksdb/pull/4424 Differential Revision: D10039248 Pulled By: abhimadan fbshipit-source-id: 6d737c1e88d60e80cf27286726627ba44463e7f4	2018-09-25 14:50:31 -07:00
Abhishek Madan	3c350a7cf0	Improve RangeDelAggregator benchmarks (#4395 ) Summary: Improve time measurements for AddTombstones to only include the call and not the VectorIterator setup. Also add a new add_tombstones_per_run flag to call AddTombstones multiple times per aggregator, which will help simulate more realistic workloads. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4395 Differential Revision: D9996811 Pulled By: abhimadan fbshipit-source-id: 5865a95c323fbd9b3606493013664b4890fe5a02	2018-09-21 16:13:08 -07:00
Anand Ananthabhotla	72712f4e28	Allow dynamic modification of window size and deletion trigger (#4403 ) Summary: Make the CompactOnDeletionCollectorFactory class public, and provide methods to update the window size and deletion trigger params. These will take effect on subsequent created SST files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4403 Differential Revision: D9976857 Pulled By: anand1976 fbshipit-source-id: 31dbf0511c12fa2bb9b2a7ba620079e0ee09cf48	2018-09-20 15:15:28 -07:00
Andrew Kryczka	990b52e95b	Unit test for custom comparator RangeDelAggregator (#4388 ) Summary: Add a unit test for range collapsing when non-default comparator is used. This exposes the bug fixed in #4386. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4388 Differential Revision: D9918252 Pulled By: ajkr fbshipit-source-id: 99501b96b251eab41791a7e33b27055ee36c5c39	2018-09-18 12:13:20 -07:00
jsteemann	27221b0cc2	use specified comparator in CollapsedRangeDelMap (#4386 ) Summary: The Comparator passed to CollapsedRangeDelMap was not used for operator less of the std::map `rep_` object contained in CollapsedRangeDelMap. So the map was always sorted using the default ByteWiseComparator, which seems wrong. Passing the specified Comparator through for usage in that map object fixes actual problems we were seeing with RangeDelete operations that do not delete keys as expected when using a custom Comparator. I found that the tests in current master crash when I run them locally, both with and without my patch, at the very same location. I therefore don't know if the patch breaks something else, but it seems to fix RangeDeletion issues in our product that uses RocksDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4386 Differential Revision: D9916506 Pulled By: ajkr fbshipit-source-id: 27bff8c775831f089dde8c5289df7343d88b2d66	2018-09-18 09:28:30 -07:00
Maysam Yabandeh	65ac72edd9	Fix bug in partition filters with format_version=4 (#4381 ) Summary: Value delta encoding in format_version 4 requires the differences between the size of two consecutive handles to be sent to BlockBuilder::Add. This applies not only to indexes on blocks but also the indexes on indexes and filters in partitioned indexes and filters respectively. The patch fixes a bug where the partitioned filters would encode the entire size of the handle rather than the difference of the size with the last size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4381 Differential Revision: D9879505 Pulled By: maysamyabandeh fbshipit-source-id: 27a22e49b482b927fbd5629dc310c46d63d4b6d1	2018-09-17 17:28:15 -07:00
Abhishek Madan	1626f6ab6b	Add RangeDelAggregator microbenchmarks (#4363 ) Summary: To measure the results of upcoming DeleteRange v2 work, this commit adds simple benchmarks for RangeDelAggregator. It measures the average time for AddTombstones and ShouldDelete calls. Using this to compare the results before #4014 and on the latest master (using the default arguments) produces the following results: Before #4014: ``` ======================= Results: ======================= AddTombstones: 1356.28 us ShouldDelete: 0.401732 us ``` Latest master: ``` ======================= Results: ======================= AddTombstones: 740.82 us ShouldDelete: 0.383271 us ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4363 Differential Revision: D9881676 Pulled By: abhimadan fbshipit-source-id: 793e7d61aa4b9d47eb917bbcc03f08695b5e5442	2018-09-17 14:58:31 -07:00
Anand Ananthabhotla	30c21df97c	Fix regression test failures introduced by PR #4164 (#4375 ) Summary: 1. Add override keyword to overridden virtual functions in EventListener 2. Fix a memory corruption that can happen during DB shutdown when in read-only mode due to a background write error 3. Fix uninitialized buffers in error_handler_test.cc that cause valgrind to complain Pull Request resolved: https://github.com/facebook/rocksdb/pull/4375 Differential Revision: D9875779 Pulled By: anand1976 fbshipit-source-id: 022ede1edc01a9f7e21ecf4c61ef7d46545d0640	2018-09-17 13:14:07 -07:00
Anand Ananthabhotla	a27fce408e	Auto recovery from out of space errors (#4164 ) Summary: This commit implements automatic recovery from a Status::NoSpace() error during background operations such as write callback, flush and compaction. The broad design is as follows - 1. Compaction errors are treated as soft errors and don't put the database in read-only mode. A compaction is delayed until enough free disk space is available to accomodate the compaction outputs, which is estimated based on the input size. This means that users can continue to write, and we rely on the WriteController to delay or stop writes if the compaction debt becomes too high due to persistent low disk space condition 2. Errors during write callback and flush are treated as hard errors, i.e the database is put in read-only mode and goes back to read-write only fater certain recovery actions are taken. 3. Both types of recovery rely on the SstFileManagerImpl to poll for sufficient disk space. We assume that there is a 1-1 mapping between an SFM and the underlying OS storage container. For cases where multiple DBs are hosted on a single storage container, the user is expected to allocate a single SFM instance and use the same one for all the DBs. If no SFM is specified by the user, DBImpl::Open() will allocate one, but this will be one per DB and each DB will recover independently. The recovery implemented by SFM is as follows - a) On the first occurance of an out of space error during compaction, subsequent compactions will be delayed until the disk free space check indicates enough available space. The required space is computed as the sum of input sizes. b) The free space check requirement will be removed once the amount of free space is greater than the size reserved by in progress compactions when the first error occured c) If the out of space error is a hard error, a background thread in SFM will poll for sufficient headroom before triggering the recovery of the database and putting it in write-only mode. The headroom is calculated as the sum of the write_buffer_size of all the DB instances associated with the SFM 4. EventListener callbacks will be called at the start and completion of automatic recovery. Users can disable the auto recov ery in the start callback, and later initiate it manually by calling DB::Resume() Todo: 1. More extensive testing 2. Add disk full condition to db_stress (follow-on PR) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164 Differential Revision: D9846378 Pulled By: anand1976 fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a	2018-09-15 13:43:04 -07:00
Sagar Vemuri	3db584059c	Remove sync point from Block destructor (#4370 ) Summary: AddressSanitizer: heap-use-after-free in std::__atomic_base<bool>::load(std::memory_order) const ==1798517==ABORTING ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4370 Differential Revision: D9844146 Pulled By: sagar0 fbshipit-source-id: 18a2970b1d504b4f6c8fb04857f26e0f32124dd1	2018-09-15 00:12:57 -07:00
Dmitri Smirnov	879998b369	Adjust c test and fix windows compilation issues Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4369 Differential Revision: D9844200 Pulled By: sagar0 fbshipit-source-id: 0d9f5f73b28234eaac55d3551ce4e2dc177af138	2018-09-14 20:57:22 -07:00
JiYou	82e8e9e26b	VersionBuilder: optmize SaveTo() to linear time. (#4366 ) Summary: Because `base_files` and `added_files` both are sorted, using a merge operation to these two sorted arrays is more effective. The complexity is reduced to linear time. - optmize the merge complexity. - move the `NDEBUG` of sorted `added_files` out of merge process. Signed-off-by: JiYou <jiyou09@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/4366 Differential Revision: D9833592 Pulled By: ajkr fbshipit-source-id: dd32b67ebdca4c20e5e9546ab8082cecefe99fd0	2018-09-14 19:43:04 -07:00
Yanqin Jin	8959063c9c	Store the return value of Fsync for check Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4361 Differential Revision: D9803723 Pulled By: riversand963 fbshipit-source-id: 5a0d4cd3e57fd195571dcd5822895ee00547fa6a	2018-09-14 13:29:56 -07:00
Yanqin Jin	82057b0d8f	Improve type conversion (#4367 ) Summary: Use `static_cast<type>(var)` instead of `(type)var`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4367 Differential Revision: D9833391 Pulled By: riversand963 fbshipit-source-id: 3d33fc2c290e7e0f3d1d45b256a881d1bc5a7df2	2018-09-14 11:12:52 -07:00
Andrew Kryczka	c94523ee56	Delete code for WAL reader to start at nonzero offset (#4362 ) Summary: The code is dead in RocksDB as `log::Reader::initial_offset_` is always zero. We should delete it so we don't have to maintain it like in #4359. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4362 Differential Revision: D9817829 Pulled By: ajkr fbshipit-source-id: 474a2c679e5bd273b40608f3a5332931d9eefe6d	2018-09-13 17:13:03 -07:00
Vitaly Isaev	0bd2ede10e	Memory usage stats in C API (#4340 ) Summary: Please consider this small PR providing access to the `MemoryUsage::GetApproximateMemoryUsageByType` function in plain C API. Actually I'm working on Go application and now trying to investigate the reasons of high memory consumption (#4313). Go [wrappers](https://github.com/tecbot/gorocksdb) are built on the top of Rocksdb C API. According to the #706, `MemoryUsage::GetApproximateMemoryUsageByType` is considered as the best option to get database internal memory usage stats, but it wasn't supported in C API yet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4340 Differential Revision: D9655135 Pulled By: ajkr fbshipit-source-id: a3d2f3f47c143ae75862fbcca2f571ea1b49e14a	2018-09-13 14:27:31 -07:00
Dan Melnic	ca92fc71a4	Initialize uninitialized std::atomic variables Summary: Initialize uninitialized std::atomic variables Reviewed By: yfeldblum Differential Revision: D9758050 fbshipit-source-id: 865d89eddafc81f3cab6f11e2ebb669f7ff70d04	2018-09-12 08:58:05 -07:00
Abhishek Madan	c86a22ac43	Restrict RangeDelAggregator's tombstone end-key truncation (#4356 ) Summary: `RangeDelAggregator::AddTombstones` contained an assertion which stated that, if a range tombstone extended past the largest key in the sstable, then `FileMetaData::largest` must have a sentinel sequence number of `kMaxSequenceNumber`, which implies that the tombstone's end key is safe to truncate. However, `largest` will not be a sentinel key when the next sstable in the level's smallest key is equal to the current sstable's largest key, which caused the assertion to fail. The assertion must hold for the truncation to be safe, so it has been moved to an additional check on end-key truncation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4356 Differential Revision: D9760891 Pulled By: abhimadan fbshipit-source-id: 7c20c3885cd919dcd14f291f88fd27aa33defebc	2018-09-10 17:42:43 -07:00
Maysam Yabandeh	3f5282268f	Skip concurrency control during recovery of pessimistic txn (#4346 ) Summary: TransactionOptions::skip_concurrency_control allows pessimistic transactions to skip the overhead of concurrency control. This could be as an optimization if the application knows that the transaction would not have any conflict with concurrent transactions. It is currently used during recovery assuming (i) application guarantees no conflict between prepared transactions in the WAL (ii) application guarantees that recovered transactions will be rolled back/commit before new transactions start. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4346 Differential Revision: D9759149 Pulled By: maysamyabandeh fbshipit-source-id: f896e84fa58b0b584be904c7fd3883a41ea3215b	2018-09-10 16:57:53 -07:00
Anand Ananthabhotla	ced618cf39	Fix a lint error due to unspecified move evaluation order (#4348 ) Summary: In C++ 11, the order of argument and move evaluation in a statement such as below is unspecified - foo(a.b).bar(std::move(a)) The compiler is free to evaluate std::move(a) first, and then a.b is unspecified. In C++ 17, this will be safe if a draft proposal around function chaining rules is accepted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4348 Differential Revision: D9688810 Pulled By: anand1976 fbshipit-source-id: e4651d0ca03dcf007e50371a0fc72c0d1e710fb4	2018-09-06 14:42:57 -07:00
cngzhnp	64324e329e	Support pragma once in all header files and cleanup some warnings (#4339 ) Summary: As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited. Besides this, try to fix some warnings about loss of data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339 Differential Revision: D9654990 Pulled By: ajkr fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848	2018-09-05 18:13:31 -07:00
Andrew Kryczka	1a88c43751	Reduce empty SST creation/deletion in compaction (#4336 ) Summary: This is a followup to #4311. Checking `!RangeDelAggregator::IsEmpty()` before opening a dedicated range tombstone SST did not properly prevent empty SSTs from being generated. That's because it relies on `CollapsedRangeDelMap::Size`, which had an underflow bug when the map was empty. This PR fixes that underflow bug. Also fixed an uninitialized variable in db_stress. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4336 Differential Revision: D9600080 Pulled By: ajkr fbshipit-source-id: bc6980ca79d2cd01b825ebc9dbccd51c1a70cfc7	2018-08-31 12:28:52 -07:00
Mikhail Antonov	927f274939	Avoiding write stall caused by manual flushes (#4297 ) Summary: Basically at the moment it seems it's possible to cause write stall by calling flush (either manually vis DB::Flush(), or from Backup Engine directly calling FlushMemTable() while background flush may be already happening. One of the ways to fix it is that in DBImpl::CompactRange() we already check for possible stall and delay flush if needed before we actually proceed to call FlushMemTable(). We can simply move this delay logic to separate method and call it from FlushMemTable. This is draft patch, for first look; need to check tests/update SyncPoints and most certainly would need to add allow_write_stall method to FlushOptions(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4297 Differential Revision: D9420705 Pulled By: mikhail-antonov fbshipit-source-id: f81d206b55e1d7b39e4dc64242fdfbceeea03fcc	2018-08-29 12:12:55 -07:00
Andrew Kryczka	42733637e1	Sync CURRENT file during checkpoint (#4322 ) Summary: For the CURRENT file forged during checkpoint, we were forgetting to `fsync` or `fdatasync` it after its creation. This PR fixes it. Differential Revision: D9525939 Pulled By: ajkr fbshipit-source-id: a505483644026ee3f501cfc0dcbe74832165b2e3	2018-08-28 12:43:18 -07:00
Yanqin Jin	198459ce17	Fix an inaccurate comment (#4315 ) Summary: According to `4848bd0c4e/db/log_reader.cc (L355)`, the original text is misleading when describing the layout of RecyclableLogHeader. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4315 Differential Revision: D9505284 Pulled By: riversand963 fbshipit-source-id: 79994c37a69e7003f03453e7efc0186feeafa609	2018-08-24 18:13:20 -07:00
Shrikanth Shankar	4848bd0c4e	Drop unnecessary deletion markers during compaction (issue - 3842) (#4289 ) Summary: This PR fixes issue 3842. We drop deletion markers iff 1. We are the bottom most level AND 2. All other occurrences of the key are in the same snapshot range as the delete I've also enhanced db_stress_test to add an option that does a full compare of the keys. This is done by a single thread (thread # 0). For tests I've run (so far) make check -j64 db_stress db_stress --acquire_snapshot_one_in=1000 --ops_per_thread=100000 /* to verify that new code doesnt break existing tests / ./db_stress --compare_full_db_state_snapshot=true --acquire_snapshot_one_in=1000 --ops_per_thread=100000 / to verify new test code */ Pull Request resolved: https://github.com/facebook/rocksdb/pull/4289 Differential Revision: D9491165 Pulled By: shrikanthshankar fbshipit-source-id: ce144834f31736c189aaca81bed356ba990331e2	2018-08-24 15:17:54 -07:00
Yanqin Jin	7daae512d2	Refactor flush request queueing and processing (#3952 ) Summary: RocksDB currently queues individual column family for flushing. This is not sufficient to support the needs of some applications that want to enforce order/dependency between column families, given that multiple foreground and background activities can trigger flushing in RocksDB. This PR aims to address this limitation. Each flush request is described as a `FlushRequest` that can contain multiple column families. A background flushing thread pops one flush request from the queue at a time and processes it. This PR does not enable atomic_flush yet, but is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Pull Request resolved: https://github.com/facebook/rocksdb/pull/3952 Differential Revision: D8529933 Pulled By: riversand963 fbshipit-source-id: 78908a21e389a3a3f7de2a79bae0cd13af5f3539	2018-08-24 13:27:35 -07:00
Andrew Kryczka	17f9a181d5	Reduce empty SST creation/deletion during compaction (#4311 ) Summary: I have a PR to start calling `OnTableFileCreated` for empty SSTs: #4307. However, it is a behavior change so should not go into a patch release. This PR adds back a check to make sure range deletions at least exist before starting file creation. This PR should be safe to backport to earlier versions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4311 Differential Revision: D9493734 Pulled By: ajkr fbshipit-source-id: f0d43cda4cfd904f133cfe3a6eb622f52a9ccbe8	2018-08-24 12:27:57 -07:00
Andrew Kryczka	ee234e83e3	Invoke OnTableFileCreated for empty SSTs (#4307 ) Summary: The API comment on `OnTableFileCreationStarted` (`b6280d01f9/include/rocksdb/listener.h (L331-L333)`) led users to believe a call to `OnTableFileCreationStarted` will always be matched with a call to `OnTableFileCreated`. However, we were skipping the `OnTableFileCreated` call in one case: no error happens but also no file is generated since there's no data. This PR adds the call to `OnTableFileCreated` for that case. The filename will be "(nil)" and the size will be zero. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4307 Differential Revision: D9485201 Pulled By: ajkr fbshipit-source-id: 2f077ec7913f128487aae2624c69a50762394df6	2018-08-23 18:27:30 -07:00
Gauresh Rane	ad789e4e0d	Adding a method for memtable class for memtable getting flushed. (#4304 ) Summary: Memtables are selected for flushing by the flush job. Currently we have listener which is invoked when memtables for a column family are flushed. That listener does not indicate which memtable was flushed in the notification. If clients want to know if particular data in the memtable was retired, there is no straight forward way to know this. This method will help users who implement memtablerep factory and extend interface for memtablerep, to know if the data in the memtable was retired. Another option that was tried, was to depend on memtable destructor to be called after flush to mark that data was persisted. This works all the time but sometimes there can huge delays between actual flush happening and memtable getting destroyed. Hence, if anyone who is waiting for data to persist will have to wait that longer. It is expected that anyone who is implementing this method to have return quickly as it blocks RocksDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4304 Reviewed By: riversand963 Differential Revision: D9472312 Pulled By: gdrane fbshipit-source-id: 8e693308dee749586af3a4c5d4fcf1fa5276ea4d	2018-08-23 17:14:25 -07:00
Yanqin Jin	bb5dcea98e	Add path to WritableFileWriter. (#4039 ) Summary: We want to sample the file I/O issued by RocksDB and report the function calls. This requires us to include the file paths otherwise it's hard to tell what has been going on. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4039 Differential Revision: D8670178 Pulled By: riversand963 fbshipit-source-id: 97ee806d1c583a2983e28e213ee764dc6ac28f7a	2018-08-23 10:12:58 -07:00
Zhongyi Xie	f1f5ba085f	add missing counters in readonly mode (#4260 ) Summary: User reported (https://github.com/facebook/rocksdb/issues/4168) that when opening RocksDB in read-only mode, some statistics are not correctly reported. After some investigation, we believe the following counters are indeed not reported during Get() call in a read-only DB: rocksdb.memtable.hit rocksdb.memtable.miss rocksdb.number.keys.read rocksdb.bytes.read As well as histogram rocksdb.bytes.per.read and perf context get_read_bytes This PR will add the necessary counter reporting logic in the Get() call path Pull Request resolved: https://github.com/facebook/rocksdb/pull/4260 Differential Revision: D9476431 Pulled By: miasantreble fbshipit-source-id: 7ab409d4e59df05d09ae8b69fe75554e5aa240d6	2018-08-22 22:43:13 -07:00
Siying Dong	d5612b43de	Two code changes to make "clang analyze" happy (#4292 ) Summary: Clang analyze is not happy in two pieces of code, with "Potential memory leak". No idea what the problem but slightly changing the code makes clang happy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4292 Differential Revision: D9413555 Pulled By: siying fbshipit-source-id: 9428c9d3664530c72129feefd135ee63d8386137	2018-08-20 17:43:41 -07:00
Yanqin Jin	d116a1725d	Update recovery code for version edits group commit. (#3945 ) Summary: During recovery, RocksDB is able to handle version edits that belong to group commits. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752) Pull Request resolved: https://github.com/facebook/rocksdb/pull/3945 Differential Revision: D8529122 Pulled By: riversand963 fbshipit-source-id: 57cb0f9cc55ecca684a837742d6626dc9c07f37e	2018-08-20 14:58:00 -07:00
Mikhail Antonov	889a0553c8	VerifyChecksum() API should preserve options Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4275 Reviewed By: yiwu-arbug Differential Revision: D9369766 Pulled By: mikhail-antonov fbshipit-source-id: d91b64c34cc1976b324a260767fce343fa32afde	2018-08-16 16:42:29 -07:00
Andrey Zagrebin	aeed4f0749	#3865 fix performance regression introduced by MergeOperator.ShouldMerge (#4266 ) Summary: This PR addresses issue #3865 and implements the following approach to fix it: - adds `MergeContext::GetOperandsDirectionForward` and `MergeContext::GetOperandsDirectionBackward` to query merge operands in a specific order - `MergeContext::GetOperands` becomes a shortcut for `MergeContext::GetOperandsDirectionForward` - pass `MergeContext::GetOperandsDirectionBackward` to `MergeOperator::ShouldMerge` and document the order Pull Request resolved: https://github.com/facebook/rocksdb/pull/4266 Differential Revision: D9360750 Pulled By: sagar0 fbshipit-source-id: 20cb73ff017760b062ecdcf4382560767086e092	2018-08-16 10:58:05 -07:00
jsteemann	33ad9060d3	fix compilation with g++ option `-Wsuggest-override` (#4272 ) Summary: Fixes compilation warnings (which are turned into compilation errors by default) when compiling with g++ option `-Wsuggest-override`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4272 Differential Revision: D9322556 Pulled By: siying fbshipit-source-id: abd57a29ec8f544bee77c0bb438f31be830b7244	2018-08-14 15:13:10 -07:00
Huachao Huang	d916a1105a	c-api: add some missing options Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4267 Differential Revision: D9309505 Pulled By: anand1976 fbshipit-source-id: eb9fee8037f4ff24dc1cdd5cc5ef41c231a03e1f	2018-08-13 18:42:30 -07:00
Siying Dong	f3d91a0b57	Add a unit test to verify iterators release data blocks after using them (#4170 ) Summary: Add a unit test to check that iterators release data blocks after it has moved away from it. Verify the same for compaction input iterators. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4170 Differential Revision: D8962513 Pulled By: siying fbshipit-source-id: 05a5b604d7d29887fb488f2cda7286f554a14407	2018-08-13 17:43:14 -07:00
Anand Ananthabhotla	4ea56b1bd0	Revert changes in PR #4003 (#4263 ) Summary: Revert this change. Not generating the OnTableFileCreated() notification for a 0 byte SST on flush breaks the assumption that every OnTableFileCreationStarted() notification is followed by a corresponding OnTableFileCreated(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4263 Differential Revision: D9285623 Pulled By: anand1976 fbshipit-source-id: 808c3dcd498b4b4f4ed4be947a29a24b2296aa8d	2018-08-11 16:57:36 -07:00
Zhichao Cao	6d75319d95	Add tracing function of Seek() and SeekForPrev() to trace_replay (#4228 ) Summary: In the current trace_and replay, Get an WriteBatch are traced. This pull request track down the Seek() and SeekForPrev() to the trace file. <target_key, timestamp, column_family_id> are write to the file. Replay of Iterator is not supported in the current implementation. Tested with trace_analyzer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4228 Differential Revision: D9201381 Pulled By: zhichao-cao fbshipit-source-id: 6f9cc9cb3c20260af741bee065ec35c5c96354ab	2018-08-10 17:57:40 -07:00
Zhichao Cao	76d77205da	Remove the redundant condition inclusion to avoid confusion (#4254 ) Summary: The pair of ROCKSDB_LITE condition inclusion is redundant, it is already inside the #ifndef ROCKSDB_LITE. Remove them to void confusion. Tested by make asan_check. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4254 Differential Revision: D9281652 Pulled By: zhichao-cao fbshipit-source-id: 06bf7641ede71391f21f6a3fe37fbd13f0e2a43a	2018-08-10 17:43:33 -07:00
Maysam Yabandeh	caf0f53a74	Index value delta encoding (#3983 ) Summary: Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the block cache more efficiently. The feature is enabled with using format_version 4. The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size. Results with sysbench read-only using 4k blocks and using 16 index restart interval: Format 2: 19585 rocksdb read-only range=100 Format 3: 19569 rocksdb read-only range=100 Format 4: 19352 rocksdb read-only range=100 Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983 Differential Revision: D8361343 Pulled By: maysamyabandeh fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651	2018-08-09 16:58:40 -07:00
Yanqin Jin	de7f423a82	Add SST ingestion to ldb (#4205 ) Summary: We add two subcommands `write_extern_sst` and `ingest_extern_sst` to ldb. This PR avoids changing existing code because we hope to cherry-pick to earlier releases to support compatibility check for external SST file ingestion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4205 Differential Revision: D9112711 Pulled By: riversand963 fbshipit-source-id: 7cae88380d4de86da8440230e87eca66755648e4	2018-08-09 14:29:11 -07:00
Zhongyi Xie	b15379dcea	fix use-after-free error involving a temporary string (#4240 ) Summary: In the current code, `error_msg` is pointing to the inner buffer of a temporary std::string object. When `error_msg` is used to construct the error message, that array is already released. This PR will fix this bug by copying the string to a local variable. Fixes https://github.com/facebook/rocksdb/issues/4239 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4240 Differential Revision: D9204334 Pulled By: miasantreble fbshipit-source-id: 0ac599e166ae0a4ec413e32d8b8853d7c5fba878	2018-08-09 11:13:10 -07:00
Maysam Yabandeh	d8d66c937e	Simplify DBWithMaxSpaceAllowedRandomized (#4235 ) Summary: The test has become complicated over the years and hard to reason about the corner cases that makes the test flaky. The patch simplifies the test and also fixes some probable synchronization issues. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4235 Differential Revision: D9187995 Pulled By: maysamyabandeh fbshipit-source-id: 53c7b060f14367e5a9e361014578c26debfe3d27	2018-08-08 07:27:46 -07:00
Huachao Huang	badfd70a3e	types: add kEntryBlobIndex for TablePropertiesCollector (#4233 ) Summary: So that we can act accordingly on blob index entries Pull Request resolved: https://github.com/facebook/rocksdb/pull/4233 Differential Revision: D9190205 Pulled By: yiwu-arbug fbshipit-source-id: e5b84d5b41e44fa7a76762f1f7b0305369bb3a0c	2018-08-06 18:27:44 -07:00
Yi Wu	4cb7068c1e	BlobDB: Fix VisibleToActiveSnapshot() (#4236 ) Summary: There are two issues with `VisibleToActiveSnapshot`: 1. If there are no snapshots, `oldest_snapshot` will be 0 and `VisibleToActiveSnapshot` will always return true. Since the method is used to decide whether it is safe to delete obsolete files, obsolete file won't be able to delete in this case. 2. The `auto` keyword of `auto snapshots = db_impl_->snapshots()` translate to a copy of `const SnapshotList` instead of a reference. Since copy constructor of `SnapshotList` is not defined, using the copy may yield unexpected result. Issue 2 actually hide issue 1 from being catch by tests. During test `snapshots.empty()` can return false while it should actually be empty, and `snapshots.oldest()` return an invalid address, making `oldest_snapshot` being some random large number. The issue was originally reported by BlobDB early adopter at Kuaishou. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4236 Differential Revision: D9188706 Pulled By: yiwu-arbug fbshipit-source-id: a0f2624b927cf9bf28c1bb534784fee5d106f5ea	2018-08-06 16:57:42 -07:00
Jingguo Yao	ceb5fea1e3	Improve FullFilterBitsReader::HashMayMatch's doc (#4202 ) Summary: HashMayMatch is related to AddKey() instead of CreateFilter(). Also applies some minor Fixes #4191 #4200 #3910 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4202 Differential Revision: D9180945 Pulled By: maysamyabandeh fbshipit-source-id: 6f07b81c5bb9bda5c0273475b486ba8a030471e6	2018-08-06 11:13:18 -07:00
Yanqin Jin	1f802773bc	Update JobContext. (#3949 ) Summary: In the past, we assume that a job modifies a single column family. Therefore, a job can create at most one superversion since each superversion corresponds to one column family. This assumption leads to the fact that a `JobContext` has only one member variable called `superversion_context`. Now we want to support group flush of column families, indicating that each job can create multiple superversions. Therefore, we need to make the following change to accommodate this new feature. Add a vector of `SuperVersionContext` to `JobContext` to support installing superversions for multiple column families in one job context. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Pull Request resolved: https://github.com/facebook/rocksdb/pull/3949 Differential Revision: D8864895 Pulled By: riversand963 fbshipit-source-id: 5937a48817276370d3c8172db9c8aafc826d97ca	2018-08-03 17:42:34 -07:00
Yanqin Jin	22368965a0	Modify verification logic of ObsoleteOptionsFileTest (#4218 ) Summary: The current verification logic does not consider the case in which multiple threads (foreground and background) may execute `PurgeObsoleteFiles` function simultaneously. Each invocation will trigger the callback adding elements to a vector. Then we verify the elements in the vector, which can fail sometimes. The solution is to give up checking the elements. Instead, we check the number of OPTIONS file in the database dir. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4218 Differential Revision: D9128727 Pulled By: riversand963 fbshipit-source-id: 2b13b705fb21bc0ddd41940c4ec9b6b0c8d88224	2018-08-03 13:57:40 -07:00
DorianZheng	f9373e2d5c	Make sure to call ReleaseFileNumberFromPendingOutputs Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4219 Differential Revision: D9144294 Pulled By: riversand963 fbshipit-source-id: e46b72e5f8a149dc7a0512e38edcd0ddb0150f30	2018-08-02 18:57:34 -07:00
Andrew Kryczka	f8f6983f89	Skip range deletions at seqno zero when collapsing (#4216 ) Summary: `CollapsedRangeDelMap` internally uses seqno zero as a sentinel value to denote a gap between range tombstones or the end of range tombstones. It therefore expects to never have consecutive sentinel tombstones. However, since `DeleteRange` is now supported in `SstFileWriter`, an ingested file may contain range tombstones, and that ingested file may be assigned global seqno zero. When such tombstones are added to the collapsed map, they resemble sentinel tombstones due to having seqno zero. Then, the invariant mentioned above about never having consecutive sentinel tombstones can be violated. The symptom of this violation was dereferencing the `end()` iterator (#4204). The fix in this PR is to not add range tombstones with seqno zero to the collapsed map. They're not needed anyways since they can't possibly cover anything (in case of a key and a range tombstone with the same seqno, the key is visible). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4216 Differential Revision: D9121716 Pulled By: ajkr fbshipit-source-id: f5b78a70bea9527354603ea7ac8542a7e2b6a210	2018-08-01 12:12:02 -07:00
Sagar Vemuri	12b6cdeed3	Trace and Replay for RocksDB (#3837 ) Summary: A framework for tracing and replaying RocksDB operations. A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench. - Column-families are supported - Multi-threaded tracing is supported. - TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it. - This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations. Currently supported DB operations: - Writes: -- Put -- Merge -- Delete -- SingleDelete -- DeleteRange -- Write - Reads: -- Get (point lookups) Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837 Differential Revision: D7974837 Pulled By: sagar0 fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72	2018-08-01 00:27:08 -07:00
Yanqin Jin	54de56844d	Remove random writes from SST file ingestion (#4172 ) Summary: RocksDB used to store global_seqno in external SST files written by SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the `global_seqno`. Since random write is not supported in some non-POSIX compliant file systems, external SST file ingestion is not supported on these file systems. To address this limitation, we no longer update `global_seqno` during file ingestion. Later RocksDB uses the MANIFEST and other information in table properties to deduce global seqno for externally-ingested SST files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172 Differential Revision: D8961465 Pulled By: riversand963 fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071	2018-07-27 16:12:23 -07:00
DorianZheng	f5e46354d2	Protect external file when ingesting (#4099 ) Summary: If crash happen after a hard link established, Recover function may reuse the file number that has already assigned to the internal file, and this will overwrite the external file. To protect the external file, we have to make sure the file number will never being reused. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4099 Differential Revision: D9034092 Pulled By: riversand963 fbshipit-source-id: 3f1a737440b86aa2ef01673e5013aacbb7c33e28	2018-07-27 14:13:12 -07:00
Siying Dong	fd45495cf5	DBImpl::IngestExternalFile() should grab mutex when releasing file number in failure case (#4189 ) Summary: `995fcf7573` has a bug: ReleaseFileNumberFromPendingOutputs() added is not protected by the DB mutex. Fix it by grabbing the lock for this operation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4189 Differential Revision: D9015447 Pulled By: siying fbshipit-source-id: b8506e09a96c3f95a6fe32b5ca5fcdb9bee88937	2018-07-26 11:12:29 -07:00
Siying Dong	2a81633da2	Fix bug when seeking backward against an out-of-bound iterator (#4187 ) Summary: `92ee3350e0` introduces an out-of-bound check in BlockBasedTableIterator::Valid(). However, this flag is not reset when re-seeking in backward direction. This caused the iterator to be invalide by mistake. Fix it by always resetting the out-of-bound flag in every seek. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4187 Differential Revision: D8996600 Pulled By: siying fbshipit-source-id: b6235ea614f71381e50e7904c4fb036300604ac1	2018-07-25 17:14:01 -07:00
Manuel Ung	ea212e5316	WriteUnPrepared: Implement unprepared batches for transactions (#4104 ) Summary: This adds support for writing unprepared batches based on size defined in `TransactionOptions::max_write_batch_size`. This is done by overriding methods that modify data (Put/Delete/SingleDelete/Merge) and checking first if write batch size has exceeded threshold. If so, the write batch is written to DB as an unprepared batch. Support for Commit/Rollback for unprepared batch is added as well. This has been done by simply extending the WritePrepared Commit/Rollback logic to take care of all unprep_seq numbers either when updating prepare heap, or adding to commit map. For updating the commit map, this logic exists inside `WriteUnpreparedCommitEntryPreReleaseCallback`. A test change was also made to have transactions unregister themselves when committing without prepare. This is because with write unprepared, there may be unprepared entries (which act similarly to prepared entries) already when a commit is done without prepare. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4104 Differential Revision: D8785717 Pulled By: lth fbshipit-source-id: c02006e281ec1ce00f628e2a7beec0ee73096a91	2018-07-24 00:13:18 -07:00
Zhongyi Xie	f95a5b2464	Avoid unnecessary big for-loop when reporting ticker stats stored in GetContext (#3490 ) Summary: Currently in `Version::Get` when reporting ticker stats stored in `GetContext`, there is a big for-loop through all `Ticker` which adds unnecessary cost to overall CPU usage. We can optimize by storing only ticker values that are used in `Get()` calls in a new struct `GetContextStats` since only a small fraction of all tickers are used in `Get()` calls. For comparison, with the new approach we only need to visit 17 values while old approach will require visiting 100+ `Ticker` Pull Request resolved: https://github.com/facebook/rocksdb/pull/3490 Differential Revision: D6969154 Pulled By: miasantreble fbshipit-source-id: fc27072965a3a94125a3e6883d20dafcf5b84029	2018-07-20 16:58:13 -07:00
Siying Dong	a5e851e113	Reformatting some recent changes (#4161 ) Summary: Lint is not happy with some new code recently committed. Format them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4161 Differential Revision: D8940582 Pulled By: siying fbshipit-source-id: c9b43b1ef8c88b5e923911058b44eb77234b36b7	2018-07-20 14:43:38 -07:00
Siying Dong	8425c8bd4d	BlockBasedTableReader: automatically adjust tail prefetch size (#4156 ) Summary: Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others. Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10% Pull Request resolved: https://github.com/facebook/rocksdb/pull/4156 Differential Revision: D8916847 Pulled By: siying fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758	2018-07-20 14:43:37 -07:00
Yanqin Jin	2736752b33	Fix a bug in MANIFEST group commit (#4157 ) Summary: PR #3944 introduces group commit of `VersionEdit` in MANIFEST. The implementation has a bug. When updating the log file number of each column family, we must consider only `VersionEdit`s that operate on the same column family. Otherwise, a column family may accidentally set its log file number higher than actual value, indicating that log files with smaller file number will be ignored, thus causing some updates to be lost. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4157 Differential Revision: D8916650 Pulled By: riversand963 fbshipit-source-id: 8f456cf688f17bf35ad87b38e30e899aa162f201	2018-07-19 17:27:56 -07:00
Dmitri Smirnov	78ab11cd71	Return new operator for Status allocations for Windows (#4128 ) Summary: Windows requires new/delete for memory allocations to be overriden. Refactor to be less intrusive. Differential Revision: D8878047 Pulled By: siying fbshipit-source-id: 35f2b5fec2f88ea48c9be926539c6469060aab36	2018-07-19 15:09:06 -07:00
Sagar Vemuri	f3801528c1	Disable DBFlushTest.SyncFail and DBTest.GroupCommitTest on Travis (#4154 ) Summary: I am temporarily disabling DBFlushTest.SyncFail and DBTest.GroupCommitTest tests on Travis until we figure out the root-cause. These tests will still continue to run locally though. I haven't been able to reproduce these failures locally so far (even on a [local Travis environment](https://docs.travis-ci.com/user/common-build-problems/#Troubleshooting-Locally-in-a-Docker-Image) ). These tests are failing way too frequently causing everyone to wonder why their PR failed on travis, and waste time in debugging. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4154 Differential Revision: D8907258 Pulled By: sagar0 fbshipit-source-id: f40068b16e9245fb3791b6a4796435d1ce1ed205	2018-07-18 18:43:11 -07:00
Siying Dong	37e0fdc824	DBSSTTest.DeleteSchedulerMultipleDBPaths data race (#4146 ) Summary: Fix a minor data race in DBSSTTest.DeleteSchedulerMultipleDBPaths reported by TSAN Pull Request resolved: https://github.com/facebook/rocksdb/pull/4146 Differential Revision: D8880945 Pulled By: siying fbshipit-source-id: 25c632f685757735c59ad4ff26b2f346a443a446	2018-07-17 17:57:46 -07:00
Yi Wu	d538ebdff0	Fix write get stuck when pipelined write is enabled (#4143 ) Summary: Fix the issue when pipelined write is enabled, writers can get stuck indefinitely and not able to finish the write. It can show with the following example: Assume there are 4 writers W1, W2, W3, W4 (W1 is the first, W4 is the last). T1: all writers pending in WAL writer queue: WAL writer queue: W1, W2, W3, W4 memtable writer queue: empty T2. W1 finish WAL writer and move to memtable writer queue: WAL writer queue: W2, W3, W4, memtable writer queue: W1 T3. W2 and W3 finish WAL write as a batch group. W2 enter ExitAsBatchGroupLeader and move the group to memtable writer queue, but before wake up next leader. WAL writer queue: W4 memtable writer queue: W1, W2, W3 T4. W1, W2, W3 finish memtable write as a batch group. Note that W2 still in the previous ExitAsBatchGroupLeader, although W1 have done memtable write for W2. WAL writer queue: W4 memtable writer queue: empty T5. The thread corresponding to W3 create another writer W3' with the same address as W3. WAL writer queue: W4, W3' memtable writer queue: empty T6. W2 continue with ExitAsBatchGroupLeader. Because the address of W3' is the same as W3, the last writer in its group, it thinks there are no pending writers, so it reset newest_writer_ to null, emptying the queue. W4 and W3' are deleted from the queue and will never be wake up. The issue exists since pipelined write was introduced in 5.5.0. Closes #3704 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4143 Differential Revision: D8871599 Pulled By: yiwu-arbug fbshipit-source-id: 3502674e51066a954a0660257e24ac588f815e2a	2018-07-17 17:27:51 -07:00
Siying Dong	ddc07b40fc	Remove managed iterator Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4124 Differential Revision: D8829910 Pulled By: siying fbshipit-source-id: f3e952ccf3a631071a5d77c48e327046f8abb560	2018-07-17 14:43:18 -07:00
Siying Dong	995fcf7573	Pending output file number should be released after bulkload failure (#4145 ) Summary: If bulkload fails for an input error, the pending output file number wasn't released. This bug can cause all future files with larger number than the current number won't be deleted, even they are compacted. This commit fixes the bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4145 Differential Revision: D8877900 Pulled By: siying fbshipit-source-id: 080be92a23d43305ca1e13fe1c06eb4cd0b01466	2018-07-17 14:13:16 -07:00
Maysam Yabandeh	b55da012f6	Refactor IndexBlockIter (#4141 ) Summary: Refactor IndexBlockIter to reduce conditional branches on key_includes_seq_. IndexBlockIter::Prev is also separated from DataBlockIter::Prev, not to cache the prev entries as they are of less importance when iterating over the index block. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4141 Differential Revision: D8866437 Pulled By: maysamyabandeh fbshipit-source-id: fdac76880426fc2be7d3c6354c09ab98f6657d4b	2018-07-16 17:13:10 -07:00
Sagar Vemuri	991120fa10	Allow ttl to be changed dynamically (#4133 ) Summary: Allow ttl to be changed dynamically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4133 Differential Revision: D8845440 Pulled By: sagar0 fbshipit-source-id: c8c87ae643b3a8c4123e4c037c4645efc094a2d3	2018-07-16 14:27:53 -07:00
Nathan VanBenschoten	ef7815b803	Support range deletion tombstones in IngestExternalFile SSTs (#3778 ) Summary: Fixes #3391. This change adds a `DeleteRange` method to `SstFileWriter` and adds support for ingesting SSTs with range deletion tombstones. This is important for applications that need to atomically ingest SSTs while clearing out any existing keys in a given key range. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3778 Differential Revision: D8821836 Pulled By: anand1976 fbshipit-source-id: ca7786c1947ff129afa703dab011d524c7883844	2018-07-13 22:43:09 -07:00
Peter Mattis	90fc40690a	Relax VersionStorageInfo::GetOverlappingInputs check (#4050 ) Summary: Do not consider the range tombstone sentinel key as causing 2 adjacent sstables in a level to overlap. When a range tombstone's end key is the largest key in an sstable, the sstable's end key is so to a "sentinel" value that is the smallest key in the next sstable with a sequence number of kMaxSequenceNumber. This "sentinel" is guaranteed to not overlap in internal-key space with the next sstable. Unfortunately, GetOverlappingFiles uses user-keys to determine overlap and was thus considering 2 adjacent sstables in a level to overlap if they were separated by this sentinel key. This in turn would cause compactions to be larger than necessary. Note that this conflicts with https://github.com/facebook/rocksdb/pull/2769 and cases `DBRangeDelTest.CompactionTreatsSplitInputLevelDeletionAtomically` to fail. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4050 Differential Revision: D8844423 Pulled By: ajkr fbshipit-source-id: df3f9f1db8f4cff2bff77376b98b83c2ae1d155b	2018-07-13 17:42:38 -07:00
Yanqin Jin	21171615c1	Reduce execution time of IngestFileWithGlobalSeqnoRandomized (#4131 ) Summary: Make `ExternalSSTFileTest.IngestFileWithGlobalSeqnoRandomized` run faster. `make format` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4131 Differential Revision: D8839952 Pulled By: riversand963 fbshipit-source-id: 4a7e842fde1cde4dc902e928a1cf511322578521	2018-07-13 17:27:39 -07:00
Maysam Yabandeh	8581a93a6b	Per-thread unique test db names (#4135 ) Summary: The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel. Example: ``` ~/gtest-parallel/gtest-parallel ./table_test``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135 Differential Revision: D8846653 Pulled By: maysamyabandeh fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84	2018-07-13 17:27:39 -07:00
Fosco Marotto	8527012bb6	Converted db/merge_test.cc to use gtest (#4114 ) Summary: Picked up a task to convert this to use the gtest framework. It can't be this simple, can it? It works, but should all the std::cout be removed? ``` [$] ~/git/rocksdb [gft !]: ./merge_test [==========] Running 2 tests from 1 test case. [----------] Global test environment set-up. [----------] 2 tests from MergeTest [ RUN ] MergeTest.MergeDbTest Test read-modify-write counters... a: 3 1 2 a: 3 b: 1225 3 Compaction started ... Compaction ended a: 3 b: 1225 Test merge-based counters... a: 3 1 2 a: 3 b: 1225 3 Test merge in memtable... a: 3 1 2 a: 3 b: 1225 3 Test Partial-Merge Test merge-operator not set after reopen [ OK ] MergeTest.MergeDbTest (93 ms) [ RUN ] MergeTest.MergeDbTtlTest Opening database with TTL Test read-modify-write counters... a: 3 1 2 a: 3 b: 1225 3 Compaction started ... Compaction ended a: 3 b: 1225 Test merge-based counters... a: 3 1 2 a: 3 b: 1225 3 Test merge in memtable... Opening database with TTL a: 3 1 2 a: 3 b: 1225 3 Test Partial-Merge Opening database with TTL Opening database with TTL Opening database with TTL Opening database with TTL Test merge-operator not set after reopen [ OK ] MergeTest.MergeDbTtlTest (97 ms) [----------] 2 tests from MergeTest (190 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test case ran. (190 ms total) [ PASSED ] 2 tests. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4114 Differential Revision: D8822886 Pulled By: gfosco fbshipit-source-id: c299d008e883c3bb911d2b357a2e9e4423f8e91a	2018-07-13 14:13:07 -07:00
Anand Ananthabhotla	e3eba52a5d	Re-enable kUniversalSubcompactions option_config (#4125 ) Summary: 1. Move kUniversalSubcompactions up before kEnd in db_test_util.h, so tests that cycle through all the option_configs include this 2. Skip kUniversalSubcompactions wherever kUniversalCompaction and kUniversalCompactionMultilevel are skipped Related to #3935 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4125 Differential Revision: D8828637 Pulled By: anand1976 fbshipit-source-id: 650dee15fd27d85281cf9bb4ca8ab460e04cac6f	2018-07-13 11:13:01 -07:00
Tamir Duberstein	7bee48bdbd	Add GCC 8 to Travis (#3433 ) Summary: - Avoid `strdup` to use jemalloc on Windows - Use `size_t` for consistency - Add GCC 8 to Travis - Add CMAKE_BUILD_TYPE=Release to Travis Pull Request resolved: https://github.com/facebook/rocksdb/pull/3433 Differential Revision: D6837948 Pulled By: sagar0 fbshipit-source-id: b8543c3a4da9cd07ee9a33f9f4623188e233261f	2018-07-13 10:58:06 -07:00
Yanqin Jin	90ebf1a257	Reduce execution time of a test. (#4127 ) Summary: Reduce the number of key ranges in `ExternalSSTFileTest.OverlappingRanges` so that the test completes in shorter time to avoid timeouts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4127 Differential Revision: D8827851 Pulled By: riversand963 fbshipit-source-id: a16387b0cc92a7c872b1c50f0cfbadc463afc9db	2018-07-12 17:42:03 -07:00
Yanqin Jin	dbeaa0d397	Reduce #iterations to shorten execution time. (#4123 ) Summary: Reduce #iterations from 5000 to 1000 so that `ExternalSSTFileTest.CompactDuringAddFileRandom` can finish faster. On the one hand, 5000 iterations does not seem to improve the quality of unit test in comparison with 1000. On the other hand, long running tests should belong to stress tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4123 Differential Revision: D8822514 Pulled By: riversand963 fbshipit-source-id: 0f439b8d5ccd9a4aed84638f8bac16382de17245	2018-07-12 14:42:39 -07:00
Nikhil Benesch	5f3088d565	Range deletion performance improvements + cleanup (#4014 ) Summary: This fixes the same performance issue that #3992 fixes but with much more invasive cleanup. I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z. With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level. ajkr please take a look when you have a minute! Pull Request resolved: https://github.com/facebook/rocksdb/pull/4014 Differential Revision: D8773253 Pulled By: ajkr fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677	2018-07-12 14:42:39 -07:00
Nikhil Benesch	5cd8240b86	Test range deletions with more configurations (#4021 ) Summary: Run the basic range deletion tests against the standard set of configurations. This testing exposed that files with hash indexes and partitioned indexes were not handling the case where the file contained only range deletions--i.e., where the index was empty. Additionally file a TODO about the fact that range deletions are broken when allow_mmap_reads = true is set. /cc ajkr nvanbenschoten Best viewed with ?w=1: https://github.com/facebook/rocksdb/pull/4021/files?w=1 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4021 Differential Revision: D8811860 Pulled By: ajkr fbshipit-source-id: 3cc07e6d6210a2a00b932866481b3d5c59775343	2018-07-11 15:57:49 -07:00
Yanqin Jin	331cb63641	SetOptions Backup Race Condition (#4108 ) Summary: Prior to this PR, there was a race condition between `DBImpl::SetOptions` and `BackupEngine::CreateNewBackup`, as illustrated below. ``` Time thread 1 thread 2 \| CreateNewBackup -> GetLiveFiles \| SetOptions -> RenameTempFileToOptionsFile \| SetOptions -> RenameTempFileToOptionsFile \| SetOptions -> RenameTempFileToOptionsFile // unlink oldest OPTIONS file \| copy the oldest OPTIONS // IO error! V ``` Proposed fix is to check the value of `DBImpl::disable_obsolete_files_deletion_` before calling `DeleteObsoleteOptionsFiles`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4108 Differential Revision: D8796360 Pulled By: riversand963 fbshipit-source-id: 02045317f793ea4c7d4400a5bf333b8502fa3e82	2018-07-11 14:57:46 -07:00
Manuel Ung	b9846370e9	WriteUnPrepared: Add support for recovering WriteUnprepared transactions (#4078 ) Summary: This adds support for recovering WriteUnprepared transactions through the following changes: - The information in `RecoveredTransaction` is extended so that it can reference multiple batches. - `MarkBeginPrepare` is extended with a bool indicating whether it is an unprepared begin, and this is passed down to `InsertRecoveredTransaction` to indicate whether the current transaction is prepared or not. - `WriteUnpreparedTxnDB::Initialize` is overridden so that it will rollback unprepared transactions from the recovered transactions. This can be done without updating the prepare heap/commit map, because this is before the DB has finished initializing, and after writing the rollback batch, those data structures should not contain information about the rolled back transaction anyway. Commit/Rollback of live transactions is still unimplemented and will come later. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4078 Differential Revision: D8703382 Pulled By: lth fbshipit-source-id: 7e0aada6c23bd39299f1f20d6c060492e0e6b60a	2018-07-06 17:59:13 -07:00
Yanqin Jin	db7ae0a485	Fix a map lookup that may throw exception. (#4098 ) Summary: `std::map::at(key)` throws std::out_of_range if key does not exist. Current code does not handle this. Although this case is unlikely, I feel it's safe to use `std::map::find`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4098 Differential Revision: D8753865 Pulled By: riversand963 fbshipit-source-id: 9a9ba43badb0fb5e0d24cd87903931fd12f3f8ec	2018-07-06 16:12:49 -07:00
Huachao Huang	35b83327a7	compaction: fix max_subcompactions option for CompactRange (#4082 ) Summary: The max_subcompactions option was introduced in https://github.com/facebook/rocksdb/pull/3775. Closes https://github.com/facebook/rocksdb/pull/4082 Differential Revision: D8743258 Pulled By: ajkr fbshipit-source-id: d60ee75769dfc19ab6f8754e4ff3a267848f1ed9	2018-07-05 20:12:56 -07:00
Zhongyi Xie	b3efb1cbe0	fix clang analyzer warnings (#4072 ) Summary: clang analyze is giving the following warnings: > db/compaction_job.cc:1178:16: warning: Called C++ object pointer is null } else if (meta->smallest.size() > 0) { ^~~~~~~~~~~~~~~~~~~~~ db/compaction_job.cc:1201:33: warning: Access to field 'marked_for_compaction' results in a dereference of a null pointer (loaded from variable 'meta') meta->marked_for_compaction = sub_compact->builder->NeedCompact(); ~~~~ db/version_set.cc:2770:26: warning: Called C++ object pointer is null uint32_t cf_id = last_writer->cfd->GetID(); ^~~~~~~~~~~~~~~~~~~~~~~~~ Closes https://github.com/facebook/rocksdb/pull/4072 Differential Revision: D8685852 Pulled By: miasantreble fbshipit-source-id: b0e2fd9dfc1cbba2317723e09886384b9b1c9085	2018-06-28 19:12:35 -07:00
Manuel Ung	8ad63a4b86	WriteUnPrepared: Add new WAL marker kTypeBeginUnprepareXID (#4069 ) Summary: This adds a new WAL marker of type kTypeBeginUnprepareXID. Also, DBImpl now contains a field called batch_per_txn (meaning one WriteBatch per transaction, or possibly multiple WriteBatches). This would also indicate that this DB is using WriteUnprepared policy. Recovery code would be able to make use of this extra field on DBImpl in a separate diff. For now, it is just used to determine whether the WAL is compatible or not. Closes https://github.com/facebook/rocksdb/pull/4069 Differential Revision: D8675099 Pulled By: lth fbshipit-source-id: ca27cae1738e46d65f2bb92860fc759deb874749	2018-06-28 18:58:29 -07:00
Anand Ananthabhotla	52d4c9b7f6	Allow DB resume after background errors (#3997 ) Summary: Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts - 1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not 2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place 3. Provide an API for the user to clear the error and resume the DB instance This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors. Closes https://github.com/facebook/rocksdb/pull/3997 Differential Revision: D8653831 Pulled By: anand1976 fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd	2018-06-28 12:34:40 -07:00
Yanqin Jin	26d67e357e	Support group commits of version edits (#3944 ) Summary: This PR supports the group commit of multiple version edit entries corresponding to different column families. Column family drop/creation still cannot be grouped. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Closes https://github.com/facebook/rocksdb/pull/3944 Differential Revision: D8432536 Pulled By: riversand963 fbshipit-source-id: 8f11bd05193b6c0d9272d82e44b676abfac113cb	2018-06-28 12:34:39 -07:00
Maysam Yabandeh	0a5b5d88b2	Remove ReadOnly part of PinnableSliceAndMmapReads from Lite (#4070 ) Summary: Lite does not support readonly DBs. Closes https://github.com/facebook/rocksdb/pull/4070 Differential Revision: D8677858 Pulled By: maysamyabandeh fbshipit-source-id: 536887d2363ee2f5d8e1ea9f1a511e643a1707fa	2018-06-28 08:42:17 -07:00
Zhongyi Xie	14f409c0f1	PrefixMayMatch: remove unnecessary check for prefix_extractor_ (#4067 ) Summary: with https://github.com/facebook/rocksdb/pull/3601 and https://github.com/facebook/rocksdb/pull/3899, `prefix_extractor_` is not really being used in block based filter and full filter's version of `PrefixMayMatch` because now `prefix_extractor` is passed as an argument. Also it is now possible that prefix_extractor_ may be initialized to nullptr when a non-standard prefix_extractor is used and also for ROCKSDB_LITE. Removing these checks should not break any existing tests. Closes https://github.com/facebook/rocksdb/pull/4067 Differential Revision: D8669002 Pulled By: miasantreble fbshipit-source-id: 0e701ba912b8a26734fadb72d15bb1b266b6176a	2018-06-27 20:42:43 -07:00
Zhichao Cao	1f6efabe23	Add bottommost_compression_opts to for bottommost_compression (#3985 ) Summary: …ression For `CompressionType` we have options `compression` and `bottommost_compression`. Thus, to make the compression options consitent with the compression type when bottommost_compression is enabled, we add the bottommost_compression_opts Closes https://github.com/facebook/rocksdb/pull/3985 Reviewed By: riversand963 Differential Revision: D8385911 Pulled By: zhichao-cao fbshipit-source-id: 07bc533dd61bcf1cef5927d8d62901c13d38d5fc	2018-06-27 17:42:38 -07:00
Maysam Yabandeh	235ab9dd32	Pin mmap files in ReadOnlyDB (#4053 ) Summary: https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case. Closes https://github.com/facebook/rocksdb/pull/4053 Differential Revision: D8662546 Pulled By: maysamyabandeh fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd	2018-06-27 17:13:34 -07:00
Manuel Ung	a16e00b7b9	WriteUnPrepared Txn: Disable seek to snapshot optimization (#3955 ) Summary: This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data. There are the places where reads had to be modified. - Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible. - Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key. - Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case. Closes https://github.com/facebook/rocksdb/pull/3955 Differential Revision: D8286688 Pulled By: lth fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe	2018-06-27 12:23:07 -07:00
Nikhil Benesch	17339dc2f3	Add table property tracking number of range deletions (#4016 ) Summary: Add a new table property, rocksdb.num.range-deletions, which tracks the number of range deletions in a block-based table. Range deletions are no longer counted in rocksdb.num.entries; as discovered in PR #3778, there are various code paths that implicitly assume that rocksdb.num.entries counts only true keys, not range deletions. /cc ajkr nvanbenschoten Closes https://github.com/facebook/rocksdb/pull/4016 Differential Revision: D8527575 Pulled By: ajkr fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7	2018-06-26 20:27:35 -07:00
Zhongyi Xie	408205a36b	use user_key and iterate_upper_bound to determine compatibility of bloom filters (#3899 ) Summary: Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file. This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at. Closes https://github.com/facebook/rocksdb/pull/3899 Differential Revision: D8157459 Pulled By: miasantreble fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41	2018-06-26 15:57:26 -07:00
Andrew Kryczka	a8e503e545	Fix universal compaction scheduling conflict with CompactFiles (#4055 ) Summary: Universal size-amp-triggered compaction was pulling the final sorted run into the compaction without checking whether any of its files are already being compacted. When all compactions are automatic, it is safe since it verifies the second-last sorted run is not already being compacted, which implies the last sorted run is also not being compacted (in automatic compaction multiple sorted runs are always compacted together). But with manual compaction, files in the last sorted run can be compacted independently, so the last sorted run also must be checked. We were seeing the below assertion failure in `db_stress`. Also the test case included in this PR repros the failure. ``` db_universal_compaction_test: db/compaction.cc:312: void rocksdb::Compaction::MarkFilesBeingCompacted(bool): Assertion `mark_as_compacted ? !inputs_[i][j]->being_compacted : inputs_[i][j]->being_compacted' failed. Aborted (core dumped) ``` Closes https://github.com/facebook/rocksdb/pull/4055 Differential Revision: D8630094 Pulled By: ajkr fbshipit-source-id: ac3b30a874678b76e113d4f6c42c1260411b08f8	2018-06-26 10:44:56 -07:00
Sagar Vemuri	189f0c27aa	Make BlockBasedTableIterator compaction-aware (#4048 ) Summary: Pass in `for_compaction` to `BlockBasedTableIterator` via `BlockBasedTableReader::NewIterator`. In `7103559f49`, `for_compaction` was set in `BlockBasedTable::Rep` via `BlockBasedTable::SetupForCompaction`. In hindsight it was not the right decision; it also caused TSAN to complain. Closes https://github.com/facebook/rocksdb/pull/4048 Differential Revision: D8601056 Pulled By: sagar0 fbshipit-source-id: 30127e898c15c38c1080d57710b8c5a6d64a0ab3	2018-06-25 13:19:27 -07:00
Maysam Yabandeh	80ade9ad83	Pin top-level index on partitioned index/filter blocks (#4037 ) Summary: Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead. Closes https://github.com/facebook/rocksdb/pull/4037 Differential Revision: D8596218 Pulled By: maysamyabandeh fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2	2018-06-22 15:27:46 -07:00
Zhongyi Xie	795e663df0	option for timing measurement of non-blocking ops during compaction (#4029 ) Summary: For example calling CompactionFilter is always timed and gives the user no way to disable. This PR will disable the timer if `Statistics::stats_level_` (which is part of DBOptions) is `kExceptDetailedTimers` Closes https://github.com/facebook/rocksdb/pull/4029 Differential Revision: D8583670 Pulled By: miasantreble fbshipit-source-id: 913be9fe433ae0c06e88193b59d41920a532307f	2018-06-21 21:28:05 -07:00
Yanqin Jin	524c6e6b72	Add file name info to SequentialFileReader. (#4026 ) Summary: We potentially need this information for tracing, profiling and diagnosis. Closes https://github.com/facebook/rocksdb/pull/4026 Differential Revision: D8555214 Pulled By: riversand963 fbshipit-source-id: 4263e06c00b6d5410b46aa46eb4e358ff2161dd2	2018-06-21 08:42:24 -07:00
Siying Dong	92ee3350e0	BlockBasedTableIterator to keep BlockIter after out of upper bound (#4004 ) Summary: `b555ed30a4` makes the BlockBasedTableIterator to be invalidated if the current position if over the upper bound. However, this can bring performance regression to the case of multiple Seek()s hitting the same data block but all out of upper bound. For example, if an SST file has a data block containing following keys : {a, z} The user sets the upper bound to be "x", and it executed following queries: Seek("b") Seek("c") Seek("d") Before the upper bound optimization, these queries always come to this same current data block of the iterator, but now inside each Seek() the data block is read from the block cache but is returned again. To prevent this regression case, we keep the current data block iterator if it is upper bound. Closes https://github.com/facebook/rocksdb/pull/4004 Differential Revision: D8463192 Pulled By: siying fbshipit-source-id: 8710628b30acde7063a097c3184d6c4333a8ef81	2018-06-19 09:57:11 -07:00
Tomas Kolda	c766887458	Fix ExternalSSTFileTest::OverlappingRanges test on Solaris Sparc (#4012 ) Summary: Fix of #4011 Closes https://github.com/facebook/rocksdb/pull/4012 Differential Revision: D8499173 Pulled By: sagar0 fbshipit-source-id: cbb2b90c544ed364a3640ea65835d577b2dbc5df	2018-06-18 14:57:37 -07:00
Zhongyi Xie	80bc35927c	Should only decode restart points for uncompressed blocks (#3996 ) Summary: The Block object assumes contents are uncompressed. Block's constructor tries to read the number of restarts, but does not get an accurate number when its contents are compressed, which is causing issues like https://github.com/facebook/rocksdb/issues/3843. This PR address this issue by skipping reconstruction of restart points when blocks are known to be compressed. Somehow the restart points can be read directly when Snappy is used and some tests (for example https://github.com/facebook/rocksdb/blob/master/db/db_block_cache_test.cc#L196) expects blocks to be fully constructed even when Snappy compression is used, so here we keep the restart point logic for Snappy. Closes https://github.com/facebook/rocksdb/pull/3996 Differential Revision: D8416186 Pulled By: miasantreble fbshipit-source-id: 002c0b62b9e5d89fb7736563d354ce0023c8cb28	2018-06-15 19:26:58 -07:00
Anand Ananthabhotla	c48764ba47	Don't generate a notification for a 0 size SST (#4003 ) Summary: Don't call the OnTableFileCreated listener callback when a 0 size SST file gets created by Flush. Doing so causes an assertion failure in db_stress. It is also not correct behavior as we call env->DeleteFile() for such files right before the notification. Closes https://github.com/facebook/rocksdb/pull/4003 Differential Revision: D8461385 Pulled By: anand1976 fbshipit-source-id: ae92d4f921c2e2cff981ad58f4929ed8b609f35d	2018-06-15 17:57:24 -07:00
zhichao-cao	3fbc865cd5	Add kOptionsStatistics to GetProperty() (#3966 ) Summary: Add a new DB property to DB::GetProperty(), which returns the option.statistics. Test is updated to pass. Closes https://github.com/facebook/rocksdb/pull/3966 Differential Revision: D8311139 Pulled By: zhichao-cao fbshipit-source-id: ea78f4727358c807b0e5a0ea62e09defb10ad9ac	2018-06-15 17:28:01 -07:00
奏之章	f23fed19a1	Delay verify compaction output table (#3979 ) Summary: Verify table will load SST into `TableCache` it occupy memory & `TableCache`‘s capacity ... but no logic use them it's unnecessary ... so , we verify them after all sub compact finished Closes https://github.com/facebook/rocksdb/pull/3979 Differential Revision: D8389946 Pulled By: ajkr fbshipit-source-id: 54bd4f474f9e7b3accf39c3068b1f36a27ec4c49	2018-06-15 12:42:53 -07:00
Fenggang Wu	fbe3b9e2b6	Udpate db_universal_compaction_test according to PR #3970 (#3995 ) Summary: The SST file sizes changed slightly after the improvement of PR #3970 which reduces the size of the properties block. Before PR #3970 a size ratio compaction included all of the first four flushed files but it only includes two files after. We increase the size_ratio universal compaction option to make that compaction include all four files again. Closes https://github.com/facebook/rocksdb/pull/3995 Differential Revision: D8426925 Pulled By: fgwu fbshipit-source-id: 1429c38672e9f4fb4d4881fd4b06db45c4861d62	2018-06-15 10:42:21 -07:00
Siying Dong	d82f1421b4	Fix regression bug of Prev() with upper bound (#3989 ) Summary: A recent change pushed down the upper bound checking to child iterators. However, this causes the logic of following sequence wrong: Seek(key); if (!Valid()) SeekToLast(); Because !Valid() may be caused by upper bounds, rather than the end of the iterator. In this case SeekToLast() points to totally wrong places. This can cause wrong results, infinite loops, or segfault in some cases. This sequence is called when changing direction from forward to backward. And this by itself also implicitly happen during reseeking optimization in Prev(). Fix this bug by using SeekForPrev() rather than this sequuence, as what is already done in prefix extrator case. Closes https://github.com/facebook/rocksdb/pull/3989 Differential Revision: D8385422 Pulled By: siying fbshipit-source-id: 429e869990cfd2dc389421e0836fc496bed67bb4	2018-06-12 16:57:36 -07:00
Maysam Yabandeh	b73652169e	Extend format 3 to partitioned index/filters (#3958 ) Summary: format_version 3 changes the format of index blocks by storing user keys instead of the internal keys, which saves 8-bytes per key. This patch extends the format to top-level indexes in partitioned index/filters. Closes https://github.com/facebook/rocksdb/pull/3958 Differential Revision: D8294615 Pulled By: maysamyabandeh fbshipit-source-id: 17666cc16b8076c363972e2308e31547e835f0fe	2018-06-06 16:58:16 -07:00
Andrew Kryczka	4420df4b0e	Check conflict at output level in CompactFiles (#3926 ) Summary: CompactFiles checked whether the existing files conflicted with the chosen compaction. But it missed checking whether future files would conflict, i.e., when another compaction was simultaneously writing new files to the same range at the same output level. Closes https://github.com/facebook/rocksdb/pull/3926 Differential Revision: D8218996 Pulled By: ajkr fbshipit-source-id: 21cb00a6fed4c8c62d3ed2ff810962e6bdc2fdfb	2018-06-05 14:14:05 -07:00
Zhongyi Xie	f1592a06c2	run make format for PR 3838 (#3954 ) Summary: PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings. Run `make format` to fix formatting as suggested by siying . Also piggyback two changes: 1) fix singleton destruction order for windows and posix env 2) fix two clang warnings Closes https://github.com/facebook/rocksdb/pull/3954 Differential Revision: D8272041 Pulled By: miasantreble fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f	2018-06-05 12:58:02 -07:00
Maysam Yabandeh	d0c38c0c8c	Extend some tests to format_version=3 (#3942 ) Summary: format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3. Closes https://github.com/facebook/rocksdb/pull/3942 Differential Revision: D8238413 Pulled By: maysamyabandeh fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035	2018-06-04 20:13:00 -07:00
Zhongyi Xie	50d7ac0ea3	Fix test for rocksdb_lite: hide incompatible option kDirectIO Summary: Previous commit https://github.com/facebook/rocksdb/pull/3935 unhide a few test options which includes kDirectIO. However it's not supported by RocksDB lite. Need to hide this option from the lite build. Closes https://github.com/facebook/rocksdb/pull/3943 Differential Revision: D8242757 Pulled By: miasantreble fbshipit-source-id: 1edfad3a5d01a46bfb7eedee765981ebe02c500a	2018-06-01 20:42:36 -07:00
Andrew Kryczka	fea2b1dfb2	Copy Get() result when file reads use mmap Summary: For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region. For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`). This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied. This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime. Closes https://github.com/facebook/rocksdb/pull/3881 Differential Revision: D8076288 Pulled By: ajkr fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08	2018-06-01 16:57:58 -07:00
straw	89b37081a1	add c api rocksdb_sstfilewriter_file_size Summary: Closes https://github.com/facebook/rocksdb/pull/3922 Differential Revision: D8208528 Pulled By: ajkr fbshipit-source-id: d384fe53cf526f2aadc7b79a423ce36dbd3ff224	2018-06-01 09:43:59 -07:00
Maysam Yabandeh	44cf84932f	Fix the bug of some test scenarios being put after kEnd Summary: DBTestBase::OptionConfig includes the scenarios that unit tests could iterate over them by calling ChangeOptions(). Some of the options have been mistakenly put after kEnd which makes them essentially invisible to ChangeOptions() caller. This patch fixes it except for kUniversalSubcompactions which is left as TODO since it would break some unit tests. Closes https://github.com/facebook/rocksdb/pull/3935 Differential Revision: D8230748 Pulled By: maysamyabandeh fbshipit-source-id: edddb8fffcd161af1809fef24798ce118f8593db	2018-05-31 19:28:00 -07:00
QingpingWang	2807678b11	c api set bottommost level compaction Summary: Closes https://github.com/facebook/rocksdb/pull/3928 Differential Revision: D8224962 Pulled By: ajkr fbshipit-source-id: 3caf463509a935bff46530f27232a85ae7e4e484	2018-05-31 17:30:50 -07:00
Siying Dong	82089d59c3	DBImpl::FindObsoleteFiles() not to call GetChildren() on the same path Summary: DBImpl::FindObsoleteFiles() may call GetChildren() multiple times if different CFs are on the same path. Fix it. Closes https://github.com/facebook/rocksdb/pull/3885 Differential Revision: D8084634 Pulled By: siying fbshipit-source-id: b471fbc251f6a05e9243304dc14c0831060cc0b0	2018-05-31 12:58:33 -07:00
maoyouxiang	a35451eaa4	fix deadlock with enable_pipelined_write=true and max_successive_merges > 0 Summary: fix this https://github.com/facebook/rocksdb/issues/3916 Closes https://github.com/facebook/rocksdb/pull/3923 Differential Revision: D8215192 Pulled By: yiwu-arbug fbshipit-source-id: a4c2f839a91d92dc70906d2b7c6de0fe014a2422	2018-05-31 11:13:14 -07:00
Siying Dong	4dd80debd0	Remove tests from ROCKSDB_VALGRIND_RUN Summary: In order to make valgrind check test to pass in a day, remove some tests that run prohibitively slow under valgrind. Closes https://github.com/facebook/rocksdb/pull/3924 Differential Revision: D8210184 Pulled By: siying fbshipit-source-id: 5b06fb08f3cf57571d422d05a0dbddc9f9376f7a	2018-05-30 16:15:16 -07:00
Anand Ananthabhotla	a736255de8	Delete triggered compaction for universal style Summary: This is still WIP, but I'm hoping for early feedback on the overall approach. This patch implements deletion triggered compaction, which till now only worked for leveled, for universal style. SST files are marked for compaction by the CompactOnDeletionCollertor table property. This is expected to be used when free disk space is low and the user wants to reclaim space by deleting a bunch of keys. The deletions are expected to be dense. In such a situation, we want to avoid a full compaction due to its space overhead. The strategy used in this case is similar to leveled. We pick one file from the set of files marked for compaction. We then expand the inputs to a clean cut on the same level, and then pick overlapping files from the next non-mepty level. Picking files from the next level can cause the key range to expand, and we opportunistically expand inputs in the source level to include files wholly in this key range. The main side effect of this is that it breaks the property of no time range overlap between levels. This shouldn't break any functionality. Closes https://github.com/facebook/rocksdb/pull/3860 Differential Revision: D8124397 Pulled By: anand1976 fbshipit-source-id: bfa2a9dd6817930e991b35d3a8e7e61304ed3dcf	2018-05-29 15:44:34 -07:00
Yanqin Jin	cf826de3ed	Fix compilation error when OPT="-DROCKSDB_LITE". Summary: Closes https://github.com/facebook/rocksdb/pull/3917 Differential Revision: D8187733 Pulled By: riversand963 fbshipit-source-id: e4aa179cd0791ca77167e357f99de9afd4aef910	2018-05-29 12:28:59 -07:00
奏之章	1c1bafa668	Fix VersionStorageInfo::EstimateLiveDataSize seg fault Summary: `HandleEstimateLiveDataSize`'s `need_out_of_mutex` is true `402b7aa07f/db/internal_stats.cc (L412-L413)` so , is will ref a `SuperVersion` `402b7aa07f/db/db_impl.cc (L1896-L1908)` so , the param `version` of `InternalStats::HandleEstimateLiveDataSize` is safe , but `cfd_->current()` is not safe ! `402b7aa07f/db/internal_stats.cc (L790-L795)` the `cfd_->current()` maybe invalid ... here's mongo-rocks crash backtrace ``` mongod(mongo::printStackTrace(std::basic_ostream<char, std::char_traits<char> >&)+0x41) [0x7fe3a3137c51] mongod(+0x2152E89) [0x7fe3a3136e89] mongod(+0x21534F6) [0x7fe3a31374f6] libpthread.so.0(+0xF5E0) [0x7fe39f5e45e0] mongod(rocksdb::InternalKeyComparator::Compare(rocksdb::Slice const&, rocksdb::Slice const&) const+0x17) [0x7fe3a22375a7] mongod(rocksdb::VersionStorageInfo::EstimateLiveDataSize() const+0x3AA) [0x7fe3a228daba] mongod(rocksdb::InternalStats::HandleEstimateLiveDataSize(unsigned long, rocksdb::DBImpl, rocksdb::Version)+0x20) [0x7fe3a2250d70] mongod(rocksdb::DBImpl::GetIntPropertyInternal(rocksdb::ColumnFamilyData, rocksdb::DBPropertyInfo const&, bool, unsigned long*)+0xEF) [0x7fe3a21e3dbf] ``` Closes https://github.com/facebook/rocksdb/pull/3912 Differential Revision: D8179944 Pulled By: yiwu-arbug fbshipit-source-id: 26f314a8f98f4c2dc4348745d759f26f0e8d95e1	2018-05-28 11:27:08 -07:00
Maysam Yabandeh	402b7aa07f	Exclude seq from index keys Summary: Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator. The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s Closes https://github.com/facebook/rocksdb/pull/3894 Differential Revision: D8118775 Pulled By: maysamyabandeh fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd	2018-05-25 18:42:43 -07:00
Yanqin Jin	aa53579d6c	Fix segfault caused by object premature destruction Summary: Please refer to earlier discussion in [issue 3609](https://github.com/facebook/rocksdb/issues/3609). There was also an alternative fix in [PR 3888](https://github.com/facebook/rocksdb/pull/3888), but the proposed solution requires complex change. To summarize the cause of the problem. Upon creation of a column family, a `BlockBasedTableFactory` object is `new`ed and encapsulated by a `std::shared_ptr`. Since there is no other `std::shared_ptr` pointing to this `BlockBasedTableFactory`, when the column family is dropped, the `ColumnFamilyData` is `delete`d, causing the destructor of `std::shared_ptr`. Since there is no other `std::shared_ptr`, the underlying memory is also freed. Later when the db exits, it releases all the table readers, including the table readers that have been operating on the dropped column family. This needs to access the `table_options` owned by `BlockBasedTableFactory` that has already been deleted. Therefore, a segfault is raised. Previous workaround is to purge all obsolete files upon `ColumnFamilyData` destruction, which leads to a force release of table readers of the dropped column family. However this does not work when the user disables file deletion. Our solution in this PR is making a copy of `table_options` in `BlockBasedTable::Rep`. This solution increases memory copy and usage, but is much simpler. Test plan ``` $ make -j16 $ ./column_family_test --gtest_filter=ColumnFamilyTest.CreateDropAndDestroy:ColumnFamilyTest.CreateDropAndDestroyWithoutFileDeletion ``` Expected behavior: All tests should pass. Closes https://github.com/facebook/rocksdb/pull/3898 Differential Revision: D8149421 Pulled By: riversand963 fbshipit-source-id: eaecc2e064057ef607fbdd4cc275874f866c3438	2018-05-25 11:57:51 -07:00
QingpingWang	070319f7bb	add flush_before_backup parameter to c api rocksdb_backup_engine_create_new_backup Summary: Add flush_before_backup to rocksdb_backup_engine_create_new_backup. make c api able to control the flush before backup behavior. Closes https://github.com/facebook/rocksdb/pull/3897 Differential Revision: D8157676 Pulled By: ajkr fbshipit-source-id: 88998c62f89f087bf8672398fd7ddafabbada505	2018-05-24 22:28:52 -07:00
Yi Wu	bc7e8d472e	LRUCache midpoint insertion Summary: Implement midpoint insertion strategy where new blocks will be insert to the middle of LRU list, then move the head on the first hit in cache. Closes https://github.com/facebook/rocksdb/pull/3877 Differential Revision: D8100895 Pulled By: yiwu-arbug fbshipit-source-id: f4bd83cb8be469e5d02072cfc8bd66011391f3da	2018-05-24 15:57:33 -07:00
Yanqin Jin	4011012d9d	Specify the underlying type of enums. Summary: Explicitly specify the underlying type of enums help developers understand the physical storage. Closes https://github.com/facebook/rocksdb/pull/3892 Differential Revision: D8107027 Pulled By: riversand963 fbshipit-source-id: a00efecbba46df4a3c8eed0994a2d4972ad1a1d3	2018-05-23 16:12:59 -07:00
Andrew Kryczka	7db721b9a6	Avoid sleep in DBTest.GroupCommitTest to fix flakiness Summary: DBTest.GroupCommitTest would often fail when run under valgrind because its sleeps were insufficient to guarantee a group commit had multiple entries. Instead we can use sync point to force a leader to wait until a non-leader thread has enqueued its work, thus guaranteeing a leader can do group commit work for multiple threads. Closes https://github.com/facebook/rocksdb/pull/3883 Differential Revision: D8079429 Pulled By: ajkr fbshipit-source-id: 61dc50fad29d2c85547842f681288de60fa29049	2018-05-22 12:16:25 -07:00
Siying Dong	3db1ada3bf	PersistRocksDBOptions() to use WritableFileWriter Summary: By using WritableFileWriter rather than WritableFile directly, we can buffer multiple Append() calls to one write() file system call, which will be expensive to underlying Env without its own write buffering. Closes https://github.com/facebook/rocksdb/pull/3882 Differential Revision: D8080673 Pulled By: siying fbshipit-source-id: e0db900cb3c178166aa738f3985db65e3ae2cf1b	2018-05-21 16:42:22 -07:00
Zhongyi Xie	c3ebc75843	Move prefix_extractor to MutableCFOptions Summary: Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users. This PR aims to make it possible to dynamically change bloom filter config. Closes https://github.com/facebook/rocksdb/pull/3601 Differential Revision: D7253114 Pulled By: miasantreble fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c	2018-05-21 14:43:11 -07:00
Yanqin Jin	263ef52b65	Update ColumnFamilyTest for multi-CF verification Summary: Change `keys_` from `set<string>` to `vector<set<string>>` so that each column family's keys are stored in one set. ajkr When you have a chance, can you PTAL? Thanks! Closes https://github.com/facebook/rocksdb/pull/3871 Differential Revision: D8056447 Pulled By: riversand963 fbshipit-source-id: 650d0f9cad02b1bc005fc329ad76edbf053e6386	2018-05-21 11:57:42 -07:00
Andrew Kryczka	7b655214d2	Assert keys/values pinned by range deletion meta-block iterators Summary: `RangeDelAggregator` holds the pointers returned by `BlockIter::key()` and `BlockIter::value()` so requires the data to which they point is pinned. `BlockIter::key()` points into block memory and is guaranteed to be pinned if and only if prefix encoding is disabled (or, equivalently, restart interval is set to one). I think `BlockIter::value()` is always pinned. Added an assert for these and removed the wrong TODO about increasing restart interval, which would enable key prefix encoding and break the assertion. Closes https://github.com/facebook/rocksdb/pull/3875 Differential Revision: D8063667 Pulled By: ajkr fbshipit-source-id: 60b5ebcc0cdd610dd6aad9e74a23378793672c41	2018-05-21 09:57:00 -07:00
Zhongyi Xie	ed4d3393fb	fix a division by zero bug Summary: fixes the failing clang_analyze contrun test Closes https://github.com/facebook/rocksdb/pull/3872 Differential Revision: D8059241 Pulled By: miasantreble fbshipit-source-id: e8fc1838004fe16a823456188386b8b39429803b	2018-05-18 21:57:24 -07:00
Siying Dong	17af09fcce	Implement key shortening functions in ReverseBytewiseComparator Summary: Right now ReverseBytewiseComparator::FindShortestSeparator() doesn't really shorten key, and ReverseBytewiseComparator::FindShortestSuccessor() seems to return wrong results. The code is confusing too as it uses BytewiseComparatorImpl::FindShortestSeparator() but the function actually won't do anything if the the first key is larger than the second. Implement ReverseBytewiseComparator::FindShortestSeparator() and override ReverseBytewiseComparator::FindShortestSuccessor() to be empty. Closes https://github.com/facebook/rocksdb/pull/3836 Differential Revision: D7959762 Pulled By: siying fbshipit-source-id: 93acb621c16ce6f23e087ae4e19f7d84d1254683	2018-05-17 18:27:16 -07:00
Zhongyi Xie	1d7ca20f29	add override to virtual functions Summary: this will fix the failing clang_check test Closes https://github.com/facebook/rocksdb/pull/3868 Differential Revision: D8050880 Pulled By: miasantreble fbshipit-source-id: 749932e2e4025f835c961c068d601e522a126da6	2018-05-17 17:57:48 -07:00
Mike Kolupaev	8bf555f487	Change and clarify the relationship between Valid(), status() and Seek() for all iterators. Also fix some bugs Summary: Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are: If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted. * When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not. However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours). This PR changes the convention to: * If status() is not ok, Valid() always returns false. * Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.) This does sacrifice the two use cases listed above, but siying said it's ok. Overview of the changes: * A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario. * Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed. * A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator. To find the iterator types that needed changes I searched for "public .Iterator" in the code. Here's an overview of all 27 iterator types: Iterators that didn't need changes: status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator. * Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator. Iterators with changes (see inline comments for details): * DBIter - an overhaul: - It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back. - It had a few code paths silently discarding subiterator's status. The stress test caught a few. - The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example. - Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption. - It used to not reset status on seek for some types of errors. - Some simplifications and better comments. - Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer. * MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status. * ForwardIterator - changed to the new convention, also slightly simplified. * ForwardLevelIterator - fixed some bugs and simplified. * LevelIterator - simplified. * TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_. * BlockBasedTableIterator - minor changes. * BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid. * PlainTableIterator - some seeks used to not reset status. * CuckooTableIterator - tiny code cleanup. * ManagedIterator - fixed some bugs. * BaseDeltaIterator - changed to the new convention and fixed a bug. * BlobDBIterator - seeks used to not reset status. * KeyConvertingIterator - some small change. Closes https://github.com/facebook/rocksdb/pull/3810 Differential Revision: D7888019 Pulled By: al13n321 fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d	2018-05-17 02:56:56 -07:00
Maysam Yabandeh	46fde6b653	Fix race condition between log_.erase and log_.back Summary: log_ contract specifies that it should not be modified unless both mutex_ and log_write_mutex_ are held. log_.erase however does that with only holding mutex_. This causes a race condition with two_write_queues since logs_.back is read with holding only log_write_mutex_ (which is correct according to logs_ contract) but logs_.erase is called concurrently. This is probably the cause of logs_.back returning nullptr in https://github.com/facebook/rocksdb/issues/3852 although I could not reproduce it. Fixes https://github.com/facebook/rocksdb/issues/3852 Closes https://github.com/facebook/rocksdb/pull/3859 Differential Revision: D8026103 Pulled By: maysamyabandeh fbshipit-source-id: ee394e00fe4aa520d884c5ef87981e9d6b5ccb28	2018-05-16 13:01:33 -07:00
Maysam Yabandeh	12ad711247	Suppress tsan lock-order-inversion on FlushWAL Summary: TSAN reports a false alarm for lock-order-inversion in DBWriteTest.IOErrorOnWALWritePropagateToWriteThreadFollower but Open and FlushWAL are not run concurrently. Suppressing the error by skipping FlushWAL in the test until TSAN is fixed. The alternative would be to use ``` TSAN_OPTIONS="suppressions=tsan-suppressions.txt" ./db_write_test ``` but it does not seem straightforward to integrate it to our test infra. Closes https://github.com/facebook/rocksdb/pull/3854 Differential Revision: D8000202 Pulled By: maysamyabandeh fbshipit-source-id: fde33483d963a7ad84d3145123821f64960a4802	2018-05-14 21:13:35 -07:00
Andrew Kryczka	3d7dc75b36	Bottommost level-based compactions in bottom-pri pool Summary: This feature was introduced for universal compaction in `cc01985d`. At that point we thought it'd be used only to prevent long-running universal full compactions from blocking short-lived upper-level compactions. Now we have a level compaction user who could benefit from it since they use more expensive compression algorithm in the bottom level. So enable it for level. Closes https://github.com/facebook/rocksdb/pull/3835 Differential Revision: D7957179 Pulled By: ajkr fbshipit-source-id: 177285d2cef3b650b6a4d81dc5db84bc441c9fe4	2018-05-14 14:57:15 -07:00
Maysam Yabandeh	718c1c9c1f	Pass manual_wal_flush also to the first wal file Summary: Currently manual_wal_flush if set in the options will be used only for the wal files created during wal switch. The configuration thus does not affect the first wal file. The patch fixes that and also update the related unit tests. This PR is built on top of https://github.com/facebook/rocksdb/pull/3756 Closes https://github.com/facebook/rocksdb/pull/3824 Differential Revision: D7909153 Pulled By: maysamyabandeh fbshipit-source-id: 024ed99d2555db06bf096c902b998e432bb7b9ce	2018-05-14 10:57:56 -07:00
Sergey Elin	3272bc07c6	Fix formatting in log message Summary: Add missing space. Closes https://github.com/facebook/rocksdb/pull/3826 Differential Revision: D7956059 Pulled By: miasantreble fbshipit-source-id: 3aeba76385f8726399a3086c46de710636a31191	2018-05-11 11:28:54 -07:00
Andrew Kryczka	072ae671a7	Apply use_direct_io_for_flush_and_compaction to writes only Summary: Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used. This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously. Closes https://github.com/facebook/rocksdb/pull/3829 Differential Revision: D7915443 Pulled By: ajkr fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279	2018-05-09 19:42:58 -07:00
Dmitri Smirnov	f92cd2feb4	Introduce and use the option to disable stall notifications structures Summary: and code. Removing this helps with insert performance. Closes https://github.com/facebook/rocksdb/pull/3830 Differential Revision: D7921030 Pulled By: siying fbshipit-source-id: 84e80d50a7ef96f5441c51c9a0d089c50217cce2	2018-05-09 10:13:53 -07:00
Andrew Kryczka	4bf169f07e	Disable readahead when using mmap for reads Summary: `ReadaheadRandomAccessFile` had an unwritten assumption, which was that its wrapped file's `Read()` function always copies into the provided scratch buffer. Actually this was not true when the wrapped file was `PosixMmapReadableFile`, whose `Read()` implementation does no copying and instead returns a `Slice` pointing directly into the `mmap`'d memory region. This PR: - prevents `ReadaheadRandomAccessFile` from ever wrapping mmap readable files - adds an assert for the assumption `ReadaheadRandomAccessFile` makes about the wrapped file's use of scratch buffer Closes https://github.com/facebook/rocksdb/pull/3813 Differential Revision: D7891513 Pulled By: ajkr fbshipit-source-id: dc64a55222d6af280c39a1852ee39e9e9d7cde7d	2018-05-08 12:13:18 -07:00
Maysam Yabandeh	d72a51e9e1	Split FaultInjectionTest.FaultTest to avoid timeout Summary: tsan flavor of this test occasionally times out in our test infra. The patch split the test to two, each working on half of the option range. Before: [ OK ] FaultTest/FaultInjectionTest.FaultTest/0 (5918 ms) [ OK ] FaultTest/FaultInjectionTest.FaultTest/1 (5336 ms) After: [ OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/0 (2930 ms) [ OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/1 (2676 ms) [ OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/2 (2759 ms) [ OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/3 (2546 ms) Closes https://github.com/facebook/rocksdb/pull/3819 Differential Revision: D7894975 Pulled By: maysamyabandeh fbshipit-source-id: 809f1411cbcc27f8aa71a6b29a16b039f51b67c9	2018-05-07 12:29:58 -07:00
LingBin	72942ad7a4	Recommit "Avoid adding tombstones of the same file to RangeDelAggregator multiple times" Summary: The origin commit #3635 will hurt performance for users who aren't using range deletions， because unneeded std::set operations, so it was reverted by commit `44653c7b7a`. (see #3672) To fix this, move the set to and add a check in , i.e., file will be added only if is non-nullptr. The db_bench command which find the performance regression: > ./db_bench --benchmarks=fillrandom,seekrandomwhilewriting --threads=1 --num=1000000 --reads=150000 --key_size=66 > --value_size=1262 --statistics=0 --compression_ratio=0.5 --histogram=1 --seek_nexts=1 --stats_per_interval=1 > --stats_interval_seconds=600 --max_background_flushes=4 --num_multi_db=1 --max_background_compactions=16 --seed=1522388277 > -write_buffer_size=1048576 --level0_file_num_compaction_trigger=10000 --compression_type=none Before and after the modification, I re-run this command on the machine, the results of are as follows： fillrandom Table \| P50 \| P75 \| P99 \| P99.9 \| P99.99 \| ---- \| --- \| --- \| --- \| ----- \| ------ \| before commit \| 5.92 \| 8.57 \| 19.63 \| 980.97 \| 12196.00 \| after commit \| 5.91 \| 8.55 \| 19.34 \| 965.56 \| 13513.56 \| seekrandomwhilewriting Table \| P50 \| P75 \| P99 \| P99.9 \| P99.99 \| ---- \| --- \| --- \| --- \| ----- \| ------ \| before commit \| 1418.62 \| 1867.01 \| 3823.28 \| 4980.99 \| 9240.00 \| after commit \| 1450.54 \| 1880.61 \| 3962.87 \| 5429.60 \| 7542.86 \| Closes https://github.com/facebook/rocksdb/pull/3800 Differential Revision: D7874245 Pulled By: ajkr fbshipit-source-id: 2e8bec781b3f7399246babd66395c88619534a17	2018-05-04 16:45:15 -07:00
Maysam Yabandeh	171f415b30	Rename vars to satisfy unity built Summary: Tested by "make unity_test" Closes https://github.com/facebook/rocksdb/pull/3807 Differential Revision: D7882657 Pulled By: maysamyabandeh fbshipit-source-id: 84862c18d7f2fc762bd96ad070eaeb6936e45159	2018-05-04 15:28:06 -07:00
Zhongyi Xie	a703432808	MaxFileSizeForLevel: adjust max_file_size for dynamic level compaction Summary: `MutableCFOptions::RefreshDerivedOptions` always assume base level is L1, which is not true when `level_compaction_dynamic_level_bytes=true` and Level based compaction is used. This PR fixes this by recomputing `max_file_size` at query time (in `MaxFileSizeForLevel`) Fixes https://github.com/facebook/rocksdb/issues/3229 In master: ``` Level Files Size(MB) -------------------- 0 14 846 1 0 0 2 0 0 3 0 0 4 0 0 5 15 366 6 11 481 Cumulative compaction: 3.83 GB write, 2.27 GB read ``` In branch: ``` Level Files Size(MB) -------------------- 0 9 544 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 445 935 Cumulative compaction: 2.91 GB write, 1.46 GB read ``` db_bench command used: ``` ./db_bench --benchmarks="fillrandom,deleterandom,fillrandom,levelstats,stats" --statistics -deletes=5000 -db=tmp -compression_type=none --num=20000 -value_size=100000 -level_compaction_dynamic_level_bytes=true -target_file_size_base=2097152 -target_file_size_multiplier=2 ``` Closes https://github.com/facebook/rocksdb/pull/3755 Differential Revision: D7721381 Pulled By: miasantreble fbshipit-source-id: 39afb8503190bac3b466adf9bbf2a9b3655789f8	2018-05-03 16:42:13 -07:00
Dmitri Smirnov	934f96de27	Better destroydb Summary: Delete archive directory before WAL folder since archive may be contained as a subfolder. Also improve loop readability. Closes https://github.com/facebook/rocksdb/pull/3797 Differential Revision: D7866378 Pulled By: riversand963 fbshipit-source-id: 0c45d97677ce6fbefa3f8d602ef5e2a2a925e6f5	2018-05-03 16:13:09 -07:00
Maysam Yabandeh	a8d77ca381	Speedup ManualCompactionTest.Test Summary: ManualCompactionTest.Test occasionally times out in tsan flavor of our test infra. The patch reduces the number of keys to make the test run faster. The change does not seem to negatively impact the coverage of the test. Closes https://github.com/facebook/rocksdb/pull/3802 Differential Revision: D7865596 Pulled By: maysamyabandeh fbshipit-source-id: b4f60e32c3ae1677e25506f71c766e33fa985785	2018-05-03 16:13:09 -07:00
Siying Dong	d59549298f	Skip deleted WALs during recovery Summary: This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic. Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction) This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2. Closes https://github.com/facebook/rocksdb/pull/3765 Differential Revision: D7747618 Pulled By: siying fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729	2018-05-03 15:43:09 -07:00
Zhongyi Xie	6cab3184f5	avoid double delete on dummy record insertion failure Summary: When the dummy record insertion fails, there is no need to explicitly delete the block as it will be registered for cleanup regardless. Closes https://github.com/facebook/rocksdb/pull/3688 Differential Revision: D7537741 Pulled By: miasantreble fbshipit-source-id: fcd3a3d3d382ee8e2c7ced0a4980e683d93a16d6	2018-05-01 16:01:28 -07:00
Victor Grishchenko	c9ace1d81b	expose WAL iterator in the C API Summary: A minor change: I wrapped TransactionLogIterator for the C API. I needed that for the golang binding. Closes https://github.com/facebook/rocksdb/pull/3304 Differential Revision: D6628736 Pulled By: miasantreble fbshipit-source-id: 3374f3c64b1d7b225696b8767090917761e2f30a	2018-04-27 16:56:59 -07:00
Huachao Huang	ed7a95b28c	Add max_subcompactions as a compaction option Summary: Sometimes we want to compact files as fast as possible, but don't want to set a large `max_subcompactions` in the `DBOptions` by default. I add a `max_subcompactions` options to `CompactionOptions` so that we can choose a proper concurrency dynamically. Closes https://github.com/facebook/rocksdb/pull/3775 Differential Revision: D7792357 Pulled By: ajkr fbshipit-source-id: 94f54c3784dce69e40a229721a79a97e80cd6a6c	2018-04-27 11:57:39 -07:00
Yanqin Jin	7dfbe33532	Rename pending_compaction_ to queued_for_compaction_. Summary: We use `queued_for_flush_` to indicate a column family has been added to the flush queue. Similarly and to be consistent in our naming, we need to use `queued_for_compaction_` to indicate a column family has been added to the compaction queue. In the past we used `pending_compaction_` which can also be ambiguous. Closes https://github.com/facebook/rocksdb/pull/3781 Differential Revision: D7790063 Pulled By: riversand963 fbshipit-source-id: 6786b11a4fcaea36dc9b4672233dbe042f921804	2018-04-27 11:12:01 -07:00
Yanqin Jin	513b5ce618	Rename pending_flush_ to queued_for_flush_. Summary: With ColumnFamilyData::pending_flush_, we have the following code snippet in DBImpl::ScheedulePendingFlush ``` if (!cfd->pending_flush() && cfd->imm()->IsFlushPending()) { ... } ``` `Pending` is ambiguous, and I feel `queued_for_flush` is a better name, especially for the sake of readability. Closes https://github.com/facebook/rocksdb/pull/3777 Differential Revision: D7783066 Pulled By: riversand963 fbshipit-source-id: f1bd8c8bfe5eafd2c94da0d8566c9b2b6bb57229	2018-04-26 21:12:51 -07:00
Siying Dong	63c965cdb4	Sync parent directory after deleting a file in delete scheduler Summary: sync parent directory after deleting a file in delete scheduler. Otherwise, trim speed may not be as smooth as what we want. Closes https://github.com/facebook/rocksdb/pull/3767 Differential Revision: D7760136 Pulled By: siying fbshipit-source-id: ec131d53b61953f09c60d67e901e5eeb2716b05f	2018-04-26 13:58:20 -07:00
Vincent Lee	7c9f23e6db	Rate limiter should be allowed to share between different rocksdb instances in C API Summary: Currently, the `rocksdb_options_set_ratelimiter` in `c.cc` will change the input to nil, which make it is not possible to use the shared rate limiter create by `rocksdb_ratelimiter_create` in different rocksdb option. In this pr, I changed it to shared ptr. Closes https://github.com/facebook/rocksdb/pull/3758 Differential Revision: D7749740 Pulled By: ajkr fbshipit-source-id: c6121f8ca75402afdb4b295ce63c2338d253a1b5	2018-04-25 15:57:48 -07:00
Mike Kolupaev	affe01b0d5	Improve write time breakdown stats Summary: There's a group of stats in PerfContext for profiling the write path. They break down the write time into WAL write, memtable insert, throttling, and everything else. We use these stats a lot for figuring out the cause of slow writes. These stats got a bit out of date and are now categorizing some interesting things as "everything else", and also do some double counting. This PR fixes it and adds two new stats: time spent waiting for other threads of the batch group, and time spent waiting for scheduling flushes/compactions. Probably these will be enough to explain all the occasional abnormally slow (multiple seconds) writes that we're seeing. Closes https://github.com/facebook/rocksdb/pull/3602 Differential Revision: D7251562 Pulled By: al13n321 fbshipit-source-id: 0a2d0f5a4fa5677455e1f566da931cb46efe2a0d	2018-04-23 17:58:54 -07:00
Siying Dong	d5afa73789	Revert "Skip deleted WALs during recovery" Summary: This reverts commit `73f21a7b21`. It breaks compatibility. When created a DB using a build with this new change, opening the DB and reading the data will fail with this error: "Corruption: Can't access /000000.sst: IO error: while stat a file for size: /tmp/xxxx/000000.sst: No such file or directory" This is because the dummy AddFile4 entry generated by the new code will be treated as a real entry by an older build. The older build will think there is a real file with number 0, but there isn't such a file. Closes https://github.com/facebook/rocksdb/pull/3762 Differential Revision: D7730035 Pulled By: siying fbshipit-source-id: f2051859eff20ef1837575ecb1e1bb96b3751e77	2018-04-23 12:01:26 -07:00
Anand Ananthabhotla	dbdaa4662e	Add a stat for MultiGet keys found, update memtable hit/miss stats Summary: 1. Add a new ticker stat rocksdb.number.multiget.keys.found to track the number of keys successfully read 2. Update rocksdb.memtable.hit/miss in DBImpl::MultiGet(). It was being done in DBImpl::GetImpl(), but not MultiGet Closes https://github.com/facebook/rocksdb/pull/3730 Differential Revision: D7677364 Pulled By: anand1976 fbshipit-source-id: af22bd0ef8ddc5cf2b4244b0a024e539fe48bca5	2018-04-20 15:28:19 -07:00
Maysam Yabandeh	c3d1e36cce	WritePrepared Txn: enable TryAgain for duplicates at the end of the batch Summary: The WriteBatch::Iterate will try with a larger sequence number if the memtable reports a duplicate. This status is specified with TryAgain status. So far the assumption was that the last entry in the batch will never return TryAgain, which is correct when WAL is created via WritePrepared since it always appends a batch separator if a natural one does not exist. However when reading a WAL generated by WriteCommitted this batch separator might not exist. Although WritePrepared is not supposed to be able to read the WAL generated by WriteCommitted we should avoid confusing scenarios in which the behavior becomes unpredictable. The path fixes that by allowing TryAgain even for the last entry of the write batch. Closes https://github.com/facebook/rocksdb/pull/3747 Differential Revision: D7708391 Pulled By: maysamyabandeh fbshipit-source-id: bfaddaa9b14a4cdaff6977f6f63c789a6ab1ee0d	2018-04-20 15:28:19 -07:00
przemyslaw.skibinski@percona.com	dee95a1afc	Fix GitHub issue #3716 : gcc-8 warnings Summary: Fix the following gcc-8 warnings: - conflicting C language linkage declaration [-Werror] - writing to an object with no trivial copy-assignment [-Werror=class-memaccess] - array subscript -1 is below array bounds [-Werror=array-bounds] Solves https://github.com/facebook/rocksdb/issues/3716 Closes https://github.com/facebook/rocksdb/pull/3736 Differential Revision: D7684161 Pulled By: yiwu-arbug fbshipit-source-id: 47c0423d26b74add251f1d3595211eee1e41e54a	2018-04-20 13:42:47 -07:00
Zhongyi Xie	e1e826b980	check return status for Sync() and Append() calls to avoid corruption Summary: Right now in `SyncClosedLogs`, `CopyFile`, and `AddRecord`, where `Sync` and `Append` are invoked in a loop, the error status are not checked. This could lead to potential corruption as later calls will overwrite the error status. Closes https://github.com/facebook/rocksdb/pull/3740 Differential Revision: D7678848 Pulled By: miasantreble fbshipit-source-id: 4b0b412975989dfe80348f73217b9c4122a4bd77	2018-04-19 14:13:46 -07:00
Yi Wu	ad511684b2	Add block cache related DB properties Summary: Add DB properties "rocksdb.block-cache-capacity", "rocksdb.block-cache-usage", "rocksdb.block-cache-pinned-usage" to show block cache usage. Closes https://github.com/facebook/rocksdb/pull/3734 Differential Revision: D7657180 Pulled By: yiwu-arbug fbshipit-source-id: dd34a019d5878dab539c51ee82669e97b2b745fd	2018-04-18 21:42:25 -07:00
Yanqin Jin	5e48811844	Initialize a boolean member variable of a struct. Summary: The reason for this initialization is that LLVM UBSAN check will fail due to uninitialized bool. [StackOverflow post](https://stackoverflow.com/questions/31420154/runtime-error-load-of-value-127-which-is-not-a-valid-value-for-type-bool). UBSAN log: > ===== Running external_sst_file_basic_test [==========] Running 7 tests from 1 test case. [----------] Global test environment set-up. [----------] 7 tests from ExternalSSTFileBasicTest [ RUN ] ExternalSSTFileBasicTest.Basic [ OK ] ExternalSSTFileBasicTest.Basic (6 ms) [ RUN ] ExternalSSTFileBasicTest.NoCopy db/external_sst_file_ingestion_job.h:23:8: runtime error: load of value 253, which is not a valid value for type 'bool' miasantreble I've tested this locally using the following command. ``` TEST_TMPDIR=/dev/shm/rocksdb COMPILE_WITH_UBSAN=1 OPT=-g make J=1 -j8 ubsan_check ``` ajkr This PR is related to your review comment in [PR](https://github.com/facebook/rocksdb/pull/3713/). It turns out that, with UBSAN enabled, we must provide a default value for boolean member variables. Closes https://github.com/facebook/rocksdb/pull/3728 Differential Revision: D7642476 Pulled By: riversand963 fbshipit-source-id: 4c09a4b8d271151cb99ae7393db9e4ad9f29762e	2018-04-16 14:28:01 -07:00
Zhongyi Xie	954b496b3f	fix memory leak in two_level_iterator Summary: this PR fixes a few failed contbuild: 1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406 2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662 3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag. Closes https://github.com/facebook/rocksdb/pull/3718 Reviewed By: maysamyabandeh Differential Revision: D7621192 Pulled By: miasantreble fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146	2018-04-15 17:26:26 -07:00
zhangjinpeng1987	31ee4bf240	add kEntryRangeDeletion Summary: When there are many range deletions in a range, we want to trigger manual compaction on this range to reclaim disk space as soon as possible and speed up read. After this change, we can collect informations of range deletions and store them into user properties which can guide our manual compaction. Closes https://github.com/facebook/rocksdb/pull/3695 Differential Revision: D7570322 Pulled By: ajkr fbshipit-source-id: c358fa43b0aac6cc954d2eadc7d3bd8015373369	2018-04-13 11:27:17 -07:00
Yanqin Jin	c81b0abedd	Improve accuracy of I/O stats collection of external SST ingestion. Summary: RocksDB supports ingestion of external ssts. If ingestion_options.move_files is true, when performing ingestion, RocksDB first tries to link external ssts. If external SST file resides on a different FS, or the underlying FS does not support hard link, then RocksDB performs actual file copy. However, no matter which choice is made, current code increase bytes-written when updating compaction stats, which is inaccurate when RocksDB does NOT copy file. Rename a sync point. Closes https://github.com/facebook/rocksdb/pull/3713 Differential Revision: D7604151 Pulled By: riversand963 fbshipit-source-id: dd0c0d9b9a69c7d9ffceafc3d9c23371aa413586	2018-04-13 10:58:42 -07:00
David Lai	3be9b36453	comment unused parameters to turn on -Wunused-parameter flag Summary: This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557. Closes https://github.com/facebook/rocksdb/pull/3662 Differential Revision: D7426121 Pulled By: Dayvedde fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352	2018-04-12 17:59:16 -07:00
Yanqin Jin	d42bd041c5	Improve visibility into the reasons for compaction. Summary: Add `compaction_reason` as part of event log for event `compaction started`. Add counters for each `CompactionReason`. Closes https://github.com/facebook/rocksdb/pull/3679 Differential Revision: D7550348 Pulled By: riversand963 fbshipit-source-id: a19cff3a678c785aa5ef41aac78b9a5968fcc34d	2018-04-11 10:58:44 -07:00
Andrew Kryczka	019d7894eb	fix calling SetOptions on deprecated options Summary: In `cf_options_type_info`, the deprecated options are all considered to have offset zero in the `MutableCFOptions` struct. Previously we weren't checking in `GetMutableOptionsFromStrings` whether the provided option was deprecated or not and simply writing the provided value to the offset specified by `cf_options_type_info`. That meant setting any deprecated option would overwrite the first element in the struct, which is `write_buffer_size`. `db_stress` hit this often since it calls `SetOptions` with `soft_rate_limit=0` and `hard_rate_limit=0`, which are both deprecated so cause `write_buffer_size` to be set to zero, which causes it to crash on the following assertion: ``` db_stress: db/memtable.cc:106: rocksdb::MemTable::MemTable(const rocksdb::InternalKeyComparator&, const rocksdb::ImmutableCFOptions&, const rocksdb::MutableCFOptions&, rocksdb::WriteBufferManager*, rocksdb::SequenceNumber, uint32_t): Assertion `!ShouldScheduleFlush()' failed. ``` We fix it by skipping deprecated options (and logging a warning) when users provide them to `SetOptions`. I didn't want to fail the call for compatibility reasons. Closes https://github.com/facebook/rocksdb/pull/3700 Differential Revision: D7572596 Pulled By: ajkr fbshipit-source-id: bd5d84e14c0c39f30c5d4c6df7c1503d2c28ecf1	2018-04-10 19:02:09 -07:00
Yanqin Jin	d95014b9df	fix some text in comments. Summary: 1. Remove redundant text. 2. Make terminology consistent across all comments and doc of RocksDB. Also do our best to conform to conventions. Specifically, use 'callback' instead of 'call-back' [wikipedia](https://en.wikipedia.org/wiki/Callback_(computer_programming)). Closes https://github.com/facebook/rocksdb/pull/3693 Differential Revision: D7560396 Pulled By: riversand963 fbshipit-source-id: ba8c251c487f4e7d1872a1a8dc680f9e35a6ffb8	2018-04-10 15:59:24 -07:00
Zhongyi Xie	2770a94c42	make MockTimeEnv::current_time_ atomic to fix data race Summary: fix a new TSAN failure https://gist.github.com/miasantreble/7599c33f4e17da1024c67d4540dbe397 Closes https://github.com/facebook/rocksdb/pull/3694 Differential Revision: D7565310 Pulled By: miasantreble fbshipit-source-id: f672c96e925797b34dec6e20b59527e8eebaa825	2018-04-10 14:13:18 -07:00
Gihwan Oh	65fe8d6cd6	Change a comment Summary: In this case, we add input files of compaction, not outputs. Closes https://github.com/facebook/rocksdb/pull/3686 Differential Revision: D7556781 Pulled By: ajkr fbshipit-source-id: ae135bb6eda60db8f275a9ba2d21c18aaadef5b7	2018-04-09 13:42:31 -07:00
Andrew Kryczka	1c27cbfbd1	fix intra-L0 FIFO for uncompressed use case Summary: - inflate the argument passed as `max_compact_bytes_per_del_file` by a bit (10%). The intent of this argument is prevent L0 files from being intra-L0 compacted multiple times. Without compression, some intra-L0 compactions exceed this limit (and thus aren't executed), even though none of their files have gone through intra-L0 before. - fix `FindIntraL0Compaction` as it was rejecting some valid intra-L0 compactions. In particular, `compact_bytes_per_del_file` is the work-per-deleted-file for the span [0, span_len), whereas `new_compact_bytes_per_del_file` is the work-per-deleted-file for the span [0, span_len+1). The former is more correct for checking whether we've found an eligible span. Closes https://github.com/facebook/rocksdb/pull/3684 Differential Revision: D7530396 Pulled By: ajkr fbshipit-source-id: cad4f50902bdc428ac9ff6fffb13eb288648d85e	2018-04-09 13:42:31 -07:00
Zhongyi Xie	f3a1d9e049	fix data race Summary: Fix a TSAN failure in `DBRangeDelTest.ValidLevelSubcompactionBoundaries`: https://gist.github.com/miasantreble/712e04b4de2ff7f193c98b1acf07e899 Closes https://github.com/facebook/rocksdb/pull/3691 Differential Revision: D7541400 Pulled By: miasantreble fbshipit-source-id: b0b4538980bce7febd0385e61d6e046580bcaefb	2018-04-09 12:28:28 -07:00
Maysam Yabandeh	bde1c1a72a	WritePrepared Txn: add stats Summary: Adding some stats that would be helpful to monitor if the DB has gone to unlikely stats that would hurt the performance. These are mostly when we end up needing to acquire a mutex. Closes https://github.com/facebook/rocksdb/pull/3683 Differential Revision: D7529393 Pulled By: maysamyabandeh fbshipit-source-id: f7d36279a8f39bd84d8ddbf64b5c97f670c5d6d9	2018-04-07 21:56:42 -07:00
Gihwan Oh	74767deec3	Fix typo Summary: regrad -> regard Closes https://github.com/facebook/rocksdb/pull/3685 Differential Revision: D7540952 Pulled By: miasantreble fbshipit-source-id: e08c9389f7fccf401c962a4441b62cd5e73a33ad	2018-04-06 15:42:50 -07:00
Phani Shekhar Mantripragada	446b32cfc3	Support for Column family specific paths. Summary: In this change, an option to set different paths for different column families is added. This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path. To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path. Changes : 1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions. This member is used to identify the path information whenever files are accessed. 2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting. 3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths. 4) Unit tests are added appropriately. Closes https://github.com/facebook/rocksdb/pull/3102 Differential Revision: D6951697 Pulled By: ajkr fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d	2018-04-05 19:58:20 -07:00
Dmitri Smirnov	147dfc7bdf	Fix pre_release callback argument list. Summary: Primitive types constness does not affect the signature of the method and has no influence on whether the overriding method would actually have that const bool instead of just bool. In addition, it is rarely useful but does produce a compatibility warnings in VS 2015 compiler. Closes https://github.com/facebook/rocksdb/pull/3663 Differential Revision: D7475739 Pulled By: ajkr fbshipit-source-id: fb275378b5acc397399420ae6abb4b6bfe5bd32f	2018-04-05 11:12:16 -07:00
Zhongyi Xie	c827b2dc2a	fix build for rocksdb lite Summary: currently rocksdb lite build fails due to the following errors: > db/db_sst_test.cc:29:51: error: ‘FlushJobInfo’ does not name a type virtual void OnFlushCompleted(DB* /db/, const FlushJobInfo& info) override { ^ db/db_sst_test.cc:29:16: error: ‘virtual void rocksdb::FlushedFileCollector::OnFlushCompleted(rocksdb::DB, const int&)’ marked ‘override’, but does not override virtual void OnFlushCompleted(DB /db/, const FlushJobInfo& info) override { ^ db/db_sst_test.cc:24:7: error: ‘class rocksdb::FlushedFileCollector’ has virtual functions and accessible non-virtual destructor [-Werror=non-virtual-dtor] class FlushedFileCollector : public EventListener { ^ db/db_sst_test.cc: In member function ‘virtual void rocksdb::FlushedFileCollector::OnFlushCompleted(rocksdb::DB, const int&)’: db/db_sst_test.cc:31:35: error: request for member ‘file_path’ in ‘info’, which is of non-class type ‘const int’ flushed_files_.push_back(info.file_path); ^ cc1plus: all warnings being treated as errors make: ** [db/db_sst_test.o] Error 1 Closes https://github.com/facebook/rocksdb/pull/3676 Differential Revision: D7493006 Pulled By: miasantreble fbshipit-source-id: 77dff0a5b23e27db51be9b9798e3744e6fdec64f	2018-04-05 09:11:36 -07:00
Sagar Vemuri	7d9067991e	Ttl-triggered and snapshot-release-triggered compactions should not be manual compactions Summary: Ttl-triggered and snapshot-release-triggered compactions should not be considered as manual compactions. This is a bug. Closes https://github.com/facebook/rocksdb/pull/3678 Differential Revision: D7498151 Pulled By: sagar0 fbshipit-source-id: a2d5bed05268a4dc93d54ea97a9ae44b366df15d	2018-04-05 06:41:52 -07:00
Sagar Vemuri	04c11b867d	Level Compaction with TTL Summary: Level Compaction with TTL. As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space. Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data. This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space. This functionality can be controlled by the newly introduced column family option -- ttl. TODO for later: - Make ttl mutable - Extend TTL to Universal compaction as well? (TTL is already supported in FIFO) - Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option. Closes https://github.com/facebook/rocksdb/pull/3591 Differential Revision: D7275442 Pulled By: sagar0 fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267	2018-04-02 22:14:28 -07:00
Maysam Yabandeh	b225de7e10	WritePrepared Txn: smallest_prepare optimization Summary: The is an optimization to reduce lookup in the CommitCache when querying IsInSnapshot. The optimization takes the smallest uncommitted data at the time that the snapshot was taken and if the sequence number of the read data is lower than that number it assumes the data as committed. To implement this optimization two changes are required: i) The AddPrepared function must be called sequentially to avoid out of order insertion in the PrepareHeap (otherwise the top of the heap does not indicate the smallest prepare in future too), ii) non-2PC transactions also call AddPrepared if they do not commit in one step. Closes https://github.com/facebook/rocksdb/pull/3649 Differential Revision: D7388630 Pulled By: maysamyabandeh fbshipit-source-id: b79506238c17467d590763582960d4d90181c600	2018-04-02 20:27:41 -07:00
Amy Tai	1579626d0d	Enable cancelling manual compactions if they hit the sfm size limit Summary: Manual compactions should be cancelled, just like scheduled compactions are cancelled, if sfm->EnoughRoomForCompaction is not true. Closes https://github.com/facebook/rocksdb/pull/3670 Differential Revision: D7457683 Pulled By: amytai fbshipit-source-id: 669b02fdb707f75db576d03d2c818fb98d1876f5	2018-04-02 19:58:04 -07:00
Zhongyi Xie	44653c7b7a	Revert "Avoid adding tombstones of the same file to RangeDelAggregato… Summary: …r multiple times" This reverts commit `e80709a33a`. lingbin PR https://github.com/facebook/rocksdb/pull/3635 is causing some performance regression for seekrandom workloads I'm reverting the commit for now but feel free to submit new patches 😃 To reproduce the regression, you can run the following db_bench command > ./db_bench --benchmarks=fillrandom,seekrandomwhilewriting --threads=1 --num=1000000 --reads=150000 --key_size=66 --value_size=1262 --statistics=0 --compression_ratio=0.5 --histogram=1 --seek_nexts=1 --stats_per_interval=1 --stats_interval_seconds=600 --max_background_flushes=4 --num_multi_db=1 --max_background_compactions=16 --seed=1522388277 -write_buffer_size=1048576 --level0_file_num_compaction_trigger=10000 --compression_type=none write stats printed by db_bench: Table \| \| \| \| \| \| \| \| \| \| \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- revert commit \| Percentiles: \| P50: \| 80.77 \| P75: \|102.94 \|P99: \| 1786.44 \| P99.9: \| 1892.39 \|P99.99: 2645.10 \| keep commit \| Percentiles: \| P50: \| 221.72 \| P75: \| 686.62 \| P99: \| 1842.57 \| P99.9: \| 1899.70\| P99.99: 2814.29\| Closes https://github.com/facebook/rocksdb/pull/3672 Differential Revision: D7463315 Pulled By: miasantreble fbshipit-source-id: 8e779c87591127f2c3694b91a56d9b459011959d	2018-04-02 19:58:04 -07:00
Fosco Marotto	d12112d05e	Throw NoSpace instead of IOError when out of space. Summary: Replaces #1702 and is updated from feedback. Closes https://github.com/facebook/rocksdb/pull/3531 Differential Revision: D7457395 Pulled By: gfosco fbshipit-source-id: 25a21dd8cfa5a6e42e024208b444d9379d920c82	2018-03-30 15:27:18 -07:00
Maysam Yabandeh	73f21a7b21	Skip deleted WALs during recovery Summary: This patch record the deleted WAL numbers in the manifest to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic. Closes https://github.com/facebook/rocksdb/pull/3488 Differential Revision: D6967893 Pulled By: maysamyabandeh fbshipit-source-id: 13119feb155a08ab6d4909f437c7a750480dc8a1	2018-03-30 11:28:05 -07:00
Maysam Yabandeh	89d989ed75	WritePrepared Txn: fix a bug in publishing recoverable state seq Summary: When using two_write_queue, the published seq and the last allocated sequence could be ahead of the LastSequence, even if both write queues are stopped as in WriteRecoverableState. The patch fixes a bug in WriteRecoverableState in which LastSequence was used as a reference but the result was applied to last fetched sequence and last published seq. Closes https://github.com/facebook/rocksdb/pull/3665 Differential Revision: D7446099 Pulled By: maysamyabandeh fbshipit-source-id: 1449bed9aed8e9db6af85946efd347cb8efd3c0b	2018-03-29 14:46:41 -07:00
Maysam Yabandeh	0377ff9dea	WritePrepared Txn: make recoverable state visible after flush Summary: Currently if the CommitTimeWriteBatch is set to be used only as a state that is required only for recovery , the user cannot see that in DB until it is restarted. This while the state is already inserted into the DB after the memtable flush. It would be useful for debugging if make this state visible to the user after the flush by committing it. The patch does it by a invoking a callback that does the commit on the recoverable state. Closes https://github.com/facebook/rocksdb/pull/3661 Differential Revision: D7424577 Pulled By: maysamyabandeh fbshipit-source-id: 137f9408662f0853938b33fa440f27f04c1bbf5c	2018-03-28 12:12:08 -07:00
Yanqin Jin	1f5def1653	Fix race condition causing double deletion of ssts Summary: Possible interleaved execution of background compaction thread calling `FindObsoleteFiles (no full scan) / PurgeObsoleteFiles` and user thread calling `FindObsoleteFiles (full scan) / PurgeObsoleteFiles` can lead to race condition on which RocksDB attempts to delete a file twice. The second attempt will fail and return `IO error`. This may occur to other files, but this PR targets sst. Also add a unit test to verify that this PR fixes the issue. The newly added unit test `obsolete_files_test` has a test case for this scenario, implemented in `ObsoleteFilesTest#RaceForObsoleteFileDeletion`. `TestSyncPoint`s are used to coordinate the interleaving the `user_thread` and background compaction thread. They execute as follows ``` timeline user_thread background_compaction thread t1 \| FindObsoleteFiles(full_scan=false) t2 \| FindObsoleteFiles(full_scan=true) t3 \| PurgeObsoleteFiles t4 \| PurgeObsoleteFiles V ``` When `user_thread` invokes `FindObsoleteFiles` with full scan, it collects ALL files in RocksDB directory, including the ones that background compaction thread have collected in its job context. Then `user_thread` will see an IO error when trying to delete these files in `PurgeObsoleteFiles` because background compaction thread has already deleted the file in `PurgeObsoleteFiles`. To fix this, we make RocksDB remember which (SST) files have been found by threads after calling `FindObsoleteFiles` (see `DBImpl#files_grabbed_for_purge_`). Therefore, when another thread calls `FindObsoleteFiles` with full scan, it will not collect such files. ajkr could you take a look and comment? Thanks! Closes https://github.com/facebook/rocksdb/pull/3638 Differential Revision: D7384372 Pulled By: riversand963 fbshipit-source-id: 01489516d60012e722ee65a80e1449e589ce26d3	2018-03-28 10:29:59 -07:00
Maysam Yabandeh	35a4469bbf	Fix race condition via concurrent FlushWAL Summary: Currently log_writer->AddRecord in WriteImpl is protected from concurrent calls via FlushWAL only if two_write_queues_ option is set. The patch fixes the problem by i) skip log_writer->AddRecord in FlushWAL if manual_wal_flush is not set, ii) protects log_writer->AddRecord in WriteImpl via log_write_mutex_ if manual_wal_flush_ is set but two_write_queues_ is not. Fixes #3599 Closes https://github.com/facebook/rocksdb/pull/3656 Differential Revision: D7405608 Pulled By: maysamyabandeh fbshipit-source-id: d6cc265051c77ae49c7c6df4f427350baaf46934	2018-03-26 16:29:56 -07:00
Maysam Yabandeh	3e417a6607	WritePrepared Txn: AddPrepared for all sub-batches Summary: Currently AddPrepared is performed only on the first sub-batch if there are duplicate keys in the write batch. This could cause a problem if the transaction takes too long to commit and the seq number of the first sub-patch moved to old_prepared_ but not the seq of the later ones. The patch fixes this by calling AddPrepared for all sub-patches. Closes https://github.com/facebook/rocksdb/pull/3651 Differential Revision: D7388635 Pulled By: maysamyabandeh fbshipit-source-id: 0ccd80c150d9bc42fe955e49ddb9d7ca353067b4	2018-03-23 17:30:04 -07:00
Dmitri Smirnov	d382ae7de6	Imporve perf of random read and insert compare by suggesting inlining to the compiler Summary: Results from 2015 compiler. This improve sequential insert. Random Read results are inconclusive but I hope 2017 will do a better job at inlining. Before: fillseq : 3.638 micros/op 274866 ops/sec; 213.9 MB/s After: fillseq : 3.379 micros/op 295979 ops/sec; 230.3 MB/s Closes https://github.com/facebook/rocksdb/pull/3645 Differential Revision: D7382711 Pulled By: siying fbshipit-source-id: 092a07ffe8a6e598d1226ceff0f11b35e6c5c8e4	2018-03-23 13:26:55 -07:00
LingBin	e80709a33a	Avoid adding tombstones of the same file to RangeDelAggregator multiple times Summary: RangeDelAggregator will remember the files whose range tombstones have been added, so the caller can check whether the file has been added before call AddTombstones. Closes https://github.com/facebook/rocksdb/pull/3635 Differential Revision: D7354604 Pulled By: ajkr fbshipit-source-id: 9b9f7ec130556028df417e650711554b46d8d107	2018-03-23 12:43:06 -07:00
Radoslaw Zarzynski	09b6bf828a	InlineSkiplist: don't decode keys unnecessarily during comparisons Summary: Summary ======== `InlineSkipList<>::Insert` takes the `key` parameter as a C-string. Then, it performs multiple comparisons with it requiring the `GetLengthPrefixedSlice()` to be spawn in `MemTable::KeyComparator::operator()(const char* prefix_len_key1, const char* prefix_len_key2)` on the same data over and over. The patch tries to optimize that. Rough performance comparison ===== Big keys, no compression. ``` $ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256 (...) fillrandom : 4.222 micros/op 236836 ops/sec; 80.4 MB/s ``` ``` $ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256 (...) fillrandom : 4.064 micros/op 246059 ops/sec; 83.5 MB/s ``` TODO ====== In ~~a separated~~ this PR: - [x] Go outside the write path. Maybe even eradicate the C-string-taking variant of `KeyIsAfterNode` entirely. - [x] Try to cache the transformations applied by `KeyComparator` & friends in situations where we havy many comparisons with the same key. Closes https://github.com/facebook/rocksdb/pull/3516 Differential Revision: D7059300 Pulled By: ajkr fbshipit-source-id: 6f027dbb619a488129f79f79b5f7dbe566fb2dbb	2018-03-23 12:14:30 -07:00
Zhongyi Xie	1cbc96d236	FlushReason improvement Summary: Right now flush reason "SuperVersion Change" covers a few different scenarios which is a bit vague. For example, the following db_bench job should trigger "Write Buffer Full" > $ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 $ grep 'flush_reason' /dev/shm/dbbench/LOG ... 2018/03/06-17:30:42.543638 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242543634, "job": 192, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018024, "flush_reason": "SuperVersion Change"} 2018/03/06-17:30:42.569541 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242569536, "job": 193, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"} 2018/03/06-17:30:42.596396 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242596392, "job": 194, "event": "flush_started", "num_memtables": 1, "num_entries": 7008, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "SuperVersion Change"} 2018/03/06-17:30:42.622444 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242622440, "job": 195, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"} With the fix: > 2018/03/19-14:40:02.341451 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602341444, "job": 98, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018008, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.379655 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602379642, "job": 100, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018016, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.418479 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602418474, "job": 101, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.455084 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602455079, "job": 102, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.492293 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602492288, "job": 104, "event": "flush_started", "num_memtables": 1, "num_entries": 7007, "num_deletes": 0, "memory_usage": 1018056, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.528720 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602528715, "job": 105, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.566255 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602566238, "job": 107, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018112, "flush_reason": "Write Buffer Full"} Closes https://github.com/facebook/rocksdb/pull/3627 Differential Revision: D7328772 Pulled By: miasantreble fbshipit-source-id: 67c94065fbdd36930f09930aad0aaa6d2c152bb8	2018-03-22 18:42:18 -07:00
Sagar Vemuri	2e3d407778	Fsync after writing global seq number in ExternalSstFileIngestionJob Summary: Fsync after writing global sequence number to the ingestion file in ExternalSstFileIngestionJob. Otherwise the file metadata could be incorrect. Closes https://github.com/facebook/rocksdb/pull/3644 Differential Revision: D7373813 Pulled By: sagar0 fbshipit-source-id: 4da2c9e71a8beb5c08b4ac955f288ee1576358b8	2018-03-22 17:42:56 -07:00
Andrew Kryczka	4d51feab0b	Rename function for handling WAL write error Summary: It was misnamed. It actually updates `bg_error_` if `PreprocessWrite()` or `WriteToWAL()` fail, not related to the user callback. Closes https://github.com/facebook/rocksdb/pull/3485 Differential Revision: D6955787 Pulled By: ajkr fbshipit-source-id: bd7afc3fdb7a52830c021cbfc25fcbc3ab7d5e10	2018-03-22 15:58:39 -07:00
Maysam Yabandeh	7429b20e39	WritePrepared Txn: fix race condition on publishing seq Summary: This commit fixes a race condition on calling SetLastPublishedSequence. The function must be called only from the 2nd write queue when two_write_queues is enabled. However there was a bug that would also call it from the main write queue if CommitTimeWriteBatch is provided to the commit request and yet use_only_the_last_commit_time_batch_for_recovery optimization is not enabled. To fix that we penalize the commit request in such cases by doing an additional write solely to publish the seq number from the 2nd queue. Closes https://github.com/facebook/rocksdb/pull/3641 Differential Revision: D7361508 Pulled By: maysamyabandeh fbshipit-source-id: bf8f7a27e5cccf5425dccbce25eb0032e8e5a4d7	2018-03-22 14:43:36 -07:00
QingpingWang	70282cf876	fix behavior does not match name for "IsFileDeletionsEnabled" Summary: for PR https://github.com/facebook/rocksdb/pull/3598 I deleted the original repo for some reason. Sorry for the inconvenience. Closes https://github.com/facebook/rocksdb/pull/3612 Differential Revision: D7291671 Pulled By: ajkr fbshipit-source-id: 918490ba86b13fe450d232af436cbe259d847c64	2018-03-21 22:13:34 -07:00
QingpingWang	2ce8f63f81	C API for PerfContext Summary: This pull request exposes the interface of PerfContext as C API Closes https://github.com/facebook/rocksdb/pull/3607 Differential Revision: D7294225 Pulled By: ajkr fbshipit-source-id: eddcfbc13538f379950b2c8b299486695ffb5e2c	2018-03-21 22:13:34 -07:00
Jingguo Yao	8823487ff7	doc: fix a typo Summary: s/synchromization/synchronization/ Closes https://github.com/facebook/rocksdb/pull/3583 Differential Revision: D7276596 Pulled By: sagar0 fbshipit-source-id: 552ec6d6935f642e1a3a7c552de6c94441ac50e0	2018-03-21 15:58:58 -07:00
Siying Dong	93d52696bf	Memory Problem Of Destorying ColumnFamilyHandle after deleting the CF Summary: When destorying column family handle after the column family has been deleted, the handle may hold share pointers of some objects in ColumnFamilyOptions, but in the destructor, the destructing order may cause some of the objects to be destoryed before being used by the following steps. Fix it by making a copy of the option object and destory it as the last step. Closes https://github.com/facebook/rocksdb/pull/3610 Differential Revision: D7281025 Pulled By: siying fbshipit-source-id: ac18f3b2841788cba4ccfa1abd8d59158c1113bc	2018-03-20 17:13:12 -07:00
Andrew Kryczka	d1b26507bd	fix db_compaction_test when compression disabled Summary: Previously, the compaction in `DBCompactionTestWithParam.ForceBottommostLevelCompaction` generated multiple files in no-compression use case, andone file in compression use case. I increased `target_file_size_base` so it generates one file in both use cases. Closes https://github.com/facebook/rocksdb/pull/3625 Differential Revision: D7311885 Pulled By: ajkr fbshipit-source-id: 97f249fa83a9924ac34357a4bb3189c969ecb107	2018-03-19 12:30:05 -07:00
zhsj	cc340268e9	fix wrong length in snprintf Summary: Closes https://github.com/facebook/rocksdb/pull/3622 Differential Revision: D7307689 Pulled By: ajkr fbshipit-source-id: b8f52effc63fea06c2058b39c60944c2c1f814b4	2018-03-16 13:27:55 -07:00
Huachao Huang	ecfca1ff59	Optimize overlap checking for external file ingestion Summary: If there are a lot of overlapped files in L0, creating a merging iterator for all files in L0 to check overlap can be very slow because we need to read and seek all files in L0. However, in that case, the ingested file is likely to overlap with some files in L0, so if we check those files one by one, we can stop once we encounter overlap. Ref: https://github.com/facebook/rocksdb/issues/3540 Closes https://github.com/facebook/rocksdb/pull/3564 Differential Revision: D7196784 Pulled By: anand1976 fbshipit-source-id: 8700c1e903bd515d0fa7005b6ce9b3a3d9db2d67	2018-03-16 10:43:17 -07:00
Niv Dayan	da82aab126	allowing CompactFiles to return new file names Summary: This is a small API extension to allow the CompactFiles method to return the names of files that were created during the compaction. Closes https://github.com/facebook/rocksdb/pull/3608 Differential Revision: D7275789 Pulled By: siying fbshipit-source-id: 1ec0c3954a0f10cd877efb5f29f9be6c7b59e9ba	2018-03-15 11:58:12 -07:00
Dmitri Smirnov	6f7b7f91b5	Optionally create DuplicateDetector Summary: Address issue https://github.com/facebook/rocksdb/issues/3579 Closes https://github.com/facebook/rocksdb/pull/3589 Differential Revision: D7221161 Pulled By: yiwu-arbug fbshipit-source-id: bd875ab0aa0e414dfa98b1bf036ba9b4ed351361	2018-03-14 00:57:25 -07:00
Andrew Kryczka	2256dab135	fix flaky DBSSTTest.DeleteSchedulerMultipleDBPaths Summary: I landed #3544 which made this test flaky. The reason was the files scheduled for deletion sometimes went through the trash-marking process, and sometimes were deleted directly. Our counter only bumped on the former code path, so if the latter code path was used, we'd miss counting a file deleted by deletion scheduler. This PR also bumps the counter in the latter code path. Closes https://github.com/facebook/rocksdb/pull/3593 Differential Revision: D7226173 Pulled By: yiwu-arbug fbshipit-source-id: 81ab44c60834df6ff88db1d73ea34e26c6e93c39	2018-03-13 14:57:26 -07:00
Amy Tai	e476d0e252	Adding stat to count cancelled compactions Summary: Added a stat that counts the number of cancelled compactions. Closes https://github.com/facebook/rocksdb/pull/3574 Differential Revision: D7190259 Pulled By: amytai fbshipit-source-id: d5ce82dc9398da6d6d34023ad4ed8cec909852a3	2018-03-08 10:42:28 -08:00
Bruce Mitchener	a3a3f5497c	Fix some typos in comments and docs. Summary: Closes https://github.com/facebook/rocksdb/pull/3568 Differential Revision: D7170953 Pulled By: siying fbshipit-source-id: 9cfb8dd88b7266da920c0e0c1e10fb2c5af0641c	2018-03-08 10:27:25 -08:00
Lukas Rist	a277b0f2b7	Clarification regarding record format Summary: The CRC is actually calculated based on the record type and payload. The wiki should also be updated accordingly and extended with a section on the recyclable record format. Closes https://github.com/facebook/rocksdb/pull/3576 Differential Revision: D7196478 Pulled By: siying fbshipit-source-id: 39f7a0395075cc73e2aa2bfc9e42c85bce35e765	2018-03-08 10:27:25 -08:00
Bruce Mitchener	0de710f5b8	Use nullptr instead of NULL / 0 more consistently. Summary: Closes https://github.com/facebook/rocksdb/pull/3569 Differential Revision: D7170968 Pulled By: yiwu-arbug fbshipit-source-id: 308a6b7dd358a04fd9a7de3d927bfd8abd57d348	2018-03-07 12:42:12 -08:00
Stuart	f021f1d9e1	Add rocksdb_open_with_ttl function in C API Summary: Change-Id: Ie6f9b10bce459f6bf0ade0e5877264b4e10da3f5 Signed-off-by: Stuart <Stuart.Hu@emc.com> Closes https://github.com/facebook/rocksdb/pull/3553 Differential Revision: D7144833 Pulled By: sagar0 fbshipit-source-id: 815225fa6e560d8a5bc47ffd0a98118b107ce264	2018-03-06 20:57:20 -08:00
amytai	0a3db28d98	Disallow compactions if there isn't enough free space Summary: This diff handles cases where compaction causes an ENOSPC error. This does not handle corner cases where another background job is started while compaction is running, and the other background job triggers ENOSPC, although we do allow the user to provision for these background jobs with SstFileManager::SetCompactionBufferSize. It also does not handle the case where compaction has finished and some other background job independently triggers ENOSPC. Usage: Functionality is inside SstFileManager. In particular, users should set SstFileManager::SetMaxAllowedSpaceUsage, which is the reference highwatermark for determining whether to cancel compactions. Closes https://github.com/facebook/rocksdb/pull/3449 Differential Revision: D7016941 Pulled By: amytai fbshipit-source-id: 8965ab8dd8b00972e771637a41b4e6c645450445	2018-03-06 16:27:54 -08:00
Andrew Kryczka	20c508c1ed	Enable subcompactions in manual level-based compaction Summary: This is the simplest way I could think of to speed up `CompactRange`. It works but isn't that optimal because it relies on the same `max_compaction_bytes` and `max_subcompactions` options that are used in other places. If it turns out to be useful we can allow overriding these in `CompactRangeOptions` in the future. Closes https://github.com/facebook/rocksdb/pull/3549 Differential Revision: D7117634 Pulled By: ajkr fbshipit-source-id: d0cd03d6bd0d2fd7ea3fb13cd3b8bf7c47d11e42	2018-03-06 12:43:51 -08:00
Andrew Kryczka	6a3eebbab0	support multiple db_paths in SstFileManager Summary: Now that files scheduled for deletion are kept in the same directory, we don't need to constrain deletion scheduler to `db_paths[0]`. Previously this was done because there was a separate trash directory, and this constraint prevented files from being accidentally copied to another filesystem when they're scheduled for deletion. Closes https://github.com/facebook/rocksdb/pull/3544 Differential Revision: D7093786 Pulled By: ajkr fbshipit-source-id: 202f5c92d925eafebec1281fb95bb5828d33414f	2018-03-06 12:43:51 -08:00
Fosco Marotto	d518fe1da6	uint64_t and size_t changes to compile for iOS Summary: In attempting to build a static lib for use in iOS, I ran in to lots of type errors between uint64_t and size_t. This PR contains the changes I made to get `TARGET_OS=IOS make static_lib` to succeed while also getting Xcode to build successfully with the resulting `librocksdb.a` library imported. This also compiles for me on macOS and tests fine, but I'm really not sure if I made the correct decisions about where to `static_cast` and where to change types. Also up for discussion: is iOS worth supporting? Getting the static lib is just part one, we aren't providing any bridging headers or wrappers like the ObjectiveRocks project, it won't be a great experience. Closes https://github.com/facebook/rocksdb/pull/3503 Differential Revision: D7106457 Pulled By: gfosco fbshipit-source-id: 82ac2073de7e1f09b91f6b4faea91d18bd311f8e	2018-03-06 12:43:51 -08:00
Dmitri Smirnov	c364eb42b5	Windows cumulative patch Summary: This patch addressed several issues. Portability including db_test std::thread -> port::Thread Cc: @ and %z to ROCKSDB portable macro. Cc: maysamyabandeh Implement Env::AreFilesSame Make the implementation of file unique number more robust Get rid of C-runtime and go directly to Windows API when dealing with file primitives. Implement GetSectorSize() and aling unbuffered read on the value if available. Adjust Windows Logger for the new interface, implement CloseImpl() Cc: anand1976 Fix test running script issue where $status var was of incorrect scope so the failures were swallowed and not reported. DestroyDB() creates a logger and opens a LOG file in the directory being cleaned up. This holds a lock on the folder and the cleanup is prevented. This fails one of the checkpoin tests. We observe the same in production. We close the log file in this change. Fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose failure where the test attempts to open a directory with NewRandomAccessFile which does not work on Windows. Fix DBTest.SoftLimit as it is dependent on thread timing. CC: yiwu-arbug Closes https://github.com/facebook/rocksdb/pull/3552 Differential Revision: D7156304 Pulled By: siying fbshipit-source-id: 43db0a757f1dfceffeb2b7988043156639173f5b	2018-03-06 11:57:43 -08:00
Yi Wu	b864bc9b5b	Blob DB: Improve FIFO eviction Summary: Improving blob db FIFO eviction with the following changes, * Change blob_dir_size to max_db_size. Take into account SST file size when computing DB size. * FIFO now only take into account live sst files and live blob files. It is normal for disk usage to go over max_db_size because there are obsolete sst files and blob files pending deletion. * FIFO eviction now also evict TTL blob files that's still open. It doesn't evict non-TTL blob files. * If FIFO is triggered, it will pass an expiration and the current sequence number to compaction filter. Compaction filter will then filter inlined keys to evict those with an earlier expiration and smaller sequence number. So call LSM FIFO. * Compaction filter also filter those blob indexes where corresponding blob file is gone. * Add an event listener to listen compaction/flush event and update sst file size. * Implement DB::Close() to make sure base db, as well as event listener and compaction filter, destruct before blob db. * More blob db statistics around FIFO. * Fix some locking issue when accessing a blob file. Closes https://github.com/facebook/rocksdb/pull/3556 Differential Revision: D7139328 Pulled By: yiwu-arbug fbshipit-source-id: ea5edb07b33dfceacb2682f4789bea61de28bbfa	2018-03-06 11:57:42 -08:00
Maysam Yabandeh	62277e15c3	WritePrepared Txn: Move DuplicateDetector to util Summary: Move DuplicateDetector and SetComparator to its own header file in util. It would also address a complaint in the unity test. Closes https://github.com/facebook/rocksdb/pull/3567 Differential Revision: D7163268 Pulled By: maysamyabandeh fbshipit-source-id: 6ddf82773473646dbbc1284ae601a78c4907c778	2018-03-05 23:57:12 -08:00
Huachao Huang	9cb4856dbd	Don't need to UpdateFilesByCompactionPri for kCompactionStyleNone Summary: Closes https://github.com/facebook/rocksdb/pull/3563 Differential Revision: D7154653 Pulled By: ajkr fbshipit-source-id: 4f32fb1b02451a934504c40be22b07fb1f2deb9c	2018-03-05 17:57:39 -08:00
Andrew Kryczka	5d68243e61	Comment out unused variables Summary: Submitting on behalf of another employee. Closes https://github.com/facebook/rocksdb/pull/3557 Differential Revision: D7146025 Pulled By: ajkr fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e	2018-03-05 13:13:41 -08:00
Maysam Yabandeh	680864ae54	WritePrepared Txn: Fix bug with duplicate keys during recovery Summary: Fix the following bugs: - During recovery a duplicate key was inserted twice into the write batch of the recovery transaction, once when the memtable returns false (because it was duplicates) and once for the 2nd attempt. This would result into different SubBatch count measured when the recovered transactions is committing. - If a cf is flushed during recovery the memtable is not available to assist in detecting the duplicate key. This could result into not advancing the sequence number when iterating over duplicate keys of a flushed cf and hence inserting the next key with the wrong sequence number. - SubBacthCounter would reset the comparator to default comparator after the first duplicate key. The 2nd duplicate key hence would have gone through a wrong comparator and not being detected. Closes https://github.com/facebook/rocksdb/pull/3562 Differential Revision: D7149440 Pulled By: maysamyabandeh fbshipit-source-id: 91ec317b165f363f5d11ff8b8c47c81cebb8ed77	2018-03-05 10:57:59 -08:00
Sagar Vemuri	15f55e5e06	Fix TSAN timeout in MergeOperatorPinningTest.Randomized/x test Summary: [FB - Internal] MergeOperatorPinningTest.Randomized/x tests are frequently failing with timeouts when run with tsan, as they are exceeding 10 minute limit for tests. The tests are in turn getting disabled due to frequent failures. I halved the number of rounds to make the test complete sooner. This reduces the number of testing iterations a little, but it still is much better than totally letting the test be disabled. Closes https://github.com/facebook/rocksdb/pull/3523 Differential Revision: D7031498 Pulled By: sagar0 fbshipit-source-id: 9a694f2176b235259920a42bf24bca5346f7cff1	2018-03-02 16:27:21 -08:00
Yi Wu	1209b6db5c	Blob DB: remove existing garbage collection implementation Summary: Red diff to remove existing implementation of garbage collection. The current approach is reference counting kind of approach and require a lot of effort to get the size counter right on compaction and deletion. I'm going to go with a simple mark-sweep kind of approach and will send another PR for that. CompactionEventListener was added solely for blob db and it adds complexity and overhead to compaction iterator. Removing it as well. Closes https://github.com/facebook/rocksdb/pull/3551 Differential Revision: D7130190 Pulled By: yiwu-arbug fbshipit-source-id: c3a375ad2639a3f6ed179df6eda602372cc5b8df	2018-03-02 12:57:23 -08:00
Maysam Yabandeh	d060421c77	Fix a leak in prepared_section_completed_ Summary: The zeroed entries were not removed from prepared_section_completed_ map. This patch adds a unit test to show the problem and fixes that by refactoring the code. The new code is more efficient since i) it uses two separate mutex to avoid contention between commit and prepare threads, ii) it uses a sorted vector for maintaining uniq log entires with prepare which avoids a very large heap with many duplicate entries. Closes https://github.com/facebook/rocksdb/pull/3545 Differential Revision: D7106071 Pulled By: maysamyabandeh fbshipit-source-id: b3ae17cb6cd37ef10b6b35e0086c15c758768a48	2018-03-01 20:41:56 -08:00
Yi Wu	bf937cf15b	Add "rocksdb.live-sst-files-size" DB property Summary: Add "rocksdb.live-sst-files-size" DB property which only include files of latest version. Existing "rocksdb.total-sst-files-size" include files from all versions and thus include files that's obsolete but not yet deleted. I'm going to use this new property to cap blob db sst + blob files size. Closes https://github.com/facebook/rocksdb/pull/3548 Differential Revision: D7116939 Pulled By: yiwu-arbug fbshipit-source-id: c6a52e45ce0f24ef78708156e1a923c1dd6bc79a	2018-03-01 18:01:10 -08:00
leviathan1995	ec5843dca9	Comment typo Summary: Closes https://github.com/facebook/rocksdb/pull/3546 Differential Revision: D7111708 Pulled By: ajkr fbshipit-source-id: 522a4a00eb3e34c73afcb86c1f75cd2e90e7608d	2018-02-28 09:56:45 -08:00
Andrew Kryczka	3ae0047278	skip CompactRange flush based on memtable contents Summary: CompactRange has a call to Flush because we guarantee that, at the time it's called, all existing keys in the range will be pushed through the user's compaction filter. However, previously the flush was done blindly, so it'd happen even if the memtable does not contain keys in the range specified by the user. This caused unnecessarily many L0 files to be created, leading to write stalls in some cases. This PR checks the memtable's contents, and decides to flush only if it overlaps with `CompactRange`'s range. - Move the memtable overlap check logic from `ExternalSstFileIngestionJob` to `ColumnFamilyData::RangesOverlapWithMemtables` - Reuse the above logic in `CompactRange` and skip flushing if no overlap Closes https://github.com/facebook/rocksdb/pull/3520 Differential Revision: D7018897 Pulled By: ajkr fbshipit-source-id: a3c6b1cfae56687b49dd89ccac7c948e53545934	2018-02-27 17:12:44 -08:00
Zhongyi Xie	ad05cbb182	DB:Open should fail on tmpfs when use_direct_reads=true Summary: Before: > $ TEST_TMPDIR=/dev/shm ./db_bench -use_direct_reads=true -benchmarks=readrandomwriterandom -num=10000000 -reads=100000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -readwritepercent=50 -key_size=16 -value_size=48 -threads=32 DB path: [/dev/shm/dbbench] put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument db_bench: tpp.c:84: __pthread_tpp_change_priority: Assertion `new_prio == -1 \|\| (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed. put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument After: > TEST_TMPDIR=/dev/shm ./db_bench -use_direct_reads=true -benchmarks=readrandomwriterandom -num=10000000 -reads=100000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -readwritepercent=50 -key_size=16 -value_size=48 -threads=32 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags open error: Not implemented: Direct I/O is not supported by the specified DB. Closes https://github.com/facebook/rocksdb/pull/3539 Differential Revision: D7082658 Pulled By: miasantreble fbshipit-source-id: f9d9c6ec3b5e9e049cab52154940ee101ba4d342	2018-02-26 14:58:06 -08:00
Anand Ananthabhotla	dfbe52e099	Fix the Logger::Close() and DBImpl::Close() design pattern Summary: The recent Logger::Close() and DBImpl::Close() implementation rely on calling the CloseImpl() virtual function from the destructor, which will not work. Refactor the implementation to have a private close helper function in derived classes that can be called by both CloseImpl() and the destructor. Closes https://github.com/facebook/rocksdb/pull/3528 Reviewed By: gfosco Differential Revision: D7049303 Pulled By: anand1976 fbshipit-source-id: 76a64cbf403209216dfe4864ecf96b5d7f3db9f4	2018-02-23 13:57:26 -08:00
Siying Dong	30649dc6a1	Have a different function when ROCKSDB_JEMALLOC=0 Summary: Some sanitizer is not happy with parameter name with ROCKSDB_JEMALLOC not set. Use another function instead. Closes https://github.com/facebook/rocksdb/pull/3536 Differential Revision: D7064849 Pulled By: siying fbshipit-source-id: c6ae94e044686176af1259df9172453d52c2f9d5	2018-02-23 11:42:33 -08:00
Igor Sugak	aba3409740	Back out "[codemod] - comment out unused parameters" Reviewed By: igorsugak fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37	2018-02-22 12:43:17 -08:00
David Lai	f4a030ce81	- comment out unused parameters Reviewed By: everiq, igorsugak Differential Revision: D7046710 fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461	2018-02-22 09:44:23 -08:00
Sagar Vemuri	8ada876dfe	Add rocksdb.iterator.internal-key property Summary: Added a new iterator property: `rocksdb.iterator.internal-key` to get the internal-key (converted to user key) at which the iterator stopped. Closes https://github.com/facebook/rocksdb/pull/3525 Differential Revision: D7033694 Pulled By: sagar0 fbshipit-source-id: d51e6c00f5e9d766c6276ef79774b81c6c5216f8	2018-02-20 19:12:09 -08:00
Maysam Yabandeh	c178da053b	WritePrepared Txn: optimizations for sysbench update_noindex Summary: These are optimization that we applied to improve sysbech's update_noindex performance. 1. Make use of LIKELY compiler hint 2. Move std::atomic so the subclass 3. Make use of skip_prepared in non-2pc transactions. Closes https://github.com/facebook/rocksdb/pull/3512 Differential Revision: D7000075 Pulled By: maysamyabandeh fbshipit-source-id: 1ab8292584df1f6305a4992973fb1b7933632181	2018-02-16 08:42:31 -08:00
Mike Kolupaev	97307d888f	Fix deadlock in ColumnFamilyData::InstallSuperVersion() Summary: Deadlock: a memtable flush holds DB::mutex_ and calls ThreadLocalPtr::Scrape(), which locks ThreadLocalPtr mutex; meanwhile, a thread exit handler locks ThreadLocalPtr mutex and calls SuperVersionUnrefHandle, which tries to lock DB::mutex_. This deadlock is hit all the time on our workload. It blocks our release. In general, the problem is that ThreadLocalPtr takes an arbitrary callback and calls it while holding a lock on a global mutex. The same global mutex is (at least in some cases) locked by almost all ThreadLocalPtr methods, on any instance of ThreadLocalPtr. So, there'll be a deadlock if the callback tries to do anything to any instance of ThreadLocalPtr, or waits for another thread to do so. So, probably the only safe way to use ThreadLocalPtr callbacks is to do only do simple and lock-free things in them. This PR fixes the deadlock by making sure that local_sv_ never holds the last reference to a SuperVersion, and therefore SuperVersionUnrefHandle never has to do any nontrivial cleanup. I also searched for other uses of ThreadLocalPtr to see if they may have similar bugs. There's only one other use, in transaction_lock_mgr.cc, and it looks fine. Closes https://github.com/facebook/rocksdb/pull/3510 Reviewed By: sagar0 Differential Revision: D7005346 Pulled By: al13n321 fbshipit-source-id: 37575591b84f07a891d6659e87e784660fde815f	2018-02-16 08:13:34 -08:00
Maysam Yabandeh	8eb1d445c3	Unbreak MemTableRep API change Summary: The MemTableRep API was broken by this commit: `813719e952` This patch reverts the changes and instead adds InsertKey (and etc.) overloads to extend the MemTableRep API without breaking the existing classes that inherit from it. Closes https://github.com/facebook/rocksdb/pull/3513 Differential Revision: D7004134 Pulled By: maysamyabandeh fbshipit-source-id: e568d91fe1e17dd76c0c1f6c7dd51a18633b1c4f	2018-02-15 17:27:24 -08:00
jsteemann	4e7a182d09	Several small "fixes" Summary: - removed a few unneeded variables - fused some variable declarations and their assignments - fixed right-trimming code in string_util.cc to not underflow - simplifed an assertion - move non-nullptr check assertion before dereferencing of that pointer - pass an std::string function parameter by const reference instead of by value (avoiding potential copy) Closes https://github.com/facebook/rocksdb/pull/3507 Differential Revision: D7004679 Pulled By: sagar0 fbshipit-source-id: 52944952d9b56dfcac3bea3cd7878e315bb563c4	2018-02-15 16:57:37 -08:00
Zhongyi Xie	c88c57cde1	Tweak external file ingestion seqno logic under universal compaction Summary: Right now it is possible that a file gets assigned to L0 but also assigned the seqno from a higher level which it doesn't fit Under the current impl, it is possibe that seqno in lower levels (Ln) can be equal to smallest seqno of higher levels (Ln-1), which is undesirable from universal compaction's point of view. This should fix the intermittent failure of `ExternalSSTFileBasicTest.IngestFileWithGlobalSeqnoPickedSeqno` Closes https://github.com/facebook/rocksdb/pull/3411 Differential Revision: D6813802 Pulled By: miasantreble fbshipit-source-id: 693d0462fa94725ccfb9d8858743e6d2d9992d14	2018-02-15 14:13:39 -08:00
Fosco Marotto	ba6ee1f749	Fix 2 more unused reference errors VS2017 Summary: As in #3425 Closes https://github.com/facebook/rocksdb/pull/3497 Differential Revision: D6979588 Pulled By: gfosco fbshipit-source-id: e9fb32d04ad45575dfe9de1d79348d158e474197	2018-02-14 11:12:36 -08:00
Igor Sugak	d08d05cb62	fix UBSAN errors in fault_injection_test Summary: This fixes shift and signed-integer-overflow UBSAN checks in fault_injection_test by using a larger and unsigned type. Closes https://github.com/facebook/rocksdb/pull/3498 Reviewed By: siying Differential Revision: D6981116 Pulled By: igorsugak fbshipit-source-id: 3688f62cce570534b161e9b5f42109ebc9ae5a2c	2018-02-13 14:12:40 -08:00
Siying Dong	dadf01672a	Rename one of the two LevelIterator Summary: A new LevelIterator was recently created. Rename the old one to make unity build happy. It's also not a good idea to have two classes in the same name anyway. Closes https://github.com/facebook/rocksdb/pull/3499 Differential Revision: D6979325 Pulled By: siying fbshipit-source-id: 3a032d93fe205650a08e92e5262594731ec726bb	2018-02-13 13:57:58 -08:00
Siying Dong	b555ed30a4	Customized BlockBasedTableIterator and LevelIterator Summary: Use a customzied BlockBasedTableIterator and LevelIterator to replace current implementations leveraging two-level-iterator. Hope the customized logic will make code easier to understand. As a side effect, BlockBasedTableIterator reduces the allocation for the data block iterator object, and avoid the virtual function call to it, because we can directly reference BlockIter, a final class. Similarly, LevelIterator reduces virtual function call to the dummy iterator iterating the file metadata. It also enabled further optimization. The upper bound check is also moved from index block to data block. This implementation fits this iterator better. After the change, forwared iterator is slightly optimized to ensure we trim those iterators. The two-level-iterator now is only used by partitioned index, so it is simplified. Closes https://github.com/facebook/rocksdb/pull/3406 Differential Revision: D6809041 Pulled By: siying fbshipit-source-id: 7da3b9b1d3c8e9d9405302c15920af1fcaf50ffa	2018-02-12 17:12:25 -08:00
Andrew Kryczka	ee1c802675	Add delay before flush in CompactRange to avoid write stalling Summary: - Refactored logic for checking write stall condition to a helper function: `GetWriteStallConditionAndCause`. Now it is decoupled from the logic for updating WriteController / stats in `RecalculateWriteStallConditions`, so we can reuse it for predicting whether write stall will occur. - Updated `CompactRange` to first check whether the one additional immutable memtable / L0 file would cause stalling before it flushes. If so, it waits until that is no longer true. - Updated `bg_cv_` to be signaled on `SetOptions` calls. The stall conditions `CompactRange` cares about can change when (1) flush finishes, (2) compaction finishes, or (3) options dynamically change. The cv was already signaled for (1) and (2) but not yet for (3). Closes https://github.com/facebook/rocksdb/pull/3381 Differential Revision: D6754983 Pulled By: ajkr fbshipit-source-id: 5613e03f1524df7192dc6ae885d40fd8f091d972	2018-02-12 15:42:47 -08:00
Zhongyi Xie	3f1bb07351	make flush_reason_ atomic to keep TSAN happy Summary: Closes https://github.com/facebook/rocksdb/pull/3487 Differential Revision: D6967098 Pulled By: miasantreble fbshipit-source-id: 48e0accf2e3b3f589ddb797ff8083c8520269bf0	2018-02-12 13:28:18 -08:00
Siying Dong	ef29d2a234	Explictly fail writes if key or value is not smaller than 4GB Summary: Right now, users will encounter unexpected bahavior if they use key or value larger than 4GB. We should explicitly fail the queriers. Closes https://github.com/facebook/rocksdb/pull/3484 Differential Revision: D6953895 Pulled By: siying fbshipit-source-id: b60491e1af064fc5d52971956661f6c18ceac24f	2018-02-09 14:57:54 -08:00
Yi Wu	fe228da0a9	WritePrepared Txn: Support merge operator Summary: CompactionIterator invoke MergeHelper::MergeUntil() to do partial merge between snapshot boundaries. Previously it only depend on sequence number to tell snapshot boundary, but we also need to make use of snapshot_checker to verify visibility of the merge operands to the snapshots. For example, say there is a snapshot with seq = 2 but only can see data with seq <= 1. There are three merges, each with seq = 1, 2, 3. A correct compaction output would be (1),(2+3). Without taking snapshot_checker into account when generating merge result, compaction will generate output (1+2),(3). By filtering uncommitted keys with read callback, the read path already take care of merges well and don't need additional updates. Closes https://github.com/facebook/rocksdb/pull/3475 Differential Revision: D6926087 Pulled By: yiwu-arbug fbshipit-source-id: 8f539d6f897cfe29b6dc27a8992f68c2a629d40a	2018-02-09 14:57:54 -08:00
Zhongyi Xie	945f618ba5	log flush reason for better debugging experience Summary: It's always a mystery from the logs why flush was triggered -- user triggered it manually, WriteBufferManager triggered it, logs were full, write buffer was full, etc. This PR logs Flush reason whenever a flush is scheduled. Closes https://github.com/facebook/rocksdb/pull/3401 Differential Revision: D6788142 Pulled By: miasantreble fbshipit-source-id: a867e54d493c06adf5172bd36a180fb3faae3511	2018-02-09 12:12:43 -08:00
Siying Dong	821e0b1683	Disable options_settable_test in UBSAN and fix UBSAN failure in blob_… Summary: …db_test options_settable_test won't pass UBSAN so disable it. blob_db_test fails in UBSAN as SnapshotList doesn't initialize all the fields in dummy snapshot. Fix it. I don't understand why only blob_db_test fails though. Closes https://github.com/facebook/rocksdb/pull/3477 Differential Revision: D6928681 Pulled By: siying fbshipit-source-id: e31dd300fcdecdfd4f6af279a0987fd0cdec5122	2018-02-07 14:42:26 -08:00
Yi Wu	81736d8afe	WritePrepared Txn: update compaction_iterator_test and db_iterator_test Summary: Update compaction_iterator_test with write-prepared transaction DB related tests. Transaction related tests are group in CompactionIteratorWithSnapshotCheckerTest. The existing test are duplicated to make them also test with dummy SnapshotChecker that will say every key is visible to every snapshot (this is okay, we still compare sequence number to verify visibility). Merge related tests are disabled and will be revisit in another PR. Existing db_iterator_tests are also duplicated to test with dummy read_callback that will say every key is committed. Closes https://github.com/facebook/rocksdb/pull/3466 Differential Revision: D6909253 Pulled By: yiwu-arbug fbshipit-source-id: 2ae4656b843a55e2e9ff8beecf21f2832f96cd25	2018-02-06 14:12:13 -08:00
Maysam Yabandeh	88d8b2a2f5	WritePrepared Txn: Duplicate Keys, Txn Part Summary: This patch takes advantage of memtable being able to detect duplicate <key,seq> and returning TryAgain to handle duplicate keys in WritePrepared Txns. Through WriteBatchWithIndex's index it detects existence of at least a duplicate key in the write batch. If duplicate key was reported, it then pays the cost of counting the number of sub-patches by iterating over the write batch and pass it to DBImpl::Write. DB will make use of the provided batch_count to assign proper sequence numbers before sending them to the WAL. When later inserting the batch to the memtable, it increases the seq each time memtbale reports a duplicate (a sub-patch in our counting) and tries again. Closes https://github.com/facebook/rocksdb/pull/3455 Differential Revision: D6873699 Pulled By: maysamyabandeh fbshipit-source-id: db8487526c3a5dc1ddda0ea49f0f979b26ae648d	2018-02-05 18:43:24 -08:00
Anand Ananthabhotla	4b124fb9d3	Handle error return from WriteBuffer() Summary: There are a couple of places where we swallow any error from WriteBuffer() - in SwitchMemtable() and DBImpl::CloseImpl(). Propagate the error up in those cases rather than ignoring it. Closes https://github.com/facebook/rocksdb/pull/3404 Differential Revision: D6879954 Pulled By: anand1976 fbshipit-source-id: 2ef88b554be5286b0a8bad7384ba17a105395bdb	2018-02-05 13:59:34 -08:00
Mike Kolupaev	cb5b8f2090	Fix use-after-free in tailing iterator with merge operator Summary: ForwardIterator::SVCleanup() sometimes didn't pin superversion when it was supposed to. See the added test for the scenario. Here's the ASAN output of the added test without the fix (using `COMPILE_WITH_ASAN=1 make`): https://pastebin.com/9rD0Ywws Closes https://github.com/facebook/rocksdb/pull/3415 Differential Revision: D6817414 Pulled By: al13n321 fbshipit-source-id: bc80c44ea78a3a1fa885dfa448a26111f91afb24	2018-02-02 21:26:28 -08:00
Tamir Duberstein	cd5092e168	Suppress unused warnings Summary: - Use `__unused__` everywhere - Suppress unused warnings in Release mode + This currently affects non-MSVC builds (e.g. mingw64). Closes https://github.com/facebook/rocksdb/pull/3448 Differential Revision: D6885496 Pulled By: miasantreble fbshipit-source-id: f2f6adacec940cc3851a9eee328fafbf61aad211	2018-02-02 12:27:07 -08:00
Fosco Marotto	ba8aa8fdc8	Upgrade Appveyor to VS2017 Summary: Per some discussions, this will switch our Appveyor testing to use Visual Studio 2017. Closes https://github.com/facebook/rocksdb/pull/3445 Differential Revision: D6874918 Pulled By: gfosco fbshipit-source-id: c5a0032ca9f37f0d3baeae35c59d850d528c3176	2018-02-01 13:57:01 -08:00
Maysam Yabandeh	813719e952	WritePrepared Txn: Duplicate Keys, Memtable part Summary: Currently DB does not accept duplicate keys (keys with the same user key and the same sequence number). If Memtable returns false when receiving such keys, we can benefit from this signal to properly increase the sequence number in the rare cases when we have a duplicate key in the write batch written to DB under WritePrepared transactions. Closes https://github.com/facebook/rocksdb/pull/3418 Differential Revision: D6822412 Pulled By: maysamyabandeh fbshipit-source-id: adea3ce5073131cd38ed52b16bea0673b1a19e77	2018-01-31 18:57:07 -08:00
Fosco Marotto	5400800a56	Work around VS2017 warning for unused reference Summary: For #3407 Closes https://github.com/facebook/rocksdb/pull/3425 Differential Revision: D6836900 Pulled By: gfosco fbshipit-source-id: 7bcaf7a1beeeeabb7c05584f2745e7b4a2473497	2018-01-31 11:58:10 -08:00
Andrew Kryczka	ab5ab36ac2	fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose file ID support check Summary: Updated the test case to handle tmpfs mounted at directories different from "/dev/shm/". Closes https://github.com/facebook/rocksdb/pull/3440 Differential Revision: D6848213 Pulled By: ajkr fbshipit-source-id: 465e9dbf0921d0930161f732db6b3766bb030589	2018-01-30 16:50:42 -08:00
Huachao Huang	ab43ff58b5	Delete files in multiple ranges at once Summary: Using `DeleteFilesInRange` to delete files in a lot of ranges can be slow, because `VersionSet::LogAndApply` is expensive. This PR adds a new `DeleteFilesInRange` function to delete files in multiple ranges at once. Close https://github.com/facebook/rocksdb/issues/2951 Closes https://github.com/facebook/rocksdb/pull/3431 Differential Revision: D6849228 Pulled By: ajkr fbshipit-source-id: daeedcabd8def4b1d9ee95a58266dee77b5d68cb	2018-01-30 13:56:39 -08:00
Yi Wu	4bdf06e78f	Fix DBFlushTest::ManualFlushWithMinWriteBufferNumberToMerge dead lock Summary: In the test, there can be a dead lock between background flush thread and foreground main thread as following: * background flush thread: - holding db mutex, while - waiting on "DBImpl::FlushMemTableToOutputFile:BeforeInstallSV" sync point. * foreground thread: - waiting for db mutex to write "key2" Fixing by let background flush thread wait without holding db mutex. Closes https://github.com/facebook/rocksdb/pull/3436 Differential Revision: D6841334 Pulled By: yiwu-arbug fbshipit-source-id: b020768ac94e166e40953c5d09e505515a5f244d	2018-01-29 18:56:47 -08:00
Sagar Vemuri	e6605e5302	Tests for dynamic universal compaction options Summary: Added a test for three dynamic universal compaction options, in the realm of read amplification: - size_ratio - min_merge_width - max_merge_width Also updated DynamicUniversalCompactionSizeAmplification by adding a check on compaction reason. Found a bug in compaction reason setting while working on this PR, and fixed in #3412 . TODO for later: Still to add tests for these options: compression_size_percent, stop_style and trivial_move. Closes https://github.com/facebook/rocksdb/pull/3419 Differential Revision: D6822217 Pulled By: sagar0 fbshipit-source-id: 074573fca6389053cbac229891a0163f38bb56c4	2018-01-29 16:42:45 -08:00
Zhongyi Xie	3fe0937180	Use block cache to track memory usage when ReadOptions.fill_cache=false Summary: ReadOptions.fill_cache is set in compaction inputs and can be set by users in their queries too. It tells RocksDB not to put a data block used to block cache. The memory used by the data block is, however, not trackable by users. To make the system more manageable, we can cost the block to block cache while using it, and then release it after using. Closes https://github.com/facebook/rocksdb/pull/3333 Differential Revision: D6670230 Pulled By: miasantreble fbshipit-source-id: ab848d3ed286bd081a13ee1903de357b56cbc308	2018-01-29 14:43:10 -08:00
Mark Isaacson	b8eb32f8cf	Suppress lint in old files Summary: Grandfather in super old lint issues to make a clean slate for moving forward that allows us to have stronger enforcement on new issues. Reviewed By: yiwu-arbug Differential Revision: D6821806 fbshipit-source-id: 22797d31ec58e9eb0255d3b66fedfcfcb0dc127c	2018-01-29 12:56:42 -08:00
Sagar Vemuri	7fcc1d0ddf	Incorrect Universal Compaction reason Summary: While writing tests for dynamic Universal Compaction options, I found that the compaction reasons we set for size-ratio based and sorted-run based universal compactions are swapped with each other. Fixed it. Closes https://github.com/facebook/rocksdb/pull/3412 Differential Revision: D6820540 Pulled By: sagar0 fbshipit-source-id: 270a188968ba25b2c96a8339904416c4c87ff5b3	2018-01-26 11:12:40 -08:00
Yi Wu	c7226428dd	WritePrepared Txn: Fix DBIterator and add test Summary: In DBIter, Prev() calls FindValueForCurrentKey() to search the current value backward. If it finds that there are too many stale value being skipped, it falls back to FindValueForCurrentKeyUsingSeek(), seeking directly to the key with snapshot sequence. After introducing read_callback, however, the key it seeks to might not be visible, according to read_callback. It thus needs to keep searching forward until the first visible value. Closes https://github.com/facebook/rocksdb/pull/3382 Differential Revision: D6756148 Pulled By: yiwu-arbug fbshipit-source-id: 064e39b1eec5e083af1c10142600f26d1d2697be	2018-01-23 16:57:11 -08:00
Yi Wu	d46e832e94	Assert last reference before destroy ColumnFamilyData Summary: In ColumnFamilySet destructor, assert it hold the last reference to cfd before destroy them. Closes #3112 Closes https://github.com/facebook/rocksdb/pull/3397 Differential Revision: D6777967 Pulled By: yiwu-arbug fbshipit-source-id: 60b19070e0c194b3b6146699140c1d68777866cb	2018-01-23 15:12:28 -08:00
Yi Wu	edc258127e	DB::DumpSupportInfo should log all supported compression types Summary: DB::DumpSupportInfo should log all supported compression types. Closes #3146 Closes https://github.com/facebook/rocksdb/pull/3396 Differential Revision: D6777019 Pulled By: yiwu-arbug fbshipit-source-id: 5b17f1ffb2d71224e52f7d9c045434746c789fb0	2018-01-23 14:44:12 -08:00
Nathan VanBenschoten	ec0167eecb	Fix WriteBatch rep_ format for RangeDeletion records Summary: This is a small amount of general cleanup I made while experimenting with https://github.com/facebook/rocksdb/issues/3391. Closes https://github.com/facebook/rocksdb/pull/3392 Differential Revision: D6788365 Pulled By: yiwu-arbug fbshipit-source-id: 2716e5aabd5424a4dfdaa954361a62c8eb721ae2	2018-01-23 12:57:32 -08:00
Siying Dong	7291a3f813	Improve fallocate size in compaction output Summary: Now in leveled compaction, we allocate solely based on output target file size. If the total input size is smaller than the number, we should use the total input size instead. Also, cap the allocate size to 1GB. Closes https://github.com/facebook/rocksdb/pull/3385 Differential Revision: D6762363 Pulled By: siying fbshipit-source-id: e30906f6e9bff3ec847d2166e44cb49c92f98a13	2018-01-22 16:43:46 -08:00
Islam AbdelRahman	c615689bb5	Support skipping bloom filters for SstFileWriter Summary: Add an option for SstFileWriter to skip building bloom filters Closes https://github.com/facebook/rocksdb/pull/3360 Differential Revision: D6709120 Pulled By: IslamAbdelRahman fbshipit-source-id: 964d4bce38822a048691792f447bcfbb4b6bd809	2018-01-22 14:42:18 -08:00
Bernard Spil	6f5ba0bf5b	Fix building on FreeBSD Summary: FreeBSD uses jemalloc as the base malloc implementation. The patch has been functional on FreeBSD as of the MariaDB 10.2 port. Closes https://github.com/facebook/rocksdb/pull/3386 Differential Revision: D6765742 Pulled By: yiwu-arbug fbshipit-source-id: d55dbc082eecf640ef3df9a21f26064ebe6587e8	2018-01-19 17:12:43 -08:00
Yi Wu	5568aec421	Fix DBTest::SoftLimit TSAN failure Summary: Fix data race found by TSAN around WriteStallListener: https://gist.github.com/yiwu-arbug/027d2448b903648f2f0f40b05258d80f Closes https://github.com/facebook/rocksdb/pull/3384 Differential Revision: D6762167 Pulled By: yiwu-arbug fbshipit-source-id: cd3a5c9f806de390bd1af6077ea6dbbc8bcaec09	2018-01-19 12:57:15 -08:00
Yi Wu	f1cb83fcf4	Fix Flush() keep waiting after flush finish Summary: Flush() call could be waiting indefinitely if min_write_buffer_number_to_merge is used. Consider the sequence: 1. User call Flush() with flush_options.wait = true 2. The manual flush started in the background 3. New memtable become immutable because of writes. The new memtable will not trigger flush if min_write_buffer_number_to_merge is not reached. 4. The manual flush finish. Because of the new memtable created at step 3 not being flush, previous logic of WaitForFlushMemTable() keep waiting, despite the memtables it intent to flush has been flushed. Here instead of checking if there are any more memtables to flush, WaitForFlushMemTable() also check the id of the earliest memtable. If the id is larger than that of latest memtable at the time flush was initiated, it means all the memtable at the time of flush start has all been flush. Closes https://github.com/facebook/rocksdb/pull/3378 Differential Revision: D6746789 Pulled By: yiwu-arbug fbshipit-source-id: 35e698f71c7f90b06337a93e6825f4ea3b619bfa	2018-01-18 17:45:16 -08:00
topilski	b9873162f0	Fixed get version on windows, moved throwing exceptions into cc file. Summary: Fixes for msys2 and mingw, hide exceptions into cpp file. Closes https://github.com/facebook/rocksdb/pull/3377 Differential Revision: D6746707 Pulled By: yiwu-arbug fbshipit-source-id: 456b38df80bc48b8386a2cf87f669b5a4f9999a4	2018-01-18 14:56:56 -08:00
Andrew Kryczka	46e599fc6b	fix live WALs purged while file deletions disabled Summary: When calling `DisableFileDeletions` followed by `GetSortedWalFiles`, we guarantee the files returned by the latter call won't be deleted until after file deletions are re-enabled. However, `GetSortedWalFiles` didn't omit files already planned for deletion via `PurgeObsoleteFiles`, so the guarantee could be broken. We fix it by making `GetSortedWalFiles` wait for the number of pending purges to hit zero if file deletions are disabled. This condition is eventually met since `PurgeObsoleteFiles` is guaranteed to be called for the existing pending purges, and new purges cannot be scheduled while file deletions are disabled. Once the condition is met, `GetSortedWalFiles` simply returns the content of DB and archive directories, which nobody can delete (except for deletion scheduler, for which I plan to fix this bug later) until deletions are re-enabled. Closes https://github.com/facebook/rocksdb/pull/3341 Differential Revision: D6681131 Pulled By: ajkr fbshipit-source-id: 90b1e2f2362ea9ef715623841c0826611a817634	2018-01-17 17:42:04 -08:00
Andrew Kryczka	266d85fbec	fix DBTest.AutomaticConflictsWithManualCompaction Summary: After `af92d4ad11`, only exclusive manual compaction can have conflict. `dc360df81e` updated the conflict-checking test case accordingly. But we missed the point that exclusive manual compaction can only conflict with automatic compactions scheduled after it, since it waits on pending automatic compactions before it begins running. This PR updates the test case to ensure the automatic compactions are scheduled after the manual compaction starts but before it finishes, thus ensuring a conflict. I also cleaned up the test case to use less space as I saw it cause out-of-space error on travis. Closes https://github.com/facebook/rocksdb/pull/3375 Differential Revision: D6735162 Pulled By: ajkr fbshipit-source-id: 020530a4e150a4786792dce7cec5d66b420cb884	2018-01-16 23:12:00 -08:00
Yi Wu	dc360df81e	Fix multiple build failures Summary: * Fix DBTest.CompactRangeWithEmptyBottomLevel lite build failure * Fix DBTest.AutomaticConflictsWithManualCompaction failure introduce by #3366 * Fix BlockBasedTableTest::IndexUncompressed should be disabled if snappy is disabled * Fix ASAN failure with DBBasicTest::DBClose test Closes https://github.com/facebook/rocksdb/pull/3373 Differential Revision: D6732313 Pulled By: yiwu-arbug fbshipit-source-id: 1eb9b9d9a8d795f56188fa9770db9353f6fdedc5	2018-01-16 17:30:39 -08:00
Sunguck Lee	af92d4ad11	Avoid too frequent MaybeScheduleFlushOrCompaction() call Summary: If there's manual compaction in the queue, then "HaveManualCompaction(compaction_queue_.front())" will return true, and this cause too frequent MaybeScheduleFlushOrCompaction(). https://github.com/facebook/rocksdb/issues/3198 Closes https://github.com/facebook/rocksdb/pull/3366 Differential Revision: D6729575 Pulled By: ajkr fbshipit-source-id: 96da04f8fd33297b1ccaec3badd9090403da29b0	2018-01-16 13:12:12 -08:00
Anand Ananthabhotla	d0f1b49ab6	Add a Close() method to DB to return status when closing a db Summary: Currently, the only way to close an open DB is to destroy the DB object. There is no way for the caller to know the status. In one instance, the destructor encountered an error due to failure to close a log file on HDFS. In order to prevent silent failures, we add DB::Close() that calls CloseImpl() which must be implemented by its descendants. The main failure point in the destructor is closing the log file. This patch also adds a Close() entry point to Logger in order to get status. When DBOptions::info_log is allocated and owned by the DBImpl, it is explicitly closed by DBImpl::CloseImpl(). Closes https://github.com/facebook/rocksdb/pull/3348 Differential Revision: D6698158 Pulled By: anand1976 fbshipit-source-id: 9468e2892553eb09c4c41b8723f590c0dbd8ab7d	2018-01-16 11:08:57 -08:00
Andrew Kryczka	43549c7d59	Prevent unnecessary calls to PurgeObsoleteFiles Summary: Split `JobContext::HaveSomethingToDelete` into two functions: itself and `JobContext::HaveSomethingToClean`. Now we won't call `DBImpl::PurgeObsoleteFiles` in cases where we really just need to call `JobContext::Clean`. The change is needed because I want to track pending calls to `PurgeObsoleteFiles` for a bug fix, which is much simpler if we only call it after `FindObsoleteFiles` finds files to delete. Closes https://github.com/facebook/rocksdb/pull/3350 Differential Revision: D6690609 Pulled By: ajkr fbshipit-source-id: 61502e7469288afe16a663a1b7df345baeaf246f	2018-01-12 13:27:08 -08:00
Andrew Kryczka	ba295cda29	replace DBTest.HugeNumbersOfLevel with a more targeted test case Summary: This test often causes out-of-space error when run on travis. We don't want such stress tests in our unit test suite. The bug in #596, which this test intends to expose, can be repro'd as long as the bottommost level(s) are empty when CompactRange is called. I rewrote the test to cover this simple case without writing a lot of data. Closes https://github.com/facebook/rocksdb/pull/3362 Differential Revision: D6710417 Pulled By: ajkr fbshipit-source-id: 9a1ec85e738c813ac2fee29f1d5302065ecb54c5	2018-01-12 11:12:09 -08:00
Changli Gao	0a7ba0e548	Fix memleak when DB::DeleteFile() Summary: Because the corresponding read_first_record_cache_ item wasn't erased, memory leaked. Closes https://github.com/facebook/rocksdb/pull/1712 Differential Revision: D4363654 Pulled By: ajkr fbshipit-source-id: 7da1adcfc8c380e4ffe05b8769fc2221ad17a225	2018-01-11 18:57:33 -08:00
Bo Liu	204af1eccc	add WriteBatch::WriteBatch(std::string&&) Summary: to save a string copy for some use cases. The change is pretty straightforward, please feel free to let me know if you want to suggest any tests for it. Closes https://github.com/facebook/rocksdb/pull/3349 Differential Revision: D6706828 Pulled By: yiwu-arbug fbshipit-source-id: 873ce4442937bdc030b395c7f99228eda7f59eb7	2018-01-11 15:43:56 -08:00
Andrew Kryczka	0c6e8be9e2	Fix directory name for db_basic_test Summary: It was using the same directory as `db_options_test` so transiently failed when unit tests were run in parallel. Closes https://github.com/facebook/rocksdb/pull/3352 Differential Revision: D6691649 Pulled By: ajkr fbshipit-source-id: bee433484fec4faedd5cadf2db3c92fdcc99a170	2018-01-10 15:41:46 -08:00
Siying Dong	6aa95f4d0f	Fix a wrong log formatting Summary: I experienced weird segfault because of this mismatch of type in log formatting. Fix it. Closes https://github.com/facebook/rocksdb/pull/3345 Differential Revision: D6687224 Pulled By: siying fbshipit-source-id: c51fb1c008b7ebc3efdc353a4adad3e8f5b3e9de	2018-01-09 14:58:33 -08:00
Andrew Kryczka	0f0d2ab95a	fix DBImpl instance variable naming Summary: got confused while reading `FindObsoleteFiles` due to thinking it's a local variable, so renamed it properly Closes https://github.com/facebook/rocksdb/pull/3342 Differential Revision: D6684797 Pulled By: ajkr fbshipit-source-id: a4df0aae1cccce99d4dd4d164aadc85b17707132	2018-01-09 12:56:58 -08:00
Chris Lu	24e2c1640d	add support for allow_ingest_behind in C API Summary: https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files Need to expose these functions in the C API to be used by Go bindings. Closes https://github.com/facebook/rocksdb/pull/3011 Differential Revision: D6679563 Pulled By: sagar0 fbshipit-source-id: 536f844ddaeb0172c6d7e416d2a75e8f9e57c8ef	2018-01-08 17:26:31 -08:00
Andrew Kryczka	f00e176c5b	fix ForwardIterator reference to temporary object Summary: Fixes the following ASAN error: ``` ==2108042==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7fc50ae9b868 at pc 0x7fc5112aff55 bp 0x7fff9eb9dc10 sp 0x7fff9eb9dc08 === How to use this, how to get the raw stack trace, and more: fburl.com/ASAN === READ of size 8 at 0x7fc50ae9b868 thread T0 SCARINESS: 23 (8-byte-read-stack-use-after-scope) #0 rocksdb/dbformat.h:164 rocksdb::InternalKeyComparator::user_comparator() const #1 librocksdb_src_rocksdb_lib.so+0x1429a7d rocksdb::RangeDelAggregator::InitRep(std::vector<...> const&) #2 librocksdb_src_rocksdb_lib.so+0x142ceae rocksdb::RangeDelAggregator::AddTombstones(std::unique_ptr<...>) #3 librocksdb_src_rocksdb_lib.so+0x1382d88 rocksdb::ForwardIterator::RebuildIterators(bool) #4 librocksdb_src_rocksdb_lib.so+0x1382362 rocksdb::ForwardIterator::ForwardIterator(rocksdb::DBImpl, rocksdb::ReadOptions const&, rocksdb::ColumnFamilyData, rocksdb::SuperVersion) #5 librocksdb_src_rocksdb_lib.so+0x11f433f rocksdb::DBImpl::NewIterator(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle) #6 rocksdb/src/include/rocksdb/db.h:382 rocksdb::DB::NewIterator(rocksdb::ReadOptions const&) #7 rocksdb/db_range_del_test.cc:807 rocksdb::DBRangeDelTest_TailingIteratorRangeTombstoneUnsupported_Test::TestBody() #18 rocksdb/db_range_del_test.cc:1006 main Address 0x7fc50ae9b868 is located in stack of thread T0 at offset 104 in frame #0 librocksdb_src_rocksdb_lib.so+0x13825af rocksdb::ForwardIterator::RebuildIterators(bool) ``` Closes https://github.com/facebook/rocksdb/pull/3300 Differential Revision: D6612989 Pulled By: ajkr fbshipit-source-id: e7ea2ed914c1b80a8a29d71d92440a6bd9cbcc80	2017-12-20 16:12:04 -08:00
Maysam Yabandeh	0ef3fdd732	Disable need_log_sync on bg err Summary: When there is a background error PreprocessWrite returns without marking the logs synced. If we keep need_log_sync to true, it would try to sync them at the end, which would break the logic. The patch would unset need_log_sync if the logs end up not being marked for sync in PreprocessWrite. Closes https://github.com/facebook/rocksdb/pull/3293 Differential Revision: D6602347 Pulled By: maysamyabandeh fbshipit-source-id: 37ee04209e8dcfd78de891654ce50d0954abeb38	2017-12-20 08:12:24 -08:00
Yi Wu	06149429d9	WritePrepared Txn: Return NotSupported on iterator refresh Summary: A proper implementation of Iterator::Refresh() for WritePreparedTxnDB would require release and acquire another snapshot. Since MyRocks don't make use of Iterator::Refresh(), we just simply mark it as not supported. Closes https://github.com/facebook/rocksdb/pull/3290 Differential Revision: D6599931 Pulled By: yiwu-arbug fbshipit-source-id: 4e1632d967316431424f6e458254ecf9a97567cf	2017-12-18 22:29:30 -08:00
Maysam Yabandeh	78c2eedb4f	fix release order in validateNumberOfEntries Summary: ScopedArenaIterator should be defined after range_del_agg so that it destructs the assigned iterator, which depends on range_del_agg, before it range_del_agg is already destructed. Closes https://github.com/facebook/rocksdb/pull/3281 Differential Revision: D6592332 Pulled By: maysamyabandeh fbshipit-source-id: 89a15d8ed13d0fc856b0c47dce3d91778738dbac	2017-12-18 14:27:28 -08:00
Anand Ananthabhotla	fccc12f386	Add a histogram stat for memtable flush Summary: Add a new histogram stat called rocksdb.db.flush.micros for memtable flush Closes https://github.com/facebook/rocksdb/pull/3269 Differential Revision: D6559496 Pulled By: anand1976 fbshipit-source-id: f5c771ba2568630458751795e8c37a493ff9b14d	2017-12-15 18:57:00 -08:00
Yi Wu	237b292515	BlobDB: Remove the need to get sequence number per write Summary: Previously we store sequence number range of each blob files, and use the sequence number range to check if the file can be possibly visible by a snapshot. But it adds complexity to the code, since the sequence number is only available after a write. (The current implementation get sequence number by calling GetLatestSequenceNumber(), which is wrong.) With the patch, we are not storing sequence number range, and check if snapshot_sequence < obsolete_sequence to decide if the file is visible by a snapshot (previously we check if first_sequence <= snapshot_sequence < obsolete_sequence). Closes https://github.com/facebook/rocksdb/pull/3274 Differential Revision: D6571497 Pulled By: yiwu-arbug fbshipit-source-id: ca06479dc1fcd8782f6525b62b7762cd47d61909	2017-12-15 13:27:30 -08:00
Andrew Kryczka	5a7e08468a	fix ThreadStatus for bottom-pri compaction threads Summary: added `ThreadType::BOTTOM_PRIORITY` which is used in the `ThreadStatus` object to indicate the thread is used for bottom-pri compactions. Previously there was a bug where we mislabeled such threads as `ThreadType::LOW_PRIORITY`. Closes https://github.com/facebook/rocksdb/pull/3270 Differential Revision: D6559428 Pulled By: ajkr fbshipit-source-id: 96b1a50a9c19492b1a5fd1b77cf7061a6f9f1d1c	2017-12-14 14:57:49 -08:00
Siying Dong	def6a00740	Print out compression type of new SST files in logging Summary: Closes https://github.com/facebook/rocksdb/pull/3264 Differential Revision: D6552768 Pulled By: siying fbshipit-source-id: 6303110aff22f341d5cff41f8d2d4f138a53652d	2017-12-14 10:27:43 -08:00
Maysam Yabandeh	546a63272f	disableWAL with WriteImplWALOnly Summary: Currently WriteImplWALOnly simply returns when disableWAL is set. This is an incorrect behavior since it does not allocated the sequence number, which is a side-effect of writing to the WAL. This patch fixes the issue. Closes https://github.com/facebook/rocksdb/pull/3262 Differential Revision: D6550974 Pulled By: maysamyabandeh fbshipit-source-id: 745a83ae8f04e7ca6c8ffb247d6ef16c287c52e7	2017-12-13 07:57:44 -08:00
Zhongyi Xie	51c2ea0feb	Reduce heavy hitter for Get operation Summary: This PR addresses the following heavy hitters in `Get` operation by moving calls to `StatisticsImpl::recordTick` from `BlockBasedTable` to `Version::Get` - rocksdb.block.cache.bytes.write - rocksdb.block.cache.add - rocksdb.block.cache.data.miss - rocksdb.block.cache.data.bytes.insert - rocksdb.block.cache.data.add - rocksdb.block.cache.hit - rocksdb.block.cache.data.hit - rocksdb.block.cache.bytes.read The db_bench statistics before and after the change are: \|1GB block read\|Children \|Self \|Command \|Shared Object \|Symbol\| \|---\|---\|---\|---\|---\|---\| \|master: \|4.22% \|1.31% \|db_bench \|db_bench \|[.] rocksdb::StatisticsImpl::recordTick\| \|updated: \|0.51% \|0.21% \|db_bench \|db_bench \|[.] rocksdb::StatisticsImpl::recordTick\| \| \|0.14% \|0.14% \|db_bench \|db_bench \|[.] rocksdb::GetContext::record_counters\| \|1MB block read\|Children \|Self \|Command \|Shared Object \|Symbol\| \|---\|---\|---\|---\|---\|---\| \|master: \|3.48% \|1.08% \|db_bench \|db_bench \|[.] rocksdb::StatisticsImpl::recordTick\| \|updated: \|0.80% \|0.31% \|db_bench \|db_bench \|[.] rocksdb::StatisticsImpl::recordTick\| \| \|0.35% \|0.35% \|db_bench \|db_bench \|[.] rocksdb::GetContext::record_counters\| Closes https://github.com/facebook/rocksdb/pull/3172 Differential Revision: D6330532 Pulled By: miasantreble fbshipit-source-id: 2b492959e00a3db29e9437ecdcc5e48ca4ec5741	2017-12-12 21:11:33 -08:00
Islam AbdelRahman	9089373a01	Fix DeleteScheduler::MarkAsTrash() handling existing trash Summary: DeleteScheduler::MarkAsTrash() don't handle existing .trash files correctly This cause rocksdb to not being able to delete existing .trash files on restart Closes https://github.com/facebook/rocksdb/pull/3261 Differential Revision: D6548003 Pulled By: IslamAbdelRahman fbshipit-source-id: c3800639412e587a690062c63076a5a08881e0e6	2017-12-12 18:17:13 -08:00
Yi Wu	e3a06f12d2	WritePrepared Txn: fix compaction filter snapshot checks Summary: Add snapshot_checker check whenever we need to check sequence against snapshots and decide what to do with an input key. The changes are related to one of: * compaction filter * single delete * delete at bottom level * merge Closes https://github.com/facebook/rocksdb/pull/3251 Differential Revision: D6537850 Pulled By: yiwu-arbug fbshipit-source-id: 3faba40ed5e37779f4a0cb7ae78af9546659c7f2	2017-12-12 11:12:24 -08:00
Zhongyi Xie	bb5ed4b1d1	exclude DynamicUniversalCompactionOptions from ROCKSDB_LITE Summary: since [SetOptions](https://github.com/facebook/rocksdb/blob/master/db/db_impl.cc#L494) is not supported in ROCKSDB_LITE Right now unit test under lite is broken Closes https://github.com/facebook/rocksdb/pull/3253 Differential Revision: D6539428 Pulled By: miasantreble fbshipit-source-id: 13172b8ecbd75682330726498ea198969bc3e637	2017-12-11 16:28:20 -08:00
Yi Wu	9a27ac5d89	Fix drop column family data race Summary: A data race is caught by tsan_crash test between compaction and DropColumnFamily: https://gist.github.com/yiwu-arbug/5a2b4baae05eeb99ae1719b650f30a44 Compaction checks if the column family has been dropped on each key input, while user can issue DropColumnFamily which updates cfd->dropped_, causing the data race. Fixing it by making cfd->dropped_ an atomic. Closes https://github.com/facebook/rocksdb/pull/3250 Differential Revision: D6535991 Pulled By: yiwu-arbug fbshipit-source-id: 5571df020beae7fa7db6fff5ad0d598f49962895	2017-12-11 13:57:48 -08:00
Zhongyi Xie	fcc8a6574d	Make Universal compaction options dynamic Summary: Let me know if more test coverage is needed Closes https://github.com/facebook/rocksdb/pull/3213 Differential Revision: D6457165 Pulled By: miasantreble fbshipit-source-id: 3f944abff28aa7775237f1c4f61c64ccbad4eea9	2017-12-11 13:27:06 -08:00
Prashant D	6a183d1ae8	Fix coverity issues compaction_job, compaction_picker Summary: db/compaction_job.cc: ReportStartedCompaction(compaction); CID 1419863 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 2. uninit_member: Non-static class member bottommost_level_ is not initialized in this constructor nor in any functions that it calls. db/compaction_picker_universal.cc: 7struct InputFileInfo { 2. uninit_member: Non-static class member level is not initialized in this constructor nor in any functions that it calls. CID 1405355 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 4. uninit_member: Non-static class member index is not initialized in this constructor nor in any functions that it calls. 38 InputFileInfo() : f(nullptr) {} db/dbformat.h: ParsedInternalKey() 84 : sequence(kMaxSequenceNumber) // Make code analyzer happy CID 1168095 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 2. uninit_member: Non-static class member type is not initialized in this constructor nor in any functions that it calls. 85 {} // Intentionally left uninitialized (for speed) Closes https://github.com/facebook/rocksdb/pull/3091 Differential Revision: D6534558 Pulled By: yiwu-arbug fbshipit-source-id: 5ada975956196d267b3f149386842af71eda7553	2017-12-11 11:57:15 -08:00
Prashant D	34aa245dd8	Fix coverity issues version, write_batch Summary: db/version_builder.cc: 117 base_vstorage_->InternalComparator(); CID 1351713 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR) 2. uninit_member: Non-static class member field level_zero_cmp_.internal_comparator is not initialized in this constructor nor in any functions that it calls. db/version_edit.h: 145 FdWithKeyRange() 146 : fd(), 147 smallest_key(), 148 largest_key() { CID 1418254 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR) 2. uninit_member: Non-static class member file_metadata is not initialized in this constructor nor in any functions that it calls. 149 } db/version_set.cc: 120 } CID 1322789 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR) 4. uninit_member: Non-static class member curr_file_level_ is not initialized in this constructor nor in any functions that it calls. 121 } db/write_batch.cc: 939 assert(cf_mems_); CID 1419862 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 3. uninit_member: Non-static class member rebuilding_trx_seq_ is not initialized in this constructor nor in any functions that it calls. 940 } Closes https://github.com/facebook/rocksdb/pull/3092 Differential Revision: D6505666 Pulled By: yiwu-arbug fbshipit-source-id: fd2c68948a0280772691a419d72ac7e190951d86	2017-12-07 11:57:36 -08:00
Andrew Kryczka	2e3a00987e	fix ASAN for DeleteFilesInRange test case Summary: error message was ``` ==3095==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7ffd18216c40 at pc 0x0000005edda1 bp 0x7ffd18215550 sp 0x7ffd18214d00 ... Address 0x7ffd18216c40 is located in stack of thread T0 at offset 1952 in frame #0 internal_repo_rocksdb/db_compaction_test.cc:1520 rocksdb::DBCompactionTest_DeleteFileRangeFileEndpointsOverlapBug_Test::TestBody() ``` It was unsafe to have slices referring to the temporary string objects' buffers, as those strings were destroyed before the slices were used. Fixed it by assigning the strings returned by `Key()` to local variables. Closes https://github.com/facebook/rocksdb/pull/3238 Differential Revision: D6507864 Pulled By: ajkr fbshipit-source-id: dd07de1a0070c6748c1ab4f3d7bd31f9a81889d0	2017-12-07 11:12:43 -08:00
Sagar Vemuri	bbef8c3884	Log GetCurrentTime failures during Flush and Compaction Summary: `GetCurrentTime()` is used to populate `creation_time` table property during flushes and compactions. It is safe to ignore `GetCurrentTime()` failures here but they should be logged. (Note that `creation_time` property was introduced as part of TTL-based FIFO compaction in #2480.) Tes Plan: `make check` Closes https://github.com/facebook/rocksdb/pull/3231 Differential Revision: D6501935 Pulled By: sagar0 fbshipit-source-id: 376adcf4ab801d3a43ec4453894b9a10909c8eb6	2017-12-06 20:56:53 -08:00
Andrew Kryczka	78d1a5ec72	Preserve overlapping file endpoint invariant Summary: Fix for #2833. - In `DeleteFilesInRange`, use `GetCleanInputsWithinInterval` instead of `GetOverlappingInputs` to make sure we get a clean cut set of files to delete. - In `GetCleanInputsWithinInterval`, support nullptr as `begin_key` or `end_key`. - In `GetOverlappingInputsRangeBinarySearch`, move the assertion for non-empty range away from `ExtendFileRangeWithinInterval`, which should be allowed to return an empty range (via `end_index < begin_index`). Closes https://github.com/facebook/rocksdb/pull/2843 Differential Revision: D5772387 Pulled By: ajkr fbshipit-source-id: e554e8461823c6be82b21a9262a2da02b3957881	2017-12-06 18:56:54 -08:00
Yi Wu	a7d32776f0	Fix write_callback_test compile error Summary: Rename shadow variable name db_impl. Fixing #3227 Closes https://github.com/facebook/rocksdb/pull/3235 Differential Revision: D6504051 Pulled By: yiwu-arbug fbshipit-source-id: 186c9378dabb11f8d6db56f45c95cc3b029fcb88	2017-12-06 17:12:27 -08:00
Yi Wu	20995c5729	Make iterator invalid on Merge error Summary: Since #1665, on merge error, iterator will be set to corrupted status, but it doesn't invalidate the iterator. Fixing it. Closes https://github.com/facebook/rocksdb/pull/3226 Differential Revision: D6499094 Pulled By: yiwu-arbug fbshipit-source-id: 80222930f949e31f90a6feaa37ddc3529b510d2c	2017-12-06 11:56:39 -08:00
Andrew Kryczka	63f1c0a57d	fix gflags namespace Summary: I started adding gflags support for cmake on linux and got frustrated that I'd need to duplicate the build_detect_platform logic, which determines namespace based on attempting compilation. We can do it differently -- use the GFLAGS_NAMESPACE macro if available, and if not, that indicates it's an old gflags version without configurable namespace so we can simply hardcode "google". Closes https://github.com/facebook/rocksdb/pull/3212 Differential Revision: D6456973 Pulled By: ajkr fbshipit-source-id: 3e6d5bde3ca00d4496a120a7caf4687399f5d656	2017-12-01 10:42:05 -08:00
Maysam Yabandeh	18dcf7f98d	WritePrepared Txn: PreReleaseCallback Summary: Add PreReleaseCallback to be called at the end of WriteImpl but before publishing the sequence number. The callback is used in WritePrepareTxn to i) update the commit map, ii) update the last published sequence number in the 2nd write queue. It also ensures that all the commits will go to the 2nd queue. These changes will ensure that the commit map is updated before the sequence number is published and used by reading snapshots. If we use two write queues, the snapshots will use the seq number published by the 2nd queue. If we use one write queue (the default, the snapshots will use the last seq number in the memtable, which also indicates the last published seq number. Closes https://github.com/facebook/rocksdb/pull/3205 Differential Revision: D6438959 Pulled By: maysamyabandeh fbshipit-source-id: f8b6c434e94bc5f5ab9cb696879d4c23e2577ab9	2017-11-30 23:50:45 -08:00
zhangjinpeng1987	ffacaaa3ea	fix Seek with lower_bound Summary: When Seek a key less than `lower_bound`, should return `lower_bound`. ajkr PTAL Closes https://github.com/facebook/rocksdb/pull/3199 Differential Revision: D6421126 Pulled By: ajkr fbshipit-source-id: a06c825830573e0040630704f6bcb3f7f48626f7	2017-11-29 22:56:29 -08:00
kapitan-k	75d57a5d53	C API: Add some block based table options Summary: Closes https://github.com/facebook/rocksdb/pull/3159 Differential Revision: D6428220 Pulled By: sagar0 fbshipit-source-id: 60508d09b5281f54b907a1c40e9631fc08343131	2017-11-28 14:12:44 -08:00
Yi Wu	3cf562be31	Fix IOError on WAL write doesn't propagate to write group follower Summary: This is a simpler version of #3097 by removing all unrelated changes. Fixing the bug where concurrent writes may get Status::OK while it actually gets IOError on WAL write. This happens when multiple writes form a write batch group, and the leader get an IOError while writing to WAL. The leader failed to pass the error to followers in the group, and the followers end up returning Status::OK() while actually writing nothing. The bug only affect writes in a batch group. Future writes after the batch group will correctly return immediately with the IOError. Closes https://github.com/facebook/rocksdb/pull/3201 Differential Revision: D6421644 Pulled By: yiwu-arbug fbshipit-source-id: 1c2a455c5b73f6842423785eb8a9dbfbb191dc0e	2017-11-28 11:42:48 -08:00
Andrew Kryczka	1bdb44de95	optimize file ingestion checks for range deletion overlap Summary: Before we were checking every file in the level which was unnecessary. We can piggyback onto the code for checking point-key overlap, which already opens all the files that could possibly contain overlapping range deletions. This PR makes us check just the range deletions from those files, so no extra ones will be opened. Closes https://github.com/facebook/rocksdb/pull/3179 Differential Revision: D6358125 Pulled By: ajkr fbshipit-source-id: 00e200770fdb8f3cc6b1b2da232b755e4ba36279	2017-11-28 11:27:02 -08:00
Griffin Smith	2f09524762	Expose all remaining read and write options via the C API Summary: Expose read and write options via the C API Closes https://github.com/facebook/rocksdb/pull/3185 Differential Revision: D6389658 Pulled By: sagar0 fbshipit-source-id: 1848912750329a476805b3cb2f315e7b71f61472	2017-11-28 10:28:46 -08:00
Maysam Yabandeh	e59cb2a19b	Add seq_per_batch to WriteWithCallbackTest Summary: Augment WriteWithCallbackTest to also test when seq_per_batch is true. Closes https://github.com/facebook/rocksdb/pull/3195 Differential Revision: D6398143 Pulled By: maysamyabandeh fbshipit-source-id: 7bc4218609355ec20fed25df426a8455ec2390d3	2017-11-22 13:56:44 -08:00
Zhongyi Xie	5fac4729cc	make compaction_readahead_size_ thread safe Summary: this should fix the failing tsan_check Closes https://github.com/facebook/rocksdb/pull/3192 Differential Revision: D6390004 Pulled By: miasantreble fbshipit-source-id: 6cadfc6f68febb1a77b0abcdb5416570dad926a5	2017-11-21 20:11:38 -08:00
anand1976	d394a6bb48	Add a ticker stat for number of keys skipped during iteration Summary: This diff adds a new ticker stat, NUMBER_ITER_SKIP, to count the number of internal keys skipped during iteration. Keys can be skipped due to deletes, or lower sequence number, or higher sequence number than the one requested. Also, fix the issue when StatisticsData is naturally aligned on cacheline boundary, padding becomes a zero size array, which the Windows compiler doesn't like. So add a cacheline worth of padding in that case to keep it happy. We cannot conditionally add padding as gcc doesn't allow using sizeof in preprocessor directives. Closes https://github.com/facebook/rocksdb/pull/3177 Differential Revision: D6353897 Pulled By: anand1976 fbshipit-source-id: 441d5a09af9c4e22e7355242dfc0c7b27aa0a6c2	2017-11-20 21:26:37 -08:00
Gustav Davidsson	2d04ed65e4	Make trash-to-DB size ratio limit configurable Summary: Allow users to configure the trash-to-DB size ratio limit, so that ratelimits for deletes can be enforced even when larger portions of the database are being deleted. Closes https://github.com/facebook/rocksdb/pull/3158 Differential Revision: D6304897 Pulled By: gdavidsson fbshipit-source-id: a28dd13059ebab7d4171b953ed91ce383a84d6b3	2017-11-17 11:58:17 -08:00
Zhongyi Xie	32e31d49d1	Make DBOption compaction_readahead_size dynamic Summary: Closes https://github.com/facebook/rocksdb/pull/3004 Differential Revision: D6056141 Pulled By: miasantreble fbshipit-source-id: 56df1630f464fd56b07d25d38161f699e0528b7f	2017-11-16 17:57:25 -08:00
Maysam Yabandeh	54b43563be	WritePrepared Txn: Refactoring WriteCallback Summary: Refactor the logic around WriteCallback in the write path to clarify when and how exactly we advance the sequence number and making sure it is consistent across the code. Closes https://github.com/facebook/rocksdb/pull/3168 Differential Revision: D6324312 Pulled By: maysamyabandeh fbshipit-source-id: 9a34f479561fdb2a5d01ef6d37a28908d03bbe33	2017-11-15 08:27:06 -08:00
Maysam Yabandeh	53863b76f9	WritePrepared Txn: fix bug with Rollback seq Summary: The sequence number was not properly advanced after a rollback marker. The patch extends the existing unit tests to detect the bug and also fixes it. Closes https://github.com/facebook/rocksdb/pull/3157 Differential Revision: D6304291 Pulled By: maysamyabandeh fbshipit-source-id: 1b519c44a5371b802da49c9e32bd00087a8da401	2017-11-15 08:27:06 -08:00
Maysam Yabandeh	175d5d6a9e	Properly destruct rebuilding_trx_ Summary: When testing rebuilding_trx_ in MemTableInserter might still be set before the tests finishes which would cause ASAN alarms for leaks. This patch deletes the pointers in MemTableInserter destructor. Closes https://github.com/facebook/rocksdb/pull/3162 Differential Revision: D6317113 Pulled By: maysamyabandeh fbshipit-source-id: a68be70709a4fff7ac2b768660119311968f9c21	2017-11-14 08:56:50 -08:00
Maysam Yabandeh	2515266725	WritePrepared Txn: Refactoring TrackKeys Summary: This patch clarifies and refactors the logic around tracked keys in transactions. Closes https://github.com/facebook/rocksdb/pull/3140 Differential Revision: D6290258 Pulled By: maysamyabandeh fbshipit-source-id: 03b50646264cbcc550813c060b180fc7451a55c1	2017-11-11 13:14:20 -08:00
Maysam Yabandeh	2edc92bc28	WritePrepared Txn: cross-compatibility test Summary: Add tests to ensure that WritePrepared and WriteCommitted policies are cross compatible when the db WAL is empty. This is important when the admin want to switch between the policies. In such case, before the switch the admin needs to empty the WAL by i) committing/rollbacking all the pending transactions, ii) FlushMemTables Closes https://github.com/facebook/rocksdb/pull/3118 Differential Revision: D6227247 Pulled By: maysamyabandeh fbshipit-source-id: bcde3d92c1e89cda3b9cfa69f6a20af5d8993db7	2017-11-11 11:28:37 -08:00
Maysam Yabandeh	857adf388f	WritePrepared Txn: Refactor conf params Summary: Summary of changes: - Move seq_per_batch out of Options - Rename concurrent_prepare to two_write_queues - Add allocate_seq_only_for_data_ Closes https://github.com/facebook/rocksdb/pull/3136 Differential Revision: D6304458 Pulled By: maysamyabandeh fbshipit-source-id: 08e685bfa82bbc41b5b1c5eb7040a8ca6e05e58c	2017-11-10 17:28:12 -08:00
Dmitri Smirnov	f8e2db0717	Fix crashes, address test issues and adjust windows test script Summary: Add per-exe execution capability Add fix parsing of groups/tests Add timer test exclusion Fix unit tests Ifdef threadpool specific tests that do not pass on Vista threadpool. Remove spurious outout from prefix_test so test case listing works properly. Fix not using standard test directories results in file creation errors in sst_dump_test. BlobDb fixes: In C++ end() iterators can not be dereferenced. They are not valid. When deleting blob_db_ set it to nullptr before any other code executes. Not fixed:. On Windows you can not delete a file while it is open. [ RUN ] BlobDBTest.ReadWhileGC d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options) IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options) IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied write_batch Should not call front() if there is a chance the container is empty Closes https://github.com/facebook/rocksdb/pull/3152 Differential Revision: D6293274 Pulled By: sagar0 fbshipit-source-id: 318c3717c22087fae13b18715dffb24565dbd956	2017-11-10 10:41:57 -08:00
Shaohua Li	eefd75a228	Stream Summary: Add a simple policy for NVMe write time life hint Closes https://github.com/facebook/rocksdb/pull/3095 Differential Revision: D6298030 Pulled By: shligit fbshipit-source-id: 9a72a42e32e92193af11599eb71f0cf77448e24d	2017-11-10 09:26:24 -08:00
kapitan-k	f1c5eaba56	updated c ingestexternalfileoptions for ingest behind Summary: Closes https://github.com/facebook/rocksdb/pull/3151 Differential Revision: D6293861 Pulled By: ajkr fbshipit-source-id: f8db0a71509d1cd8237f2d377bf9e1bb0464bdbf	2017-11-09 18:15:09 -08:00
Andrew Kryczka	93f69cb93a	use bottommost compression when base level is bottommost Summary: The previous compression type selection caused unexpected behavior when the base level was also the bottommost level. The following sequence of events could happen: - full compaction generates files with `bottommost_compression` type - now base level is bottommost level since all files are in the same level - any compaction causes files to be rewritten `compression_per_level` type since bottommost compression didn't apply to base level I changed the code to make bottommost compression apply to base level. Closes https://github.com/facebook/rocksdb/pull/3141 Differential Revision: D6264614 Pulled By: ajkr fbshipit-source-id: d7aaa8675126896684154a1f2c9034d6214fde82	2017-11-09 17:42:00 -08:00
Sagar Vemuri	a6d8e30c05	Remove unnecessary status check in TableCache::NewIterator Summary: While investigating the usage of `new_table_iterator_nanos` perf counter, I saw some code was wrapper around with unnecessary status check ... so removed it. Closes https://github.com/facebook/rocksdb/pull/3120 Differential Revision: D6229181 Pulled By: sagar0 fbshipit-source-id: f8a44fe67f5a05df94553fdb233b21e54e88cc34	2017-11-03 14:42:08 -07:00
Andrew Kryczka	24ad430600	pass key/value samples through zstd compression dictionary generator Summary: Instead of using samples directly, we now support passing the samples through zstd's dictionary generator when `CompressionOptions::zstd_max_train_bytes` is set to nonzero. If set to zero, we will use the samples directly as the dictionary -- same as before. Note this is the first step of #2987, extracted into a separate PR per reviewer request. Closes https://github.com/facebook/rocksdb/pull/3057 Differential Revision: D6116891 Pulled By: ajkr fbshipit-source-id: 70ab13cc4c734fa02e554180eed0618b75255497	2017-11-02 22:56:36 -07:00
Andrew Kryczka	c4c1f961e7	dynamically change current memtable size Summary: Previously setting `write_buffer_size` with `SetOptions` would only apply to new memtables. An internal user wanted it to take effect immediately, instead of at an arbitrary future point, to prevent OOM. This PR makes the memtable's size mutable, and makes `SetOptions()` mutate it. There is one case when we preserve the old behavior, which is when memtable prefix bloom filter is enabled and the user is increasing the memtable's capacity. That's because the prefix bloom filter's size is fixed and wouldn't work as well on a larger memtable. Closes https://github.com/facebook/rocksdb/pull/3119 Differential Revision: D6228304 Pulled By: ajkr fbshipit-source-id: e44bd9d10a5f8c9d8c464bf7436070bb3eafdfc9	2017-11-02 22:28:10 -07:00
Yi Wu	62578d80c1	Blob DB: Add compaction filter to remove expired blob index entries Summary: After adding expiration to blob index in #3066, we are now able to add a compaction filter to cleanup expired blob index entries. Closes https://github.com/facebook/rocksdb/pull/3090 Differential Revision: D6183812 Pulled By: yiwu-arbug fbshipit-source-id: 9cb03267a9702975290e758c9c176a2c03530b83	2017-11-02 17:27:38 -07:00
Yi Wu	7bfa88037e	Blob DB: fix snapshot handling Summary: Blob db will keep blob file if data in the file is visible to an active snapshot. Before this patch it checks whether there is an active snapshot has sequence number greater than the earliest sequence in the file. This is problematic since we take snapshot on every read, if it keep having reads, old blob files will not be cleanup. Change to check if there is an active snapshot falls in the range of [earliest_sequence, obsolete_sequence) where obsolete sequence is 1. if data is relocated to another file by garbage collection, it is the latest sequence at the time garbage collection finish 2. otherwise, it is the latest sequence of the file Closes https://github.com/facebook/rocksdb/pull/3087 Differential Revision: D6182519 Pulled By: yiwu-arbug fbshipit-source-id: cdf4c35281f782eb2a9ad6a87b6727bbdff27a45	2017-11-02 15:58:27 -07:00
Andrew Kryczka	6778690b51	fix duplicate definition of GetEntryType() Summary: It's also defined in db/dbformat.cc per `7fe3b32896` Closes https://github.com/facebook/rocksdb/pull/3111 Differential Revision: D6219140 Pulled By: ajkr fbshipit-source-id: 0f2b14e41457334a4665c6b7e3f42f1a060a0f35	2017-11-01 22:56:17 -07:00
Maysam Yabandeh	02693f64fc	WritePrepared Txn: ValidateSnapshot Summary: Implements ValidateSnapshot for WritePrepared txns and also adds a unit test to clarify the contract of this function. Closes https://github.com/facebook/rocksdb/pull/3101 Differential Revision: D6199405 Pulled By: maysamyabandeh fbshipit-source-id: ace509934c307ea5d26f4bbac5f836d7c80fd240	2017-11-01 19:11:09 -07:00
Mikhail Antonov	7fe3b32896	Added support for differential snapshots Summary: The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2). This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages. From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff". This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR. For now, what's done here according to initial discussions: Preserving deletes: - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion. - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum. - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum. Iterator changes: - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum. TableCache changes: - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span. What's left: - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type. Closes https://github.com/facebook/rocksdb/pull/2999 Differential Revision: D6175602 Pulled By: mikhail-antonov fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900	2017-11-01 18:56:43 -07:00
Maysam Yabandeh	17731a43a6	WritePrepared Txn: Optimize for recoverable state Summary: GetCommitTimeWriteBatch is currently used to store some state as part of commit in 2PC. In MyRocks it is specifically used to store some data that would be needed only during recovery. So it is not need to be stored in memtable right after each commit. This patch enables an optimization to write the GetCommitTimeWriteBatch only to the WAL. The batch will be written to memtable during recovery when the WAL is replayed. To cover the case when WAL is deleted after memtable flush, the batch is also buffered and written to memtable right before each memtable flush. Closes https://github.com/facebook/rocksdb/pull/3071 Differential Revision: D6148023 Pulled By: maysamyabandeh fbshipit-source-id: 2d09bae5565abe2017c0327421010d5c0d55eaa7	2017-11-01 17:26:46 -07:00
Shaohua Li	33c7d4ccd9	Make writable_file_max_buffer_size dynamic Summary: The DBOptions::writable_file_max_buffer_size can be changed dynamically. Closes https://github.com/facebook/rocksdb/pull/3053 Differential Revision: D6152720 Pulled By: shligit fbshipit-source-id: aa0c0cfcfae6a54eb17faadb148d904797c68681	2017-10-31 13:56:35 -07:00
Andrew Kryczka	b7bc9cc038	fix tracking oldest snapshot for bottom-level compaction Summary: The assertion was caught by `MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/5` when run in a loop. The caller doesn't track whether the released snapshot is oldest, so let this function handle that case. Closes https://github.com/facebook/rocksdb/pull/3080 Differential Revision: D6185257 Pulled By: ajkr fbshipit-source-id: 4b3015c11db5d31e46521a00af568546ef4558cd	2017-10-30 00:55:58 -07:00
Yi Wu	792ef10ca8	Return Status::InvalidArgument if user request sync write while disabling WAL Summary: write_options.sync = true and write_options.disableWAL is incompatible. When WAL is disabled, we are not able to persist the write immediately. Return an error in this case to avoid misuse of the options. Closes https://github.com/facebook/rocksdb/pull/3086 Differential Revision: D6176822 Pulled By: yiwu-arbug fbshipit-source-id: 1eb10028c14fe7d7c13c8bc12c0ef659f75aa071	2017-10-28 22:11:18 -07:00
Andrew Kryczka	6a9335dbbb	always drop tombstones compacted to bottommost level Summary: Problem was in bottommost compaction, when an L0->L0 compaction happened and L0 was bottommost. Then we'd preserve tombstones according to `Compaction::KeyNotExistsBeyondOutputLevel`, while zeroing seqnum according to `CompactionIterator::PrepareOutput`, thus triggering the assertion in `PrepareOutput`. To fix, we can just drop tombstones in L0->L0 when the output is "bottommost", i.e., the compaction includes the oldest L0 file and there's nothing at lower levels. Closes https://github.com/facebook/rocksdb/pull/3085 Differential Revision: D6175742 Pulled By: ajkr fbshipit-source-id: 8ab19a2e001496f362e9eb0a71757e2f6ecfdb3b	2017-10-27 15:56:35 -07:00
Yi Wu	84a04af9a9	TableProperty::oldest_key_time defaults to 0 Summary: We don't propagate TableProperty::oldest_key_time on compaction and just write the default value to SST files. It is more natural to default the value to 0. Also revert db_sst_test back to before #2842. Closes https://github.com/facebook/rocksdb/pull/3079 Differential Revision: D6165702 Pulled By: yiwu-arbug fbshipit-source-id: ca3ce5928d96ae79a5beb12bb7d8c640a71478a0	2017-10-27 15:00:05 -07:00
Islam AbdelRahman	05993155ef	Mark files as trash by using .trash extension Summary: SstFileManager move files that need to be deleted into a trash directory. Deprecate this behaviour and instead add ".trash" extension to files that need to be deleted Closes https://github.com/facebook/rocksdb/pull/2970 Differential Revision: D5976805 Pulled By: IslamAbdelRahman fbshipit-source-id: 27374ece4315610b2792c30ffcd50232d4c9a343	2017-10-27 13:27:12 -07:00
Prashant D	50e95a63dd	Fix coverity issues column_family, compaction_db/iterator Summary: db/column_family.h : 79 ColumnFamilyHandleInternal() CID 1322806 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR) 2. uninit_member: Non-static class member internal_cfd_ is not initialized in this constructor nor in any functions that it calls. 80 : ColumnFamilyHandleImpl(nullptr, nullptr, nullptr) {} db/compacted_db_impl.cc: 18CompactedDBImpl::CompactedDBImpl( 19 const DBOptions& options, const std::string& dbname) 20 : DBImpl(options, dbname) { 2. uninit_member: Non-static class member cfd_ is not initialized in this constructor nor in any functions that it calls. 4. uninit_member: Non-static class member version_ is not initialized in this constructor nor in any functions that it calls. CID 1396120 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR) 6. uninit_member: Non-static class member user_comparator_ is not initialized in this constructor nor in any functions that it calls. 21} db/compaction_iterator.cc: 9. uninit_member: Non-static class member current_user_key_sequence_ is not initialized in this constructor nor in any functions that it calls. 11. uninit_member: Non-static class member current_user_key_snapshot_ is not initialized in this constructor nor in any functions that it calls. CID 1419855 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 13. uninit_member: Non-static class member current_key_committed_ is not initialized in this constructor nor in any functions that it calls. Closes https://github.com/facebook/rocksdb/pull/3084 Differential Revision: D6172999 Pulled By: sagar0 fbshipit-source-id: 084d73393faf8022c01359cfb445807b6a782460	2017-10-27 11:26:42 -07:00

... 6 7 8 9 10 ...

3694 Commits