rocksdb

Author	SHA1	Message	Date
fangchenliaohui	b77007df8b	Bug: paralle_group status updated in WriteThread::CompleteParallelWorker Summary: Multi-write thread may update the status of the parallel_group in WriteThread::CompleteParallelWorker if the status of Writer is not ok! When copy write status to the paralle_group, the write thread just hold the mutex of the the writer processed by itself. it is useless. The thread should held the the leader of the parallel_group instead. Closes https://github.com/facebook/rocksdb/pull/1598 Differential Revision: D4252335 Pulled By: siying fbshipit-source-id: 3864cf7	2016-12-01 09:54:11 -08:00
Mike Kolupaev	247d0979aa	Support for range skips in compaction filter Summary: This adds the ability for compaction filter to say "drop this key-value, and also drop everything up to key x". This will cause the compaction to seek input iterator to x, without reading the data. This can make compaction much faster when large consecutive chunks of data are filtered out. See the changes in include/rocksdb/compaction_filter.h for the new API. Along the way this diff also adds ability for compaction filter changing merge operands, similar to how it can change values; we're not going to use this feature, it just seemed easier and cleaner to implement it than to document that it's not implemented :) The diff is not as big as it may seem, about half of the lines are a test. Closes https://github.com/facebook/rocksdb/pull/1599 Differential Revision: D4252092 Pulled By: al13n321 fbshipit-source-id: 41e1e48	2016-12-01 07:09:15 -08:00
Panagiotis Ktistakis	96fcefbf1d	c api: expose option for dynamic level size target Summary: Closes https://github.com/facebook/rocksdb/pull/1587 Differential Revision: D4245923 Pulled By: yiwu-arbug fbshipit-source-id: 6ee7291	2016-11-30 11:24:14 -08:00
zhangjinpeng1987	00197cff39	Add C API to set base_backgroud_compactions Summary: Add C API to set base_backgroud_compactions Closes https://github.com/facebook/rocksdb/pull/1571 Differential Revision: D4245709 Pulled By: yiwu-arbug fbshipit-source-id: 792c6b8	2016-11-30 11:09:13 -08:00
Andrew Kryczka	5b219eccb5	deleterange end-to-end test improvements for lite/robustness Summary: Closes https://github.com/facebook/rocksdb/pull/1591 Differential Revision: D4246019 Pulled By: ajkr fbshipit-source-id: 0c4aa37	2016-11-29 12:24:13 -08:00
Andrew Kryczka	e333528991	DeleteRange write path end-to-end tests Summary: Closes https://github.com/facebook/rocksdb/pull/1578 Differential Revision: D4241171 Pulled By: ajkr fbshipit-source-id: ce5fd83	2016-11-29 11:09:22 -08:00
Siying Dong	7784980fcd	Fix mis-reporting of compaction read bytes to the base level Summary: In dynamic leveled compaction, when calculating read bytes, output level bytes may be wronglyl calculated as input level inputs. Fix it. Closes https://github.com/facebook/rocksdb/pull/1475 Differential Revision: D4148412 Pulled By: siying fbshipit-source-id: f2f475a	2016-11-29 11:09:22 -08:00
Islam AbdelRahman	3c6b49ed66	Fix implicit conversion between int64_t to int Summary: Make conversion explicit, implicit conversion breaks the build Closes https://github.com/facebook/rocksdb/pull/1589 Differential Revision: D4245158 Pulled By: IslamAbdelRahman fbshipit-source-id: aaec00d	2016-11-29 10:54:15 -08:00
Siying Dong	b3b875657f	Remove unused assignment in db/db_iter.cc Summary: "make analyze" complains the assignment is not useful. Remove it. Closes https://github.com/facebook/rocksdb/pull/1581 Differential Revision: D4241697 Pulled By: siying fbshipit-source-id: 178f67a	2016-11-29 09:09:14 -08:00
Andrew Kryczka	4f6e89b1d0	Fix range deletion covering key in same SST file Summary: AddTombstones() needs to be before t->Get(), oops :'( Closes https://github.com/facebook/rocksdb/pull/1576 Differential Revision: D4241041 Pulled By: ajkr fbshipit-source-id: 781ceea	2016-11-28 22:54:13 -08:00
Islam AbdelRahman	a2bf265a39	Avoid intentional overflow in GetL0ThresholdSpeedupCompaction Summary: `99c052a34f` fixes integer overflow in GetL0ThresholdSpeedupCompaction() by checking if int become -ve. UBSAN will complain about that since this is still an overflow, we can fix the issue by simply using int64_t Closes https://github.com/facebook/rocksdb/pull/1582 Differential Revision: D4241525 Pulled By: IslamAbdelRahman fbshipit-source-id: b3ae21f	2016-11-28 18:39:13 -08:00
Islam AbdelRahman	52fd1ff2c2	disable UBSAN for functions with intentional -ve shift / overflow Summary: disable UBSAN for functions with intentional left shift on -ve number / overflow These functions are rocksdb:: Hash FixedLengthColBufEncoder::Append FaultInjectionTest:: Key Closes https://github.com/facebook/rocksdb/pull/1577 Differential Revision: D4240801 Pulled By: IslamAbdelRahman fbshipit-source-id: 3e1caf6	2016-11-28 17:54:12 -08:00
Islam AbdelRahman	1886c435b9	Fix CompactionJob::Install division by zero Summary: Fix CompactionJob::Install division by zero Closes https://github.com/facebook/rocksdb/pull/1580 Differential Revision: D4240794 Pulled By: IslamAbdelRahman fbshipit-source-id: 7286721	2016-11-28 16:54:16 -08:00
Islam AbdelRahman	13e66a8f51	Fix compaction_job.cc division by zero Summary: Fix division by zero in compaction_job.cc Closes https://github.com/facebook/rocksdb/pull/1575 Differential Revision: D4240818 Pulled By: IslamAbdelRahman fbshipit-source-id: a8bc757	2016-11-28 16:39:13 -08:00
Andrew Kryczka	01eabf7375	Fix double-counted deletion stat Summary: Both the single deletion and the value are included in compaction outputs, so no need to update the stat for the value's deletion yet, otherwise it'd be double-counted. Closes https://github.com/facebook/rocksdb/pull/1574 Differential Revision: D4241181 Pulled By: ajkr fbshipit-source-id: c9aaa15	2016-11-28 15:54:12 -08:00
Andrew Kryczka	7ffb10fc1a	DeleteRange compaction statistics Summary: - "rocksdb.compaction.key.drop.range_del" - number of keys dropped during compaction due to a range tombstone covering them - "rocksdb.compaction.range_del.drop.obsolete" - number of range tombstones dropped due to compaction to bottom level and no snapshot saving them - s/CompactionIteratorStats/CompactionIterationStats/g since this class is no longer specific to CompactionIterator -- it's also updated for range tombstone iteration during compaction - Move the above class into a separate .h file to avoid circular dependency. Closes https://github.com/facebook/rocksdb/pull/1520 Differential Revision: D4187179 Pulled By: ajkr fbshipit-source-id: 10c2103	2016-11-28 11:54:12 -08:00
Mike Kolupaev	236d4c67e9	Less linear search in DBIter::Seek() when keys are overwritten a lot Summary: In one deployment we saw high latencies (presumably from slow iterator operations) and a lot of CPU time reported by perf with this stack: ``` rocksdb::MergingIterator::Next rocksdb::DBIter::FindNextUserEntryInternal rocksdb::DBIter::Seek ``` I think what's happening is: 1. we create a snapshot iterator, 2. we do lots of Put()s for the same key x; this creates lots of entries in memtable, 3. we seek the iterator to a key slightly smaller than x, 4. the seek walks over lots of entries in memtable for key x, skipping them because of high sequence numbers. CC IslamAbdelRahman Closes https://github.com/facebook/rocksdb/pull/1413 Differential Revision: D4083879 Pulled By: IslamAbdelRahman fbshipit-source-id: a83ddae	2016-11-28 10:24:11 -08:00
Siying Dong	cd7c4143d7	Improve Write Stalling System Summary: Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before. Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system. Closes https://github.com/facebook/rocksdb/pull/1562 Differential Revision: D4218519 Pulled By: siying fbshipit-source-id: 95e4088	2016-11-23 09:24:15 -08:00
Yi Wu	dfb6fe6755	Unified InlineSkipList::Insert algorithm with hinting Summary: This PR is based on nbronson's diff with small modifications to wire it up with existing interface. Comparing to previous version, this approach works better for inserting keys in decreasing order or updating the same key, and impose less restriction to the prefix extractor. ---- Summary from original diff ---- This diff introduces a single InlineSkipList::Insert that unifies the existing sequential insert optimization (prev_), concurrent insertion, and insertion using externally-managed insertion point hints. There's a deep symmetry between insertion hints (cursors) and the concurrent algorithm. In both cases we have partial information from the recent past that is likely but not certain to be accurate. This diff introduces the struct InlineSkipList::Splice, which encodes predecessor and successor information in the same form that was previously only used within a single call to InsertConcurrently. Splice holds information about an insertion point that can be used to levera Closes https://github.com/facebook/rocksdb/pull/1561 Differential Revision: D4217283 Pulled By: yiwu-arbug fbshipit-source-id: 33ee437	2016-11-22 14:09:13 -08:00
Andrew Kryczka	734e4acafb	Eliminate redundant cache lookup with range deletion Summary: When we introduced range deletion block, TableCache::Get() and TableCache::NewIterator() each did two table cache lookups, one for range deletion block iterator and another for getting the table reader to which the Get()/NewIterator() is delegated. This extra cache lookup was very CPU-intensive (about 10% overhead in a read-heavy benchmark). We can avoid it by reusing the Cache::Handle created for range deletion block iterator to get the file reader. Closes https://github.com/facebook/rocksdb/pull/1537 Differential Revision: D4201167 Pulled By: ajkr fbshipit-source-id: d33ffd8	2016-11-21 21:24:11 -08:00
Maysam Yabandeh	182b940e70	Add WriteOptions.no_slowdown Summary: If the WriteOptions.no_slowdown flag is set AND we need to wait or sleep for the write request, then fail immediately with Status::Incomplete(). Closes https://github.com/facebook/rocksdb/pull/1527 Differential Revision: D4191405 Pulled By: maysamyabandeh fbshipit-source-id: 7f3ce3f	2016-11-21 18:09:13 -08:00
Karthikeyan Radhakrishnan	4118e13330	Persistent Cache: Expose stats to user via public API Summary: Exposing persistent cache stats (counters) to the user via public API. Closes https://github.com/facebook/rocksdb/pull/1485 Differential Revision: D4155274 Pulled By: siying fbshipit-source-id: 30a9f50	2016-11-21 17:39:13 -08:00
Andrew Kryczka	fd43ee09da	Range deletion microoptimizations Summary: - Made RangeDelAggregator's InternalKeyComparator member a reference-to-const so we don't need to copy-construct it. Also added InternalKeyComparator to ImmutableCFOptions so we don't need to construct one for each DBIter. - Made MemTable::NewRangeTombstoneIterator and the table readers' NewRangeTombstoneIterator() functions return nullptr instead of NewEmptyInternalIterator to avoid the allocation. Updated callers accordingly. Closes https://github.com/facebook/rocksdb/pull/1548 Differential Revision: D4208169 Pulled By: ajkr fbshipit-source-id: 2fd65cf	2016-11-21 12:24:13 -08:00
Andrew Kryczka	fe349db57b	Remove Arena in RangeDelAggregator Summary: The Arena construction/destruction introduced significant overhead to read-heavy workload just by creating empty vectors for its blocks, so avoid it in RangeDelAggregator. Closes https://github.com/facebook/rocksdb/pull/1547 Differential Revision: D4207781 Pulled By: ajkr fbshipit-source-id: 9d1c130	2016-11-19 14:24:12 -08:00
Andrew Kryczka	3f62215210	Lazily initialize RangeDelAggregator's map and pinning manager Summary: Since a RangeDelAggregator is created for each read request, these heap-allocating member variables were consuming significant CPU (~3% total) which slowed down request throughput. The map and pinning manager are only necessary when range deletions exist, so we can defer their initialization until the first range deletion is encountered. Currently lazy initialization is done for reads only since reads pass us a single snapshot, which is easier to store on the stack for later insertion into the map than the vector passed to us by flush or compaction. Note the Arena member variable is still expensive, I will figure out what to do with it in a subsequent diff. It cannot be lazily initialized because we currently use this arena even to allocate empty iterators, which is necessary even when no range deletions exist. Closes https://github.com/facebook/rocksdb/pull/1539 Differential Revision: D4203488 Pulled By: ajkr fbshipit-source-id: 3b36279	2016-11-18 17:09:11 -08:00
Andrew Kryczka	635a7bd1ad	refactor TableCache Get/NewIterator for single exit points Summary: these functions were too complicated to change with exit points everywhere, so refactored them. btw, please review urgently, this is a prereq to fix the 5.0 perf regression Closes https://github.com/facebook/rocksdb/pull/1534 Differential Revision: D4198972 Pulled By: ajkr fbshipit-source-id: 04ebfb7	2016-11-17 14:39:13 -08:00
Siying Dong	a4eb7387b2	Allow plain table to store index on file with bloom filter disabled Summary: Currently plain table bloom filter is required if storing metadata on file. Remove the constraint. Closes https://github.com/facebook/rocksdb/pull/1525 Differential Revision: D4190977 Pulled By: siying fbshipit-source-id: be60442	2016-11-17 11:09:13 -08:00
Yi Wu	36e4762ce0	Remove Ticker::SEQUENCE_NUMBER Summary: Remove the ticker count because: * Having to reset the ticker count in WriteImpl is ineffiecent; * It doesn't make sense to have it as a ticker count if multiple db instance share a statistics object. Closes https://github.com/facebook/rocksdb/pull/1531 Differential Revision: D4194442 Pulled By: yiwu-arbug fbshipit-source-id: e2110a9	2016-11-16 22:39:09 -08:00
Andrew Kryczka	760ef68a69	fix deleterange asan issue Summary: pinned_iters_mgr_ pins iterators allocated with arena_, so we should order the instance variable declarations such that the pinned iterators have their destructors executed before the arena is destroyed. Closes https://github.com/facebook/rocksdb/pull/1528 Differential Revision: D4191984 Pulled By: ajkr fbshipit-source-id: 1386f20	2016-11-16 14:09:07 -08:00
Siying Dong	972e3ff295	Enable allow_concurrent_memtable_write and enable_write_thread_adaptive_yield by default Summary: Closes https://github.com/facebook/rocksdb/pull/1496 Differential Revision: D4168080 Pulled By: siying fbshipit-source-id: 056ae62	2016-11-16 09:39:09 -08:00
Yi Wu	1543d5d92e	Report memory usage by memtable insert hints map. Summary: It is hard to measure acutal memory usage by std containers. Even providing a custom allocator will miss count some of the usage. Here we only do a wild guess on its memory usage. Closes https://github.com/facebook/rocksdb/pull/1511 Differential Revision: D4179945 Pulled By: yiwu-arbug fbshipit-source-id: 32ab929	2016-11-15 20:24:13 -08:00
Andrew Kryczka	48e8baebc0	Decouple data iterator and range deletion iterator in TableCache Summary: Previously we used TableCache::NewIterator() for multiple purposes (data block iterator and range deletion iterator), and returned non-ok status in the data block iterator. In one case where the caller only used the range deletion block iterator (`9e7cf3469b/db/version_set.cc (L965-L973)`), we didn't check/free the data block iterator containing non-ok status, which caused a valgrind error. So, this diff decouples creation of data block and range deletion block iterators, and updates the callers accordingly. Both functions can return non-ok status in an InternalIterator. Since the non-ok status is returned in an iterator that the callers will definitely use, it should be more usable/less error-prone. Closes https://github.com/facebook/rocksdb/pull/1513 Differential Revision: D4181423 Pulled By: ajkr fbshipit-source-id: 835b8f5	2016-11-15 17:24:28 -08:00
Andrew Kryczka	661e4c9267	DeleteRange unsupported in non-block-based tables Summary: Return an error from DeleteRange() (or Write() if the user is using the low-level WriteBatch API) if an unsupported table type is configured. Closes https://github.com/facebook/rocksdb/pull/1519 Differential Revision: D4185933 Pulled By: ajkr fbshipit-source-id: abcdf84	2016-11-15 15:24:16 -08:00
Andrew Kryczka	489d142808	DeleteRange interface Summary: Expose DeleteRange() interface since we think the implementation is functionally correct now. Closes https://github.com/facebook/rocksdb/pull/1503 Differential Revision: D4171921 Pulled By: ajkr fbshipit-source-id: 5e21c98	2016-11-15 15:24:16 -08:00
Islam AbdelRahman	eba99c28e4	Fix min_write_buffer_number_to_merge = 0 bug Summary: It's possible that we set min_write_buffer_number_to_merge to 0. This should never happen Closes https://github.com/facebook/rocksdb/pull/1515 Differential Revision: D4183356 Pulled By: yiwu-arbug fbshipit-source-id: c9d39d7	2016-11-15 13:54:08 -08:00
Artemiy Kolesnikov	91300d01f6	Dynamic max_total_wal_size option Summary: Closes https://github.com/facebook/rocksdb/pull/1509 Differential Revision: D4176426 Pulled By: yiwu-arbug fbshipit-source-id: b57689d	2016-11-14 22:54:17 -08:00
Andrew Kryczka	ec2f64794b	Consider subcompaction boundaries when updating file boundaries for range deletion Summary: Adjusted AddToBuilder() to take lower_bound and upper_bound, which serve two purposes: (1) only range deletions overlapping with the interval [lower_bound, upper_bound) will be added to the output file, and (2) the output file's boundaries will not be extended before lower_bound or after upper_bound. Our computation of lower_bound/upper_bound consider both subcompaction boundaries and previous/next files within the subcompaction. Test cases are here (level subcompactions: https://gist.github.com/ajkr/63c7eae3e9667c5ebdc0a7efb74ac332, and universal subcompactions: https://gist.github.com/ajkr/5a62af77c4ebe4052a1955c496d51fdb) but can't be included in this diff as they depend on committing the API first. They fail before this change and pass after. Closes https://github.com/facebook/rocksdb/pull/1501 Reviewed By: yhchiang Differential Revision: D4171685 Pulled By: ajkr fbshipit-source-id: ee99db8	2016-11-14 20:24:21 -08:00
Andrew Kryczka	3b192f6186	Handle full final subcompaction output file with range deletions Summary: This conditional should only open a new file that's dedicated to range deletions when it's the sole output of the subcompaction. Previously, we created such a file whenever the table builder was nullptr, which would've also been the case whenever the CompactionIterator's final key coincided with the final output table becoming full. Closes https://github.com/facebook/rocksdb/pull/1507 Differential Revision: D4174613 Pulled By: ajkr fbshipit-source-id: 9ffacea	2016-11-14 17:54:20 -08:00
Andrew Kryczka	6c57952002	Make range deletion inclusive-exclusive Summary: This makes it easier to implement future optimizations like range collapsing. Closes https://github.com/facebook/rocksdb/pull/1504 Differential Revision: D4172214 Pulled By: ajkr fbshipit-source-id: ac4942f	2016-11-14 17:39:13 -08:00
Yi Wu	1ea79a78c9	Optimize sequential insert into memtable - Part 1: Interface Summary: Currently our skip-list have an optimization to speedup sequential inserts from a single stream, by remembering the last insert position. We extend the idea to support sequential inserts from multiple streams, and even tolerate small reordering wihtin each stream. This PR is the interface part adding the following: - Add `memtable_insert_prefix_extractor` to allow specifying prefix for each key. - Add `InsertWithHint()` interface to memtable, to allow underlying implementation to return a hint of insert position, which can be later pass back to optimize inserts. - Memtable will maintain a map from prefix to hints and pass the hint via `InsertWithHint()` if `memtable_insert_prefix_extractor` is non-null. Closes https://github.com/facebook/rocksdb/pull/1419 Differential Revision: D4079367 Pulled By: yiwu-arbug fbshipit-source-id: 3555326	2016-11-13 19:09:18 -08:00
Yi Wu	df5eeb85ca	Optimize sequential insert into memtable - Part 2: Implementation Summary: Implement a insert hint into skip-list to hint insert position. This is to optimize for the write workload where there are multiple stream of sequential writes. For example, there is a stream of keys of a1, a2, a3... but also b1, b2, b2... Each stream are not neccessary strictly sequential, but can get reorder a little bit. User can specify a prefix extractor and the `SkipListRep` can thus maintan a hint for each of the stream for fast insert into memtable. This is the internal implementation part. See #1419 for the interface part. See inline comments for details. Closes https://github.com/facebook/rocksdb/pull/1449 Differential Revision: D4106781 Pulled By: yiwu-arbug fbshipit-source-id: f4d48c4	2016-11-13 13:09:16 -08:00
Islam AbdelRahman	5ed650857d	Fix SstFileWriter destructor Summary: If user did not call SstFileWriter::Finish() or called Finish() but it failed. We need to abandon the builder, to avoid destructing it while it's open Closes https://github.com/facebook/rocksdb/pull/1502 Differential Revision: D4171660 Pulled By: IslamAbdelRahman fbshipit-source-id: ab6f434	2016-11-12 20:11:19 -08:00
Lijun Tang	adb665e0bf	Allowed delayed_write_rate option to be dynamically set. Summary: Closes https://github.com/facebook/rocksdb/pull/1488 Differential Revision: D4157784 Pulled By: siying fbshipit-source-id: f150081	2016-11-12 15:54:11 -08:00
Maysam Yabandeh	361010d447	Exporting compaction stats in the form of a map Summary: Currently the compaction stats are printed to stdout. We want to export the compaction stats in a map format so that the upper layer apps (e.g., MySQL) could present the stats in any format required by the them. Closes https://github.com/facebook/rocksdb/pull/1477 Differential Revision: D4149836 Pulled By: maysamyabandeh fbshipit-source-id: b3df19f	2016-11-11 20:54:14 -08:00
Aaron Gao	b39b2ee12f	do not call get() in recovery mode Summary: This is a previous fix that has a typo Closes https://github.com/facebook/rocksdb/pull/1487 Differential Revision: D4157381 Pulled By: lightmark fbshipit-source-id: f079be8	2016-11-10 11:24:20 -08:00
Reid Horuff	1ca5f6d132	Fix 2PC Recovery SeqId Miscount Summary: Originally sequence ids were calculated, in recovery, based off of the first seqid found if the first log recovered. The working seqid was then incremented from that value based on every insertion that took place. This was faulty because of the potential for missing log files or inserts that skipped the WAL. The current recovery scheme grabs sequence from current recovering batch and increments using memtableinserter to track how many actual inserts take place. This works for 2PC batches as well scenarios where some logs are missing or inserts that skip the WAL. Closes https://github.com/facebook/rocksdb/pull/1486 Differential Revision: D4156064 Pulled By: reidHoruff fbshipit-source-id: a6da8d9	2016-11-10 11:09:22 -08:00
Andrew Kryczka	c90fef88b1	fix open failure with empty wal Summary: Closes https://github.com/facebook/rocksdb/pull/1490 Differential Revision: D4158821 Pulled By: IslamAbdelRahman fbshipit-source-id: 59b73f4	2016-11-09 22:24:26 -08:00
Andrew Kryczka	4e20c5da20	Store internal keys in TombstoneMap Summary: This fixes a correctness issue where ranges with same begin key would overwrite each other. This diff uses InternalKey as TombstoneMap's key such that all tombstones have unique keys even when their start keys overlap. We also update TombstoneMap to use an internal key comparator. End-to-end tests pass and are here (https://gist.github.com/ajkr/851ffe4c1b8a15a68d33025be190a7d9) but cannot be included yet since the DeleteRange() API is yet to be checked in. Note both tests failed before this fix. Closes https://github.com/facebook/rocksdb/pull/1484 Differential Revision: D4155248 Pulled By: ajkr fbshipit-source-id: 304b4b9	2016-11-09 15:09:18 -08:00
Yueh-Hsuan Chiang	a9fb346e4a	Fix RocksDB Lite build failure in c_test.cc Summary: Fix the following RocksDB Lite build failure in c_test.cc db/c_test.c:1051:3: error: implicit declaration of function 'fprintf' is invalid in C99 [-Werror,-Wimplicit-function-declaration] fprintf(stderr, "SKIPPED\n"); ^ db/c_test.c:1051:3: error: declaration of built-in function 'fprintf' requires inclusion of the header <stdio.h> [-Werror,-Wbuiltin-requires-header] db/c_test.c:1051:11: error: use of undeclared identifier 'stderr' fprintf(stderr, "SKIPPED\n"); ^ 3 errors generated. Closes https://github.com/facebook/rocksdb/pull/1479 Differential Revision: D4151160 Pulled By: yhchiang fbshipit-source-id: a471a30	2016-11-09 12:24:18 -08:00
Reid Horuff	d133b08f68	Use correct sequence number when creating memtable Summary: copied from: `5ebfd2623a` Opening existing RocksDB attempts recovery from log files, which uses wrong sequence number to create the memtable. This is a regression introduced in change `a400336`. This change includes a test demonstrating the problem, without the fix the test fails with "Operation failed. Try again.: Transaction could not check for conflicts for operation at SequenceNumber 1 as the MemTable only contains changes newer than SequenceNumber 2. Increasing the value of the max_write_buffer_number_to_maintain option could reduce the frequency of this error" This change is a joint effort by Peter 'Stig' Edwards thatsafunnyname and me. Closes https://github.com/facebook/rocksdb/pull/1458 Differential Revision: D4143791 Pulled By: reidHoruff fbshipit-source-id: 5a25033	2016-11-09 12:24:17 -08:00
Islam AbdelRahman	9bd191d2f4	Fix deadlock between (WriterThread/Compaction/IngestExternalFile) Summary: A deadlock is possible if this happen (1) Writer thread is stopped because it's waiting for compaction to finish (2) Compaction is waiting for current IngestExternalFile() calls to finish (3) IngestExternalFile() is waiting to be able to acquire the writer thread (4) WriterThread is held by stopped writes that are waiting for compactions to finish This patch fix the issue by not incrementing num_running_ingest_file_ except when we acquire the writer thread. This patch include a unittest to reproduce the described scenario Closes https://github.com/facebook/rocksdb/pull/1480 Differential Revision: D4151646 Pulled By: IslamAbdelRahman fbshipit-source-id: 09b39db	2016-11-09 10:54:10 -08:00
Islam AbdelRahman	193221e0a1	Fix Forward Iterator Seek()/SeekToFirst() Summary: In ForwardIterator::SeekInternal(), we may end up passing empty Slice representing an internal key to InternalKeyComparator::Compare. and when we try to extract the user key from this empty Slice, we will create a slice with size = 0 - 8 ( which will overflow and cause us to read invalid memory as well ) Scenarios to reproduce these issues are in the unit tests Closes https://github.com/facebook/rocksdb/pull/1467 Differential Revision: D4136660 Pulled By: lightmark fbshipit-source-id: 151e128	2016-11-08 13:54:31 -08:00
Aaron Gao	e48f3f8b9e	remove tabs and duplicate #include in c api Summary: fix lint error about tabs and duplicate includes. Closes https://github.com/facebook/rocksdb/pull/1476 Differential Revision: D4149646 Pulled By: lightmark fbshipit-source-id: 2e0a632	2016-11-08 13:54:31 -08:00
Jay Lee	a7875272d7	c: support seek_for_prev Summary: support seek_for_prev in c abi. Closes https://github.com/facebook/rocksdb/pull/1457 Differential Revision: D4135360 Pulled By: lightmark fbshipit-source-id: 61256b0	2016-11-08 12:54:13 -08:00
Andrew Kryczka	9e7cf3469b	DeleteRange user iterator support Summary: Note: reviewed in https://reviews.facebook.net/D65115 - DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times. - DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205) - DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones Closes https://github.com/facebook/rocksdb/pull/1464 Differential Revision: D4131753 Pulled By: ajkr fbshipit-source-id: be86559	2016-11-04 12:09:22 -07:00
Andrew Kryczka	f998c9790f	DeleteRange Get support Summary: During Get()/MultiGet(), build up a RangeDelAggregator with range tombstones as we search through live memtable, immutable memtables, and SST files. This aggregator is then used by memtable.cc's SaveValue() and GetContext::SaveValue() to check whether keys are covered. added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761 Closes https://github.com/facebook/rocksdb/pull/1456 Differential Revision: D4111271 Pulled By: ajkr fbshipit-source-id: 6e388d4	2016-11-03 18:54:20 -07:00
zhangjinpeng1987	879f366366	Add C api for RateLimiter Summary: Add C api for RateLimiter. Closes https://github.com/facebook/rocksdb/pull/1455 Differential Revision: D4116362 Pulled By: yiwu-arbug fbshipit-source-id: cb05a8d	2016-11-03 11:09:17 -07:00
Yi Wu	437942e481	Add avoid_flush_during_shutdown DB option Summary: Add avoid_flush_during_shutdown DB option. Closes https://github.com/facebook/rocksdb/pull/1451 Differential Revision: D4108643 Pulled By: yiwu-arbug fbshipit-source-id: abdaf4d	2016-11-02 15:39:18 -07:00
Benoit Girard	2b16d664cb	Change max_bytes_for_level_multiplier to double Summary: Closes https://github.com/facebook/rocksdb/pull/1427 Differential Revision: D4094732 Pulled By: yiwu-arbug fbshipit-source-id: b9b79e9	2016-11-01 21:09:23 -07:00
Jay Lee	16fb04434f	expose IngestExternalFile to c abi Summary: IngestExternalFile is very useful when doing bulk load. This pr expose this API to c so many bindings can benefit from it too. Closes https://github.com/facebook/rocksdb/pull/1454 Differential Revision: D4113420 Pulled By: yiwu-arbug fbshipit-source-id: 307c6ae	2016-11-01 17:09:39 -07:00
Andrew Kryczka	40a2e406f8	DeleteRange flush support Summary: Changed BuildTable() (used for flush) to (1) add range tombstones to the aggregator, which is used by CompactionIterator to determine which keys can be removed; and (2) add aggregator's range tombstones to the table that is output for the flush. Closes https://github.com/facebook/rocksdb/pull/1438 Differential Revision: D4100025 Pulled By: ajkr fbshipit-source-id: cb01a70	2016-10-31 20:54:18 -07:00
Vladislav Vaintroub	d5555d95a3	Fix MSVC compile error in 32 bit compilation Summary: Passing std::atomic<uint64_t> variables to ASSERT_EQ() results in compile error C2718 'const T1': actual parameter with requested alignment of 8 won't be aligned. VS2015 defines std::atomic as specially aligned type ( with 'alignas'), however the compiler does not like declspec(align)ed function arguments. Worked around by casting std::atomic<uint64_t> types to uint64_t in ASSERT_EQ. Closes https://github.com/facebook/rocksdb/pull/1450 Differential Revision: D4106788 Pulled By: yiwu-arbug fbshipit-source-id: 5fb42c3	2016-10-31 17:24:18 -07:00
Siying Dong	da61f348d3	Print compression and Fast CRC support info as Header level Summary: Currently the compression suppport and fast CRC support information is printed as info level. They should be in the same level as options, which is header level. Also add ZSTD to this printing. Closes https://github.com/facebook/rocksdb/pull/1448 Differential Revision: D4106608 Pulled By: yiwu-arbug fbshipit-source-id: cb9a076	2016-10-31 16:09:13 -07:00
Siying Dong	c90c48d3c8	Show More DB Stats in info logs Summary: DB Stats now are truncated if there are too many CFs. Extend the buffer size to allow more to be printed out. Also, separate out malloc to another log line. Closes https://github.com/facebook/rocksdb/pull/1439 Differential Revision: D4100943 Pulled By: yiwu-arbug fbshipit-source-id: 79f7218	2016-10-29 16:09:18 -07:00
Siying Dong	1b295ac8ae	DBTest.GetThreadStatus: Wait for test results for longer Summary: The current 10 millisecond waiting for test results may not be sufficient in some test environments. Increase it to 60 seconds and check the results for every 1 milliseond. Already reviewed: https://reviews.facebook.net/D65457 Closes https://github.com/facebook/rocksdb/pull/1437 Differential Revision: D4099443 Pulled By: siying fbshipit-source-id: cf1f205	2016-10-29 16:09:18 -07:00
Aaron Gao	b50a81a2bb	Add a test for tailing_iterator Summary: A bug that tailingIterator->Seek(target) skips records. I think the bug is in the SeekInternal starting at lines 387: search_left_bound > search_right_bound There are only 2 cases this can happen: (1) target key is smaller than left most file (2) target key is larger than right most file The comment is wrong, there is another possibility that at the higher level there is a big gap such that the file in the lower level fits completely in the gap and then indexer->GetNextLevelIndex returns search_left_bound > search_right_bound I think pointing on the files after and before the gap. details: https://github.com/facebook/rocksdb/issues/1372 fixed this bug with test case added. Closes https://github.com/facebook/rocksdb/pull/1436 Reviewed By: IslamAbdelRahman Differential Revision: D4099313 Pulled By: lightmark fbshipit-source-id: 6a675b3	2016-10-28 18:24:14 -07:00
Siying Dong	04751d5345	L0 compression should follow options.compression_per_level if not empty Summary: Currently, we don't use options.compression_per_level[0] as the compression style for L0 compression type, unless it is None. This behavior doesn't look like on purpose. This diff will make sure L0 compress using the style of options.compression_per_level[0]. Reviewed and accepted in: https://reviews.facebook.net/D65607 Closes https://github.com/facebook/rocksdb/pull/1435 Differential Revision: D4099368 Pulled By: siying fbshipit-source-id: cfbbdcd	2016-10-28 17:39:20 -07:00
Andrew Kryczka	2946cadc46	Improve RangeDelAggregator documentation Summary: as requested in D62259 Closes https://github.com/facebook/rocksdb/pull/1434 Differential Revision: D4099047 Pulled By: ajkr fbshipit-source-id: a258cfb	2016-10-28 15:54:21 -07:00
Aaron Gao	bc429de490	revert fractional cascading in farward iterator Summary: As offline discussion with Siying, revert this since it has bug with seek. Test Plan: make check -j64 Reviewers: yiwu, andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65559	2016-10-28 10:25:39 -07:00
Andrew Kryczka	b9bc7a2aa4	Use skiplist rep for range tombstone memtable Summary: somehow missed committing this update in D62217 Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65361	2016-10-27 10:07:28 -07:00
Siying Dong	9ee84067f6	Disable DBTest.RepeatedWritesToSameKey (#1420 ) Summary: The verification condition of the test DBTest.RepeatedWritesToSameKey doesn't hold anymore after `3ce3bb3da2`. Disable the test for now before we find a way to replace it. Test Plan: Run the test and make sure it is disabled.	2016-10-25 10:23:50 -07:00
Aaron Gao	9de2f75216	revert Prev() in MergingIterator to use previous code in non-prefix-seek mode Summary: Siying suggested to keep old code for normal mode prev() for safety Test Plan: make check -j64 Reviewers: yiwu, andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65439	2016-10-24 13:13:01 -07:00
sdong	24495186da	DBSSTTest.RateLimitedDelete: not to use real clock Summary: Using real clock causes failures of DBSSTTest.RateLimitedDelete in some cases. Turn away from the real time. Use fake time instead. Test Plan: Run the tests and all existing tests. Reviewers: yiwu, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65145	2016-10-24 10:35:00 -07:00
sdong	1168cb810a	Fix a bug that may cause a deleted row to appear again Summary: The previous fix of reappearing of a deleted row `0ce258f9b3` missed a corner case, which can be reproduced using test CompactionPickerTest.OverlappingUserKeys7. Consider such an example: input level file: 1[B E] 2[F H] output level file: 3[A C] 4[D I] 5[I K] First file 2 is picked, which overlaps to file 4. 4 expands to 5. Now the all range is [D K] with 2 output level files. When we try to expand that, [D K] overlaps with file 1 and 2 in the input level, and 1 and 2 overlaps with 3 and 4 in the output level. So we end up with picking 3 and 4 in the output level. Without expanding, it also has 2 files, so we determine the output level doesn't change, although they are the different two files. The fix is to expand the output level files after we picked 3 and 4. In that case, there will be three output level files so we will abort the expanding. I also added two unit tests related to marked_for_compaction and being_compacted. They have been passing though. Test Plan: Run the new unit test, as well as all other tests. Reviewers: andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: yoshinorim, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65373	2016-10-24 09:49:07 -07:00
Edouard A	99c052a34f	Fix integer overflow in GetL0ThresholdSpeedupCompaction (#1378 )	2016-10-23 18:43:29 -07:00
Aaron Gao	59a7c0337b	Change ioptions to store user_comparator, fix bug Summary: change ioptions.comparator to user_comparator instread of internal_comparator. Also change Comparator* to InternalKeyComparator* to make its type explicitly. Test Plan: make all check -j64 Reviewers: andrewkr, sdong, yiwu Reviewed By: yiwu Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65121	2016-10-21 11:31:42 -07:00
Islam AbdelRahman	869ae5d786	Support IngestExternalFile (remove AddFile restrictions) Summary: Changes in the diff API changes: - Introduce IngestExternalFile to replace AddFile (I think this make the API more clear) - Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file) - Deprecate AddFile() API Logic changes: - If our file overlap with the memtable we will flush the memtable - We will find the first level in the LSM tree that our file key range overlap with the keys in it - We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it - We will assign a global sequence number to our new file - Remove AddFile restrictions by using global sequence numbers Other changes: - Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob Test Plan: unit tests (still need to add more) addfile_stress (https://reviews.facebook.net/D65037) Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong Reviewed By: sdong Subscribers: jkedgar, hcz, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65061	2016-10-20 17:05:32 -07:00
sdong	1d9dbef64e	Restrict running condition of UniversalCompactionTrivialMoveTest2 Summary: DBTestUniversalCompaction.UniversalCompactionTrivialMoveTest2 verifies non-trivial move is not triggered if we load data in sequential order. However, if there are multiple compaction threads, this conditon may not hold. Restrict the running condition to 1 compaction thread to make the test more robust. Test Plan: Run the test and make sure at least it doesn't regress normally. Reviewers: yhchiang, andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65277	2016-10-20 15:43:00 -07:00
sdong	fb2e412943	column_family_test: disable some tests in LITE Summary: Some tests in column_family_test depend on functions that are not available in LITE build, which sometimes cause flakiness. Disable them. Test Plan: Run those tests in LITE build. Reviewers: yiwu, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65271	2016-10-19 15:55:56 -07:00
Aaron Gao	5af651db24	fix data race in compact_files_test Summary: fix data race Test Plan: compact_files_test Reviewers: sdong, yiwu, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65259	2016-10-19 13:37:51 -07:00
Andrew Kryczka	a0ba0aa877	Fix uninitialized variable gcc error for MyRocks Summary: make sure seq_ is properly initialized even if ParseInternalKey() fails. Test Plan: run myrocks release tests Reviewers: lightmark, mung, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65199	2016-10-19 10:59:46 -07:00
Islam AbdelRahman	b88f8e87c5	Support SST files with Global sequence numbers [reland] Summary: reland https://reviews.facebook.net/D62523 - Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno` - Update TableProperties to be aware of the offset of each property in the file - Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks Test Plan: unit tests Reviewers: sdong, yhchiang Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65211	2016-10-18 16:59:37 -07:00
Aaron Gao	52c9808c3a	not split file in compaciton on level 0 Summary: we should not split file on level 0 in compaction because it will fail the following verification of seqno order on level 0 Test Plan: check with filldeterministic in db_bench Reviewers: yhchiang, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65193	2016-10-18 16:30:34 -07:00
Aaron Gao	5e0d6b4cc9	fix db_stress assertion failure Summary: in rocksdb::DBIter::FindValueForCurrentKey(), last_not_merge_type could also be SingleDelete() which is omitted Test Plan: db_iter_test Reviewers: yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65187	2016-10-18 16:07:10 -07:00
sdong	b4d07123c4	SamePrefixTest.InDomainTest to clear the test directory before testing Summary: SamePrefixTest.InDomainTest may fail if the previous run of some test cases in prefix_test fail. Test Plan: Run the test Reviewers: lightmark, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65163	2016-10-18 14:01:10 -07:00
Islam AbdelRahman	aa09d03381	Avoid calling GetDBOptions() inside GetFromBatchAndDB() Summary: MyRocks hit a regression, @mung generated perf reports showing that the reason is the cost of calling `GetDBOptions()` inside `GetFromBatchAndDB()` This diff avoid calling `GetDBOptions` and use the `ImmutableDBOptions` instead Test Plan: make check -j64 Reviewers: sdong, yiwu Reviewed By: yiwu Subscribers: andrewkr, dhruba, mung Differential Revision: https://reviews.facebook.net/D65151	2016-10-18 13:19:26 -07:00
Andrew Kryczka	6fbe96baf8	Compaction Support for Range Deletion Summary: This diff introduces RangeDelAggregator, which takes ownership of iterators provided to it via AddTombstones(). The tombstones are organized in a two-level map (snapshot stripe -> begin key -> tombstone). Tombstone creation avoids data copy by holding Slices returned by the iterator, which remain valid thanks to pinning. For compaction, we create a hierarchical range tombstone iterator with structure matching the iterator over compaction input data. An aggregator based on that iterator is used by CompactionIterator to determine which keys are covered by range tombstones. In case of merge operand, the same aggregator is used by MergeHelper. Upon finishing each file in the compaction, relevant range tombstones are added to the output file's range tombstone metablock and file boundaries are updated accordingly. To check whether a key is covered by range tombstone, RangeDelAggregator::ShouldDelete() considers tombstones in the key's snapshot stripe. When this function is used outside of compaction, it also checks newer stripes, which can contain covering tombstones. Currently the intra-stripe check involves a linear scan; however, in the future we plan to collapse ranges within a stripe such that binary search can be used. RangeDelAggregator::AddToBuilder() adds all range tombstones in the table's key-range to a new table's range tombstone meta-block. Since range tombstones may fall in the gap between files, we may need to extend some files' key-ranges. The strategy is (1) first file extends as far left as possible and other files do not extend left, (2) all files extend right until either the start of the next file or the end of the last range tombstone in the gap, whichever comes first. One other notable change is adding release/move semantics to ScopedArenaIterator such that it can be used to transfer ownership of an arena-allocated iterator, similar to how unique_ptr is used for malloc'd data. Depends on D61473 Test Plan: compaction_iterator_test, mock_table, end-to-end tests in D63927 Reviewers: sdong, IslamAbdelRahman, wanning, yhchiang, lightmark Reviewed By: lightmark Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62205	2016-10-18 12:04:56 -07:00
Andrew Kryczka	f47054015d	Handle WAL deletion when using avoid_flush_during_recovery Summary: Previously the WAL files that were avoided during recovery would never be considered for deletion. That was because alive_log_files_ was only populated when log files are created. This diff further populates alive_log_files_ with existing log files that aren't flushed during recovery, such that FindObsoleteFiles() can find them later. Depends on D64053. Test Plan: new unit test, verifies it fails before this change and passes after Reviewers: sdong, IslamAbdelRahman, yiwu Reviewed By: yiwu Subscribers: leveldb, dhruba, andrewkr Differential Revision: https://reviews.facebook.net/D64059	2016-10-14 12:59:51 -07:00
Yi Wu	e29d3b67c2	Make max_background_compactions and base_background_compactions dynamic changeable Summary: Add DB::SetDBOptions to dynamic change max_background_compactions and base_background_compactions. I'll add more dynamic changeable options soon. Test Plan: unit test. Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64749	2016-10-14 12:25:39 -07:00
Aaron Gao	21e8daced5	fix assertion failure in Prev() Summary: fix assertion failure in db_stress. It happens because of prefix seek key is larger than merge iterator key when they have the same user key Test Plan: ./db_stress --max_background_compactions=1 --max_write_buffer_number=3 --sync=0 --reopen=20 --write_buffer_size=33554432 --delpercent=5 --log2_keys_per_lock=10 --block_size=16384 --allow_concurrent_memtable_write=0 --test_batches_snapshots=0 --max_bytes_for_level_base=67108864 --progress_reports=0 --mmap_read=0 --writepercent=35 --disable_data_sync=0 --readpercent=50 --subcompactions=4 --ops_per_thread=20000000 --memtablerep=skip_list --prefix_size=0 --target_file_size_multiplier=1 --column_families=1 --threads=32 --disable_wal=0 --open_files=500000 --destroy_db_initially=0 --target_file_size_base=16777216 --nooverwritepercent=1 --iterpercent=10 --max_key=100000000 --prefixpercent=0 --use_clock_cache=false --kill_random_test=888887 --cache_size=1048576 --verify_checksum=1 Reviewers: sdong, andrewkr, yiwu, yhchiang Reviewed By: yhchiang Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65025	2016-10-13 17:36:48 -07:00
Yueh-Hsuan Chiang	040328a30d	Remove an assertion for single-delete in MergeHelper::MergeUntil Summary: Previously we have an assertion which triggers when we issue Merges after a single delete. However, merges after a single delete are unrelated to that single delete. Thus this behavior should be allowed. This will address a flakyness of db_stress. Test Plan: db_stress Reviewers: IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64923	2016-10-13 14:26:57 -07:00
Islam AbdelRahman	f26a139d89	Log successful AddFile Summary: Log successful AddFile Test Plan: visually check LOG file Reviewers: yiwu, andrewkr, lightmark, sdong Reviewed By: sdong Subscribers: andrewkr, jkedgar, dhruba Differential Revision: https://reviews.facebook.net/D65019	2016-10-13 11:56:27 -07:00
Islam AbdelRahman	5691a1d8a4	Fix compaction conflict with running compaction Summary: Issue scenario: (1) We have 3 files in L1 and we issue a compaction that will compact them into 1 file in L2 (2) While compaction (1) is running, we flush a file into L0 and trigger another compaction that decide to move this file to L1 and then move it again to L2 (this file don't overlap with any other files) (3) compaction (1) finishes and install the file it generated in L2, but this file overlap with the file we generated in (2) so we break the LSM consistency Looks like this issue can be triggered by using non-exclusive manual compaction or AddFile() Test Plan: unit tests Reviewers: sdong Reviewed By: sdong Subscribers: hermanlee4, jkedgar, andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D64947	2016-10-13 10:49:06 -07:00
Andrew Kryczka	017de666c7	fixup commit Summary: I accidentally left out these changes from my commit of D64053 due to messing up the merge conflict resolution. Test Plan: ./db_wal_test Reviewers: Subscribers: Tasks: Blame Revision: D64053	2016-10-13 08:48:40 -07:00
Andrew Kryczka	1b7af5fb1a	Redo handling of recycled logs in full purge Summary: This reverts commit `9e4aa798c3`, which doesn't handle all cases (see inline comment). I reimplemented the logic as suggested in the initial PR: https://github.com/facebook/rocksdb/pull/1313. This approach has two benefits: - All the parsing/filtering of full_scan_candidate_files is kept together in PurgeObsoleteFiles. - We only need to check whether log file is recycled in one place where we've already determined it's a log file Test Plan: new unit test, verified fails before the original fix, still passes now. Reviewers: IslamAbdelRahman, yiwu, sdong Reviewed By: yiwu, sdong Subscribers: leveldb, dhruba, andrewkr Differential Revision: https://reviews.facebook.net/D64053	2016-10-12 23:13:09 -07:00
Aaron Gao	447f17127c	new Prev() prefix support using SeekForPrev() Summary: 1) The previous solution for Prev() prefix support is not clean. Since I add api SeekForPrev(), now the Prev() can be symmetric to Next(). and we do not need SeekToLast() to be called in Prev() any more. Also, Next() will Seek(prefix_seek_key_) to solve the problem of possible inconsistency between db_iter and merge_iter when there is merge_operator. And prefix_seek_key is only refreshed when change direction to forward. 2) This diff also solves the bug of Iterator::SeekToLast() with iterate_upper_bound_ with prefix extractor. add test cases for the above two cases. There are some tests for the SeekToLast() in Prev(), I will clean them later. Test Plan: make all check Reviewers: IslamAbdelRahman, andrewkr, yiwu, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63933	2016-10-11 13:54:26 -07:00
Islam AbdelRahman	2ad68b971a	Support running consistency checks in release mode Summary: We always run consistency checks when compiling in debug mode allow users to set Options::force_consistency_checks to true to be able to run such checks even when compiling in release mode Test Plan: make check -j64 make release Reviewers: lightmark, sdong, yiwu Reviewed By: yiwu Subscribers: hermanlee4, andrewkr, yoshinorim, jkedgar, dhruba Differential Revision: https://reviews.facebook.net/D64701	2016-10-07 17:21:45 -07:00
Islam AbdelRahman	d062328977	Revert "Support SST files with Global sequence numbers" This reverts commit `ab01da5437`.	2016-10-07 14:05:12 -07:00
Reid Horuff	2c1f95291d	Add facility to write only a portion of WriteBatch to WAL Summary: When constructing a write batch a client may now call MarkWalTerminationPoint() on that batch. No batch operations after this call will be added written to the WAL but will still be inserted into the Memtable. This facility is used to remove one of the three WriteImpl calls in 2PC transactions. This produces a ~1% perf improvement. ``` RocksDB - unoptimized 2pc, sync_binlog=1, disable_2pc=off INFO 2016-08-31 14:30:38,814 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2619 seconds. Requests/second = 28628 RocksDB - optimized 2pc , sync_binlog=1, disable_2pc=off INFO 2016-08-31 16:26:59,442 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2581 seconds. Requests/second = 29054 ``` Test Plan: Two unit tests added. Reviewers: sdong, yiwu, IslamAbdelRahman Reviewed By: yiwu Subscribers: hermanlee4, dhruba, andrewkr Differential Revision: https://reviews.facebook.net/D64599	2016-10-07 11:32:10 -07:00
Islam AbdelRahman	ab01da5437	Support SST files with Global sequence numbers Summary: - Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno` - Update TableProperties to be aware of the offset of each property in the file - Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks Test Plan: unit tests Reviewers: andrewkr, yhchiang, yiwu, sdong Reviewed By: sdong Subscribers: hcz, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62523	2016-10-03 16:12:39 -07:00
Andrew Kryczka	6009c473c7	Store range tombstones in memtable Summary: - Store range tombstones in a separate MemTableRep instantiated with ColumnFamilyOptions::memtable_factory - MemTable::NewRangeTombstoneIterator() returns a MemTableIterator over the separate MemTableRep - Part of the read path is not implemented yet (i.e., MemTable::Get()) Test Plan: see unit tests Reviewers: wanning Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62217	2016-09-30 09:06:43 -07:00
Aaron Gao	26388247aa	delete unused variable for PrevInterval() Summary: delete unused variable Test Plan: make check Reviewers: sdong, andrewkr, IslamAbdelRahman, tianx Reviewed By: tianx Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64509	2016-09-29 13:19:58 -07:00
Islam AbdelRahman	87dfc1d23e	Fix conflict between AddFile() and CompactRange() Summary: Fix the conflict bug between AddFile() and CompactRange() by - Make sure that no AddFile calls are running when asking CompactionPicker to pick compaction for manual compaction - If AddFile() run after we pick the compaction for the manual compaction it will be aware of it since we will add the manual compaction to running_compactions_ after picking it This will solve these 2 scenarios - If AddFile() is running, we will wait for it to finish before we pick a compaction for the manual compaction - If we already picked a manual compaction and then AddFile() started ... we ensure that it never ingest a file in a level that will overlap with the manual compaction Test Plan: unit tests Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, yoshinorim, jkedgar, dhruba Differential Revision: https://reviews.facebook.net/D64449	2016-09-28 15:42:06 -07:00
Aaron Gao	f517d9dd09	Add SeekForPrev() to Iterator Summary: Add new Iterator API, `SeekForPrev`: find the last key that <= target key support prefix_extractor support prefix_same_as_start support upper_bound not supported in iterators without Prev() Also add tests in db_iter_test and db_iterator_test Pass all tests Cheers! Test Plan: make all check -j64 Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64149	2016-09-27 18:20:57 -07:00
yiwu-arbug	eb3894cf42	Recompute compaction score on SetOptions (#1346 ) Summary: We didn't recompute compaction score on SetOptions, and end up not having compaction if no flush happens afterward. The PR fixing it. Test Plan: See unit test. Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D64167	2016-09-27 11:17:15 -07:00
Islam AbdelRahman	5c64fb67d2	Fix AddFile() conflict with compaction output [WaitForAddFile()] Summary: Since AddFile unlock/lock the mutex inside LogAndApply() we need to ensure that during this period other compactions cannot run since such compactions are not aware of the file we are ingesting and could create a compaction that overlap wit this file this diff add - WaitForAddFile() call that will ensure that no AddFile() calls are being processed right now - Call `WaitForAddFile()` in 3 locations -- When doing manual Compaction -- When starting automatic Compaction -- When doing CompactFiles() Test Plan: unit test Reviewers: lightmark, yiwu, andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, yoshinorim, jkedgar, dhruba Differential Revision: https://reviews.facebook.net/D64383	2016-09-27 00:14:55 -07:00
Islam AbdelRahman	9e9f5a0b92	Fix CompactFilesTest.ObsoleteFiles timeout (#1353 )	2016-09-26 10:39:07 -07:00
Aaron Gao	c2a62a4cb2	not cut compaction output when compact to level 0 Summary: we should not call ShouldStopBefore() in compaction when the compaction targets level 0. Otherwise, CheckConsistency will fail the assertion of seq number check on level 0. Test Plan: make all check -j64 I also manully test that using db_bench to compact files to level 0. Without this line change, the assertion files and multiple files are generated on level 0 after compaction. Reviewers: yhchiang, andrewkr, yiwu, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64269	2016-09-23 17:16:38 -07:00
Yi Wu	9ed928e7a9	Split DBOptions into ImmutableDBOptions and MutableDBOptions Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs. Test Plan: make all check Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64065	2016-09-23 16:34:04 -07:00
yiwu-arbug	4bc8c88e6b	Recover same sequence id from WAL (#1350 ) Summary: Revert the behavior where we don't read sequence id from WAL, but increase it as we replay the log. We still keep the behave for 2PC for now but will fix later. This change fixes github issue 1339, where some writes come with WAL disabled and we may recover records with wrong sequence id. Test Plan: Added unit test. Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D64275	2016-09-23 16:15:14 -07:00
Aaron Gao	0a1bd9c509	add cfh deletion started listener Summary: add ColumnFamilyHandleDeletionStarted listener which can be called when user deletes handler. Test Plan: ./listener_test Reviewers: yiwu, IslamAbdelRahman, sdong, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60717	2016-09-22 11:56:18 -07:00
Islam AbdelRahman	da5a9a65c1	Fix mac build	2016-09-21 20:22:09 -07:00
Islam AbdelRahman	abc0ae462b	Add AddFile() InternalStats for Total files/L0 files/total keys ingested Summary: Report more information about the ingested files in CF InternalStats - Total files - Total L0 files - Total keys There was also noticed that we were reporting files that failed to ingest, fix this bug Test Plan: print stats in tests Reviewers: sdong, andrewkr, lightmark Reviewed By: lightmark Subscribers: jkedgar, andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D63039	2016-09-21 14:24:08 -07:00
Aaron Gao	715256338a	forbid merge during recovery Summary: Mitigate regression bug of options.max_successive_merges hit during DB Recovery For https://reviews.facebook.net/D62625 Test Plan: make all check Reviewers: horuff, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62655	2016-09-21 11:05:07 -07:00
Yi Wu	e4d3f5d9b8	Fix DBImpl::GetWalPreallocateBlockSize Mac build error Summary: Specify type param with std::min to resolve compile error on Mac. Test Plan: https://travis-ci.org/facebook/rocksdb/builds/161223845 Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64143	2016-09-20 10:17:28 -07:00
sdong	d78a4401b5	DBImpl::GetWalPreallocateBlockSize() should return size_t Summary: WritableFile::SetPreallocationBlockSize() requires parameter as size_t, and options used in DBImpl::GetWalPreallocateBlockSize() are all size_t. WritableFile::SetPreallocationBlockSize() should return size_t to avoid build break if size_t is not uint64_t. Test Plan: Run existing tests. Reviewers: andrewkr, IslamAbdelRahman, yiwu Reviewed By: yiwu Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D64137	2016-09-19 16:51:38 -07:00
sdong	b666f85445	Consider more factors when determining preallocation size of WAL files Summary: Currently the WAL file preallocation size is 1.1 * write_buffer_size. This, however, will be over-estimated if options.db_write_buffer_size or options.max_total_wal_size is set and is much smaller. Test Plan: Add a unit test. Reviewers: andrewkr, yiwu Reviewed By: yiwu Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D63957	2016-09-19 12:04:35 -07:00
rockeet	4c3f4496b5	Add TableBuilderOptions::level and relevant changes (#1335 )	2016-09-17 22:30:43 -07:00
Yi Wu	0a88f38b7e	Remove ColumnFamilyData::options() Summary: One more small refactor before I split DBOptions into mutable and immutable parts. Test Plan: existing unit tests. Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64047	2016-09-16 15:09:14 -07:00
Yi Wu	8d9bf5c498	Fix DBOptionsTest.GetLatestOptions Summary: RandomInitCFOptions will allocate a new compaction filter, which we have to delete afterward. Test Plan: valgrind against the test Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64011	2016-09-15 14:57:32 -07:00
Yi Wu	40cfa3e021	Fix DBWALTest.RecoveryWithLogDataForSomeCFs with mac Summary: Seems there's no std::array on mac+clang. Use raw array instead. Test Plan: run ./db_wal_test on mac. Reviewers: andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64005	2016-09-15 13:44:33 -07:00
Andrew Kryczka	06b4785fec	Fix recovery for WALs without data for all CFs Summary: if one or more CFs had no data in the WAL, the log number that's used by FindObsoleteFiles() wasn't updated. We need to treat this case the same as if the data for that WAL had been flushed. Test Plan: new unit test Reviewers: IslamAbdelRahman, yiwu, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63963	2016-09-15 11:40:48 -07:00
Andrew Kryczka	d7242ff4d5	Fix GetSortedWalFiles when log recycling enabled Summary: Previously the sequence number was mistakenly passed in an argument where the log number should go. This caused the reader to assume the old WAL format was used, which is incompatible with the WAL recycling format. Test Plan: new unit test, verified it fails before this change and passes afterwards. Reviewers: yiwu, lightmark, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63987	2016-09-15 09:55:02 -07:00
Yi Wu	17f76fc564	DB::GetOptions() reflect dynamic changed options Summary: DB::GetOptions() reflect dynamic changed options. Test Plan: See the new unit test. Reviewers: yhchiang, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63903	2016-09-14 22:10:28 -07:00
Yi Wu	81747f1be6	Refactor MutableCFOptions Summary: * Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions. * Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable. Test Plan: existing unit tests. Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63945	2016-09-13 21:11:59 -07:00
somnathr	9e4aa798c3	Summary: (#1313 ) If log recycling is enabled with the rocksdb (recycle_log_file_num=16) db->Writebatch is erroring out with keynotfound after ~5-6 hours of run (1M seq but can happen to any workload I guess).See my detailed bug report here (https://github.com/facebook/rocksdb/issues/1303). This commit is the fix for this, a check is been added not to delete the log file if it is already there in the recycle list. Test Plan: Unit tested it and ran the similar profile. Not reproducing anymore.	2016-09-12 16:53:42 -07:00
Adam Faulkner	a10e8a056d	Fix C api memtable rep bugs. (#1328 )	2016-09-12 15:31:42 -07:00
zhangjinpeng1987	b06b191362	add C api for set wal_recovery_mode (#1327 ) * add C api for set wal recovery mode * add test	2016-09-09 10:11:30 -07:00
Islam AbdelRahman	1cca091298	Temporarily revert Prev() prefix support Summary: Temporarily revert commits for supporting prefix Prev() to unblock MyRocks and RocksDB release These are the commits reverted - `6a14d55bd9` - `b18f9c9eac` - `db74b1a219` - `2482d5fb45` Test Plan: make check -j64 Reviewers: sdong, lightmark Reviewed By: lightmark Subscribers: andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D63789	2016-09-08 14:45:32 -07:00
Islam AbdelRahman	52ee07b021	Move AddFile() tests to external_sst_file_test.cc Summary: Simply move the tests Test Plan: make check -j64 Reviewers: andrewkr, lightmark, yiwu, yhchiang, kradhakrishnan, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62529	2016-09-07 15:41:54 -07:00
Edouard A	66a91e2607	Add NoSpace subcode to IOError (#1320 ) Add a sub code to distinguish "out of space" errors from regular I/O errors	2016-09-07 12:37:45 -07:00
sdong	67036c0406	Fix Flaky ColumnFamilyTest.FlushCloseWALFiles Summary: In ColumnFamilyTest.FlushCloseWALFiles, there is a small window in which the flush has finished but the log writer is not yet closed, causing the assert failure. Fix it by explicitly waiting the flush job to finish. Test Plan: Run the test many times in high parallelism. Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D63423	2016-09-07 11:19:15 -07:00
sdong	607628d349	Support ZSTD with finalized format Summary: ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility. Still keep ZSTDNotFinal for compatibility reason. Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older. Reviewers: andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D63141	2016-09-06 12:22:16 -07:00
Injun Song	ce1be2ce37	Fix build error on Windows (AppVeyor) (#1315 ) Add 'cf_options' to source list and db_imple.cc fix casting	2016-09-06 08:41:43 -07:00
sdong	22696b0881	Fix uninitlized CompactionJob::SubcompactionState::current_output_file_size Summary: The new variable introduced in 2149059f910149197d1a0f79ac08cf19465ea2d may be unitialized. Valgrind is failing because of it. Test Plan: Run valgrind tests Reviewers: yiwu, andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D63201	2016-09-02 17:06:20 -07:00
Yi Wu	a88677d2cf	Remove ImmutableCFOptions from public API Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (`5665e5e285`) but later its reference in /include/rocksdb/table.h is removed. Test Plan: make all check Reviewers: IslamAbdelRahman, sdong, yhchiang Reviewed By: yhchiang Subscribers: yhchiang, andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63177	2016-09-02 14:16:31 -07:00
Islam AbdelRahman	80c75593ed	Fix data race in AddFile() with multiple files + custom comparator bug Summary: When ingesting multiple files - We should use user comparator - Should not call `cfd->current()` outside of mutex Test Plan: unit tests Reviewers: sdong, lightmark Reviewed By: lightmark Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D63075	2016-09-02 11:17:58 -07:00
John Alexander	4fd08f4b8b	Ensure Correct Behavior of StatsLevel kExceptDetailedTimers and kExceptTimeForMutex (#1308 ) * Fix StatsLevel so that kExceptTimeForMutex leaves compression stats enabled and kExceptDetailedTimers disables mutex lock stats. Also change default stats level to kExceptDetailedTimers (disabling both compression and mutex timing). * Changed order of StatsLevel enum to simplify logic for determining what stats to record.	2016-09-01 19:57:55 -07:00
sdong	32149059f9	Merge options source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes. Test Plan: Add two new unit tests. Run all existing tests, including jtest. Reviewers: yhchiang, igor, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59829	2016-09-01 14:33:24 -07:00
Aaron Gao	6a14d55bd9	add prefix_seek_mode to db_iter_test Summary: add prefix_seek_mode to db_iter_test to enable data race test for iterator when prefix_extractor != nullptr Test Plan: make all check -j64 Reviewers: andrewkr, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63027	2016-08-31 12:07:09 -07:00
Aaron Gao	2482d5fb45	support Prev() in prefix seek mode Summary: As title, make sure Prev() works as expected with Next() when the current iter->key() in the range of the same prefix in prefix seek mode Test Plan: make all check -j64 (add prefix_test with PrefixSeekModePrev test case) Reviewers: andrewkr, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: yoshinorim, andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61419	2016-08-29 20:55:39 -07:00
Islam AbdelRahman	b49b92cf28	Introduce Read amplification bitmap (read amp statistics) Summary: Add ReadOptions::read_amp_bytes_per_bit option which allow us to create a bitmap for every data block we read the bitmap will contain (block_size / read_amp_bytes_per_bit) bits. We will use this bitmap to mark which bytes have been used of the block so we can calculate the read amplification Test Plan: added new tests Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: yiwu, leveldb, march, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58707	2016-08-26 18:55:58 -07:00
sdong	dade61ac26	Mitigate regression bug of options.max_successive_merges hit during DB Recovery Summary: After `1b8a2e8fdd`, DB Pointer is passed to WriteBatchInternal::InsertInto() while DB recovery. This can cause deadlock if options.max_successive_merges hits. In that case DB::Get() will be called. Get() will try to acquire the DB mutex, which is already held by the DB::Open(), causing a deadlock condition. This commit mitigates the problem by not passing the DB pointer unless 2PC is allowed. Test Plan: Add a new test and run it. Reviewers: IslamAbdelRahman, andrewkr, kradhakrishnan, horuff Reviewed By: kradhakrishnan Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62625	2016-08-25 17:30:34 -07:00
Justin Gibbs	b2ce59537c	Persist data during user initiated shutdown Summary: Move the manual memtable flush for databases containing data that has bypassed the WAL from DBImpl's destructor to CancleAllBackgroundWork(). CancelAllBackgroundWork() is a publicly exposed API which allows async operations performed by background threads to be disabled on a database. In effect, this places the database into a "shutdown" state in advance of calling the database object's destructor. No compactions or flushing of SST files can occur once a call to this API completes. When writes are issued to a database with WriteOptions::disableWAL set to true, DBImpl::has_unpersisted_data_ is set so that memtables can be flushed when the database object is destroyed. If CancelAllBackgroundWork() has been called prior to DBImpl's destructor, this flush operation is not possible and is skipped, causing unnecessary loss of data. Since CancelAllBackgroundWork() is already invoked by DBImpl's destructor in order to perform the thread join portion of its cleanup processing, moving the manual memtable flush to CancelAllBackgroundWork() ensures data is persisted regardless of client behavior. Test Plan: Write an amount of data that will not cause a memtable flush to a rocksdb database with all writes marked with WriteOptions::disableWAL. Properly "close" the database. Reopen database and verify that the data was persisted. Reviewers: IslamAbdelRahman, yiwu, yoshinorim, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62277	2016-08-25 12:24:22 -07:00
Mike Kolupaev	7b81095171	Fix a crash when compaction fails to open a file Summary: We've got a crash with this stack trace: Program terminated with signal SIGTRAP, Trace/breakpoint trap. #0 0x00007fc85f2f4009 in raise () from /usr/local/fbcode/gcc-4.9-glibc-2.20-fb/lib/libpthread.so.0 #1 0x00000000005c8f61 in facebook::logdevice::handle_sigsegv(int) () at logdevice/server/sigsegv.cpp:159 #2 0x00007fc85f2f4150 in <signal handler called> () at /usr/local/fbcode/gcc-4.9-glibc-2.20-fb/lib/libpthread.so.0 #3 0x00000000031ed80c in rocksdb::NewReadaheadRandomAccessFile() at util/file_reader_writer.cc:383 #4 0x00000000031ed80c in rocksdb::NewReadaheadRandomAccessFile() at util/file_reader_writer.cc:472 #5 0x00000000031558e7 in rocksdb::TableCache::GetTableReader() at db/table_cache.cc:99 #6 0x0000000003156329 in rocksdb::TableCache::NewIterator() at db/table_cache.cc:198 #7 0x0000000003166568 in rocksdb::VersionSet::MakeInputIterator() at db/version_set.cc:3345 #8 0x000000000324a94f in rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState) () at db/compaction_job.cc:650 #9 0x000000000324c2f6 in rocksdb::CompactionJob::Run() () at db/compaction_job.cc:530 #10 0x00000000030f5ae5 in rocksdb::DBImpl::BackgroundCompaction() at db/db_impl.cc:3269 #11 0x0000000003108d36 in rocksdb::DBImpl::BackgroundCallCompaction(void) () at db/db_impl.cc:2970 #12 0x00000000029a2a9a in facebook::logdevice::RocksDBEnv::callback(void) () at logdevice/server/locallogstore/RocksDBEnv.cpp:26 #13 0x00000000029a2a9a in facebook::logdevice::RocksDBEnv::callback(void) () at logdevice/server/locallogstore/RocksDBEnv.cpp:30 #14 0x00000000031e7521 in rocksdb::ThreadPool::BGThread() at util/threadpool.cc:230 #15 0x00000000031e7663 in rocksdb::BGThreadWrapper(void*) () at util/threadpool.cc:254 #16 0x00007fc85f2ea7f1 in start_thread () at /usr/local/fbcode/gcc-4.9-glibc-2.20-fb/lib/libpthread.so.0 #17 0x00007fc85e8fb46d in clone () at /usr/local/fbcode/gcc-4.9-glibc-2.20-fb/lib/libc.so.6 From looking at the code, probably what happened is this: - `TableCache::GetTableReader()` called `Env::NewRandomAccessFile()`, which dispatched to a `PosixEnv::NewRandomAccessFile()`, where probably an `open()` call failed, so the `NewRandomAccessFile()` left a nullptr in the resulting file, - `TableCache::GetTableReader()` called `NewReadaheadRandomAccessFile()` with that `nullptr` file, - it tried to call file's method and crashed. This diff is a trivial fix to this crash. Test Plan: `make -j check` Reviewers: sdong, andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62451	2016-08-25 04:39:26 -07:00
Islam AbdelRahman	2a9c97108e	[Flaky Test] Disable DBPropertiesTest.GetProperty Summary: Disable flaky test Test Plan: run it Reviewers: yiwu, andrewkr, kradhakrishnan, yhchiang, lightmark, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62487	2016-08-24 15:32:01 -07:00
Yi Wu	badbff65b7	Not insert into block cache if cache is full and not holding handle Summary: We used to allow insert into full block cache as long as `strict_capacity_limit=false`. This diff further restrict insert to full cache if caller don't intent to hold handle to the cache entry after insert. Hope this diff fix the assertion failure with db_stress: https://our.intern.facebook.com/intern/sandcastle/log/?instance_id=211853102&step_id=2475070014 db_stress: util/lru_cache.cc:278: virtual void rocksdb::LRUCacheShard::Release(rocksdb::Cache::Handle*): Assertion `lru_.next == &lru_' failed. The assertion at lru_cache.cc:278 can fail when an entry is inserted into full cache and stay in LRU list. Test Plan: make all check Reviewers: IslamAbdelRahman, lightmark, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62325	2016-08-23 13:53:49 -07:00
Yi Wu	4a16c32ece	Option to cache index/filter blocks with priority Summary: Add option to block based table to insert index/filter blocks to block cache with priority. Combined with LRUCache with high_pri_pool_ratio, we can reserved space for index/filter blocks, make them less likely to be evicted. Depends on D61977. Test Plan: See unit test. Reviewers: lightmark, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, march, leveldb Differential Revision: https://reviews.facebook.net/D62241	2016-08-23 13:44:13 -07:00
Islam AbdelRahman	6a17b07ca8	Add TablePropertiesCollector support in SstFileWriter Summary: Update SstFileWriter to use user TablePropertiesCollectors that are passed in Options Test Plan: unittests Reviewers: sdong Reviewed By: sdong Subscribers: jkedgar, andrewkr, hermanlee4, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D62253	2016-08-19 16:17:56 -07:00
Wanning Jiang	78837f5d61	TableBuilder / TableReader support for range deletion Summary: 1. Range Deletion Tombstone structure 2. Modify Add() in table_builder to make it usable for adding range del tombstones 3. Expose NewTombstoneIterator() API in table_reader Test Plan: table_test.cc (now BlockBasedTableBuilder::Add() only accepts InternalKey. I make table_test only pass InternalKey to BlockBasedTableBuidler. Also test writing/reading range deletion tombstones in table_test ) Reviewers: sdong, IslamAbdelRahman, lightmark, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61473	2016-08-19 15:10:31 -07:00

1 2 3 4 5 ...

2656 Commits