rocksdb

Author	SHA1	Message	Date
Artemiy Kolesnikov	2f4fc539c6	Compaction::IsTrivialMove relaxing Summary: IsTrivialMove returns true if no input file overlaps with output_level+1 with more than max_compaction_bytes_ bytes. Closes https://github.com/facebook/rocksdb/pull/1619 Differential Revision: D4278338 Pulled By: yiwu-arbug fbshipit-source-id: 994c001	2016-12-07 11:54:11 -08:00
dhruba borthakur	1dce75b2e2	Update USERS.md Summary: Closes https://github.com/facebook/rocksdb/pull/1630 Differential Revision: D4293200 Pulled By: IslamAbdelRahman fbshipit-source-id: 9c81f85	2016-12-07 11:24:14 -08:00
dhruba borthakur	304b3c706b	Update USERS.md Summary: Closes https://github.com/facebook/rocksdb/pull/1629 Differential Revision: D4293183 Pulled By: IslamAbdelRahman fbshipit-source-id: 759ea92	2016-12-07 11:24:14 -08:00
Andrew Kryczka	fa50fffaf5	Option to expand range tombstones in db_bench Summary: When enabled, this option replaces range tombstones with a sequence of point tombstones covering the same range. This can be used to A/B test perf of range tombstones vs sequential point tombstones, and help us find the cross-over point, i.e., the size of the range above which range tombstones outperform point tombstones. Closes https://github.com/facebook/rocksdb/pull/1594 Differential Revision: D4246312 Pulled By: ajkr fbshipit-source-id: 3b00b23	2016-12-06 18:09:14 -08:00
Yi Wu	c26a4d8e8a	Fix compile error in trasaction_lock_mgr.cc Summary: Fix error on mac/windows build since they don't recognize `uint`. Closes https://github.com/facebook/rocksdb/pull/1624 Differential Revision: D4287139 Pulled By: yiwu-arbug fbshipit-source-id: b7cc88f	2016-12-06 14:39:16 -08:00
Islam AbdelRahman	ed8fbdb560	Add EventListener::OnExternalFileIngested() event Summary: Add EventListener::OnExternalFileIngested() to allow user to subscribe to external file ingestion events Closes https://github.com/facebook/rocksdb/pull/1623 Differential Revision: D4285844 Pulled By: IslamAbdelRahman fbshipit-source-id: 0b95a88	2016-12-06 14:09:17 -08:00
Manuel Ung	2005c88a75	Implement non-exclusive locks Summary: This is an implementation of non-exclusive locks for pessimistic transactions. It is relatively simple and does not prevent starvation (ie. it's possible that request for exclusive access will never be granted if there are always threads holding shared access). It is done by changing `KeyLockInfo` to hold an set a transaction ids, instead of just one, and adding a flag specifying whether this lock is currently held with exclusive access or not. Some implementation notes: - Some lock diagnostic functions had to be updated to return a set of transaction ids for a given lock, eg. `GetWaitingTxn` and `GetLockStatusData`. - Deadlock detection is a bit more complicated since a transaction can now wait on multiple other transactions. A BFS is done in this case, and deadlock detection depth is now just a limit on the number of transactions we visit. - Expirable transactions do not work efficiently with shared locks at the moment, but that's okay for now. Closes https://github.com/facebook/rocksdb/pull/1573 Differential Revision: D4239097 Pulled By: lth fbshipit-source-id: da7c074	2016-12-05 17:39:17 -08:00
Yi Wu	0b0f235724	Mention IngestExternalFile changes in HISTORY.md Summary: I hit the land button too fast and didn't include the line. Closes https://github.com/facebook/rocksdb/pull/1622 Differential Revision: D4281316 Pulled By: yiwu-arbug fbshipit-source-id: c7b38e0	2016-12-05 16:09:11 -08:00
Yi Wu	23db48e8d8	Update HISTORY.md for 5.0 branch Summary: These changes are included in the new branch-cut. Closes https://github.com/facebook/rocksdb/pull/1621 Differential Revision: D4281015 Pulled By: yiwu-arbug fbshipit-source-id: d88858b	2016-12-05 15:39:11 -08:00
Mike Kolupaev	beb36d9c1e	Fixed CompactionFilter::Decision::kRemoveAndSkipUntil Summary: Embarassingly enough, the first time I tried to use my new feature in logdevice it crashed with this assertion failure: db/pinned_iterators_manager.h:30: void rocksdb::PinnedIteratorsManager::StartPinning(): Assertion `pinning_enabled == false' failed The issue was that `pinned_iters_mgr_.StartPinning()` was called but `pinned_iters_mgr_.ReleasePinnedData()` wasn't. Closes https://github.com/facebook/rocksdb/pull/1611 Differential Revision: D4265622 Pulled By: al13n321 fbshipit-source-id: 747b10f	2016-12-05 15:24:11 -08:00
Islam AbdelRahman	67f37cf198	Allow user to specify a CF for SST files generated by SstFileWriter Summary: Allow user to explicitly specify that the generated file by SstFileWriter will be ingested in a specific CF. This allow us to persist the CF id in the generated file Closes https://github.com/facebook/rocksdb/pull/1615 Differential Revision: D4270422 Pulled By: IslamAbdelRahman fbshipit-source-id: 7fb954e	2016-12-05 14:24:16 -08:00
Anton Safonov	9053fe2a5c	Made delete_obsolete_files_period_micros option dynamic Summary: Made delete_obsolete_files_period_micros option dynamic. It can be updating using DB::SetDBOptions(). Closes https://github.com/facebook/rocksdb/pull/1595 Differential Revision: D4246569 Pulled By: tonek fbshipit-source-id: d23f560	2016-12-05 14:24:16 -08:00
Islam AbdelRahman	edde954e7b	fix clang build Summary: override is missing for FilterV2 Closes https://github.com/facebook/rocksdb/pull/1606 Differential Revision: D4263832 Pulled By: IslamAbdelRahman fbshipit-source-id: d8b337a	2016-12-01 18:39:10 -08:00
Yi Wu	56281f3a97	Add memtable_insert_with_hint_prefix_size option to db_bench Summary: Add memtable_insert_with_hint_prefix_size option to db_bench Closes https://github.com/facebook/rocksdb/pull/1604 Differential Revision: D4260549 Pulled By: yiwu-arbug fbshipit-source-id: cee5ef7	2016-12-01 16:54:16 -08:00
Islam AbdelRahman	4a21b1402c	Cache heap::downheap() root comparison (optimize heap cmp call) Summary: Reduce number of comparisons in heap by caching which child node in the first level is smallest (left_child or right_child) So next time we can compare directly against the smallest child I see that the total number of calls to comparator drops significantly when using this optimization Before caching (~2mil key comparison for iterating the DB) ``` $ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq" --db="/dev/shm/heap_opt" --use_existing_db --disable_auto_compactions --cache_size=1000000000 --perf_level=2 readseq : 0.338 micros/op 2959201 ops/sec; 327.4 MB/s user_key_comparison_count = 2000008 ``` After caching (~1mil key comparison for iterating the DB) ``` $ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq" --db="/dev/shm/heap_opt" --use_existing_db --disable_auto_compactions --cache_size=1000000000 --perf_level=2 readseq : 0.309 micros/op 3236801 ops/sec; 358.1 MB/s user_key_comparison_count = 1000011 ``` It also improves Closes https://github.com/facebook/rocksdb/pull/1600 Differential Revision: D4256027 Pulled By: IslamAbdelRahman fbshipit-source-id: 76fcc66	2016-12-01 13:39:14 -08:00
Islam AbdelRahman	e39d080871	Fix travis (compile for clang < 3.9) Summary: Travis fail because it uses clang 3.6 which don't recognize `__attribute__((__no_sanitize__("undefined")))` Closes https://github.com/facebook/rocksdb/pull/1601 Differential Revision: D4257175 Pulled By: IslamAbdelRahman fbshipit-source-id: fb4d1ab	2016-12-01 10:09:22 -08:00
Igor Canadi	3f407b065c	Kill flashcache code in RocksDB Summary: Now that we have userspace persisted cache, we don't need flashcache anymore. Closes https://github.com/facebook/rocksdb/pull/1588 Differential Revision: D4245114 Pulled By: igorcanadi fbshipit-source-id: e2c1c72	2016-12-01 10:09:22 -08:00
fangchenliaohui	b77007df8b	Bug: paralle_group status updated in WriteThread::CompleteParallelWorker Summary: Multi-write thread may update the status of the parallel_group in WriteThread::CompleteParallelWorker if the status of Writer is not ok! When copy write status to the paralle_group, the write thread just hold the mutex of the the writer processed by itself. it is useless. The thread should held the the leader of the parallel_group instead. Closes https://github.com/facebook/rocksdb/pull/1598 Differential Revision: D4252335 Pulled By: siying fbshipit-source-id: 3864cf7	2016-12-01 09:54:11 -08:00
Mike Kolupaev	247d0979aa	Support for range skips in compaction filter Summary: This adds the ability for compaction filter to say "drop this key-value, and also drop everything up to key x". This will cause the compaction to seek input iterator to x, without reading the data. This can make compaction much faster when large consecutive chunks of data are filtered out. See the changes in include/rocksdb/compaction_filter.h for the new API. Along the way this diff also adds ability for compaction filter changing merge operands, similar to how it can change values; we're not going to use this feature, it just seemed easier and cleaner to implement it than to document that it's not implemented :) The diff is not as big as it may seem, about half of the lines are a test. Closes https://github.com/facebook/rocksdb/pull/1599 Differential Revision: D4252092 Pulled By: al13n321 fbshipit-source-id: 41e1e48	2016-12-01 07:09:15 -08:00
Panagiotis Ktistakis	96fcefbf1d	c api: expose option for dynamic level size target Summary: Closes https://github.com/facebook/rocksdb/pull/1587 Differential Revision: D4245923 Pulled By: yiwu-arbug fbshipit-source-id: 6ee7291	2016-11-30 11:24:14 -08:00
zhangjinpeng1987	00197cff39	Add C API to set base_backgroud_compactions Summary: Add C API to set base_backgroud_compactions Closes https://github.com/facebook/rocksdb/pull/1571 Differential Revision: D4245709 Pulled By: yiwu-arbug fbshipit-source-id: 792c6b8	2016-11-30 11:09:13 -08:00
Andrew Kryczka	5b219eccb5	deleterange end-to-end test improvements for lite/robustness Summary: Closes https://github.com/facebook/rocksdb/pull/1591 Differential Revision: D4246019 Pulled By: ajkr fbshipit-source-id: 0c4aa37	2016-11-29 12:24:13 -08:00
Anirban Rahut	aad1191765	pass rocksdb oncall to mysql_mtr_filter otherwise tasks get created w… Summary: …rong owner mysql_mtr_filter script needs proper oncall Closes https://github.com/facebook/rocksdb/pull/1586 Differential Revision: D4245150 Pulled By: anirbanr-fb fbshipit-source-id: fd8577c	2016-11-29 12:09:12 -08:00
Andrew Kryczka	e333528991	DeleteRange write path end-to-end tests Summary: Closes https://github.com/facebook/rocksdb/pull/1578 Differential Revision: D4241171 Pulled By: ajkr fbshipit-source-id: ce5fd83	2016-11-29 11:09:22 -08:00
Siying Dong	7784980fcd	Fix mis-reporting of compaction read bytes to the base level Summary: In dynamic leveled compaction, when calculating read bytes, output level bytes may be wronglyl calculated as input level inputs. Fix it. Closes https://github.com/facebook/rocksdb/pull/1475 Differential Revision: D4148412 Pulled By: siying fbshipit-source-id: f2f475a	2016-11-29 11:09:22 -08:00
Islam AbdelRahman	3c6b49ed66	Fix implicit conversion between int64_t to int Summary: Make conversion explicit, implicit conversion breaks the build Closes https://github.com/facebook/rocksdb/pull/1589 Differential Revision: D4245158 Pulled By: IslamAbdelRahman fbshipit-source-id: aaec00d	2016-11-29 10:54:15 -08:00
Siying Dong	b3b875657f	Remove unused assignment in db/db_iter.cc Summary: "make analyze" complains the assignment is not useful. Remove it. Closes https://github.com/facebook/rocksdb/pull/1581 Differential Revision: D4241697 Pulled By: siying fbshipit-source-id: 178f67a	2016-11-29 09:09:14 -08:00
Andrew Kryczka	4f6e89b1d0	Fix range deletion covering key in same SST file Summary: AddTombstones() needs to be before t->Get(), oops :'( Closes https://github.com/facebook/rocksdb/pull/1576 Differential Revision: D4241041 Pulled By: ajkr fbshipit-source-id: 781ceea	2016-11-28 22:54:13 -08:00
Islam AbdelRahman	a2bf265a39	Avoid intentional overflow in GetL0ThresholdSpeedupCompaction Summary: `99c052a34f` fixes integer overflow in GetL0ThresholdSpeedupCompaction() by checking if int become -ve. UBSAN will complain about that since this is still an overflow, we can fix the issue by simply using int64_t Closes https://github.com/facebook/rocksdb/pull/1582 Differential Revision: D4241525 Pulled By: IslamAbdelRahman fbshipit-source-id: b3ae21f	2016-11-28 18:39:13 -08:00
Islam AbdelRahman	52fd1ff2c2	disable UBSAN for functions with intentional -ve shift / overflow Summary: disable UBSAN for functions with intentional left shift on -ve number / overflow These functions are rocksdb:: Hash FixedLengthColBufEncoder::Append FaultInjectionTest:: Key Closes https://github.com/facebook/rocksdb/pull/1577 Differential Revision: D4240801 Pulled By: IslamAbdelRahman fbshipit-source-id: 3e1caf6	2016-11-28 17:54:12 -08:00
Islam AbdelRahman	1886c435b9	Fix CompactionJob::Install division by zero Summary: Fix CompactionJob::Install division by zero Closes https://github.com/facebook/rocksdb/pull/1580 Differential Revision: D4240794 Pulled By: IslamAbdelRahman fbshipit-source-id: 7286721	2016-11-28 16:54:16 -08:00
Islam AbdelRahman	63c30de80d	fix options_test ubsan Summary: Having -ve value for max_write_buffer_number does not make sense and cause us to do a left shift on a -ve value number Closes https://github.com/facebook/rocksdb/pull/1579 Differential Revision: D4240798 Pulled By: IslamAbdelRahman fbshipit-source-id: bd6267e	2016-11-28 16:39:14 -08:00
Islam AbdelRahman	13e66a8f51	Fix compaction_job.cc division by zero Summary: Fix division by zero in compaction_job.cc Closes https://github.com/facebook/rocksdb/pull/1575 Differential Revision: D4240818 Pulled By: IslamAbdelRahman fbshipit-source-id: a8bc757	2016-11-28 16:39:13 -08:00
Andrew Kryczka	01eabf7375	Fix double-counted deletion stat Summary: Both the single deletion and the value are included in compaction outputs, so no need to update the stat for the value's deletion yet, otherwise it'd be double-counted. Closes https://github.com/facebook/rocksdb/pull/1574 Differential Revision: D4241181 Pulled By: ajkr fbshipit-source-id: c9aaa15	2016-11-28 15:54:12 -08:00
Andrew Kryczka	7ffb10fc1a	DeleteRange compaction statistics Summary: - "rocksdb.compaction.key.drop.range_del" - number of keys dropped during compaction due to a range tombstone covering them - "rocksdb.compaction.range_del.drop.obsolete" - number of range tombstones dropped due to compaction to bottom level and no snapshot saving them - s/CompactionIteratorStats/CompactionIterationStats/g since this class is no longer specific to CompactionIterator -- it's also updated for range tombstone iteration during compaction - Move the above class into a separate .h file to avoid circular dependency. Closes https://github.com/facebook/rocksdb/pull/1520 Differential Revision: D4187179 Pulled By: ajkr fbshipit-source-id: 10c2103	2016-11-28 11:54:12 -08:00
Mike Kolupaev	236d4c67e9	Less linear search in DBIter::Seek() when keys are overwritten a lot Summary: In one deployment we saw high latencies (presumably from slow iterator operations) and a lot of CPU time reported by perf with this stack: ``` rocksdb::MergingIterator::Next rocksdb::DBIter::FindNextUserEntryInternal rocksdb::DBIter::Seek ``` I think what's happening is: 1. we create a snapshot iterator, 2. we do lots of Put()s for the same key x; this creates lots of entries in memtable, 3. we seek the iterator to a key slightly smaller than x, 4. the seek walks over lots of entries in memtable for key x, skipping them because of high sequence numbers. CC IslamAbdelRahman Closes https://github.com/facebook/rocksdb/pull/1413 Differential Revision: D4083879 Pulled By: IslamAbdelRahman fbshipit-source-id: a83ddae	2016-11-28 10:24:11 -08:00
Siying Dong	cd7c4143d7	Improve Write Stalling System Summary: Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before. Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system. Closes https://github.com/facebook/rocksdb/pull/1562 Differential Revision: D4218519 Pulled By: siying fbshipit-source-id: 95e4088	2016-11-23 09:24:15 -08:00
Yi Wu	dfb6fe6755	Unified InlineSkipList::Insert algorithm with hinting Summary: This PR is based on nbronson's diff with small modifications to wire it up with existing interface. Comparing to previous version, this approach works better for inserting keys in decreasing order or updating the same key, and impose less restriction to the prefix extractor. ---- Summary from original diff ---- This diff introduces a single InlineSkipList::Insert that unifies the existing sequential insert optimization (prev_), concurrent insertion, and insertion using externally-managed insertion point hints. There's a deep symmetry between insertion hints (cursors) and the concurrent algorithm. In both cases we have partial information from the recent past that is likely but not certain to be accurate. This diff introduces the struct InlineSkipList::Splice, which encodes predecessor and successor information in the same form that was previously only used within a single call to InsertConcurrently. Splice holds information about an insertion point that can be used to levera Closes https://github.com/facebook/rocksdb/pull/1561 Differential Revision: D4217283 Pulled By: yiwu-arbug fbshipit-source-id: 33ee437	2016-11-22 14:09:13 -08:00
Karthikeyan Radhakrishnan	3068870cce	Making persistent cache more resilient to filesystem failures Summary: The persistent cache is designed to hop over errors and return key not found. So far, it has shown resilience to write errors, encoding errors, data corruption etc. It is not resilient against disappearing files/directories. This was exposed during testing when multiple instances of persistence cache was started sharing the same directory simulating an unpredictable filesystem environment. This patch - makes the write code path more resilient to errors while creating files - makes the read code path more resilient to handle situation where files are not found - added a test that does negative write/read testing by removing the directory while writes are in progress Closes https://github.com/facebook/rocksdb/pull/1472 Differential Revision: D4143413 Pulled By: kradhakrishnan fbshipit-source-id: fd25e9b	2016-11-22 10:39:10 -08:00
Andrew Kryczka	734e4acafb	Eliminate redundant cache lookup with range deletion Summary: When we introduced range deletion block, TableCache::Get() and TableCache::NewIterator() each did two table cache lookups, one for range deletion block iterator and another for getting the table reader to which the Get()/NewIterator() is delegated. This extra cache lookup was very CPU-intensive (about 10% overhead in a read-heavy benchmark). We can avoid it by reusing the Cache::Handle created for range deletion block iterator to get the file reader. Closes https://github.com/facebook/rocksdb/pull/1537 Differential Revision: D4201167 Pulled By: ajkr fbshipit-source-id: d33ffd8	2016-11-21 21:24:11 -08:00
Maysam Yabandeh	182b940e70	Add WriteOptions.no_slowdown Summary: If the WriteOptions.no_slowdown flag is set AND we need to wait or sleep for the write request, then fail immediately with Status::Incomplete(). Closes https://github.com/facebook/rocksdb/pull/1527 Differential Revision: D4191405 Pulled By: maysamyabandeh fbshipit-source-id: 7f3ce3f	2016-11-21 18:09:13 -08:00
Karthikeyan Radhakrishnan	4118e13330	Persistent Cache: Expose stats to user via public API Summary: Exposing persistent cache stats (counters) to the user via public API. Closes https://github.com/facebook/rocksdb/pull/1485 Differential Revision: D4155274 Pulled By: siying fbshipit-source-id: 30a9f50	2016-11-21 17:39:13 -08:00
Siying Dong	f2a8f92a15	rocks_lua_compaction_filter: add unused attribute to a variable Summary: Release build shows warning without this fix. Closes https://github.com/facebook/rocksdb/pull/1558 Differential Revision: D4215831 Pulled By: yiwu-arbug fbshipit-source-id: 888a755	2016-11-21 14:54:14 -08:00
Nick Terrell	4444256ab7	Remove use of deprecated LZ4 function Summary: LZ4 1.7.3 emits warnings when calling the deprecated function `LZ4_compress_limitedOutput_continue()`. Starting in r129, LZ4 introduces `LZ4_compress_fast_continue()` as a replacement, and the two functions calls are [exactly equivalent](https://github.com/lz4/lz4/blob/dev/lib/lz4.c#L1408). Closes https://github.com/facebook/rocksdb/pull/1532 Differential Revision: D4199240 Pulled By: siying fbshipit-source-id: 138c2bc	2016-11-21 12:24:14 -08:00
Changli Gao	548d7fb261	Fix fd leak when using direct IOs Summary: We should close the fd, before overriding it. This bug was introduced by `f89caa127b` Closes https://github.com/facebook/rocksdb/pull/1553 Differential Revision: D4214101 Pulled By: siying fbshipit-source-id: 0d65de0	2016-11-21 12:24:13 -08:00
Andrew Kryczka	fd43ee09da	Range deletion microoptimizations Summary: - Made RangeDelAggregator's InternalKeyComparator member a reference-to-const so we don't need to copy-construct it. Also added InternalKeyComparator to ImmutableCFOptions so we don't need to construct one for each DBIter. - Made MemTable::NewRangeTombstoneIterator and the table readers' NewRangeTombstoneIterator() functions return nullptr instead of NewEmptyInternalIterator to avoid the allocation. Updated callers accordingly. Closes https://github.com/facebook/rocksdb/pull/1548 Differential Revision: D4208169 Pulled By: ajkr fbshipit-source-id: 2fd65cf	2016-11-21 12:24:13 -08:00
Joel Marcey	23a18ca5ad	Reword support a little bit to more clear and concise Summary: I tried to do this in #1556, but it landed before the change could be imported. Closes https://github.com/facebook/rocksdb/pull/1557 Differential Revision: D4214572 Pulled By: siying fbshipit-source-id: 718d4a4	2016-11-21 11:39:13 -08:00
Joel Marcey	481856ac46	Update support to separate code issues with general questions Summary: Closes https://github.com/facebook/rocksdb/pull/1556 Differential Revision: D4214184 Pulled By: siying fbshipit-source-id: c1abf47	2016-11-21 10:54:12 -08:00
Changli Gao	a0deec960f	Fix deadlock when calling getMergedHistogram Summary: When calling StatisticsImpl::HistogramInfo::getMergedHistogram(), if there is a dying thread, which is calling ThreadLocalPtr::StaticMeta::OnThreadExit() to merge its thread values to HistogramInfo, deadlock will occur. Because the former try to hold merge_lock then ThreadMeta::mutex_, but the later try to hold ThreadMeta::mutex_ then merge_lock. In short, the locking order isn't the same. This patch addressed this issue by releasing merge_lock before folding thread values. Closes https://github.com/facebook/rocksdb/pull/1552 Differential Revision: D4211942 Pulled By: ajkr fbshipit-source-id: ef89bcb	2016-11-20 18:24:12 -08:00
Andrew Kryczka	fe349db57b	Remove Arena in RangeDelAggregator Summary: The Arena construction/destruction introduced significant overhead to read-heavy workload just by creating empty vectors for its blocks, so avoid it in RangeDelAggregator. Closes https://github.com/facebook/rocksdb/pull/1547 Differential Revision: D4207781 Pulled By: ajkr fbshipit-source-id: 9d1c130	2016-11-19 14:24:12 -08:00

... 3 4 5 6 7 ...

5873 Commits