rocksdb

Author	SHA1	Message	Date
Andrew Kryczka	01eabf7375	Fix double-counted deletion stat Summary: Both the single deletion and the value are included in compaction outputs, so no need to update the stat for the value's deletion yet, otherwise it'd be double-counted. Closes https://github.com/facebook/rocksdb/pull/1574 Differential Revision: D4241181 Pulled By: ajkr fbshipit-source-id: c9aaa15	2016-11-28 15:54:12 -08:00
Andrew Kryczka	7ffb10fc1a	DeleteRange compaction statistics Summary: - "rocksdb.compaction.key.drop.range_del" - number of keys dropped during compaction due to a range tombstone covering them - "rocksdb.compaction.range_del.drop.obsolete" - number of range tombstones dropped due to compaction to bottom level and no snapshot saving them - s/CompactionIteratorStats/CompactionIterationStats/g since this class is no longer specific to CompactionIterator -- it's also updated for range tombstone iteration during compaction - Move the above class into a separate .h file to avoid circular dependency. Closes https://github.com/facebook/rocksdb/pull/1520 Differential Revision: D4187179 Pulled By: ajkr fbshipit-source-id: 10c2103	2016-11-28 11:54:12 -08:00
Mike Kolupaev	236d4c67e9	Less linear search in DBIter::Seek() when keys are overwritten a lot Summary: In one deployment we saw high latencies (presumably from slow iterator operations) and a lot of CPU time reported by perf with this stack: ``` rocksdb::MergingIterator::Next rocksdb::DBIter::FindNextUserEntryInternal rocksdb::DBIter::Seek ``` I think what's happening is: 1. we create a snapshot iterator, 2. we do lots of Put()s for the same key x; this creates lots of entries in memtable, 3. we seek the iterator to a key slightly smaller than x, 4. the seek walks over lots of entries in memtable for key x, skipping them because of high sequence numbers. CC IslamAbdelRahman Closes https://github.com/facebook/rocksdb/pull/1413 Differential Revision: D4083879 Pulled By: IslamAbdelRahman fbshipit-source-id: a83ddae	2016-11-28 10:24:11 -08:00
Siying Dong	cd7c4143d7	Improve Write Stalling System Summary: Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before. Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system. Closes https://github.com/facebook/rocksdb/pull/1562 Differential Revision: D4218519 Pulled By: siying fbshipit-source-id: 95e4088	2016-11-23 09:24:15 -08:00
Yi Wu	dfb6fe6755	Unified InlineSkipList::Insert algorithm with hinting Summary: This PR is based on nbronson's diff with small modifications to wire it up with existing interface. Comparing to previous version, this approach works better for inserting keys in decreasing order or updating the same key, and impose less restriction to the prefix extractor. ---- Summary from original diff ---- This diff introduces a single InlineSkipList::Insert that unifies the existing sequential insert optimization (prev_), concurrent insertion, and insertion using externally-managed insertion point hints. There's a deep symmetry between insertion hints (cursors) and the concurrent algorithm. In both cases we have partial information from the recent past that is likely but not certain to be accurate. This diff introduces the struct InlineSkipList::Splice, which encodes predecessor and successor information in the same form that was previously only used within a single call to InsertConcurrently. Splice holds information about an insertion point that can be used to levera Closes https://github.com/facebook/rocksdb/pull/1561 Differential Revision: D4217283 Pulled By: yiwu-arbug fbshipit-source-id: 33ee437	2016-11-22 14:09:13 -08:00
Karthikeyan Radhakrishnan	3068870cce	Making persistent cache more resilient to filesystem failures Summary: The persistent cache is designed to hop over errors and return key not found. So far, it has shown resilience to write errors, encoding errors, data corruption etc. It is not resilient against disappearing files/directories. This was exposed during testing when multiple instances of persistence cache was started sharing the same directory simulating an unpredictable filesystem environment. This patch - makes the write code path more resilient to errors while creating files - makes the read code path more resilient to handle situation where files are not found - added a test that does negative write/read testing by removing the directory while writes are in progress Closes https://github.com/facebook/rocksdb/pull/1472 Differential Revision: D4143413 Pulled By: kradhakrishnan fbshipit-source-id: fd25e9b	2016-11-22 10:39:10 -08:00
Andrew Kryczka	734e4acafb	Eliminate redundant cache lookup with range deletion Summary: When we introduced range deletion block, TableCache::Get() and TableCache::NewIterator() each did two table cache lookups, one for range deletion block iterator and another for getting the table reader to which the Get()/NewIterator() is delegated. This extra cache lookup was very CPU-intensive (about 10% overhead in a read-heavy benchmark). We can avoid it by reusing the Cache::Handle created for range deletion block iterator to get the file reader. Closes https://github.com/facebook/rocksdb/pull/1537 Differential Revision: D4201167 Pulled By: ajkr fbshipit-source-id: d33ffd8	2016-11-21 21:24:11 -08:00
Maysam Yabandeh	182b940e70	Add WriteOptions.no_slowdown Summary: If the WriteOptions.no_slowdown flag is set AND we need to wait or sleep for the write request, then fail immediately with Status::Incomplete(). Closes https://github.com/facebook/rocksdb/pull/1527 Differential Revision: D4191405 Pulled By: maysamyabandeh fbshipit-source-id: 7f3ce3f	2016-11-21 18:09:13 -08:00
Karthikeyan Radhakrishnan	4118e13330	Persistent Cache: Expose stats to user via public API Summary: Exposing persistent cache stats (counters) to the user via public API. Closes https://github.com/facebook/rocksdb/pull/1485 Differential Revision: D4155274 Pulled By: siying fbshipit-source-id: 30a9f50	2016-11-21 17:39:13 -08:00
Siying Dong	f2a8f92a15	rocks_lua_compaction_filter: add unused attribute to a variable Summary: Release build shows warning without this fix. Closes https://github.com/facebook/rocksdb/pull/1558 Differential Revision: D4215831 Pulled By: yiwu-arbug fbshipit-source-id: 888a755	2016-11-21 14:54:14 -08:00
Nick Terrell	4444256ab7	Remove use of deprecated LZ4 function Summary: LZ4 1.7.3 emits warnings when calling the deprecated function `LZ4_compress_limitedOutput_continue()`. Starting in r129, LZ4 introduces `LZ4_compress_fast_continue()` as a replacement, and the two functions calls are [exactly equivalent](https://github.com/lz4/lz4/blob/dev/lib/lz4.c#L1408). Closes https://github.com/facebook/rocksdb/pull/1532 Differential Revision: D4199240 Pulled By: siying fbshipit-source-id: 138c2bc	2016-11-21 12:24:14 -08:00
Changli Gao	548d7fb261	Fix fd leak when using direct IOs Summary: We should close the fd, before overriding it. This bug was introduced by `f89caa127b` Closes https://github.com/facebook/rocksdb/pull/1553 Differential Revision: D4214101 Pulled By: siying fbshipit-source-id: 0d65de0	2016-11-21 12:24:13 -08:00
Andrew Kryczka	fd43ee09da	Range deletion microoptimizations Summary: - Made RangeDelAggregator's InternalKeyComparator member a reference-to-const so we don't need to copy-construct it. Also added InternalKeyComparator to ImmutableCFOptions so we don't need to construct one for each DBIter. - Made MemTable::NewRangeTombstoneIterator and the table readers' NewRangeTombstoneIterator() functions return nullptr instead of NewEmptyInternalIterator to avoid the allocation. Updated callers accordingly. Closes https://github.com/facebook/rocksdb/pull/1548 Differential Revision: D4208169 Pulled By: ajkr fbshipit-source-id: 2fd65cf	2016-11-21 12:24:13 -08:00
Joel Marcey	23a18ca5ad	Reword support a little bit to more clear and concise Summary: I tried to do this in #1556, but it landed before the change could be imported. Closes https://github.com/facebook/rocksdb/pull/1557 Differential Revision: D4214572 Pulled By: siying fbshipit-source-id: 718d4a4	2016-11-21 11:39:13 -08:00
Joel Marcey	481856ac46	Update support to separate code issues with general questions Summary: Closes https://github.com/facebook/rocksdb/pull/1556 Differential Revision: D4214184 Pulled By: siying fbshipit-source-id: c1abf47	2016-11-21 10:54:12 -08:00
Changli Gao	a0deec960f	Fix deadlock when calling getMergedHistogram Summary: When calling StatisticsImpl::HistogramInfo::getMergedHistogram(), if there is a dying thread, which is calling ThreadLocalPtr::StaticMeta::OnThreadExit() to merge its thread values to HistogramInfo, deadlock will occur. Because the former try to hold merge_lock then ThreadMeta::mutex_, but the later try to hold ThreadMeta::mutex_ then merge_lock. In short, the locking order isn't the same. This patch addressed this issue by releasing merge_lock before folding thread values. Closes https://github.com/facebook/rocksdb/pull/1552 Differential Revision: D4211942 Pulled By: ajkr fbshipit-source-id: ef89bcb	2016-11-20 18:24:12 -08:00
Andrew Kryczka	fe349db57b	Remove Arena in RangeDelAggregator Summary: The Arena construction/destruction introduced significant overhead to read-heavy workload just by creating empty vectors for its blocks, so avoid it in RangeDelAggregator. Closes https://github.com/facebook/rocksdb/pull/1547 Differential Revision: D4207781 Pulled By: ajkr fbshipit-source-id: 9d1c130	2016-11-19 14:24:12 -08:00
Manuel Ung	e63350e726	Use more efficient hash map for deadlock detection Summary: Currently, deadlock cycles are held in std::unordered_map. The problem with it is that it allocates/deallocates memory on every insertion/deletion. This limits throughput since we're doing this expensive operation while holding a global mutex. Fix this by using a vector which caches memory instead. Running the deadlock stress test, this change increased throughput from 39k txns/s -> 49k txns/s. The effect is more noticeable in MyRocks. Closes https://github.com/facebook/rocksdb/pull/1545 Differential Revision: D4205662 Pulled By: lth fbshipit-source-id: ff990e4	2016-11-19 11:39:15 -08:00
Siying Dong	a13bde39ee	Skip ldb test in Travis Summary: Travis now is building for ldb tests. Disable for now to unblock other tests while we are investigating. Closes https://github.com/facebook/rocksdb/pull/1546 Differential Revision: D4209404 Pulled By: siying fbshipit-source-id: 47edd97	2016-11-18 19:24:13 -08:00
Siying Dong	73843aa636	Direct I/O Reads Handle the last sector correctly. Summary: Currently, in the Direct I/O read mode, the last sector of the file, if not full, is not handled correctly. If the return value of pread is not multiplier of kSectorSize, we still go ahead and continue reading, even if the buffer is not aligned. With the commit, if the return value is not multiplier of kSectorSize, and all but the last sector has been read, we simply return. Closes https://github.com/facebook/rocksdb/pull/1550 Differential Revision: D4209609 Pulled By: lightmark fbshipit-source-id: cb0b439	2016-11-18 19:24:13 -08:00
Maysam Yabandeh	9d60151b04	Implement PositionedAppend for PosixWritableFile Summary: This patch clarifies the contract of PositionedAppend with some unit tests and also implements it for PosixWritableFile. (Tasks: 14524071) Closes https://github.com/facebook/rocksdb/pull/1514 Differential Revision: D4204907 Pulled By: maysamyabandeh fbshipit-source-id: 06eabd2	2016-11-18 17:24:13 -08:00
Andrew Kryczka	3f62215210	Lazily initialize RangeDelAggregator's map and pinning manager Summary: Since a RangeDelAggregator is created for each read request, these heap-allocating member variables were consuming significant CPU (~3% total) which slowed down request throughput. The map and pinning manager are only necessary when range deletions exist, so we can defer their initialization until the first range deletion is encountered. Currently lazy initialization is done for reads only since reads pass us a single snapshot, which is easier to store on the stack for later insertion into the map than the vector passed to us by flush or compaction. Note the Arena member variable is still expensive, I will figure out what to do with it in a subsequent diff. It cannot be lazily initialized because we currently use this arena even to allocate empty iterators, which is necessary even when no range deletions exist. Closes https://github.com/facebook/rocksdb/pull/1539 Differential Revision: D4203488 Pulled By: ajkr fbshipit-source-id: 3b36279	2016-11-18 17:09:11 -08:00
Kefu Chai	41e77b8390	cmake: s/STEQUAL/STREQUAL/ Summary: Signed-off-by: Kefu Chai <tchaikov@gmail.com> Closes https://github.com/facebook/rocksdb/pull/1540 Differential Revision: D4207564 Pulled By: siying fbshipit-source-id: 567415b	2016-11-18 14:54:14 -08:00
Islam AbdelRahman	c1038d2837	Release RocksDB 5.0 Summary: Update HISTORY.md and version.h Closes https://github.com/facebook/rocksdb/pull/1536 Differential Revision: D4202987 Pulled By: IslamAbdelRahman fbshipit-source-id: 94985e3	2016-11-17 18:39:15 -08:00
Andrew Kryczka	635a7bd1ad	refactor TableCache Get/NewIterator for single exit points Summary: these functions were too complicated to change with exit points everywhere, so refactored them. btw, please review urgently, this is a prereq to fix the 5.0 perf regression Closes https://github.com/facebook/rocksdb/pull/1534 Differential Revision: D4198972 Pulled By: ajkr fbshipit-source-id: 04ebfb7	2016-11-17 14:39:13 -08:00
Islam AbdelRahman	f39452e81f	Fix heap use after free ASAN/Valgrind Summary: Dont use c_str() of temp std::string in RocksLuaCompactionFilter::Name() Closes https://github.com/facebook/rocksdb/pull/1535 Differential Revision: D4199094 Pulled By: IslamAbdelRahman fbshipit-source-id: e56ce62	2016-11-17 12:24:12 -08:00
Siying Dong	a4eb7387b2	Allow plain table to store index on file with bloom filter disabled Summary: Currently plain table bloom filter is required if storing metadata on file. Remove the constraint. Closes https://github.com/facebook/rocksdb/pull/1525 Differential Revision: D4190977 Pulled By: siying fbshipit-source-id: be60442	2016-11-17 11:09:13 -08:00
Yi Wu	36e4762ce0	Remove Ticker::SEQUENCE_NUMBER Summary: Remove the ticker count because: * Having to reset the ticker count in WriteImpl is ineffiecent; * It doesn't make sense to have it as a ticker count if multiple db instance share a statistics object. Closes https://github.com/facebook/rocksdb/pull/1531 Differential Revision: D4194442 Pulled By: yiwu-arbug fbshipit-source-id: e2110a9	2016-11-16 22:39:09 -08:00
Islam AbdelRahman	86eb2b9ad9	Fix src.mk	2016-11-16 18:05:19 -08:00
Andrew Kryczka	0765babe15	Remove LATEST_BACKUP file Summary: This has been unused since D42069 but kept around for backward compatibility. I think it is unlikely anyone will use a much older version of RocksDB for restore than they use for backup, so I propose removing it. It is also causing recurring confusion, e.g., https://www.facebook.com/groups/rocksdb.dev/permalink/980454015386446/ Ported from https://reviews.facebook.net/D60735 Closes https://github.com/facebook/rocksdb/pull/1529 Differential Revision: D4194199 Pulled By: ajkr fbshipit-source-id: 82f9bf4	2016-11-16 17:24:15 -08:00
Yueh-Hsuan Chiang	647eafdc21	Introduce Lua Extension: RocksLuaCompactionFilter Summary: This diff includes an implementation of CompactionFilter that allows users to write CompactionFilter in Lua. With this ability, users can dynamically change compaction filter logic without requiring building the rocksdb binary and restarting the database. To compile, WITH_LUA_PATH must be specified to the base directory of lua. Closes https://github.com/facebook/rocksdb/pull/1478 Differential Revision: D4150138 Pulled By: yhchiang fbshipit-source-id: ed84222	2016-11-16 15:39:12 -08:00
Andrew Kryczka	760ef68a69	fix deleterange asan issue Summary: pinned_iters_mgr_ pins iterators allocated with arena_, so we should order the instance variable declarations such that the pinned iterators have their destructors executed before the arena is destroyed. Closes https://github.com/facebook/rocksdb/pull/1528 Differential Revision: D4191984 Pulled By: ajkr fbshipit-source-id: 1386f20	2016-11-16 14:09:07 -08:00
Andrew Kryczka	327085b7b2	fix valgrind Summary: Closes https://github.com/facebook/rocksdb/pull/1526 Differential Revision: D4191257 Pulled By: ajkr fbshipit-source-id: d09dc76	2016-11-16 12:09:11 -08:00
Adam Retter	715591bba0	Ask travis to use JDK 7 Summary: yhchiang This may or may not help Closes https://github.com/facebook/rocksdb/pull/1385 Differential Revision: D4098424 Pulled By: yhchiang fbshipit-source-id: 9f9782e	2016-11-16 10:54:12 -08:00
Siying Dong	972e3ff295	Enable allow_concurrent_memtable_write and enable_write_thread_adaptive_yield by default Summary: Closes https://github.com/facebook/rocksdb/pull/1496 Differential Revision: D4168080 Pulled By: siying fbshipit-source-id: 056ae62	2016-11-16 09:39:09 -08:00
Siying Dong	420bdb42e7	option_change_migration_test: force full compaction when needed Summary: When option_change_migration_test decides to go with a full compaction, we don't force a compaction but allow trivial move. This can cause assert failure if the destination is level 0. Fix it by forcing the full compaction to skip trivial move if the destination level is L0. Closes https://github.com/facebook/rocksdb/pull/1518 Differential Revision: D4183610 Pulled By: siying fbshipit-source-id: dea482b	2016-11-15 22:09:34 -08:00
Yi Wu	1543d5d92e	Report memory usage by memtable insert hints map. Summary: It is hard to measure acutal memory usage by std containers. Even providing a custom allocator will miss count some of the usage. Here we only do a wild guess on its memory usage. Closes https://github.com/facebook/rocksdb/pull/1511 Differential Revision: D4179945 Pulled By: yiwu-arbug fbshipit-source-id: 32ab929	2016-11-15 20:24:13 -08:00
Andrew Kryczka	018bb2ebf5	DeleteRange support for db_bench Summary: Added a few options to configure when to add range tombstones during any benchmark involving writes. Closes https://github.com/facebook/rocksdb/pull/1522 Differential Revision: D4187388 Pulled By: ajkr fbshipit-source-id: 2c8a473	2016-11-15 17:39:47 -08:00
Willem Jan Withagen	dc51bd716b	CMakeLists.txt: FreeBSD has jemalloc as default malloc Summary: This will allow reference to `malloc_stats_print` Closes https://github.com/facebook/rocksdb/pull/1516 Differential Revision: D4187258 Pulled By: siying fbshipit-source-id: 34ae9f9	2016-11-15 17:39:47 -08:00
Andrew Kryczka	48e8baebc0	Decouple data iterator and range deletion iterator in TableCache Summary: Previously we used TableCache::NewIterator() for multiple purposes (data block iterator and range deletion iterator), and returned non-ok status in the data block iterator. In one case where the caller only used the range deletion block iterator (`9e7cf3469b/db/version_set.cc (L965-L973)`), we didn't check/free the data block iterator containing non-ok status, which caused a valgrind error. So, this diff decouples creation of data block and range deletion block iterators, and updates the callers accordingly. Both functions can return non-ok status in an InternalIterator. Since the non-ok status is returned in an iterator that the callers will definitely use, it should be more usable/less error-prone. Closes https://github.com/facebook/rocksdb/pull/1513 Differential Revision: D4181423 Pulled By: ajkr fbshipit-source-id: 835b8f5	2016-11-15 17:24:28 -08:00
daoye.ch	4b0aa3c4c8	Fix failed compaction_filter_example and add it into make all Summary: Simple patch as title Closes https://github.com/facebook/rocksdb/pull/1512 Differential Revision: D4186994 Pulled By: siying fbshipit-source-id: 880f9b8	2016-11-15 17:09:10 -08:00
Andrew Kryczka	53b693f5fe	ldb support for range delete Summary: Add a subcommand to ldb with which we can delete a range of keys. Closes https://github.com/facebook/rocksdb/pull/1521 Differential Revision: D4186338 Pulled By: ajkr fbshipit-source-id: b8e9861	2016-11-15 15:54:20 -08:00
Andrew Kryczka	661e4c9267	DeleteRange unsupported in non-block-based tables Summary: Return an error from DeleteRange() (or Write() if the user is using the low-level WriteBatch API) if an unsupported table type is configured. Closes https://github.com/facebook/rocksdb/pull/1519 Differential Revision: D4185933 Pulled By: ajkr fbshipit-source-id: abcdf84	2016-11-15 15:24:16 -08:00
Andrew Kryczka	489d142808	DeleteRange interface Summary: Expose DeleteRange() interface since we think the implementation is functionally correct now. Closes https://github.com/facebook/rocksdb/pull/1503 Differential Revision: D4171921 Pulled By: ajkr fbshipit-source-id: 5e21c98	2016-11-15 15:24:16 -08:00
Islam AbdelRahman	eba99c28e4	Fix min_write_buffer_number_to_merge = 0 bug Summary: It's possible that we set min_write_buffer_number_to_merge to 0. This should never happen Closes https://github.com/facebook/rocksdb/pull/1515 Differential Revision: D4183356 Pulled By: yiwu-arbug fbshipit-source-id: c9d39d7	2016-11-15 13:54:08 -08:00
Joel Marcey	2ef92fea51	Remove all instances of relative_url until GitHub pages problem is fixed. I am in email thread with GitHub support about what is happening here.	2016-11-15 07:40:18 -08:00
Artemiy Kolesnikov	91300d01f6	Dynamic max_total_wal_size option Summary: Closes https://github.com/facebook/rocksdb/pull/1509 Differential Revision: D4176426 Pulled By: yiwu-arbug fbshipit-source-id: b57689d	2016-11-14 22:54:17 -08:00
Andrew Kryczka	ec2f64794b	Consider subcompaction boundaries when updating file boundaries for range deletion Summary: Adjusted AddToBuilder() to take lower_bound and upper_bound, which serve two purposes: (1) only range deletions overlapping with the interval [lower_bound, upper_bound) will be added to the output file, and (2) the output file's boundaries will not be extended before lower_bound or after upper_bound. Our computation of lower_bound/upper_bound consider both subcompaction boundaries and previous/next files within the subcompaction. Test cases are here (level subcompactions: https://gist.github.com/ajkr/63c7eae3e9667c5ebdc0a7efb74ac332, and universal subcompactions: https://gist.github.com/ajkr/5a62af77c4ebe4052a1955c496d51fdb) but can't be included in this diff as they depend on committing the API first. They fail before this change and pass after. Closes https://github.com/facebook/rocksdb/pull/1501 Reviewed By: yhchiang Differential Revision: D4171685 Pulled By: ajkr fbshipit-source-id: ee99db8	2016-11-14 20:24:21 -08:00
Joel Marcey	800e51553e	Fix CSS issues again :( I have an email to GitHub support about this.	2016-11-14 20:11:26 -08:00
Yi Wu	b952c898b6	Parallize persistent_cache_test and transaction_test Summary: Parallize persistent_cache_test and transaction_test Closes https://github.com/facebook/rocksdb/pull/1506 Differential Revision: D4179392 Pulled By: IslamAbdelRahman fbshipit-source-id: 05507a1	2016-11-14 20:09:19 -08:00

1 2 3 4 5 ...

5690 Commits