rocksdb

Author	SHA1	Message	Date
Siying Dong	08b8cea69f	Deleting Blob files also goes through SstFileManager (#4904 ) Summary: Right now, deleting blob files is not rate limited, even if SstFileManger is specified. On the other hand, rate limiting blob deletion is not supported. With this change, Blob file deletion will go through SstFileManager too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4904 Differential Revision: D13772545 Pulled By: siying fbshipit-source-id: bd1b1d0beb26d5167385e00b7ecb8b94b879de84	2019-01-22 17:00:29 -08:00
Andrew Kryczka	16a5ac5b69	Update HISTORY.md with new use of ZSTD_CDict (#4901 ) Summary: Mention feature introduced by #4849 in HISTORY.md. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4901 Differential Revision: D13746430 Pulled By: ajkr fbshipit-source-id: f7bdea6f0522ed55428cbc521f8a9f3cd0002d4e	2019-01-19 19:17:50 -08:00
Yanqin Jin	e79df377c5	Use chrono::time_point instead of time_t (#4868 ) Summary: By convention, time_t almost always stores the integral number of seconds since 00:00 hours, Jan 1, 1970 UTC, according to http://www.cplusplus.com/reference/ctime/time_t/. We surely want more precision than seconds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4868 Differential Revision: D13633046 Pulled By: riversand963 fbshipit-source-id: 4e01e23a22e8838023c51a91247a286dbf3a5396	2019-01-16 09:51:05 -08:00
Siying Dong	4e37251b4d	With ldb --try_load_options and wal_dir doesn't exist, ignore it (#4875 ) Summary: LDB is frequently used to exam data copied. wal_dir in option file is not modified and it usually points to the path it copied from. The user experience will be better if when ldb sees wal_dir pointed by the option file doesn't exist, rather than fail, just ignore it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4875 Differential Revision: D13643173 Pulled By: siying fbshipit-source-id: 2e64d4ea2ec49a6794b9a706b7fc1ba901128bb8	2019-01-11 16:48:32 -08:00
Siying Dong	1fb2e274c5	Remove some components (#4101 ) Summary: Remove some components that we never heard people using them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4101 Differential Revision: D8825431 Pulled By: siying fbshipit-source-id: 97a12ad3cad4ab12c82741a5ba49669aaa854180	2019-01-10 13:30:09 -08:00
Andrew Kryczka	9e2c804fe6	Fix point lookup on range tombstone sentinel endpoint (#4829 ) Summary: Previously for point lookup we decided which file to look into based on user key overlap only. We also did not truncate range tombstones in the point lookup code path. These two ideas did not interact well in cases like this: - L1 has range tombstone [a, c)#1 and point key b#2. The data is split between file1 with range [a#1,1, b#72057594037927935,15], and file2 with range [b#2, c#1]. - L1's file2 gets compacted to L2. - User issues `Get()` for b#3. - L1's file1 is opened and the range tombstone [a, c)#1 is found for b, while no point-key for b is found in L1. - `Get()` assumes that the range tombstone must cover all data in that range in lower levels, so short circuits and returns `NotFound`. The solution to this problem is to not look into files that only overlap with the point lookup at a range tombstone sentinel endpoint. In the above example, this would mean not opening L1's file1 or its tombstones during the `Get()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4829 Differential Revision: D13561355 Pulled By: ajkr fbshipit-source-id: a13c21c816870a2f5d32a48af6dbd719a7d9d19f	2019-01-04 11:24:08 -08:00
DorianZheng	8c79f79208	Fix skip WAL for whole write_group when leader's callback fail (#4838 ) Summary: The original implementation has two problems: 1. `f0dda35d7d/db/db_impl_write.cc (L478)` `f0dda35d7d/db/write_thread.h (L231)` If the callback status of leader of the write_group fails, then the whole write_group will not write to WAL, this may cause data loss. 2. `f0dda35d7d/db/write_thread.h (L130)` The annotation says that Writer.status is the status of memtable inserter, but the original implementation use it for another case which is not consistent with the original design. Looks like we can still reuse Writer.status, but we should modify the annotation, so Writer.status is not only the status of memtable inserter. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4838 Differential Revision: D13574070 Pulled By: yiwu-arbug fbshipit-source-id: a2a2aefcfd329c4c6a91652bf090aaf1ce119c4b	2019-01-03 12:40:42 -08:00
Andrew Kryczka	ace543a815	fix accounting for range tombstones in TableProperties (#4841 ) Summary: - To be consistent with the accounting of other optypes in `TableProperties`, we should count range tombstones in `TableProperties::num_entries` and `TableProperties::num_deletions`. - Updated assertions in stress test's `OnTableFileCreated` handler to accept files with range tombstones only. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4841 Differential Revision: D13568424 Pulled By: ajkr fbshipit-source-id: 0139d7806494eda20ece67ec460d2458dbbf6026	2019-01-02 15:08:53 -08:00
Anand Ananthabhotla	b9d6eccac1	Lock free MultiGet (#4754 ) Summary: Avoid locking the DB mutex in order to reference SuperVersions. Instead, we get the thread local cached SuperVersion for each column family in the list. It depends on finding a sequence number that overlaps with all the open memtables. We start with the latest published sequence number, and if any of the memtables is sealed before we can get all the SuperVersions, the process is repeated. After a few times, give up and lock the DB mutex. Tests: 1. Unit tests 2. make check 3. db_bench - TEST_TMPDIR=/dev/shm ./db_bench -use_existing_db=true -benchmarks=readrandom -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=5000000 -reads=1000000 -threads=32 -compression_type=none -cache_size=1048576000 -batch_size=1 -bloom_bits=1 readrandom : 0.167 micros/op 5983920 ops/sec; 426.2 MB/s (1000000 of 1000000 found) Multireadrandom with batch size 1: multireadrandom : 0.176 micros/op 5684033 ops/sec; (1000000 of 1000000 found) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4754 Differential Revision: D13363550 Pulled By: anand1976 fbshipit-source-id: 6243e8de7dbd9c8bb490a8eca385da0c855b1dd4	2019-01-02 11:42:54 -08:00
Siying Dong	f0dda35d7d	Preload some files even if options.max_open_files (#3340 ) Summary: Choose to preload some files if options.max_open_files != -1. This can slightly narrow the gap of performance between options.max_open_files is -1 and a large number. To avoid a significant regression to DB reopen speed if options.max_open_files != -1. Limit the files to preload in DB open time to 16. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3340 Differential Revision: D6686945 Pulled By: siying fbshipit-source-id: 8ec11bbdb46e3d0cdee7b6ad5897a09c5a07869f	2018-12-28 18:02:28 -08:00
Andrew Kryczka	e0be1bc4f1	fix DeleteRange memory leak for mmap and block cache (#4810 ) Summary: Previously we were cleaning up range tombstone meta-block by calling `ReleaseCachedEntry`, which wouldn't work if `value != nullptr && cache_handle == nullptr`. This happened at least in the case with mmap reads and block cache both enabled. I noticed `NewDataBlockIterator` intends to handle all these cases, so migrated to that instead of `NewUnfragmentedRangeTombstoneIterator`. Also changed the table-opening logic to fail on `ReadRangeDelBlock` failure, since that can cause data corruption. Added a test case to verify this behavior. Note the test case does not fail on `TryReopen` because failure to preload table handlers is not considered critical. However, it does fail on any read involving that file since it cannot return correct data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4810 Differential Revision: D13534296 Pulled By: ajkr fbshipit-source-id: 55dde1111717cea6ec4bf38418daab81ccef3599	2018-12-20 21:59:49 -08:00
Yanqin Jin	4fce44fc8b	Improve flushing multiple column families (#4708 ) Summary: If one column family is dropped, we should simply skip it and continue to flush other active ones. Currently we use Status::ShutdownInProgress to notify caller of column families being dropped. In the future, we should consider using a different Status code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4708 Differential Revision: D13378954 Pulled By: riversand963 fbshipit-source-id: 42f248cdf2d32d4c0f677cd39012694b8f1328ca	2018-12-13 15:12:40 -08:00
Yanqin Jin	f307479ba6	Enable checkpoint of read-only db (#4681 ) Summary: 1. DBImplReadOnly::GetLiveFiles should not return NotSupported. Instead, it should call DBImpl::GetLiveFiles(flush_memtable=false). 2. In DBImp::Recover, we should also recover the OPTIONS file name and/or number so that an immediate subsequent GetLiveFiles will get the correct OPTIONS name. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4681 Differential Revision: D13069205 Pulled By: riversand963 fbshipit-source-id: 3e6a0174307d06db5a01feb099b306cea1f7f88a	2018-12-07 17:06:02 -08:00
Maysam Yabandeh	b878f93c70	Extend Transaction::GetForUpdate with do_validate (#4680 ) Summary: Transaction::GetForUpdate is extended with a do_validate parameter with default value of true. If false it skips validating the snapshot (if there is any) before doing the read. After the read it also returns the latest value (expects the ReadOptions::snapshot to be nullptr). This allows RocksDB applications to use GetForUpdate similarly to how InnoDB does. Similarly ::Merge, ::Put, ::Delete, and ::SingleDelete are extended with assume_exclusive_tracked with default value of false. It true it indicates that call is assumed to be after a ::GetForUpdate(do_validate=false). The Java APIs are accordingly updated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4680 Differential Revision: D13068508 Pulled By: maysamyabandeh fbshipit-source-id: f0b59db28f7f6a078b60844d902057140765e67d	2018-12-06 17:49:00 -08:00
Yanqin Jin	1d679e35fd	Update HISTORY.md (#4753 ) Summary: As titled. Update history to include a recent bug fix in `9be3e6b488`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4753 Differential Revision: D13350286 Pulled By: riversand963 fbshipit-source-id: b6324780dee4cb1757bc2209403a08531c150c08	2018-12-05 16:55:58 -08:00
Fosco Marotto	55479eb572	Update History for fast-forwarded 5.18 branch Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4704 Differential Revision: D13283300 Pulled By: gfosco fbshipit-source-id: cb4fdaa93137e0bba64b781ba7e8fe31b19e5656	2018-11-30 16:25:09 -08:00
Siying Dong	6e938c904f	Make NewBloomFilterPolicy() use full filter by default (#4735 ) Summary: Full block (use_block_based_builder=false) Bloom filter has clear CPU saving benefits but with limitation of using temp memory when building an SST file proportional to the SST file size. We reduced the chance of having large SST files with multi-level universal compaction. Now we change to a default with better performance. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4735 Differential Revision: D13266674 Pulled By: siying fbshipit-source-id: 7594a4c3e32568a5a2adce22bb0e46553e55c602	2018-11-30 13:13:27 -08:00
Yi Wu	512a5e3ef8	Fix BlockBasedTable not always using memory allocator if available (#4678 ) Summary: Fix block based table reader not using memory_allocator when allocating index blocks and compression dictionary blocks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4678 Differential Revision: D13054594 Pulled By: yiwu-arbug fbshipit-source-id: 379f25bcc665395662511c4f873f4b7b55104ce2	2018-11-28 18:01:24 -08:00
Abhishek Madan	e76448185c	Remove DeleteRange experimental comment (#4709 ) Summary: DeleteRange is now ready for production use. Change the header comment to reflect this, and update HISTORY.md with the feature's status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4709 Differential Revision: D13209055 Pulled By: abhimadan fbshipit-source-id: 65423eb1a4927cf593c38254cd87c322f73ae137	2018-11-27 11:11:35 -08:00
Andrew Kryczka	07cf0ee589	Fix ticker stat for number files closed (#4703 ) Summary: We haven't been populating `NO_FILE_CLOSES` since v1.5.8 even though it was never marked as deprecated. Start populating it again. Conveniently `DeleteTableReader` has an unused `void*` argument that we can use... Blame: `63f216ee0a` Closes #4700. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4703 Differential Revision: D13146769 Pulled By: ajkr fbshipit-source-id: ad8d6fb0493e701f60a165a3bca1787d255be008	2018-11-21 18:31:34 -08:00
Yi Wu	05d9d82181	Revert "Move MemoryAllocator option from Cache to BlockBasedTableOpti… (#4697 ) Summary: …ons (#4676)" This reverts commit `b32d087dbb`. `MemoryAllocator` needs to be with `Cache`, since cache entry can outlive DB and block based table. The cache needs to hold reference to memory allocator when deleting cache entry. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4697 Differential Revision: D13133490 Pulled By: yiwu-arbug fbshipit-source-id: 8ef7e8a51263bfd929f892fd062665ff4ce9ce5a	2018-11-21 11:29:57 -08:00
Abhishek Madan	ed5aec5ba3	Fix range tombstone covering short-circuit logic (#4698 ) Summary: Since a range tombstone seen at one level will cover all keys in the range at lower levels, there was a short-circuiting check in Get that reported a key was not found at most one file after the range tombstone was discovered. However, this was incorrect for merge operands, since a deletion might only cover some merge operands, which implies that the key should be found. This PR fixes this logic in the Version portion of Get, and removes the logic from the MemTable portion of Get, since the perforamnce benefit provided there is minimal. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4698 Differential Revision: D13142484 Pulled By: abhimadan fbshipit-source-id: cbd74537c806032f2bfa564724d01a80df7c8f10	2018-11-20 13:29:22 -08:00
Yi Wu	b32d087dbb	Move MemoryAllocator option from Cache to BlockBasedTableOptions (#4676 ) Summary: Per offline discussion with siying, `MemoryAllocator` and `Cache` should be decouple. The idea is that memory allocator handles memory allocation, while cache handle cache policy. It is normal that external cache libraries pack couple the two components for better optimization. If we want to integrate with such library in the future, we can make a wrapper of the library implementing both `Cache` and `MemoryAllocator` interface. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4676 Differential Revision: D13047662 Pulled By: yiwu-arbug fbshipit-source-id: cd42e246d80ab600b4de47d073f7d2db308ce6dd	2018-11-13 13:48:38 -08:00
Soli Como	5945e16dfc	Divide `NO_ITERATORS` into two counters `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETE` (#4498 ) Summary: Currently, `Statistics` can record tick by `recordTick()` whose second parameter is an `uint64_t`. That means tick can only increase. If we want to reduce tick, we have to work around like `RecordTick(statistics_, NO_ITERATORS, uint64_t(-1));`. That's kind of a hack. So, this PR divide `NO_ITERATORS` into two counters `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETE`, making the counters increase only. Fixes #3013 . Pull Request resolved: https://github.com/facebook/rocksdb/pull/4498 Differential Revision: D10395010 Pulled By: sagar0 fbshipit-source-id: cfb523b22a37411c794b4e9da090f1ae30293db2	2018-11-13 11:46:32 -08:00
Andrew Kryczka	ea9454700a	Backup engine support for direct I/O reads (#4640 ) Summary: Use the `DBOptions` that the backup engine already holds to figure out the right `EnvOptions` to use when reading the DB files. This means that, if a user opened a DB instance with `use_direct_reads=true`, then using `BackupEngine` to back up that DB instance will use direct I/O to read files when calculating checksums and copying. Currently the WALs and manifests would still be read using buffered I/O to prevent mixing direct I/O reads with concurrent buffered I/O writes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4640 Differential Revision: D13015268 Pulled By: ajkr fbshipit-source-id: 77006ad6f3e00ce58374ca4793b785eea0db6269	2018-11-13 11:17:25 -08:00
Zhongyi Xie	b313019326	use per-level perfcontext for DB::Get calls (#4617 ) Summary: this PR adds two more per-level perf context counters to track * number of keys returned in Get call, break down by levels * total processing time at each level during Get call Pull Request resolved: https://github.com/facebook/rocksdb/pull/4617 Differential Revision: D12898024 Pulled By: miasantreble fbshipit-source-id: 6b84ef1c8097c0d9e97bee1a774958f56ab4a6c4	2018-11-13 10:40:49 -08:00
Fosco Marotto	050d73551b	Update history and version for future 5.18 release. Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4669 Differential Revision: D13031522 Pulled By: gfosco fbshipit-source-id: d09e655a9d5556594f195c5d1b786900932145ce	2018-11-12 14:45:47 -08:00
Yanqin Jin	0a53577416	Move xxhash64 checksum support to 'Unreleased' section (#4627 ) Summary: Move the line `Add xxhash64 checksum support` to `Unreleased` section because it has not been released to 5.17. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4627 Differential Revision: D12944123 Pulled By: riversand963 fbshipit-source-id: 2762857065b9e741c64ff8b6116ca62fe31891d8	2018-11-06 12:06:22 -08:00
Maysam Yabandeh	2b5b7bc795	WritePrepared: Fix bug in searching in non-cached snapshots (#4639 ) Summary: When evicting an entry form the commit_cache, it is verified against the list of old snapshots to see if it overlaps with any. The list of old snapshots is split into two lists: an efficient concurrent cache and an slow vector protected by a lock. The patch fixes a bug that would stop the search in the cache if it finds any and yet would not include the larger snapshots in the slower list. An extra info log entry is also removed. The condition to trigger that although very rare is still feasible and should not spam the LOG when that happens. Fixes #4621 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4639 Differential Revision: D12934989 Pulled By: maysamyabandeh fbshipit-source-id: 4e0fe8147ba292b554ae78e94c21c2ef31e03e2d	2018-11-05 23:03:50 -08:00
Andrew Kryczka	fffac43cfb	Add DB property for SST files kept from deletion (#4618 ) Summary: This property can help debug why SST files aren't being deleted. Previously we only had the property "rocksdb.is-file-deletions-enabled". However, even when that returned true, obsolete SSTs may still not be deleted due to the coarse-grained mechanism we use to prevent newly created SSTs from being accidentally deleted. That coarse-grained mechanism uses a lower bound file number for SSTs that should not be deleted, and this property exposes that lower bound. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4618 Differential Revision: D12898179 Pulled By: ajkr fbshipit-source-id: fe68acc041ddbcc9276bbd48976524d95aafc776	2018-11-05 20:24:40 -08:00
Bo Hou	a29053b648	change history.md with new feature Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4626 Differential Revision: D12911848 Pulled By: jsjhoubo fbshipit-source-id: db6c59665e7cdbda20c6c63b0abd3ce24b473ae9	2018-11-02 19:08:03 -07:00
Abhishek Madan	eaaf1a6f05	Promote rocksdb.{deleted.keys,merge.operands} to main table properties (#4594 ) Summary: Since the number of range deletions are reported in TableProperties, it is confusing to not report the number of merge operands and point deletions as top-level properties; they are accessible through the public API, but since they are not the "main" properties, they do not appear in aggregated table properties, or the string representation of table properties. This change promotes those two property keys to `rocksdb/table_properties.h`, adds corresponding uint64 members for them, deprecates the old access methods `GetDeletedKeys()` and `GetMergeOperands()` (though they are still usable for now), and removes `InternalKeyPropertiesCollector`. The property key strings are the same as before this change, so this should be able to read DBs written from older versions (though I haven't tested this yet). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4594 Differential Revision: D12826893 Pulled By: abhimadan fbshipit-source-id: 9e4e4fbdc5b0da161c89582566d184101ba8eb68	2018-10-30 15:34:27 -07:00
Yanqin Jin	5b4c709fad	Enable atomic flush (#4023 ) Summary: Adds a DB option `atomic_flush` to control whether to enable this feature. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4023 Differential Revision: D8518381 Pulled By: riversand963 fbshipit-source-id: 1e3bb33e99bb102876a31b378d93b0138ff6634f	2018-10-26 15:08:43 -07:00
Yi Wu	f560c8f5c8	s/CacheAllocator/MemoryAllocator/g (#4590 ) Summary: Rename the interface, as it is mean to be a generic interface for memory allocation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4590 Differential Revision: D10866340 Pulled By: yiwu-arbug fbshipit-source-id: 85cb753351a40cb856c046aeaa3f3b369eef3d16	2018-10-26 14:30:30 -07:00
Maysam Yabandeh	c34cc40424	Fix user comparator receiving internal key (#4575 ) Summary: There was a bug that the user comparator would receive the internal key instead of the user key. The bug was due to RangeMightExistAfterSortedRun expecting user key but receiving internal key when called in GenerateBottommostFiles. The patch augment an existing unit test to reproduce the bug and fixes it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4575 Differential Revision: D10500434 Pulled By: maysamyabandeh fbshipit-source-id: 858346d2fd102cce9e20516d77338c112bdfe366	2018-10-23 08:14:46 -07:00
Siying Dong	7024263682	Dynamic level to adjust level multiplier when write is too heavy (#4338 ) Summary: Level compaction usually performs poorly when the writes so heavy that the level targets can't be guaranteed. With this improvement, we improve level_compaction_dynamic_level_bytes = true so that in the write heavy cases, the level multiplier can be slightly adjusted based on the size of L0. We keep the behavior the same if number of L0 files is under 2X compaction trigger and the total size is less than options.max_bytes_for_level_base, so that unless write is so heavy that compaction cannot keep up, the behavior doesn't change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4338 Differential Revision: D9636782 Pulled By: siying fbshipit-source-id: e27fc17a7c29c84b00064cc17536a01dacef7595	2018-10-22 10:21:47 -07:00
Siying Dong	c17383f918	Fix WriteBatchWithIndex's SeekForPrev() (#4559 ) Summary: WriteBatchWithIndex's SeekForPrev() has a bug that we internally place the position just before the seek key rather than after. This makes the iterator to miss the result that is the same as the seek key. Fix it by position the iterator equal or smaller. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4559 Differential Revision: D10468534 Pulled By: siying fbshipit-source-id: 2fb371ae809c561b60a1c11cef71e1c66fea1f19	2018-10-19 14:40:50 -07:00
Zhongyi Xie	d6ec288703	Add PerfContextByLevel to provide per level perf context information (#4226 ) Summary: Current implementation of perf context is level agnostic. Making it hard to do performance evaluation for the LSM tree. This PR adds `PerfContextByLevel` to decompose the counters by level. This will be helpful when analyzing point and range query performance as well as tuning bloom filter Also replaced __thread with thread_local keyword for perf_context Pull Request resolved: https://github.com/facebook/rocksdb/pull/4226 Differential Revision: D10369509 Pulled By: miasantreble fbshipit-source-id: f1ced4e0de5fcebdb7f9cff36164516bc6382d82	2018-10-17 11:19:40 -07:00
anand1976	1e3845805d	Properly determine a truncated CompactRange stop key (#4496 ) Summary: When a CompactRange() call for a level is truncated before the end key is reached, because it exceeds max_compaction_bytes, we need to properly set the compaction_end parameter to indicate the stop key. The next CompactRange will use that as the begin key. We set it to the smallest key of the next file in the level after expanding inputs to get a clean cut. Previously, we were setting it before expanding inputs. So we could end up recompacting some files. In a pathological case, where a single key has many entries spanning all the files in the level (possibly due to merge operands without a partial merge operator, thus resulting in compaction output identical to the input), this would result in an endless loop over the same set of files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4496 Differential Revision: D10395026 Pulled By: anand1976 fbshipit-source-id: f0c2f89fee29b4b3be53b6467b53abba8e9146a9	2018-10-15 23:22:51 -07:00
Andrew Kryczka	32b4d4ad47	Avoid per-key linear scan over snapshots in compaction (#4495 ) Summary: `CompactionIterator::snapshots_` is ordered by ascending seqnum, just like `DBImpl`'s linked list of snapshots from which it was copied. This PR exploits this ordering to make `findEarliestVisibleSnapshot` do binary search rather than linear scan. This can make flush/compaction significantly faster when many snapshots exist since that function is called on every single key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4495 Differential Revision: D10386470 Pulled By: ajkr fbshipit-source-id: 29734991631227b6b7b677e156ac567690118a8b	2018-10-15 16:21:22 -07:00
Abhishek Madan	9c6fea7fe1	Update HISTORY.md, fix unity_test failure (#4479 ) Summary: Follow-up to https://github.com/facebook/rocksdb/pull/4432. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4479 Differential Revision: D10304151 Pulled By: abhimadan fbshipit-source-id: 3608b95c324702ca26791f95cb26dae1d49efbe7	2018-10-10 12:09:56 -07:00
Anand Ananthabhotla	854a4be03f	Handle mixed slowdown/no_slowdown writer properly (#4475 ) Summary: There is a bug when the write queue leader is blocked on a write delay/stop, and the queue has writers with WriteOptions::no_slowdown set to true. They are not woken up until the write stall is cleared. The fix introduces a dummy writer inserted at the tail to indicate a write stall and prevent further inserts into the queue, and a condition variable that writers who can tolerate slowdown wait on before adding themselves to the queue. The leader calls WriteThread::BeginWriteStall() to add the dummy writer and then walk the queue to fail any writers with no_slowdown set. Once the stall clears, the leader calls WriteThread::EndWriteStall() to remove the dummy writer and signal the condition variable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4475 Differential Revision: D10285827 Pulled By: anand1976 fbshipit-source-id: 747465e5e7f07a829b1fb0bc1afcd7b93f4ab1a9	2018-10-09 22:52:40 -07:00
DorianZheng	e0f05754ba	Expose column family id to OnCompactionCompleted (#4466 ) Summary: The controller you requested could not be found. PTAL Pull Request resolved: https://github.com/facebook/rocksdb/pull/4466 Differential Revision: D10241358 Pulled By: yiwu-arbug fbshipit-source-id: 99664eb286860a6c8844d50efeb0ef6f0e10dd1e	2018-10-08 14:24:16 -07:00
Fosco Marotto	b787cf9e42	Update HISTORY.md to current status (#4471 ) Summary: 5.16.x status wasn't tracked, and also updated for pending 5.17 release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4471 Differential Revision: D10240925 Pulled By: gfosco fbshipit-source-id: 95ab368a04a65b201d2518097af69edf2402f544	2018-10-08 11:15:09 -07:00
Igor Canadi	1cf5deb8fd	Introduce CacheAllocator, a custom allocator for cache blocks (#4437 ) Summary: This is a conceptually simple change, but it touches many files to pass the allocator through function calls. We introduce CacheAllocator, which can be used by clients to configure custom allocator for cache blocks. Our motivation is to hook this up with folly's `JemallocNodumpAllocator` (`f43ce6d686/folly/experimental/JemallocNodumpAllocator.h`), but there are many other possible use cases. Additionally, this commit cleans up memory allocation in `util/compression.h`, making sure that all allocations are wrapped in a unique_ptr as soon as possible. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4437 Differential Revision: D10132814 Pulled By: yiwu-arbug fbshipit-source-id: be1343a4b69f6048df127939fea9bbc96969f564	2018-10-02 17:24:58 -07:00
Andrew Kryczka	ac6f435a9a	Fix CompactFiles support for kDisableCompressionOption (#4438 ) Summary: Previously `CompactFiles` with `CompressionType::kDisableCompressionOption` caused program to crash on assertion failure. This PR fixes the crash by adding support for that setting. Now, that setting will cause RocksDB to choose compression according to the column family's options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4438 Differential Revision: D10115761 Pulled By: ajkr fbshipit-source-id: a553c6fa76fa5b6f73b0d165d95640da6f454122	2018-10-01 01:18:10 -07:00
Maysam Yabandeh	a0ebec3804	Extend crash test with index_block_restart_interval (#4383 ) Summary: The default for index_block_restart_interval is 1 but some use 16 in production. The patch extends crash test to test both values. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4383 Differential Revision: D9887304 Pulled By: maysamyabandeh fbshipit-source-id: a8d00fea974a79ad563f9f4d9d7b069e9f746a8f	2018-09-18 15:43:29 -07:00
Maysam Yabandeh	3f5282268f	Skip concurrency control during recovery of pessimistic txn (#4346 ) Summary: TransactionOptions::skip_concurrency_control allows pessimistic transactions to skip the overhead of concurrency control. This could be as an optimization if the application knows that the transaction would not have any conflict with concurrent transactions. It is currently used during recovery assuming (i) application guarantees no conflict between prepared transactions in the WAL (ii) application guarantees that recovered transactions will be rolled back/commit before new transactions start. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4346 Differential Revision: D9759149 Pulled By: maysamyabandeh fbshipit-source-id: f896e84fa58b0b584be904c7fd3883a41ea3215b	2018-09-10 16:57:53 -07:00
Mikhail Antonov	927f274939	Avoiding write stall caused by manual flushes (#4297 ) Summary: Basically at the moment it seems it's possible to cause write stall by calling flush (either manually vis DB::Flush(), or from Backup Engine directly calling FlushMemTable() while background flush may be already happening. One of the ways to fix it is that in DBImpl::CompactRange() we already check for possible stall and delay flush if needed before we actually proceed to call FlushMemTable(). We can simply move this delay logic to separate method and call it from FlushMemTable. This is draft patch, for first look; need to check tests/update SyncPoints and most certainly would need to add allow_write_stall method to FlushOptions(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4297 Differential Revision: D9420705 Pulled By: mikhail-antonov fbshipit-source-id: f81d206b55e1d7b39e4dc64242fdfbceeea03fcc	2018-08-29 12:12:55 -07:00
Andrew Kryczka	42733637e1	Sync CURRENT file during checkpoint (#4322 ) Summary: For the CURRENT file forged during checkpoint, we were forgetting to `fsync` or `fdatasync` it after its creation. This PR fixes it. Differential Revision: D9525939 Pulled By: ajkr fbshipit-source-id: a505483644026ee3f501cfc0dcbe74832165b2e3	2018-08-28 12:43:18 -07:00

1 2 3 4 5 ...

484 Commits