rocksdb

Author	SHA1	Message	Date
Maysam Yabandeh	3c3252a06a	Fix tsan complaint in ConcurrentMergeWrite test (#5308 ) Summary: The test was not using separate MemTablePostProcessInfo per memetable insert thread and thus tsan was complaining about data race. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5308 Differential Revision: D15356420 Pulled By: maysamyabandeh fbshipit-source-id: 46c2f2d19fb02c3c775b587aa09ca9c0dae6ed04	2019-05-15 11:21:48 -07:00
anand76	6492430eaf	Fix a bug in db_stress and an incorrect assertion in FilePickerMultiGet (#5301 ) Summary: This PR has two fixes for crash test failures - 1. Fix a bug in TestMultiGet() in db_stress that was passing list of key to MultiGet() in the wrong order, thus ensuring that actual values don't match expected values 2. Remove an incorrect assertion in FilePickerMultiGet::GetNextFileInLevelWithKeys() that checks that files in a level are in sorted order. This is not true with MultiGet(), especially if there are duplicate keys and we may have to go back one file for the next key. Furthermore, this assertion makes more sense when a new version is created, rather than at lookup time Test - asan_crash and ubsan_crash tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/5301 Differential Revision: D15337383 Pulled By: anand1976 fbshipit-source-id: 35092cb15bbc1700e5e823cbe07bfa62f1e9e6c6	2019-05-14 11:58:04 -07:00
Maysam Yabandeh	f383641a1d	Unordered Writes (#5218 ) Summary: Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees. The patch is prepared based on an original by siying Existing unit tests are extended to include unordered_write option. Benchmark Results: ``` TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1 ``` With WAL - Vanilla RocksDB: 78.6 MB/s - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x) - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees) Without WAL - Vanilla RocksDB: 111.3 MB/s - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x) - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees) - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x) Limitations: - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218 Differential Revision: D15219029 Pulled By: maysamyabandeh fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759	2019-05-13 17:47:21 -07:00
Yi Wu	92c60547fe	db_bench: fix hang on IO error (#5300 ) Summary: db_bench will wait indefinitely if there's background error. Fix by pass `abs_time_us` to cond var. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5300 Differential Revision: D15319945 Pulled By: miasantreble fbshipit-source-id: 0034fb7f6ec7c3303c4ccf26e54c20fbdac8ab44	2019-05-13 11:30:35 -07:00
Yanqin Jin	e626016545	Fix a race condition caused by unlocking db mutex (#5294 ) Summary: Previous code may call `~ColumnFamilyData` in `DBImpl::AtomicFlushMemTablesToOutputFiles` if the column family is dropped or `cfd->IsFlushPending() == false`. In `~ColumnFamilyData`, the db mutex is released briefly and re-acquired. This can cause correctness issue. The reason is as follows. Assume there are more bg flush threads. After bg_flush_thr1 releases the db mutex, bg_flush_thr2 can grab it and pop an element from the flush queue. This will cause bg_flush_thr2 to accidentally pick some memtables which should have been picked by bg_flush_thr1. To make the matter worse, bg_flush_thr2 can clear `flush_requested_` flag for the memtable list, causing a subsequent call to `MemTableList::IsFlushPending()` by bg_flush_thr1 to return false, which is wrong. The fix is to delay `ColumnFamilyData::Unref` and `~ColumnFamilyData` for column families not selected for flush until `AtomicFlushMemTablesToOutputFiles` returns. Furthermore, a bg flush thread should not clear `MemTableList::flush_requested_` in `MemTableList::PickMemtablesToFlush` unless atomic flush is not used or the memtable list does not have unpicked memtables. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5294 Differential Revision: D15295297 Pulled By: riversand963 fbshipit-source-id: 03b101205ca22c242647cbf488bcf0ed80b2ecbd	2019-05-10 17:56:48 -07:00
Mike Kolupaev	6a6aef25c1	Fix crash in BlockBasedTableIterator::Seek() (#5291 ) Summary: https://github.com/facebook/rocksdb/pull/5256 broke it: `block_iter_.user_key()` may not be valid even if `block_iter_points_to_real_block_` is true. E.g. if there was an IO error or Status::Incomplete. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5291 Differential Revision: D15273324 Pulled By: al13n321 fbshipit-source-id: 442e5b09f9884a58f92a6ac1ca93af719c219886	2019-05-10 12:40:57 -07:00
Levi Tamasi	f0bf3bf34b	Turn CachableEntry into a proper resource handle (#5252 ) Summary: CachableEntry is used in a variety of contexts: it may refer to a cached object (i.e. an object in the block cache), an owned object, or an unowned object; also, in some cases (most notably with iterators), the responsibility of managing the pointed-to object gets handed off to another object. Each of the above scenarios have different implications for the lifecycle of the referenced object. For the most part, the patch does not change the lifecycle of managed objects; however, it makes these relationships explicit, and it also enables us to eliminate some hacks and accident-prone code around releasing cache handles and deleting/cleaning up objects. (The only places where the patch changes how an objects are managed are the partitions of partitioned indexes and filters.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5252 Differential Revision: D15101358 Pulled By: ltamasi fbshipit-source-id: 9eb59e9ae5a7230e3345789762d0ba1f189485be	2019-05-10 11:57:49 -07:00
Jelte Fennema	6451673f37	Add C bindings for LowerThreadPoolIO/CPUPriority (#5285 ) Summary: There were no C bindings for lowering thread pool priority. This adds those. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5285 Differential Revision: D15290050 Pulled By: siying fbshipit-source-id: b2ed94d0c39d27434ace2204829a242b53d0d67a	2019-05-09 18:21:21 -07:00
Siying Dong	9fad3e21eb	Merging iterator to avoid child iterator reseek for some cases (#5286 ) Summary: When reseek happens in merging iterator, reseeking a child iterator can be avoided if: (1) the iterator represents imutable data (2) reseek() to a larger key than the current key (3) the current key of the child iterator is larger than the seek key because it is guaranteed that the result will fall into the same position. This optimization will be useful for use cases where users keep seeking to keys nearby in ascending order. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5286 Differential Revision: D15283635 Pulled By: siying fbshipit-source-id: 35f79ffd5ce3609146faa8cd55f2bfd733502f83	2019-05-09 14:20:04 -07:00
anand76	181bb43f08	Fix bugs in FilePickerMultiGet (#5292 ) Summary: This PR fixes a couple of bugs in FilePickerMultiGet that were causing db_stress test failures. The failures were caused by - 1. Improper handling of a key that matches the user key portion of an L0 file's largest key. In this case, the curr_index_in_curr_level file index in L0 for that key was getting incremented, but batch_iter_ was not advanced. By design, all keys in a batch are supposed to be checked against an L0 file before advancing to the next L0 file. Not advancing to the next key in the batch was causing a double increment of curr_index_in_curr_level due to the same key being processed again 2. Improper handling of a key that matches the user key portion of the largest key in the last file of L1 and higher. This was resulting in a premature end to the processing of the batch for that level when the next key in the batch is a duplicate. Typically, the keys in MultiGet will not be duplicates, but its good to handle that case correctly Test - asan_crash make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/5292 Differential Revision: D15282530 Pulled By: anand1976 fbshipit-source-id: d1a6a86e0af273169c3632db22a44d79c66a581f	2019-05-09 13:18:00 -07:00
Siying Dong	25d81e4577	DBIter::Next() can skip user key checking if previous entry's seqnum is 0 (#5244 ) Summary: Right now, DBIter::Next() always checks whether an entry is for the same user key as the previous entry to see whether the key should be hidden to the user. However, if previous entry's sequence number is 0, the check is not needed because 0 is the oldest possible sequence number. We could extend it from seqnum 0 case to simply prev_seqno >= current_seqno. However, it is less robust with bug or unexpected situations, while the gain is relatively low. We can always extend it later when needed. In a readseq benchmark with full formed LSM-tree, number of key comparisons called is reduced from 2.981 to 2.165. readseq against a fully compacted DB, no key comparison is called. Performance in this benchmark didn't show obvious improvement, which is expected because key comparisons only takes small percentage of CPU. But it may show up to be more effective if users have an expensive customized comparator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5244 Differential Revision: D15067257 Pulled By: siying fbshipit-source-id: b7e1ef3ec4fa928cba509683d2b3246e35d270d9	2019-05-09 12:24:04 -07:00
Zhongyi Xie	bdba6c56dd	add WAL replay in TryCatchUpWithPrimary (#5282 ) Summary: Previously in PR https://github.com/facebook/rocksdb/pull/5161 we have added the capability to do WAL tailing in `OpenAsSecondary`, in this PR we extend such feature to `TryCatchUpWithPrimary` which is useful for an secondary RocksDB instance to retrieve and apply the latest updates and refresh log readers if needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5282 Differential Revision: D15261011 Pulled By: miasantreble fbshipit-source-id: a15c94471e8c3b3b1f7f47c3135db1126e936949	2019-05-08 10:59:37 -07:00
Zhongyi Xie	eea1cad850	avoid updating index type during iterator creation (#5288 ) Summary: Right now there is a potential race condition where two threads are created to iterate through the DB (https://gist.github.com/miasantreble/88f5798a397ee7cb8e7baff9db2d9e85). The problem is that in `BlockBasedTable::NewIndexIterator`, if both threads failed to find index_reader from block cache, they will call `CreateIndexReader->UpdateIndexType()` which creates a race to update `index_type` in the shared rep_ object. By checking the code, we realize the index type is always populated by `PrefetchIndexAndFilterBlocks` during the table `Open` call, so there is no need to update index type every time during iterator creation. This PR attempts to fix the race condition by removing the unnecessary call to `UpdateIndexType` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5288 Differential Revision: D15252509 Pulled By: miasantreble fbshipit-source-id: 6e3258652121d5c76d267f7ac457e15c5e84756e	2019-05-07 20:20:40 -07:00
anand76	930bfa5750	Disable MultiGet from db_stress (#5284 ) Summary: Disable it for now until we can get stress tests to pass consistently. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5284 Differential Revision: D15230727 Pulled By: anand1976 fbshipit-source-id: 239baacdb3c4cd4fb7c4447f7582b9042501d752	2019-05-06 18:26:50 -07:00
Maysam Yabandeh	6a40ee5eb1	Refresh snapshot list during long compactions (2nd attempt) (#5278 ) Summary: Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list. For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list. This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278 Differential Revision: D15203291 Pulled By: maysamyabandeh fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258	2019-05-03 17:30:22 -07:00
Zhongyi Xie	5d27d65bef	multiget: fix memory issues due to vector auto resizing (#5279 ) Summary: This PR fixes three memory issues found by ASAN * in db_stress, the key vector for MultiGet is created using `emplace_back` which could potentially invalidates references to the underlying storage (vector<string>) due to auto resizing. Fix by calling reserve in advance. * Similar issue in construction of GetContext autovector in version_set.cc * In multiget_context.h use T[] specialization for unique_ptr that holds a char array Pull Request resolved: https://github.com/facebook/rocksdb/pull/5279 Differential Revision: D15202893 Pulled By: miasantreble fbshipit-source-id: 14cc2cda0ed64d29f2a1e264a6bfdaa4294ee75d	2019-05-03 15:58:43 -07:00
Zhongyi Xie	3e994809a1	fix implicit conversion error reported by clang check (#5277 ) Summary: fix the following clang check errors ``` tools/db_stress.cc:3609:30: error: implicit conversion loses integer precision: 'std::vector::size_type' (aka 'unsigned long') to 'int' [-Werror,-Wshorten-64-to-32] int num_keys = rand_keys.size(); ~~~~~~~~ ~~~~~~~~~~^~~~~~ tools/db_stress.cc:3888:30: error: implicit conversion loses integer precision: 'std::vector::size_type' (aka 'unsigned long') to 'int' [-Werror,-Wshorten-64-to-32] int num_keys = rand_keys.size(); ~~~~~~~~ ~~~~~~~~~~^~~~~~ 2 errors generated. make: *** [tools/db_stress.o] Error 1 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5277 Differential Revision: D15196620 Pulled By: miasantreble fbshipit-source-id: d56b1420d4a9f1df875fc52877a5fbb342bc7cae	2019-05-03 10:02:27 -07:00
Adam Retter	5882e847aa	Allow builds of RocksJava debug releases (#5274 ) Summary: This allows debug releases of RocksJava to be build with the Docker release targets. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5274 Differential Revision: D15185067 Pulled By: sagar0 fbshipit-source-id: f3988e472f281f5844d9a07098344a827b1e7eb1	2019-05-02 14:27:20 -07:00
anand76	434ccf2df4	Add option to use MultiGet in db_stress (#5264 ) Summary: The new option will pick a batch size randomly in the range 1-64. It will then space the keys in the batch by random intervals. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5264 Differential Revision: D15175522 Pulled By: anand1976 fbshipit-source-id: c16baa69d0f1ff4cf53c55c813ddd82c8aeb58fc	2019-05-01 23:06:56 -07:00
Zhongyi Xie	d51eb0b583	set snappy compression only when supported (#4325 ) Summary: Right now `OptimizeLevelStyleCompaction` may set compression type to Snappy even when Snappy is not supported, this may cause errors like "no snappy compression support" Fixes https://github.com/facebook/rocksdb/issues/4283 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4325 Differential Revision: D15125542 Pulled By: miasantreble fbshipit-source-id: 70890b73ababe16752721555dbd290633c2aafac	2019-05-01 20:40:00 -07:00
Siying Dong	4479dff208	Reduce binary search when reseek into the same data block (#5256 ) Summary: Right now, when Seek() is called again, RocksDB always does a binary search against the files and index blocks, even if they end up with the same file/block. Improve it as following: 1. in LevelIterator, reseek first try to check the boundary of the current file. If it falls into the same file, skip the binary search to find the file 2. in block based table iterator, reseek skip to reseek the iterator block if the seek key is larger than the current key and lower than the index key (boundary of the current block and the next block). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5256 Differential Revision: D15105072 Pulled By: siying fbshipit-source-id: 39634bdb4a881082451fa39cecd7ecf12160bf80	2019-05-01 14:26:30 -07:00
Siying Dong	4e0f2aadb0	DB::Close() to fail when there are unreleased snapshots (#5272 ) Summary: Sometimes, users might make mistake of not releasing snapshots before closing the DB. This is undocumented use of RocksDB and the behavior is unknown. We return DB::Close() to provide a way to check it for the users. Aborted() will be returned to users when they call DB::Close(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5272 Differential Revision: D15159713 Pulled By: siying fbshipit-source-id: 39369def612398d9f239d83d396b5a28e5af65cd	2019-05-01 10:17:30 -07:00
Maysam Yabandeh	521d234bda	Revert snap_refresh_nanos feature (#5269 ) Summary: Our daily stress tests are failing after this feature. Reverting temporarily until we figure the reason for test failures. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5269 Differential Revision: D15151285 Pulled By: maysamyabandeh fbshipit-source-id: e4002b99690a97df30d4b4b58bf0f61e9591bc6e	2019-05-01 10:07:30 -07:00
Fosco Marotto	36ea379cdc	Update history and version for future 6.2.0 (#5270 ) Summary: Update history before branch cut. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5270 Differential Revision: D15153700 Pulled By: gfosco fbshipit-source-id: 2c81e01a2ab965661b1d88209dca74ba0a3756cb	2019-04-30 15:09:36 -07:00
Yuqi Gu	03c7ae24c2	RocksDB CRC32c optimization with ARMv8 Intrinsic (#5221 ) Summary: 1. Add Arm linear crc32c implemtation for RocksDB. 2. Arm runtime check for crc32 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5221 Differential Revision: D15013685 Pulled By: siying fbshipit-source-id: 2c2983743d26656d93f212dc7c1a3cf66a1acf12	2019-04-30 10:59:05 -07:00
David Palm	a5debd7ed8	Add rocksdb_property_int_cf (#5268 ) Summary: Adds the missing `rocksdb_property_int_cf` function to the C API to let consuming libraries avoid parsing strings. Fixes https://github.com/facebook/rocksdb/issues/5249 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5268 Differential Revision: D15149461 Pulled By: maysamyabandeh fbshipit-source-id: e9fe5f1ad7c64066d921dba8473507269b51d331	2019-04-30 10:13:28 -07:00
Andrew Kryczka	b02d0c238d	Init compression dict handle before reading meta-blocks (#5267 ) Summary: At least one of the meta-block loading functions (`ReadRangeDelBlock`) uses the same block reading function (`NewDataBlockIterator`) as data block reads, which means it uses the dictionary handle. However, the dictionary handle was uninitialized while reading meta-blocks, causing readers to receive an error. This situation was only noticed when `cache_index_and_filter_blocks=true`. This PR initializes the handle to null while reading meta-blocks to prevent the error. It also adds support to `db_stress` / `db_crashtest.py` for `cache_index_and_filter_blocks`. Fixes #5263. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5267 Differential Revision: D15149264 Pulled By: maysamyabandeh fbshipit-source-id: 991d38a306c62db5976778bfb050fa3cd4a0671b	2019-04-30 09:50:49 -07:00
bxq2011hust	25810ca9c7	compile gtest only when enable test Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5248 Differential Revision: D15149190 Pulled By: maysamyabandeh fbshipit-source-id: fd6d799e80bb502a7ddbc07032ea87e2e3f1e24f	2019-04-30 09:33:44 -07:00
Yanqin Jin	210b49cac9	Disable pipelined write in atomic flush stress test (#5266 ) Summary: Since currently pipelined write allows one thread to perform memtable writes while another thread is traversing the `flush_scheduler_`, it will cause an assertion failure in `FlushScheduler::Clear`. To unblock crash recoery tests, we temporarily disable pipelined write when atomic flush is enabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5266 Differential Revision: D15142285 Pulled By: riversand963 fbshipit-source-id: a0c20fe4ac543e08feaed602414f982054df7831	2019-04-30 08:12:42 -07:00
Tongliang Liao	18864567c8	CMake has stock FindZLIB in upper case. (#5261 ) Summary: More details in https://cmake.org/cmake/help/v3.14/module/FindZLIB.html This resolves the cmake config error of not finding `Findzlib` on Linux (CentOS 7 + cmake 3.14.3 + gcc-8). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5261 Differential Revision: D15138052 Pulled By: maysamyabandeh fbshipit-source-id: 2f4445f49a36c16e6f1e05c090018c02379c0de4	2019-04-29 15:30:29 -07:00
Yanqin Jin	35e6ba734e	Fix a bug when trigger atomic flush and close db (#5254 ) Summary: With atomic flush, RocksDB background flush will flush memtables of a column family up to the largest memtable id in the immutable memtable list. This can introduce a bug in the following scenario. A user thread inserts into a column family until the memtable is full and triggers a flush. This will add the column family to flush_scheduler_. Then the user thread writes another record to the column family. In the PreprocessWrite function, the user thread picks the column family from flush_scheduler_ and schedules a flush request. The flush request gaurantees to flush all the memtables up to the current largest memtable ID of the immutable memtable list. Then the user thread writes new data to the newly-created active memtable. After the write returns, the user thread closes the db. This can cause assertion failure when the background flush thread tries to install superversion for the column family. The solution is to not install flush results if the db has already set `shutting_down_` to true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5254 Differential Revision: D15124149 Pulled By: riversand963 fbshipit-source-id: 0a667a41339dedb5a18bcb01b0bf11c275c04df0	2019-04-29 12:48:32 -07:00
Sagar Vemuri	3548e4220d	Improve explicit user readahead performance (#5246 ) Summary: Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`. 1. Stop creating new table readers when the user explicitly sets readahead size. 2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases). 3. Add `readahead_size` to db_bench. Benchmarks: https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737 For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%. For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%. Test Plan: Updated `DBIteratorTest.ReadAhead` test to make sure that: - no new table readers are created for iterators on setting ReadOptions.readahead_size - At least "readahead" number of bytes are actually getting read on each iterator read. TODO later: - Use similar logic for compactions as well. - This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246 Differential Revision: D15107946 Pulled By: sagar0 fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea	2019-04-26 21:24:10 -07:00
Maysam Yabandeh	8c7eb59838	Fix ubsan failure in snapshot refresh (#5257 ) Summary: The newly added test CompactionJobTest.SnapshotRefresh sets the snapshot refresh period to 0 to stress the feature. This results into large number of refresh events, which in turn results into an UBSAN failure when a bitwise shift operand goes beyond the uint64_t size. The patch fixes that by simplifying the shift logic to be done only by 2 bits after each refresh. Furthermore it verifies that the shift operation does not result in decreasing the refresh period. Testing: COMPILE_WITH_UBSAN=1 make -j32 compaction_job_test ./compaction_job_test --gtest_filter=CompactionJobTest.SnapshotRefresh Pull Request resolved: https://github.com/facebook/rocksdb/pull/5257 Differential Revision: D15106463 Pulled By: maysamyabandeh fbshipit-source-id: f2718898ea7ba4fa9f7e87b70cf98fe647c0de80	2019-04-26 17:30:30 -07:00
Maysam Yabandeh	506e8448be	Refresh snapshot list during long compactions (#5099 ) Summary: Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099 Differential Revision: D15086710 Pulled By: maysamyabandeh fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12	2019-04-25 18:17:22 -07:00
Andrew Kryczka	6eb317bb4c	Option string/map/file can set env from object registry (#5237 ) Summary: - By providing the "env" field in any text-based options (i.e., string, map, or file), we can use `NewCustomObject` to deserialize the text value into an actual `Env` object. - Currently factory functions for `Env` registered with object registry should only return pointer to static `Env` objects. That's because `DBOptions::env` is a raw pointer so we cannot easily delegate cleanup. - Note I did not add `env` to `db_option_type_info`. It wasn't needed for (de)serialization, and I believe we don't want to do verification on `env`, even by checking name. That's because the user should be able to copy their DB from Linux to Windows, change envs, and not see an option verification error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5237 Differential Revision: D15056360 Pulled By: siying fbshipit-source-id: 4b5f0b83297a5058f8949ec955dbf27d98d73d7e	2019-04-25 11:35:09 -07:00
niukuo	084a3c697c	add missing rocksdb_flush_cf in c (#5243 ) Summary: same to #5229 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5243 Differential Revision: D15082800 Pulled By: siying fbshipit-source-id: f4a68a480db0e40e1ba7cf37e18b88e43dff7c08	2019-04-25 11:25:43 -07:00
Yanqin Jin	da96f2fe00	Close WAL files before deletion (#5233 ) Summary: Currently one thread in RocksDB keeps a WAL file open while another thread deletes it. Although the first thread never writes to the WAL again, it still tries to close it in the end. This is fine on POSIX, but can be problematic on other platforms, e.g. HDFS, etc.. It will either cause a lot of warning messages or throw exceptions. The solution is to let the second thread close the WAL before deleting it. RocksDB keeps the writers of the logs to delete in `logs_to_free_`, which is passed to `job_context` during `FindObsoleteFiles` (holding mutex). Then in `PurgeObsoleteFiles` (without mutex), these writers should close the logs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5233 Differential Revision: D15032670 Pulled By: riversand963 fbshipit-source-id: c55e8a612db8cc2306644001a5e6d53842a8f754	2019-04-25 10:11:41 -07:00
Zhongyi Xie	66d8360beb	update history.md (#5245 ) Summary: update history.md for `BottommostLevelCompaction::kForceOptimized` to mention possible user impact. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5245 Differential Revision: D15073712 Pulled By: miasantreble fbshipit-source-id: d40f698c42e8a6368be4eac0a00d02279615edea	2019-04-24 21:30:00 -07:00
Mike Kolupaev	cd77d3c558	Don't call FindObsoleteFiles() in ~ColumnFamilyHandleImpl() if CF is not dropped (#5238 ) Summary: We have a DB with ~4k column families and ~70k files. On shutdown, destroying the 4k ColumnFamilyHandle-s takes over 2 minutes. Most of this time is spent in VersionSet::AddLiveFiles() called from FindObsoleteFiles() from ~ColumnFamilyHandleImpl(). It's just iterating over the list of files in memory. This seems completely unnecessary as no obsolete files are actually found since the CFs are not even dropped. This PR fixes that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5238 Differential Revision: D15056342 Pulled By: siying fbshipit-source-id: 2aa342ef3770b4aa384ce81f8768e485480e4f08	2019-04-24 17:11:36 -07:00
Zhongyi Xie	aa56b7e74a	secondary instance: add support for WAL tailing on `OpenAsSecondary` Summary: PR https://github.com/facebook/rocksdb/pull/4899 implemented the general framework for RocksDB secondary instances. This PR adds the support for WAL tailing in `OpenAsSecondary`, which means after the `OpenAsSecondary` call, the secondary is now able to see primary's writes that are yet to be flushed. The secondary can see primary's writes in the WAL up to the moment of `OpenAsSecondary` call starts. Differential Revision: D15059905 Pulled By: miasantreble fbshipit-source-id: 44f71f548a30b38179a7940165e138f622de1f10	2019-04-24 12:08:44 -07:00
anand76	1c8cbf315f	Extend MultiGet batching to Transactions (#5210 ) Summary: MultiGet batching was implemented in #5011 in order to reduce CPU utilization when looking up multiple keys at once. This PR implements corresponding ```MultiGet``` and ```MultiGetSingleCFForUpdate``` in ```rocksdb::Transaction``` that call the underlying batching implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5210 Differential Revision: D15048164 Pulled By: anand1976 fbshipit-source-id: c52f6043102ab0cbc723f4cba2a7b7d1767f6f52	2019-04-23 14:11:26 -07:00
qinzuoyan	a7d103198e	Print smallest and largest seqno in Version::DebugString() for more details (#5231 ) Summary: In some cases, we want to known the smallest and largest sequence numbers of sstable files, to help us get more details. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5231 Differential Revision: D15038087 Pulled By: siying fbshipit-source-id: c473c1ca07b53efe2f1884fa1ecdc8686f455ed8	2019-04-23 11:22:02 -07:00
Adam Retter	990b2f4cb3	Fix compilation on db_bench_tool.cc on Windows (#5227 ) Summary: I needed this change to be able to build the v6.0.1 release on Windows. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5227 Differential Revision: D15033815 Pulled By: sagar0 fbshipit-source-id: 579f3b8e694c34c0d43527eb2fa37175e37f5911	2019-04-23 11:16:51 -07:00
Siying Dong	72c8533f2c	DBIter to use IteratorWrapper for inner iterator (#5214 ) Summary: It's hard to get DBIter to directly use InternalIterator::NextAndGetResult() because the code change would be complicated. Instead, use IteratorWrapper, where Next() is already using NextAndGetResult(). Performance number is hard to measure because it is small and ther is variation. I run readseq many times, and there seems to be 1% gain. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5214 Differential Revision: D15003635 Pulled By: siying fbshipit-source-id: 17af1965c409c2fe90cd85037fbd2c5a1364f82a	2019-04-23 10:55:01 -07:00
Yuchi Chen	78a6e07c83	Fix compilation errors for 32bits/LITE/ios build. (#5220 ) Summary: When I build RocksDB for 32bits/LITE/iOS environment, some errors like the following. ` table/block_based_table_reader.cc:971:44: error: implicit conversion loses integer precision: 'uint64_t' (aka 'unsigned long long') to 'size_t' (aka 'unsigned long') [-Werror,-Wshorten-64-to-32] size_t block_size = props_block_handle.size(); ~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~^~~~~~ ./util/file_reader_writer.h:177:8: error: private field 'env_' is not used [-Werror,-Wunused-private-field] Env* env_; ^ ` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5220 Differential Revision: D15023481 Pulled By: siying fbshipit-source-id: 1b5d121d3016f2b0a8a9a2cc1bd638479357f9f7	2019-04-22 16:02:16 -07:00
Sagar Vemuri	47fd574829	Log file_creation_time table property (#5232 ) Summary: Log file_creation_time table property when a new table file is created. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5232 Differential Revision: D15033069 Pulled By: sagar0 fbshipit-source-id: aaac56a4c03a8f96c338cad1b0cdb7fbfb887647	2019-04-22 15:30:07 -07:00
Andrew Kryczka	8272a6de57	Optionally wait on bytes_per_sync to smooth I/O (#5183 ) Summary: The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways. My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes. Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE \| SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it. There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically). The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5183 Differential Revision: D14953553 Pulled By: siying fbshipit-source-id: 445c3862e019fb7b470f9c7f314fc231b62706e9	2019-04-22 11:51:39 -07:00
Mike Kolupaev	df38c1ce66	Add BlockBasedTableOptions::index_shortening (#5174 ) Summary: Introduce BlockBasedTableOptions::index_shortening to give users control on which key shortening techniques to be used in building index blocks. Before this patch, both separators and successor keys where shortened in indexes. With this patch, the default is set to kShortenSeparators to only shorten the separators. Since each index block has many separators and only one successor (last key), the change should not have negative impact on index block size. However it should prevent many unnecessary block loads where due to approximation introduced by shorted successor, seek would land us to the previous block and then fix it by moving to the next one. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5174 Differential Revision: D14884185 Pulled By: al13n321 fbshipit-source-id: 1b08bc8c03edcf09b6b8c16e9a7eea08ad4dd534	2019-04-22 08:20:35 -07:00
jsteemann	de76909464	refactor SavePoints (#5192 ) Summary: Savepoints are assumed to be used in a stack-wise fashion (only the top element should be used), so they were stored by `WriteBatch` in a member variable `save_points` using an std::stack. Conceptually this is fine, but the implementation had a few issues: - the `save_points_` instance variable was a plain pointer to a heap- allocated `SavePoints` struct. The destructor of `WriteBatch` simply deletes this pointer. However, the copy constructor of WriteBatch just copied that pointer, meaning that copying a WriteBatch with active savepoints will very likely have crashed before. Now a proper copy of the savepoints is made in the copy constructor, and not just a copy of the pointer - `save_points_` was an std::stack, which defaults to `std::deque` for the underlying container. A deque is a bit over the top here, as we only need access to the most recent savepoint (i.e. stack.top()) but never any elements at the front. std::deque is rather expensive to initialize in common environments. For example, the STL implementation shipped with GNU g++ will perform a heap allocation of more than 500 bytes to create an empty deque object. Although the `save_points_` container is created lazily by RocksDB, moving from a deque to a plain `std::vector` is much more memory-efficient. So `save_points_` is now a vector. - `save_points_` was changed from a plain pointer to an `std::unique_ptr`, making ownership more explicit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5192 Differential Revision: D15024074 Pulled By: maysamyabandeh fbshipit-source-id: 5b128786d3789cde94e46465c9e91badd07a25d7	2019-04-19 20:33:04 -07:00
Sagar Vemuri	dc64c2f5cc	Fix history to not include some features in 6.1 (#5224 ) Summary: Fix HISTORY.md by removing a few items from 6.1.1 history as they did not make into the 6.1.fb branch. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5224 Differential Revision: D15017030 Pulled By: sagar0 fbshipit-source-id: 090724d326d29168952e06dc1a5090c03fdd739e	2019-04-19 13:00:53 -07:00

1 2 3 4 5 ...

7971 Commits