rocksdb

Author	SHA1	Message	Date
haoyuhuang	74a334a2eb	Provide an option so that SST ingestion won't fall back to copy after hard linking fails (#5333 ) Summary: RocksDB always tries to perform a hard link operation on the external SST file to ingest. This operation can fail if the external SST resides on a different device/FS, or the underlying FS does not support hard link. Currently RocksDB assumes that if the link fails, the user is willing to perform file copy, which is not true according to the post. This commit provides an option named 'failed_move_fall_back_to_copy' for users to choose which behavior they want. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5333 Differential Revision: D15457597 Pulled By: HaoyuHuang fbshipit-source-id: f3626e13f845db4f7ed970a53ec8a2b1f0d62214	2019-05-23 21:58:52 -07:00
Zhongyi Xie	6a54278b4a	add class level comment for RepeatableThread Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5344 Differential Revision: D15485431 Pulled By: miasantreble fbshipit-source-id: 9c0f6cf0d826743e743012549976705ceb8cc0c4	2019-05-23 17:03:23 -07:00
Zhongyi Xie	09b534cc2f	improve comments for CompactionJob (#5341 ) Summary: add class/function level comments to the header file Pull Request resolved: https://github.com/facebook/rocksdb/pull/5341 Differential Revision: D15485442 Pulled By: miasantreble fbshipit-source-id: 9f11e2a1cd3ce0f4990f01353d0a6f4b050615cf	2019-05-23 16:57:46 -07:00
Siying Dong	38a06aa225	Improve comments of classes for PlainTable (#5339 ) Summary: Simply add some comments. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5339 Differential Revision: D15485315 Pulled By: siying fbshipit-source-id: 4594b1c4c967e6bd08aa7fa08a37df3481df1938	2019-05-23 16:51:44 -07:00
Siying Dong	02830a20f8	Add comments in db/dbformat.h Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5342 Differential Revision: D15485238 Pulled By: siying fbshipit-source-id: a56b374584cb1d815c1173907a807d90b37d4dd6	2019-05-23 16:44:20 -07:00
Siying Dong	dc30a9b69b	Add comments to db/db_iter.h (#5340 ) Summary: Add file comment in db/db_iter.h and minor changes in other parts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5340 Differential Revision: D15484605 Pulled By: siying fbshipit-source-id: 173771f9d5bd51303de5410ee5afd0a4af9d6572	2019-05-23 16:11:38 -07:00
Levi Tamasi	40aa520a51	Add class comment for BlockFetcher Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5337 Differential Revision: D15482289 Pulled By: ltamasi fbshipit-source-id: 8639ca78c1b8dfcc337a742d4d81d5752f12545f	2019-05-23 14:22:26 -07:00
Silver Chan	2095ae8858	fixed db_stress.cc build error (#5307 ) Summary: when building this file using Xcode 10.2.1 in MacOSX10.14, the compiler report this error: ` rocksdb/tools/db_stress.cc:3613:33: error: implicit instantiation of undefined template 'std::__1::array<std::__1::basic_string<char>, 10>' std::array<std::string, 10> keys = {"0", "1", "2", "3", "4", "5", "6", "7", "8", "9"}; /usr/include/c++/v1/__tuple:223:64: note: template is declared here template <class _Tp, size_t _Size> struct _LIBCPP_TEMPLATE_VIS array; ^ 1 error generated. ` if including array, this error will be fixed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5307 Differential Revision: D15475217 Pulled By: sagar0 fbshipit-source-id: b04a7658c2ca2573157028863b3a80f5ab52b9de	2019-05-23 14:03:25 -07:00
Thomas Fersch	3d9d77d900	Restrict L0->L0 compaction according to max_compaction_bytes option (#5329 ) Summary: Modified FindIntraL0Compaction to stop picking more files if total amount of compensated bytes would be larger than max_compaction_bytes option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5329 Differential Revision: D15435728 Pulled By: ThomasFersch fbshipit-source-id: d118a6da88d5df8ee20944422ade37cf6b15d60c	2019-05-22 23:40:57 -07:00
haoyuhuang	518cd1a62a	Use GetCurrentManifestPath to locate current MANIFEST file (#5331 ) Summary: In version_set.cc, there is a function GetCurrentManifestPath. The goal of this task is to refactor ListColumnFamilies function so that ListColumnFamilies calls GetCurrentManifestPath to search for MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5331 Differential Revision: D15444524 Pulled By: HaoyuHuang fbshipit-source-id: 1dcbd030bc0f2e835695741f450bba150f2f2903	2019-05-22 09:21:56 -07:00
Sagar Vemuri	dda474399a	Remove PATENTS text from a few straggler files (#5326 ) Summary: Remove PATENTS related wording from a few stragglers which still reference the old PATENTS file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5326 Differential Revision: D15423297 Pulled By: sagar0 fbshipit-source-id: 4babcddfc120b7d2fed6eb3898287cf8012bf8ea	2019-05-21 16:22:35 -07:00
Siying Dong	b2274da0e5	LogWriter to only flush after finish generating whole record (#5328 ) Summary: Right now, in log writer, we call flush after writing each physical record. I don't see the necessarity of it. Right now, the underlying writer has a buffer, so there isn't a concern that the write request is too large either. On the other hand, in an Env where every flush is expensive, the current approach is significantly slower than only flushing after a whole record finishes, when the record is very large. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5328 Differential Revision: D15425032 Pulled By: siying fbshipit-source-id: 440ebef002dfbb60c59d8388c9ddfc83d79700aa	2019-05-21 12:33:17 -07:00
Siying Dong	cd43446d01	Improve DBTablePropertiesTest.GetPropertiesOfTablesInRange (#5302 ) Summary: DBTablePropertiesTest.GetPropertiesOfTablesInRange sometimes hits the assert that generated LSM-tree doesn't have L1 file. Tighten the compaction triggering condition even further, hoping it goes away. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5302 Differential Revision: D15325971 Pulled By: siying fbshipit-source-id: 3e032bdb16fe8d98d5fcfcd65dd8be9781f3d6ae	2019-05-20 13:50:53 -07:00
Vijay Nadimpalli	931c9df886	Use separate status code for column family drop and db shutdown in progress (#5275 ) Summary: Currently RocksDB uses Status::ShutdownInProgress to inform about column family drop. I would like to have a separate Status code for this event. https://github.com/facebook/rocksdb/blob/master/include/rocksdb/status.h#L55 Comment on this: `abc4202e47/db/version_set.cc (L2742)`:L2743 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5275 Differential Revision: D15204583 Pulled By: vjnadimpalli fbshipit-source-id: 95e99e34b27bc165b554ecb8a48a7f8e60f21e2a	2019-05-20 10:47:32 -07:00
Maysam Yabandeh	5c0e304170	WritePrepared: Clarify the need for two_write_queues in unordered_write (#5313 ) Summary: WritePrepared transactions when configured with two_write_queues=true offers higher throughput with unordered_write feature without however compromising the rocksdb guarantees. This is because it performs ordering among writes in a 2nd step that is not tied to memtable write speed. The 2nd step is naturally provided by 2PC when the commit phase does the ordering as well. Without 2PC, the 2nd step would only be provided when we use two_write_queues=true, where WritePrepared after performing the writes, in a 2nd step uses the 2nd queue to assign order to the writes. The patch clarifies the need for two_write_queues=true in the HISTORY and inline comments of unordered_writes. Moreover it extends the stress tests of WritePrepared to unordred_write. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5313 Differential Revision: D15379977 Pulled By: maysamyabandeh fbshipit-source-id: 5b6f05b9b59285dcbf3b0532215ba9fe7d926e00	2019-05-20 07:49:20 -07:00
Yanqin Jin	fb4c6a31ce	Log replay integration for secondary instance (#5305 ) Summary: RocksDB secondary can replay both MANIFEST and WAL now. On the one hand, the memory usage by memtables will grow after replaying WAL for sometime. On the other hand, replaying the MANIFEST can bring the database persistent data to a more recent point in time, giving us the opportunity to discard some memtables containing out-dated data. This PR coordinates the MANIFEST and WAL replay, using the updates from MANIFEST replay to update the active memtable and immutable memtable list of each column family. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5305 Differential Revision: D15386512 Pulled By: riversand963 fbshipit-source-id: a3ea6fc415f8382d8cf624f52a71ebdcffa3e355	2019-05-17 19:19:51 -07:00
yiwu-arbug	f3a7847598	Reduce iterator key comparison for upper/lower bound check (#5111 ) Summary: Previously if iterator upper/lower bound presents, `DBIter` will check the bound for every key. This patch turns the check into per-file or per-data block check when applicable, by checking against either file largest/smallest key or block index key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5111 Differential Revision: D15330061 Pulled By: siying fbshipit-source-id: 8a653fe3cd50d94d81eb2d13b087326c58ee2024	2019-05-17 10:28:31 -07:00
Zhichao Cao	a13026fb2f	Added trace replay fast forward function (#5273 ) Summary: In the current db_bench trace replay, the replay process strictly follows the timestamp to issue the queries. In some cases, user does not care about the time. Therefore, fast forward is needed for users to speed up the replay process. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5273 Differential Revision: D15389232 Pulled By: zhichao-cao fbshipit-source-id: 735d629b9d2a167b05af3e4fa0ddf9d5d0be1806	2019-05-16 20:21:18 -07:00
Maysam Yabandeh	c71f5bb9aa	Disable WriteUnPrepared stress tests (#5315 ) Summary: They are kind of flaky at the moment. Will re-enable it when flakiness is fixed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5315 Differential Revision: D15382744 Pulled By: maysamyabandeh fbshipit-source-id: 8b2f9d81a4bb34bfd51481727a682d5cd063c5e3	2019-05-16 15:39:33 -07:00
Siying Dong	f82e693a31	RangeDelAggregator::StripeRep::Invalidate() to be skipped if empty (#5312 ) Summary: RangeDelAggregator::StripeRep::Invalidate() clears up several vectors. If we know there isn't anything to there, we can safe these small CPUs. Profiling shows that it sometimes take non-negligible amount of CPU. Worth a small optimization. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5312 Differential Revision: D15380511 Pulled By: siying fbshipit-source-id: 53c5f34c33b4cb1e743643c6086ac56d0b84ec2e	2019-05-16 15:24:28 -07:00
Azat Khuzhin	29a198564d	Fixes for build_detect_platform Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5255 Differential Revision: D15246532 Pulled By: riversand963 fbshipit-source-id: 96a21509666152788fa2f956e865a6bed7c8f474	2019-05-15 16:01:08 -07:00
Yanqin Jin	1583cb402e	Fix a flaky test with test sync point (#5310 ) Summary: If DB is opened with `avoid_unnecessary_blocking_io` being true, then `~ColumnFamilyHandleImpl` enqueues a purge request and schedules a background thread to perform the deletion. Without test sync point, whether the SST file is purged or not at a later point in time is not deterministic. If the SST does not exist, it will cause an assertion failure. How to reproduce: ``` $git checkout `6492430eaf` $make -j20 deletefile_test $gtest-parallel --repeat 1000 --worker 16 ./deletefile_test --gtest_filter=DeleteFileTest.BackgroundPurgeCFDropTest ``` The test may fail a few times. With changes made in this PR, repeat the above commands, and the test should not fail. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5310 Differential Revision: D15361136 Pulled By: riversand963 fbshipit-source-id: c4308d5f8da83472c893bf7f8ceed347fbfa850f	2019-05-15 15:17:55 -07:00
Dave Rigby	8149bb9d6a	Pass OptionTypeInfo maps by const& (#5295 ) Summary: In options_helper.cc various functions take a const unordered_map of string -> TypeInfo for options handling. These functions pass by-value the (const) maps, resulting in unnecessary copies. Change to pass by reference. This results in a noticable reduction in the amount of time spent parsing options - in my case a set of unit tests using RocksDB which call SetOptions() to modify options see a ~25% runtime reduction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5295 Differential Revision: D15296334 Pulled By: riversand963 fbshipit-source-id: 4d4be3db635264943607911b296dda27fd7ce1a7	2019-05-15 14:25:57 -07:00
Raphael Bost	468ca61105	Break large file writes into 1GB chunks (#5213 ) Summary: This is a workaround for the issue described in #5169. It has been tested on a database with very large values, but not dedicated test has been added to the code base. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5213 Differential Revision: D15243116 Pulled By: siying fbshipit-source-id: e0c226a6cd71a60924dcd7ce7af74abcb4054484	2019-05-15 14:20:24 -07:00
Maysam Yabandeh	f0e8216197	WritePrepared: Fix deadlock in WriteRecoverableState (#5306 ) Summary: The recent improvement in https://github.com/facebook/rocksdb/pull/3661 could cause a deadlock: When writing recoverable state, we also commit its sequence number to commit table, which could result into evicting existing commit entry, which could result into advancing max_evicted_seq_, which would need to get snapshots from database, which requires obtaining db mutex. The patch releases db_mutex before calling the callback in WriteRecoverableState to avoid the potential deadlock. It also improves the stress tests to let the issue be manifested in the tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5306 Differential Revision: D15341458 Pulled By: maysamyabandeh fbshipit-source-id: 05dcbed7e21b789fd1e5fd5ee8eea08077162323	2019-05-15 13:53:54 -07:00
Yanqin Jin	ad27045d14	Update HISTORY after cherrypicking a bug fix to 6.2 (#5309 ) Summary: After cherry-pick a bug fix to 6.2.fb branch, update the HISTORY.md file to reflect this change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5309 Differential Revision: D15358002 Pulled By: riversand963 fbshipit-source-id: 5a60510ec6dd444ce5ffaefc69b2e4c38914a921	2019-05-15 13:47:36 -07:00
Yuqi Gu	da7c89d79d	RocksDB Cmake changes for Arm64 CRC32 Optimization (#5304 ) Summary: Add CMake build for RocksDB CRC32 Optimization on Arm64. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5304 Differential Revision: D15355193 Pulled By: miasantreble fbshipit-source-id: 8d750a444274fbde14e510f51290631a369026b8	2019-05-15 13:28:03 -07:00
Thomas Fersch	a42757607d	Use pre-increment instead of post-increment for iterators (#5296 ) Summary: Google C++ style guide indicates pre-increment should be used for iterators: https://google.github.io/styleguide/cppguide.html#Preincrement_and_Predecrement. Replaced all instances of ' it++' by ' ++it' (where type is iterator). So this covers the cases where iterators are named 'it'. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5296 Differential Revision: D15301256 Pulled By: tfersch fbshipit-source-id: 2803483c1392504ad3b281d21db615429c71114b	2019-05-15 13:19:15 -07:00
Andres Suarez	189e711b37	Text lint all .gitignore files Reviewed By: scottrice, pallotron Differential Revision: D15353820 fbshipit-source-id: 74f9eaadc90363a958692259f5cb66cef91ac8ef	2019-05-15 11:37:27 -07:00
Maysam Yabandeh	3c3252a06a	Fix tsan complaint in ConcurrentMergeWrite test (#5308 ) Summary: The test was not using separate MemTablePostProcessInfo per memetable insert thread and thus tsan was complaining about data race. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5308 Differential Revision: D15356420 Pulled By: maysamyabandeh fbshipit-source-id: 46c2f2d19fb02c3c775b587aa09ca9c0dae6ed04	2019-05-15 11:21:48 -07:00
anand76	6492430eaf	Fix a bug in db_stress and an incorrect assertion in FilePickerMultiGet (#5301 ) Summary: This PR has two fixes for crash test failures - 1. Fix a bug in TestMultiGet() in db_stress that was passing list of key to MultiGet() in the wrong order, thus ensuring that actual values don't match expected values 2. Remove an incorrect assertion in FilePickerMultiGet::GetNextFileInLevelWithKeys() that checks that files in a level are in sorted order. This is not true with MultiGet(), especially if there are duplicate keys and we may have to go back one file for the next key. Furthermore, this assertion makes more sense when a new version is created, rather than at lookup time Test - asan_crash and ubsan_crash tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/5301 Differential Revision: D15337383 Pulled By: anand1976 fbshipit-source-id: 35092cb15bbc1700e5e823cbe07bfa62f1e9e6c6	2019-05-14 11:58:04 -07:00
Maysam Yabandeh	f383641a1d	Unordered Writes (#5218 ) Summary: Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees. The patch is prepared based on an original by siying Existing unit tests are extended to include unordered_write option. Benchmark Results: ``` TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1 ``` With WAL - Vanilla RocksDB: 78.6 MB/s - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x) - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees) Without WAL - Vanilla RocksDB: 111.3 MB/s - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x) - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees) - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x) Limitations: - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218 Differential Revision: D15219029 Pulled By: maysamyabandeh fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759	2019-05-13 17:47:21 -07:00
Yi Wu	92c60547fe	db_bench: fix hang on IO error (#5300 ) Summary: db_bench will wait indefinitely if there's background error. Fix by pass `abs_time_us` to cond var. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5300 Differential Revision: D15319945 Pulled By: miasantreble fbshipit-source-id: 0034fb7f6ec7c3303c4ccf26e54c20fbdac8ab44	2019-05-13 11:30:35 -07:00
Yanqin Jin	e626016545	Fix a race condition caused by unlocking db mutex (#5294 ) Summary: Previous code may call `~ColumnFamilyData` in `DBImpl::AtomicFlushMemTablesToOutputFiles` if the column family is dropped or `cfd->IsFlushPending() == false`. In `~ColumnFamilyData`, the db mutex is released briefly and re-acquired. This can cause correctness issue. The reason is as follows. Assume there are more bg flush threads. After bg_flush_thr1 releases the db mutex, bg_flush_thr2 can grab it and pop an element from the flush queue. This will cause bg_flush_thr2 to accidentally pick some memtables which should have been picked by bg_flush_thr1. To make the matter worse, bg_flush_thr2 can clear `flush_requested_` flag for the memtable list, causing a subsequent call to `MemTableList::IsFlushPending()` by bg_flush_thr1 to return false, which is wrong. The fix is to delay `ColumnFamilyData::Unref` and `~ColumnFamilyData` for column families not selected for flush until `AtomicFlushMemTablesToOutputFiles` returns. Furthermore, a bg flush thread should not clear `MemTableList::flush_requested_` in `MemTableList::PickMemtablesToFlush` unless atomic flush is not used or the memtable list does not have unpicked memtables. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5294 Differential Revision: D15295297 Pulled By: riversand963 fbshipit-source-id: 03b101205ca22c242647cbf488bcf0ed80b2ecbd	2019-05-10 17:56:48 -07:00
Mike Kolupaev	6a6aef25c1	Fix crash in BlockBasedTableIterator::Seek() (#5291 ) Summary: https://github.com/facebook/rocksdb/pull/5256 broke it: `block_iter_.user_key()` may not be valid even if `block_iter_points_to_real_block_` is true. E.g. if there was an IO error or Status::Incomplete. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5291 Differential Revision: D15273324 Pulled By: al13n321 fbshipit-source-id: 442e5b09f9884a58f92a6ac1ca93af719c219886	2019-05-10 12:40:57 -07:00
Levi Tamasi	f0bf3bf34b	Turn CachableEntry into a proper resource handle (#5252 ) Summary: CachableEntry is used in a variety of contexts: it may refer to a cached object (i.e. an object in the block cache), an owned object, or an unowned object; also, in some cases (most notably with iterators), the responsibility of managing the pointed-to object gets handed off to another object. Each of the above scenarios have different implications for the lifecycle of the referenced object. For the most part, the patch does not change the lifecycle of managed objects; however, it makes these relationships explicit, and it also enables us to eliminate some hacks and accident-prone code around releasing cache handles and deleting/cleaning up objects. (The only places where the patch changes how an objects are managed are the partitions of partitioned indexes and filters.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5252 Differential Revision: D15101358 Pulled By: ltamasi fbshipit-source-id: 9eb59e9ae5a7230e3345789762d0ba1f189485be	2019-05-10 11:57:49 -07:00
Jelte Fennema	6451673f37	Add C bindings for LowerThreadPoolIO/CPUPriority (#5285 ) Summary: There were no C bindings for lowering thread pool priority. This adds those. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5285 Differential Revision: D15290050 Pulled By: siying fbshipit-source-id: b2ed94d0c39d27434ace2204829a242b53d0d67a	2019-05-09 18:21:21 -07:00
Siying Dong	9fad3e21eb	Merging iterator to avoid child iterator reseek for some cases (#5286 ) Summary: When reseek happens in merging iterator, reseeking a child iterator can be avoided if: (1) the iterator represents imutable data (2) reseek() to a larger key than the current key (3) the current key of the child iterator is larger than the seek key because it is guaranteed that the result will fall into the same position. This optimization will be useful for use cases where users keep seeking to keys nearby in ascending order. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5286 Differential Revision: D15283635 Pulled By: siying fbshipit-source-id: 35f79ffd5ce3609146faa8cd55f2bfd733502f83	2019-05-09 14:20:04 -07:00
anand76	181bb43f08	Fix bugs in FilePickerMultiGet (#5292 ) Summary: This PR fixes a couple of bugs in FilePickerMultiGet that were causing db_stress test failures. The failures were caused by - 1. Improper handling of a key that matches the user key portion of an L0 file's largest key. In this case, the curr_index_in_curr_level file index in L0 for that key was getting incremented, but batch_iter_ was not advanced. By design, all keys in a batch are supposed to be checked against an L0 file before advancing to the next L0 file. Not advancing to the next key in the batch was causing a double increment of curr_index_in_curr_level due to the same key being processed again 2. Improper handling of a key that matches the user key portion of the largest key in the last file of L1 and higher. This was resulting in a premature end to the processing of the batch for that level when the next key in the batch is a duplicate. Typically, the keys in MultiGet will not be duplicates, but its good to handle that case correctly Test - asan_crash make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/5292 Differential Revision: D15282530 Pulled By: anand1976 fbshipit-source-id: d1a6a86e0af273169c3632db22a44d79c66a581f	2019-05-09 13:18:00 -07:00
Siying Dong	25d81e4577	DBIter::Next() can skip user key checking if previous entry's seqnum is 0 (#5244 ) Summary: Right now, DBIter::Next() always checks whether an entry is for the same user key as the previous entry to see whether the key should be hidden to the user. However, if previous entry's sequence number is 0, the check is not needed because 0 is the oldest possible sequence number. We could extend it from seqnum 0 case to simply prev_seqno >= current_seqno. However, it is less robust with bug or unexpected situations, while the gain is relatively low. We can always extend it later when needed. In a readseq benchmark with full formed LSM-tree, number of key comparisons called is reduced from 2.981 to 2.165. readseq against a fully compacted DB, no key comparison is called. Performance in this benchmark didn't show obvious improvement, which is expected because key comparisons only takes small percentage of CPU. But it may show up to be more effective if users have an expensive customized comparator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5244 Differential Revision: D15067257 Pulled By: siying fbshipit-source-id: b7e1ef3ec4fa928cba509683d2b3246e35d270d9	2019-05-09 12:24:04 -07:00
Zhongyi Xie	bdba6c56dd	add WAL replay in TryCatchUpWithPrimary (#5282 ) Summary: Previously in PR https://github.com/facebook/rocksdb/pull/5161 we have added the capability to do WAL tailing in `OpenAsSecondary`, in this PR we extend such feature to `TryCatchUpWithPrimary` which is useful for an secondary RocksDB instance to retrieve and apply the latest updates and refresh log readers if needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5282 Differential Revision: D15261011 Pulled By: miasantreble fbshipit-source-id: a15c94471e8c3b3b1f7f47c3135db1126e936949	2019-05-08 10:59:37 -07:00
Zhongyi Xie	eea1cad850	avoid updating index type during iterator creation (#5288 ) Summary: Right now there is a potential race condition where two threads are created to iterate through the DB (https://gist.github.com/miasantreble/88f5798a397ee7cb8e7baff9db2d9e85). The problem is that in `BlockBasedTable::NewIndexIterator`, if both threads failed to find index_reader from block cache, they will call `CreateIndexReader->UpdateIndexType()` which creates a race to update `index_type` in the shared rep_ object. By checking the code, we realize the index type is always populated by `PrefetchIndexAndFilterBlocks` during the table `Open` call, so there is no need to update index type every time during iterator creation. This PR attempts to fix the race condition by removing the unnecessary call to `UpdateIndexType` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5288 Differential Revision: D15252509 Pulled By: miasantreble fbshipit-source-id: 6e3258652121d5c76d267f7ac457e15c5e84756e	2019-05-07 20:20:40 -07:00
anand76	930bfa5750	Disable MultiGet from db_stress (#5284 ) Summary: Disable it for now until we can get stress tests to pass consistently. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5284 Differential Revision: D15230727 Pulled By: anand1976 fbshipit-source-id: 239baacdb3c4cd4fb7c4447f7582b9042501d752	2019-05-06 18:26:50 -07:00
Maysam Yabandeh	6a40ee5eb1	Refresh snapshot list during long compactions (2nd attempt) (#5278 ) Summary: Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list. For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list. This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278 Differential Revision: D15203291 Pulled By: maysamyabandeh fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258	2019-05-03 17:30:22 -07:00
Zhongyi Xie	5d27d65bef	multiget: fix memory issues due to vector auto resizing (#5279 ) Summary: This PR fixes three memory issues found by ASAN * in db_stress, the key vector for MultiGet is created using `emplace_back` which could potentially invalidates references to the underlying storage (vector<string>) due to auto resizing. Fix by calling reserve in advance. * Similar issue in construction of GetContext autovector in version_set.cc * In multiget_context.h use T[] specialization for unique_ptr that holds a char array Pull Request resolved: https://github.com/facebook/rocksdb/pull/5279 Differential Revision: D15202893 Pulled By: miasantreble fbshipit-source-id: 14cc2cda0ed64d29f2a1e264a6bfdaa4294ee75d	2019-05-03 15:58:43 -07:00
Zhongyi Xie	3e994809a1	fix implicit conversion error reported by clang check (#5277 ) Summary: fix the following clang check errors ``` tools/db_stress.cc:3609:30: error: implicit conversion loses integer precision: 'std::vector::size_type' (aka 'unsigned long') to 'int' [-Werror,-Wshorten-64-to-32] int num_keys = rand_keys.size(); ~~~~~~~~ ~~~~~~~~~~^~~~~~ tools/db_stress.cc:3888:30: error: implicit conversion loses integer precision: 'std::vector::size_type' (aka 'unsigned long') to 'int' [-Werror,-Wshorten-64-to-32] int num_keys = rand_keys.size(); ~~~~~~~~ ~~~~~~~~~~^~~~~~ 2 errors generated. make: *** [tools/db_stress.o] Error 1 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5277 Differential Revision: D15196620 Pulled By: miasantreble fbshipit-source-id: d56b1420d4a9f1df875fc52877a5fbb342bc7cae	2019-05-03 10:02:27 -07:00
Adam Retter	5882e847aa	Allow builds of RocksJava debug releases (#5274 ) Summary: This allows debug releases of RocksJava to be build with the Docker release targets. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5274 Differential Revision: D15185067 Pulled By: sagar0 fbshipit-source-id: f3988e472f281f5844d9a07098344a827b1e7eb1	2019-05-02 14:27:20 -07:00
anand76	434ccf2df4	Add option to use MultiGet in db_stress (#5264 ) Summary: The new option will pick a batch size randomly in the range 1-64. It will then space the keys in the batch by random intervals. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5264 Differential Revision: D15175522 Pulled By: anand1976 fbshipit-source-id: c16baa69d0f1ff4cf53c55c813ddd82c8aeb58fc	2019-05-01 23:06:56 -07:00
Zhongyi Xie	d51eb0b583	set snappy compression only when supported (#4325 ) Summary: Right now `OptimizeLevelStyleCompaction` may set compression type to Snappy even when Snappy is not supported, this may cause errors like "no snappy compression support" Fixes https://github.com/facebook/rocksdb/issues/4283 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4325 Differential Revision: D15125542 Pulled By: miasantreble fbshipit-source-id: 70890b73ababe16752721555dbd290633c2aafac	2019-05-01 20:40:00 -07:00
Siying Dong	4479dff208	Reduce binary search when reseek into the same data block (#5256 ) Summary: Right now, when Seek() is called again, RocksDB always does a binary search against the files and index blocks, even if they end up with the same file/block. Improve it as following: 1. in LevelIterator, reseek first try to check the boundary of the current file. If it falls into the same file, skip the binary search to find the file 2. in block based table iterator, reseek skip to reseek the iterator block if the seek key is larger than the current key and lower than the index key (boundary of the current block and the next block). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5256 Differential Revision: D15105072 Pulled By: siying fbshipit-source-id: 39634bdb4a881082451fa39cecd7ecf12160bf80	2019-05-01 14:26:30 -07:00

... 5 6 7 8 9 ...

8300 Commits