rocksdb

Author	SHA1	Message	Date
Yi Wu	662ce62044	Reduce iterator key comparison for upper/lower bound check (2nd attempt) (#5468 ) Summary: This is a second attempt for https://github.com/facebook/rocksdb/issues/5111, with the fix to redo iterate bounds check after `SeekXXX()`. This is because MyRocks may change iterate bounds between seek. See https://github.com/facebook/rocksdb/issues/5111 for original benchmark result and discussion. Closes https://github.com/facebook/rocksdb/issues/5463. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5468 Test Plan: Existing rocksdb tests, plus myrocks test `rocksdb.optimizer_loose_index_scans` and `rocksdb.group_min_max`. Differential Revision: D15863332 fbshipit-source-id: ab4aba5899838591806b8673899bd465f3f53e18	2019-07-02 11:48:46 -07:00
anand76	7259e28d91	MultiGet parallel IO (#5464 ) Summary: Enhancement to MultiGet batching to read data blocks required for keys in a batch in parallel from disk. It uses Env::MultiRead() API to read multiple blocks and reduce latency. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5464 Test Plan: 1. make check 2. make asan_check 3. make asan_crash Differential Revision: D15911771 Pulled By: anand1976 fbshipit-source-id: 605036b9af0f90ca0020dc87c3a86b4da6e83394	2019-06-30 20:56:04 -07:00
sdong	15fd3be07b	LRU Cache to enable mid-point insertion by default (#5508 ) Summary: Mid-point insertion is a useful feature and is mature now. Make it default. Also changed cache_index_and_filter_blocks_with_high_priority=true as default accordingly, so that we won't evict index and filter blocks easier after the change, to avoid too many surprises to users. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5508 Test Plan: Run all existing tests. Differential Revision: D16021179 fbshipit-source-id: ce8456e8d43b3bfb48df6c304b5290a9d19817eb	2019-06-27 10:20:57 -07:00
Yanqin Jin	c08c0ae731	Add C binding for secondary instance (#5505 ) Summary: Add C binding for secondary instance as well as unit test. Test plan (on devserver) ``` $make clean && COMPILE_WITH_ASAN=1 make -j20 all $./c_test $make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5505 Differential Revision: D16000043 Pulled By: riversand963 fbshipit-source-id: 3361ef6bfdf4ce12438cee7290a0ac203b5250bd	2019-06-27 08:58:54 -07:00
Mike Kolupaev	b4d7209428	Add an option to put first key of each sst block in the index (#5289 ) Summary: The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes. Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it. So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks. Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files. This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289 Differential Revision: D15256423 Pulled By: al13n321 fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a	2019-06-24 20:54:04 -07:00
Yi Wu	2730fe693e	Fix ingested file and direcotry not being sync (#5435 ) Summary: It it not safe to assume application had sync the SST file before ingest it into DB. Also the directory to put the ingested file needs to be fsync, otherwise the file can be lost. For integrity of RocksDB we need to sync the ingested file and directory before apply the change to manifest. Also syncing after writing global sequence when write_global_seqno=true was removed in https://github.com/facebook/rocksdb/issues/4172. Adding it back. Fixes https://github.com/facebook/rocksdb/issues/5287. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5435 Test Plan: Test ingest file with ldb command and observe fsync/fdatasync in strace output. Tried both move_files=true and move_files=false. https://gist.github.com/yiwu-arbug/650a4023f57979056d83485fa863bef9 More test suggestions are welcome. Differential Revision: D15941675 Pulled By: riversand963 fbshipit-source-id: 389533f3923065a96df2cdde23ff4724a1810d78	2019-06-21 10:15:38 -07:00
Simon Grätzer	fe90ed7a70	Replace Corruption with TryAgain status when new tail is not visible to TransactionLogIterator (#5474 ) Summary: When tailing the WAL with TransactionLogIterator, it used to return Corruption status to indicate that the WAL has new tail that is not visible to the iterator, which is a misleading status. The patch replaces it with TryAgain which is more descriptive of a status, indicating that the user needs to create a new iterator to fetch the recent tail. Fixes https://github.com/facebook/rocksdb/issues/5455 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5474 Differential Revision: D15898953 Pulled By: maysamyabandeh fbshipit-source-id: 40966f6457cb539e1aeb104daeada6b0e46059fc	2019-06-19 08:10:08 -07:00
Levi Tamasi	5355e527d9	Make the 'block read count' performance counters consistent (#5484 ) Summary: The patch brings the semantics of per-block-type read performance context counters in sync with the generic block_read_count by only incrementing the counter if the block was actually read from the file. It also fixes index_block_read_count, which fell victim to the refactoring in PR https://github.com/facebook/rocksdb/issues/5298. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5484 Test Plan: Extended the unit tests. Differential Revision: D15887431 Pulled By: ltamasi fbshipit-source-id: a3889759d0ac5759d56625d692cd828d1b9207a6	2019-06-18 19:03:24 -07:00
Yanqin Jin	f287f8dc93	Fix a bug caused by secondary not skipping the beginning of new MANIFEST (#5472 ) Summary: While the secondary is replaying after the primary, the primary may switch to a new MANIFEST. The secondary is already able to detect and follow the primary to the new MANIFEST. However, the current implementation has a bug, described as follows. The new MANIFEST's first records have been generated by VersionSet::WriteSnapshot to describe the current state of the column families and the db as of the MANIFEST creation. Since the secondary instance has already finished recovering upon start, there is no need for the secondary to process these records. Actually, if the secondary were to replay these records, the secondary may end up adding the same SST files again to each column family, causing consistency checks done by VersionBuilder to fail. Therefore, we record the number of records to skip at the beginning of the new MANIFEST and ignore them. Test plan (on dev server) ``` $make clean && make -j32 all $./db_secondary_test ``` All existing unit tests must pass as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5472 Differential Revision: D15866771 Pulled By: riversand963 fbshipit-source-id: a1eec4837fb2ad13059398efb0f437e74fd53bed	2019-06-18 11:21:37 -07:00
Sagar Vemuri	f1219644ec	Validate CF Options when creating a new column family (#5453 ) Summary: It seems like CF Options are not properly validated when creating a new column family with `CreateColumnFamily` API; only a selected few checks are done. Calling `ColumnFamilyData::ValidateOptions`, which is the single source for all CFOptions validations, will help fix this. (`ColumnFamilyData::ValidateOptions` is already called at the time of `DB::Open`). Test Plan: Added a new test: `DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions` ``` TEST_TMPDIR=/dev/shm ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions ``` Also ran gtest-parallel to make sure the new test is not flaky. ``` TEST_TMPDIR=/dev/shm ~/gtest-parallel/gtest-parallel ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions --repeat=10000 [10000/10000] DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions (15 ms) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5453 Differential Revision: D15816851 Pulled By: sagar0 fbshipit-source-id: 9e702b9850f5c4a7e0ef8d39e1e6f9b81e7fe1e5	2019-06-14 14:11:10 -07:00
Levi Tamasi	ba64a4cf52	Revert "Reduce iterator key comparison for upper/lower bound check (#5111 )" (#5440 ) Summary: This reverts commit `f3a7847598`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5440 Differential Revision: D15765967 Pulled By: ltamasi fbshipit-source-id: d027fe24132e3729289cd7c01857a7eb449d9dd0	2019-06-11 16:23:41 -07:00
Yanqin Jin	6ce5580882	Improve memtable earliest seqno assignment for secondary instance (#5413 ) Summary: In regular RocksDB instance, `MemTable::earliest_seqno_` is "db sequence number at the time of creation". However, we cannot use the db sequence number to set the value of `MemTable::earliest_seqno_` for secondary instance, i.e. `DBImplSecondary` due to the logic of MANIFEST and WAL replay. When replaying the log files of the primary, the secondary instance first replays MANIFEST and updates the db sequence number if necessary. Next, the secondary replays WAL files, creates new memtables if necessary and inserts key-value pairs into memtables. The following can occur when the db has two or more column families. Assume the db has column family "default" and "cf1". At a certain in time, both "default" and "cf1" have data in memtables. 1. Primary triggers a flush and flushes "cf1". "default" is not flushed. 2. Secondary replays the MANIFEST updates its db sequence number to the latest value learned from the MANIFEST. 3. Secondary starts to replay WAL that contains the writes to "default". It is possible that the write batches' sequence numbers are smaller than the db sequence number. In this case, these write batches will be skipped, and these updates will not be visible to reader until "default" is later flushed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5413 Differential Revision: D15637407 Pulled By: riversand963 fbshipit-source-id: 3de3fe35cfc6f1b9f844f3f926f0df29717b6580	2019-06-10 12:58:14 -07:00
Levi Tamasi	bee2f48a66	Refactor the handling of cache related counters and statistics (#5408 ) Summary: The patch cleans up the handling of cache hit/miss/insertion related performance counters, get context counters, and statistics by eliminating some code duplication and factoring out the affected logic into separate methods. In addition, it makes the semantics of cache hit metrics more consistent by changing the code so that accessing a partition of partitioned indexes/filters through a pinned reference no longer counts as a cache hit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5408 Differential Revision: D15610883 Pulled By: ltamasi fbshipit-source-id: ee749c18965077aca971d8f8bee8b24ed8fa76f1	2019-06-06 11:36:40 -07:00
Yanqin Jin	340ed4fac7	Add support for timestamp in Get/Put (#5079 ) Summary: It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now). Test plan (on devserver): ``` $COMPILE_WITH_ASAN=1 make -j32 all $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/* $make check ``` All tests must pass. We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled. ``` $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000 $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000 ``` Repeat for 6 times for both versions. Results are as follows: ``` \| \| readrandom \| fillrandom \| \| master \| 16.77 MB/s \| 47.05 MB/s \| \| PR5079 \| 16.44 MB/s \| 47.03 MB/s \| ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079 Differential Revision: D15132946 Pulled By: riversand963 fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c	2019-06-05 23:10:47 -07:00
Andrew Kryczka	ebe89ef9d8	Fix merging range tombstone covering put during flush/compaction (#5406 ) Summary: Flush/compaction use `MergeUntil` which has a special code path to handle a merge ending with a non-`Merge` point key. In particular if that key is a `Put` we forgot to check whether it is covered by a range tombstone. If it is covered then we must not include it in the following call to `TimedFullMerge`. Fixes #5392. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5406 Differential Revision: D15611144 Pulled By: sagar0 fbshipit-source-id: ba6a7863ca2d043f591de78fd0c4f4561f0c500e	2019-06-04 10:24:14 -07:00
Siying Dong	cb094e13bb	Auto roll logger to enforce options.keep_log_file_num immediately after a new file is created (#5370 ) Summary: Right now, with auto roll logger, options.keep_log_file_num enforcement is triggered by events like DB reopen or full obsolete scan happens. In the mean time, the size and number of log files can grow without a limit. We put a stronger enforcement to the option, so that the number of log files can always under control. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5370 Differential Revision: D15570413 Pulled By: siying fbshipit-source-id: 0916c3c4d42ab8fdd29389ee7fd7e1557b03176e	2019-05-31 10:50:19 -07:00
Yanqin Jin	b9f5900658	Fix WAL replay by skipping old write batches (#5170 ) Summary: 1. Fix a bug in WAL replay in which write batches with old sequence numbers are mistakenly inserted into memtables. 2. Add support for benchmarking secondary instance to db_bench_tool. With changes made in this PR, we can start benchmarking secondary instance using two processes. It is also possible to vary the frequency at which the secondary instance tries to catch up with the primary. The info log of the secondary can be found in a directory whose path can be specified with '-secondary_path'. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5170 Differential Revision: D15564608 Pulled By: riversand963 fbshipit-source-id: ce97688ed3d33f69d3a0b9266ebbbbf887aa0ec8	2019-05-30 19:33:33 -07:00
Levi Tamasi	1e35584251	Move the index readers out of the block cache (#5298 ) Summary: Currently, when the block cache is used for index blocks as well, it is not really the index block that is stored in the cache but an IndexReader object. Since this object is not pure data (it has, for instance, pointers that might dangle), it's not really sharable. To avoid the issues around this, the current code uses a dummy unique cache key for each TableReader to store the IndexReader, and erases the IndexReader entry when the TableReader is closed. Instead of doing this, the new code moves the IndexReader out of the cache altogether. In particular, instead of the TableReader owning, or caching/pinning the IndexReader based on the customer's settings, the TableReader unconditionally owns the IndexReader, which in turn owns/caches/pins the index block (which is itself sharable and thus can be safely put in the cache without any hacks). Note: the change has two side effects: 1) Partitions of partitioned indexes no longer affect the read amplification statistics. 2) Eviction statistics for index blocks are temporarily broken. We plan to fix this in a separate phase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298 Differential Revision: D15303203 Pulled By: ltamasi fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47	2019-05-30 11:53:27 -07:00
haoyuhuang	74a334a2eb	Provide an option so that SST ingestion won't fall back to copy after hard linking fails (#5333 ) Summary: RocksDB always tries to perform a hard link operation on the external SST file to ingest. This operation can fail if the external SST resides on a different device/FS, or the underlying FS does not support hard link. Currently RocksDB assumes that if the link fails, the user is willing to perform file copy, which is not true according to the post. This commit provides an option named 'failed_move_fall_back_to_copy' for users to choose which behavior they want. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5333 Differential Revision: D15457597 Pulled By: HaoyuHuang fbshipit-source-id: f3626e13f845db4f7ed970a53ec8a2b1f0d62214	2019-05-23 21:58:52 -07:00
Siying Dong	b2274da0e5	LogWriter to only flush after finish generating whole record (#5328 ) Summary: Right now, in log writer, we call flush after writing each physical record. I don't see the necessarity of it. Right now, the underlying writer has a buffer, so there isn't a concern that the write request is too large either. On the other hand, in an Env where every flush is expensive, the current approach is significantly slower than only flushing after a whole record finishes, when the record is very large. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5328 Differential Revision: D15425032 Pulled By: siying fbshipit-source-id: 440ebef002dfbb60c59d8388c9ddfc83d79700aa	2019-05-21 12:33:17 -07:00
Vijay Nadimpalli	931c9df886	Use separate status code for column family drop and db shutdown in progress (#5275 ) Summary: Currently RocksDB uses Status::ShutdownInProgress to inform about column family drop. I would like to have a separate Status code for this event. https://github.com/facebook/rocksdb/blob/master/include/rocksdb/status.h#L55 Comment on this: `abc4202e47/db/version_set.cc (L2742)`:L2743 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5275 Differential Revision: D15204583 Pulled By: vjnadimpalli fbshipit-source-id: 95e99e34b27bc165b554ecb8a48a7f8e60f21e2a	2019-05-20 10:47:32 -07:00
Maysam Yabandeh	5c0e304170	WritePrepared: Clarify the need for two_write_queues in unordered_write (#5313 ) Summary: WritePrepared transactions when configured with two_write_queues=true offers higher throughput with unordered_write feature without however compromising the rocksdb guarantees. This is because it performs ordering among writes in a 2nd step that is not tied to memtable write speed. The 2nd step is naturally provided by 2PC when the commit phase does the ordering as well. Without 2PC, the 2nd step would only be provided when we use two_write_queues=true, where WritePrepared after performing the writes, in a 2nd step uses the 2nd queue to assign order to the writes. The patch clarifies the need for two_write_queues=true in the HISTORY and inline comments of unordered_writes. Moreover it extends the stress tests of WritePrepared to unordred_write. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5313 Differential Revision: D15379977 Pulled By: maysamyabandeh fbshipit-source-id: 5b6f05b9b59285dcbf3b0532215ba9fe7d926e00	2019-05-20 07:49:20 -07:00
Yanqin Jin	fb4c6a31ce	Log replay integration for secondary instance (#5305 ) Summary: RocksDB secondary can replay both MANIFEST and WAL now. On the one hand, the memory usage by memtables will grow after replaying WAL for sometime. On the other hand, replaying the MANIFEST can bring the database persistent data to a more recent point in time, giving us the opportunity to discard some memtables containing out-dated data. This PR coordinates the MANIFEST and WAL replay, using the updates from MANIFEST replay to update the active memtable and immutable memtable list of each column family. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5305 Differential Revision: D15386512 Pulled By: riversand963 fbshipit-source-id: a3ea6fc415f8382d8cf624f52a71ebdcffa3e355	2019-05-17 19:19:51 -07:00
yiwu-arbug	f3a7847598	Reduce iterator key comparison for upper/lower bound check (#5111 ) Summary: Previously if iterator upper/lower bound presents, `DBIter` will check the bound for every key. This patch turns the check into per-file or per-data block check when applicable, by checking against either file largest/smallest key or block index key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5111 Differential Revision: D15330061 Pulled By: siying fbshipit-source-id: 8a653fe3cd50d94d81eb2d13b087326c58ee2024	2019-05-17 10:28:31 -07:00
Yanqin Jin	ad27045d14	Update HISTORY after cherrypicking a bug fix to 6.2 (#5309 ) Summary: After cherry-pick a bug fix to 6.2.fb branch, update the HISTORY.md file to reflect this change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5309 Differential Revision: D15358002 Pulled By: riversand963 fbshipit-source-id: 5a60510ec6dd444ce5ffaefc69b2e4c38914a921	2019-05-15 13:47:36 -07:00
Maysam Yabandeh	f383641a1d	Unordered Writes (#5218 ) Summary: Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees. The patch is prepared based on an original by siying Existing unit tests are extended to include unordered_write option. Benchmark Results: ``` TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1 ``` With WAL - Vanilla RocksDB: 78.6 MB/s - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x) - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees) Without WAL - Vanilla RocksDB: 111.3 MB/s - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x) - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees) - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x) Limitations: - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218 Differential Revision: D15219029 Pulled By: maysamyabandeh fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759	2019-05-13 17:47:21 -07:00
Yanqin Jin	e626016545	Fix a race condition caused by unlocking db mutex (#5294 ) Summary: Previous code may call `~ColumnFamilyData` in `DBImpl::AtomicFlushMemTablesToOutputFiles` if the column family is dropped or `cfd->IsFlushPending() == false`. In `~ColumnFamilyData`, the db mutex is released briefly and re-acquired. This can cause correctness issue. The reason is as follows. Assume there are more bg flush threads. After bg_flush_thr1 releases the db mutex, bg_flush_thr2 can grab it and pop an element from the flush queue. This will cause bg_flush_thr2 to accidentally pick some memtables which should have been picked by bg_flush_thr1. To make the matter worse, bg_flush_thr2 can clear `flush_requested_` flag for the memtable list, causing a subsequent call to `MemTableList::IsFlushPending()` by bg_flush_thr1 to return false, which is wrong. The fix is to delay `ColumnFamilyData::Unref` and `~ColumnFamilyData` for column families not selected for flush until `AtomicFlushMemTablesToOutputFiles` returns. Furthermore, a bg flush thread should not clear `MemTableList::flush_requested_` in `MemTableList::PickMemtablesToFlush` unless atomic flush is not used or the memtable list does not have unpicked memtables. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5294 Differential Revision: D15295297 Pulled By: riversand963 fbshipit-source-id: 03b101205ca22c242647cbf488bcf0ed80b2ecbd	2019-05-10 17:56:48 -07:00
Siying Dong	9fad3e21eb	Merging iterator to avoid child iterator reseek for some cases (#5286 ) Summary: When reseek happens in merging iterator, reseeking a child iterator can be avoided if: (1) the iterator represents imutable data (2) reseek() to a larger key than the current key (3) the current key of the child iterator is larger than the seek key because it is guaranteed that the result will fall into the same position. This optimization will be useful for use cases where users keep seeking to keys nearby in ascending order. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5286 Differential Revision: D15283635 Pulled By: siying fbshipit-source-id: 35f79ffd5ce3609146faa8cd55f2bfd733502f83	2019-05-09 14:20:04 -07:00
Siying Dong	25d81e4577	DBIter::Next() can skip user key checking if previous entry's seqnum is 0 (#5244 ) Summary: Right now, DBIter::Next() always checks whether an entry is for the same user key as the previous entry to see whether the key should be hidden to the user. However, if previous entry's sequence number is 0, the check is not needed because 0 is the oldest possible sequence number. We could extend it from seqnum 0 case to simply prev_seqno >= current_seqno. However, it is less robust with bug or unexpected situations, while the gain is relatively low. We can always extend it later when needed. In a readseq benchmark with full formed LSM-tree, number of key comparisons called is reduced from 2.981 to 2.165. readseq against a fully compacted DB, no key comparison is called. Performance in this benchmark didn't show obvious improvement, which is expected because key comparisons only takes small percentage of CPU. But it may show up to be more effective if users have an expensive customized comparator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5244 Differential Revision: D15067257 Pulled By: siying fbshipit-source-id: b7e1ef3ec4fa928cba509683d2b3246e35d270d9	2019-05-09 12:24:04 -07:00
Maysam Yabandeh	6a40ee5eb1	Refresh snapshot list during long compactions (2nd attempt) (#5278 ) Summary: Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list. For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list. This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278 Differential Revision: D15203291 Pulled By: maysamyabandeh fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258	2019-05-03 17:30:22 -07:00
Siying Dong	4479dff208	Reduce binary search when reseek into the same data block (#5256 ) Summary: Right now, when Seek() is called again, RocksDB always does a binary search against the files and index blocks, even if they end up with the same file/block. Improve it as following: 1. in LevelIterator, reseek first try to check the boundary of the current file. If it falls into the same file, skip the binary search to find the file 2. in block based table iterator, reseek skip to reseek the iterator block if the seek key is larger than the current key and lower than the index key (boundary of the current block and the next block). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5256 Differential Revision: D15105072 Pulled By: siying fbshipit-source-id: 39634bdb4a881082451fa39cecd7ecf12160bf80	2019-05-01 14:26:30 -07:00
Siying Dong	4e0f2aadb0	DB::Close() to fail when there are unreleased snapshots (#5272 ) Summary: Sometimes, users might make mistake of not releasing snapshots before closing the DB. This is undocumented use of RocksDB and the behavior is unknown. We return DB::Close() to provide a way to check it for the users. Aborted() will be returned to users when they call DB::Close(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5272 Differential Revision: D15159713 Pulled By: siying fbshipit-source-id: 39369def612398d9f239d83d396b5a28e5af65cd	2019-05-01 10:17:30 -07:00
Maysam Yabandeh	521d234bda	Revert snap_refresh_nanos feature (#5269 ) Summary: Our daily stress tests are failing after this feature. Reverting temporarily until we figure the reason for test failures. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5269 Differential Revision: D15151285 Pulled By: maysamyabandeh fbshipit-source-id: e4002b99690a97df30d4b4b58bf0f61e9591bc6e	2019-05-01 10:07:30 -07:00
Fosco Marotto	36ea379cdc	Update history and version for future 6.2.0 (#5270 ) Summary: Update history before branch cut. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5270 Differential Revision: D15153700 Pulled By: gfosco fbshipit-source-id: 2c81e01a2ab965661b1d88209dca74ba0a3756cb	2019-04-30 15:09:36 -07:00
Sagar Vemuri	3548e4220d	Improve explicit user readahead performance (#5246 ) Summary: Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`. 1. Stop creating new table readers when the user explicitly sets readahead size. 2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases). 3. Add `readahead_size` to db_bench. Benchmarks: https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737 For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%. For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%. Test Plan: Updated `DBIteratorTest.ReadAhead` test to make sure that: - no new table readers are created for iterators on setting ReadOptions.readahead_size - At least "readahead" number of bytes are actually getting read on each iterator read. TODO later: - Use similar logic for compactions as well. - This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246 Differential Revision: D15107946 Pulled By: sagar0 fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea	2019-04-26 21:24:10 -07:00
Maysam Yabandeh	506e8448be	Refresh snapshot list during long compactions (#5099 ) Summary: Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099 Differential Revision: D15086710 Pulled By: maysamyabandeh fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12	2019-04-25 18:17:22 -07:00
Andrew Kryczka	6eb317bb4c	Option string/map/file can set env from object registry (#5237 ) Summary: - By providing the "env" field in any text-based options (i.e., string, map, or file), we can use `NewCustomObject` to deserialize the text value into an actual `Env` object. - Currently factory functions for `Env` registered with object registry should only return pointer to static `Env` objects. That's because `DBOptions::env` is a raw pointer so we cannot easily delegate cleanup. - Note I did not add `env` to `db_option_type_info`. It wasn't needed for (de)serialization, and I believe we don't want to do verification on `env`, even by checking name. That's because the user should be able to copy their DB from Linux to Windows, change envs, and not see an option verification error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5237 Differential Revision: D15056360 Pulled By: siying fbshipit-source-id: 4b5f0b83297a5058f8949ec955dbf27d98d73d7e	2019-04-25 11:35:09 -07:00
Yanqin Jin	da96f2fe00	Close WAL files before deletion (#5233 ) Summary: Currently one thread in RocksDB keeps a WAL file open while another thread deletes it. Although the first thread never writes to the WAL again, it still tries to close it in the end. This is fine on POSIX, but can be problematic on other platforms, e.g. HDFS, etc.. It will either cause a lot of warning messages or throw exceptions. The solution is to let the second thread close the WAL before deleting it. RocksDB keeps the writers of the logs to delete in `logs_to_free_`, which is passed to `job_context` during `FindObsoleteFiles` (holding mutex). Then in `PurgeObsoleteFiles` (without mutex), these writers should close the logs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5233 Differential Revision: D15032670 Pulled By: riversand963 fbshipit-source-id: c55e8a612db8cc2306644001a5e6d53842a8f754	2019-04-25 10:11:41 -07:00
Zhongyi Xie	66d8360beb	update history.md (#5245 ) Summary: update history.md for `BottommostLevelCompaction::kForceOptimized` to mention possible user impact. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5245 Differential Revision: D15073712 Pulled By: miasantreble fbshipit-source-id: d40f698c42e8a6368be4eac0a00d02279615edea	2019-04-24 21:30:00 -07:00
Andrew Kryczka	8272a6de57	Optionally wait on bytes_per_sync to smooth I/O (#5183 ) Summary: The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways. My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes. Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE \| SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it. There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically). The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5183 Differential Revision: D14953553 Pulled By: siying fbshipit-source-id: 445c3862e019fb7b470f9c7f314fc231b62706e9	2019-04-22 11:51:39 -07:00
Mike Kolupaev	df38c1ce66	Add BlockBasedTableOptions::index_shortening (#5174 ) Summary: Introduce BlockBasedTableOptions::index_shortening to give users control on which key shortening techniques to be used in building index blocks. Before this patch, both separators and successor keys where shortened in indexes. With this patch, the default is set to kShortenSeparators to only shorten the separators. Since each index block has many separators and only one successor (last key), the change should not have negative impact on index block size. However it should prevent many unnecessary block loads where due to approximation introduced by shorted successor, seek would land us to the previous block and then fix it by moving to the next one. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5174 Differential Revision: D14884185 Pulled By: al13n321 fbshipit-source-id: 1b08bc8c03edcf09b6b8c16e9a7eea08ad4dd534	2019-04-22 08:20:35 -07:00
Sagar Vemuri	dc64c2f5cc	Fix history to not include some features in 6.1 (#5224 ) Summary: Fix HISTORY.md by removing a few items from 6.1.1 history as they did not make into the 6.1.fb branch. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5224 Differential Revision: D15017030 Pulled By: sagar0 fbshipit-source-id: 090724d326d29168952e06dc1a5090c03fdd739e	2019-04-19 13:00:53 -07:00
Zhongyi Xie	3bdce20e2b	reorganize history.md to list unreleased changes separately Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5216 Differential Revision: D15003749 Pulled By: miasantreble fbshipit-source-id: a52c264e694cd7c55813be33ee22b4f3046b545a	2019-04-18 14:55:57 -07:00
Zhongyi Xie	baa5302447	Avoid double-compacting data in bottom level in manual compactions (#5138 ) Summary: Depending on the config, manual compaction (leveled compaction style) does following compactions: L0->L1 L1->L2 ... Ln-1 -> Ln Ln -> Ln The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only. Resolves issue https://github.com/facebook/rocksdb/issues/4995 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138 Differential Revision: D14940106 Pulled By: miasantreble fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5	2019-04-16 23:32:20 -07:00
Siying Dong	beb44ec3eb	WriteBufferManager's dummy entry size to block cache 1MB -> 256KB (#5175 ) Summary: Dummy cache size of 1MB is too large for small block sizes. Our GetDefaultCacheShardBits() use min_shard_size = 512L * 1024L to determine number of shards, so 1MB will excceeds the size of the whole shard and make the cache excceeds the budget. Change it to 256KB accordingly. There shouldn't be obvious performance impact, since inserting a cache entry every 256KB of memtable inserts is still infrequently enough. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5175 Differential Revision: D14954289 Pulled By: siying fbshipit-source-id: 2c275255c1ac3992174e06529e44c55538325c94	2019-04-16 12:03:07 -07:00
yiwu-arbug	f1239d5f10	Avoid per-key upper bound check in BlockBasedTableIterator (#5142 ) Summary: This is second attempt for #5101. Original commit message: `BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons. This patch come with two fixes: Fix 1: To optimize checking for bounds, we need comparing the bounds with index key as well. However BlockBasedTableIterator doesn't know whether its index iterator is internally using user keys or internal keys. The patch fixes that by extending InternalIterator with a user_key() function that is overridden by In IndexBlockIter. Fix 2: In #5101 we return `IsOutOfBound()=true` when block index key is out of bound. But the index key can be larger than smallest key of the next file on the level. That file can be within upper bound and should not be filtered out. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5142 Differential Revision: D14907113 Pulled By: siying fbshipit-source-id: ac95775c5b4e7b700f76ab43e39f45402c98fbfb	2019-04-16 11:37:47 -07:00
Fosco Marotto	b5cad5c986	Update history and version to 6.1.1 (#5171 ) Summary: Including latest fixes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5171 Differential Revision: D14875157 Pulled By: gfosco fbshipit-source-id: 86ec7ee3553a9b25ab71ed98966ce08a16322e2c	2019-04-15 10:49:38 -07:00
yiwu-arbug	cca141ecf8	Fix crash with memtable prefix bloom and key out of prefix extractor domain (#5190 ) Summary: Before using prefix extractor `InDomain()` should be check. All uses in memtable.cc didn't check `InDomain()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5190 Differential Revision: D14923773 Pulled By: miasantreble fbshipit-source-id: b3ad60bcca5f3a1a2b929a6eb34b0b7ba6326f04	2019-04-12 17:07:49 -07:00
Maysam Yabandeh	fe642cbee6	WritePrepared: fix race condition in reading batch with duplicate keys (#5147 ) Summary: When ReadOption doesn't specify a snapshot, WritePrepared::Get used kMaxSequenceNumber to avoid the cost of creating a new snapshot object (that requires sync over db_mutex). This creates a race condition if it is reading from the writes of a transaction that had duplicate keys: each instance of duplicate key is inserted with a different sequence number and depending on the ordering the ::Get might skip the newer one and read the older one that is obsolete. The patch fixes that by using last published seq as the snapshot sequence number. It also adds a check after the read is done to ensure that the max_evicted_seq has not advanced the aforementioned seq, which is a very unlikely event. If it did, then the read is not valid since the seq is not backed by an actually snapshot to let IsInSnapshot handle that properly when an overlapping commit is evicted from commit cache. A unit test is added to reproduce the race condition with duplicate keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5147 Differential Revision: D14758815 Pulled By: maysamyabandeh fbshipit-source-id: a56915657132cf6ba5e3f5ea1b5d78c803407719	2019-04-12 14:40:41 -07:00
Siying Dong	ed9f5e21aa	Change OptimizeForPointLookup() and OptimizeForSmallDb() (#5165 ) Summary: Change the behavior of OptimizeForSmallDb() so that it is less likely to go out of memory. Change the behavior of OptimizeForPointLookup() to take advantage of the new memtable whole key filter, and move away from prefix extractor as well as hash-based indexing, as they are prone to misuse. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5165 Differential Revision: D14880709 Pulled By: siying fbshipit-source-id: 9af30e3c9e151eceea6d6b38701a58f1f9fb692d	2019-04-11 10:45:36 -07:00

1 2 3 4 5 ...

569 Commits