rocksdb

Author	SHA1	Message	Date
haoyuhuang	227b5d52df	Make RocksDB secondary instance respect atomic groups in version edits. (#5411 ) Summary: With this commit, RocksDB secondary instance respects atomic groups in version edits. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5411 Differential Revision: D15617512 Pulled By: HaoyuHuang fbshipit-source-id: 913f4ede391d772dcaf5649e3cd2099fa292d120	2019-06-04 10:56:19 -07:00
Vijay Nadimpalli	49c5a12dbe	Organizing rocksdb/db directory Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390 Differential Revision: D15579388 Pulled By: vjnadimpalli fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366	2019-05-31 11:57:01 -07:00
Siying Dong	8843129ece	Move some memory related files from util/ to memory/ (#5382 ) Summary: Move arena, allocator, and memory tools under util to a separate memory/ directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382 Differential Revision: D15564655 Pulled By: siying fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5	2019-05-30 17:44:09 -07:00
Vijay Nadimpalli	50e470791d	Organizing rocksdb/table directory by format Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373 Differential Revision: D15559425 Pulled By: vjnadimpalli fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55	2019-05-30 14:51:11 -07:00
Siying Dong	e9e0101ca4	Move test related files under util/ to test_util/ (#5377 ) Summary: There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo ve them to a new directory test_util/, so that util/ is cleaner. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377 Differential Revision: D15551366 Pulled By: siying fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c	2019-05-30 11:25:51 -07:00
Siying Dong	545d206040	Move some file related files outside util/ (#5375 ) Summary: util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375 Differential Revision: D15550935 Pulled By: siying fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2	2019-05-29 20:47:06 -07:00
haoyuhuang	518cd1a62a	Use GetCurrentManifestPath to locate current MANIFEST file (#5331 ) Summary: In version_set.cc, there is a function GetCurrentManifestPath. The goal of this task is to refactor ListColumnFamilies function so that ListColumnFamilies calls GetCurrentManifestPath to search for MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5331 Differential Revision: D15444524 Pulled By: HaoyuHuang fbshipit-source-id: 1dcbd030bc0f2e835695741f450bba150f2f2903	2019-05-22 09:21:56 -07:00
Vijay Nadimpalli	931c9df886	Use separate status code for column family drop and db shutdown in progress (#5275 ) Summary: Currently RocksDB uses Status::ShutdownInProgress to inform about column family drop. I would like to have a separate Status code for this event. https://github.com/facebook/rocksdb/blob/master/include/rocksdb/status.h#L55 Comment on this: `abc4202e47/db/version_set.cc (L2742)`:L2743 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5275 Differential Revision: D15204583 Pulled By: vjnadimpalli fbshipit-source-id: 95e99e34b27bc165b554ecb8a48a7f8e60f21e2a	2019-05-20 10:47:32 -07:00
yiwu-arbug	f3a7847598	Reduce iterator key comparison for upper/lower bound check (#5111 ) Summary: Previously if iterator upper/lower bound presents, `DBIter` will check the bound for every key. This patch turns the check into per-file or per-data block check when applicable, by checking against either file largest/smallest key or block index key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5111 Differential Revision: D15330061 Pulled By: siying fbshipit-source-id: 8a653fe3cd50d94d81eb2d13b087326c58ee2024	2019-05-17 10:28:31 -07:00
anand76	6492430eaf	Fix a bug in db_stress and an incorrect assertion in FilePickerMultiGet (#5301 ) Summary: This PR has two fixes for crash test failures - 1. Fix a bug in TestMultiGet() in db_stress that was passing list of key to MultiGet() in the wrong order, thus ensuring that actual values don't match expected values 2. Remove an incorrect assertion in FilePickerMultiGet::GetNextFileInLevelWithKeys() that checks that files in a level are in sorted order. This is not true with MultiGet(), especially if there are duplicate keys and we may have to go back one file for the next key. Furthermore, this assertion makes more sense when a new version is created, rather than at lookup time Test - asan_crash and ubsan_crash tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/5301 Differential Revision: D15337383 Pulled By: anand1976 fbshipit-source-id: 35092cb15bbc1700e5e823cbe07bfa62f1e9e6c6	2019-05-14 11:58:04 -07:00
Siying Dong	9fad3e21eb	Merging iterator to avoid child iterator reseek for some cases (#5286 ) Summary: When reseek happens in merging iterator, reseeking a child iterator can be avoided if: (1) the iterator represents imutable data (2) reseek() to a larger key than the current key (3) the current key of the child iterator is larger than the seek key because it is guaranteed that the result will fall into the same position. This optimization will be useful for use cases where users keep seeking to keys nearby in ascending order. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5286 Differential Revision: D15283635 Pulled By: siying fbshipit-source-id: 35f79ffd5ce3609146faa8cd55f2bfd733502f83	2019-05-09 14:20:04 -07:00
anand76	181bb43f08	Fix bugs in FilePickerMultiGet (#5292 ) Summary: This PR fixes a couple of bugs in FilePickerMultiGet that were causing db_stress test failures. The failures were caused by - 1. Improper handling of a key that matches the user key portion of an L0 file's largest key. In this case, the curr_index_in_curr_level file index in L0 for that key was getting incremented, but batch_iter_ was not advanced. By design, all keys in a batch are supposed to be checked against an L0 file before advancing to the next L0 file. Not advancing to the next key in the batch was causing a double increment of curr_index_in_curr_level due to the same key being processed again 2. Improper handling of a key that matches the user key portion of the largest key in the last file of L1 and higher. This was resulting in a premature end to the processing of the batch for that level when the next key in the batch is a duplicate. Typically, the keys in MultiGet will not be duplicates, but its good to handle that case correctly Test - asan_crash make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/5292 Differential Revision: D15282530 Pulled By: anand1976 fbshipit-source-id: d1a6a86e0af273169c3632db22a44d79c66a581f	2019-05-09 13:18:00 -07:00
Zhongyi Xie	5d27d65bef	multiget: fix memory issues due to vector auto resizing (#5279 ) Summary: This PR fixes three memory issues found by ASAN * in db_stress, the key vector for MultiGet is created using `emplace_back` which could potentially invalidates references to the underlying storage (vector<string>) due to auto resizing. Fix by calling reserve in advance. * Similar issue in construction of GetContext autovector in version_set.cc * In multiget_context.h use T[] specialization for unique_ptr that holds a char array Pull Request resolved: https://github.com/facebook/rocksdb/pull/5279 Differential Revision: D15202893 Pulled By: miasantreble fbshipit-source-id: 14cc2cda0ed64d29f2a1e264a6bfdaa4294ee75d	2019-05-03 15:58:43 -07:00
Siying Dong	4479dff208	Reduce binary search when reseek into the same data block (#5256 ) Summary: Right now, when Seek() is called again, RocksDB always does a binary search against the files and index blocks, even if they end up with the same file/block. Improve it as following: 1. in LevelIterator, reseek first try to check the boundary of the current file. If it falls into the same file, skip the binary search to find the file 2. in block based table iterator, reseek skip to reseek the iterator block if the seek key is larger than the current key and lower than the index key (boundary of the current block and the next block). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5256 Differential Revision: D15105072 Pulled By: siying fbshipit-source-id: 39634bdb4a881082451fa39cecd7ecf12160bf80	2019-05-01 14:26:30 -07:00
qinzuoyan	a7d103198e	Print smallest and largest seqno in Version::DebugString() for more details (#5231 ) Summary: In some cases, we want to known the smallest and largest sequence numbers of sstable files, to help us get more details. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5231 Differential Revision: D15038087 Pulled By: siying fbshipit-source-id: c473c1ca07b53efe2f1884fa1ecdc8686f455ed8	2019-04-23 11:22:02 -07:00
Sagar Vemuri	efa948741c	Use creation_time or mtime when file_creation_time=0 (#5184 ) Summary: We found an issue in Periodic Compactions (introduced in #5166) where files were not being picked up for compactions as all the SST files created with older versions of RocksDB have `file_creation_time` as 0. (Note that `file_creation_time` is a new table property introduced in #5166). To address this, Periodic compactions now fall back to looking at the `creation_time` table property or the file's modification time (as given by the Env) when `file_creation_time` table property is found to be 0. Here how the file's modification time (and, in turn, the file age) is computed now: 1. Use `file_creation_time` table property if it is > 0. 1. If not, then use `creation_time` table property if it is > 0. 1. If not, then use file's mtime stat metadata given by the underlying Env. Don't consider the file at all for compaction if the modification time cannot be correctly determined based on the above conditions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5184 Differential Revision: D14907795 Pulled By: sagar0 fbshipit-source-id: 4bb2f3631f9a3e04470c674a1d13544584e1e56c	2019-04-18 22:39:34 -07:00
Siying Dong	992dfc7811	Introduce InternalIteratorBase::NextAndGetResult() (#5197 ) Summary: In long scans, virtual function calls of Next(), Valid(), key() and value() are not trivial. By introducing NextAndGetResult(), Some of the Next(), Valid() and key() calls are consolidated into one virtual function call to reduce CPU. Also did some inline tricks and add some "final" randomly in some functions. Even without the "final" annotation, most Next() calls are inlined with -O3, but sometimes with a final it is inlined by O2 too. It doesn't hurt to add those final annotations. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5197 Differential Revision: D14945977 Pulled By: siying fbshipit-source-id: 7003969f9a5f1d5717f0bda503b91d19ba75ed88	2019-04-18 11:12:39 -07:00
Yanqin Jin	392f6d49e5	Fix a bug in GetOverlappingInputsRangeBinarySearch (#5211 ) Summary: As title. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5211 Differential Revision: D14992018 Pulled By: riversand963 fbshipit-source-id: b5720ea4742029e2fb47ff6d9f8d9de006db4ed4	2019-04-18 09:22:16 -07:00
JiYou	5b7e09bd6f	VersionSet: optmize GetOverlappingInputsRangeBinarySearch (#4987 ) Summary: `GetOverlappingInputsRangeBinarySearch` firstly use binary search to find a index in the given range `[begin, end]`. But after find the index, then use linear search to find the `start_index` and `end_index`. So the search process degraded to linear time. Here optmize the search process with below changes: - use `std::lower_bound` and `std::upper_bound` to get `lg(n)` search complexity. - use uniformed lambda for search process. - simplify process for `within_interval` true or false. - remove function `ExtendFileRangeWithinInterval` and `ExtendFileRangeOverlappingInterval`. Signed-off-by: JiYou <jiyou09@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/4987 Differential Revision: D14984192 Pulled By: riversand963 fbshipit-source-id: fae4b8e59a21b7e350718d60cdc94dd55ac81e89	2019-04-17 18:15:20 -07:00
anand76	29111e92b4	Add bounds check in FilePickerMultiGet::PrepareNextLevel() (#5189 ) Summary: Add bounds check when looping through empty levels in FilePickerMultiGet Pull Request resolved: https://github.com/facebook/rocksdb/pull/5189 Differential Revision: D14925334 Pulled By: anand1976 fbshipit-source-id: 65d53247cf443153e28ce2b8b753fa51c6ae4566	2019-04-12 18:05:09 -07:00
anand76	fefd4b98c5	Introduce a new MultiGet batching implementation (#5011 ) Summary: This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching. Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to - 1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch() 2. Bloom filter cachelines can be prefetched, hiding the cache miss latency The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress. Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32). Batch Sizes 1 \| 2 \| 4 \| 8 \| 16 \| 32 Random pattern (Stride length 0) 4.158 \| 4.109 \| 4.026 \| 4.05 \| 4.1 \| 4.074 - Get 4.438 \| 4.302 \| 4.165 \| 4.122 \| 4.096 \| 4.075 - MultiGet (no batching) 4.461 \| 4.256 \| 4.277 \| 4.11 \| 4.182 \| 4.14 - MultiGet (w/ batching) Good locality (Stride length 16) 4.048 \| 3.659 \| 3.248 \| 2.99 \| 2.84 \| 2.753 4.429 \| 3.728 \| 3.406 \| 3.053 \| 2.911 \| 2.781 4.452 \| 3.45 \| 2.833 \| 2.451 \| 2.233 \| 2.135 Good locality (Stride length 256) 4.066 \| 3.786 \| 3.581 \| 3.447 \| 3.415 \| 3.232 4.406 \| 4.005 \| 3.644 \| 3.49 \| 3.381 \| 3.268 4.393 \| 3.649 \| 3.186 \| 2.882 \| 2.676 \| 2.62 Medium locality (Stride length 4096) 4.012 \| 3.922 \| 3.768 \| 3.61 \| 3.582 \| 3.555 4.364 \| 4.057 \| 3.791 \| 3.65 \| 3.57 \| 3.465 4.479 \| 3.758 \| 3.316 \| 3.077 \| 2.959 \| 2.891 dbbench command used (on a DB with 4 levels, 12 million keys)- TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011 Differential Revision: D14348703 Pulled By: anand1976 fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b	2019-04-11 14:28:26 -07:00
Sagar Vemuri	d3d20dcdca	Periodic Compactions (#5166 ) Summary: Introducing Periodic Compactions. This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold. - Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF. - This works across all levels. - The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used). - Compaction filters, if any, are invoked as usual. - A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS). This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166 Differential Revision: D14884441 Pulled By: sagar0 fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47	2019-04-10 19:31:18 -07:00
Adam Simpkins	c06c4c01c5	Fix many bugs in log statement arguments (#5089 ) Summary: Annotate all of the logging functions to inform the compiler that these use printf-style formatting arguments. This allows the compiler to emit warnings if the format arguments are incorrect. This also fixes many problems reported now that format string checking is enabled. Many of these are simply mix-ups in the argument type (e.g, int vs uint64_t), but in several cases the wrong number of arguments were being passed in which can cause the code to crash. The primary motivation for this was to fix the log message in `DBImpl::SwitchMemtable()` which caused a segfault due to an extra %s format parameter with no argument supplied. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5089 Differential Revision: D14574795 Pulled By: simpkins fbshipit-source-id: 0921b03f0743652bf4ae21e414ff54b3bb65422a	2019-04-04 12:12:11 -07:00
Yi Wu	d69241586e	Fix perf_context.user_key_comparison_count for range scan (#5098 ) Summary: Currently `perf_context.user_key_comparison_count` is bump only in `InternalKeyComparator`. For places user comparator is used directly the counter is not bump. Fixing the majority of it. Index iterator and filter code also use user comparator directly and don't bump the counter. It is not fixed in this patch. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5098 Differential Revision: D14603753 Pulled By: siying fbshipit-source-id: 1cd41035644ca9e49b97a51030a5d1e15f5f3cae	2019-03-27 10:34:27 -07:00
Yanqin Jin	9358178edc	Support for single-primary, multi-secondary instances (#4899 ) Summary: This PR allows RocksDB to run in single-primary, multi-secondary process mode. The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary. Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`. This PR has several components: 1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary. 2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue. 3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`. 3.1 Tail the primary's MANIFEST during recovery. 3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`. 3.3 Tailing WAL will be in a future PR. 4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899 Differential Revision: D14510945 Pulled By: riversand963 fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886	2019-03-26 16:45:31 -07:00
Michael Liu	3c5d1b16b1	Apply modernize-use-override (3) Summary: Use C++11’s override and remove virtual where applicable. Change are automatically generated. bypass-lint drop-conflicts Reviewed By: igorsugak Differential Revision: D14131816 fbshipit-source-id: f20e7f7cecf2e699d70f5fa036f72c0e3f59b50e	2019-02-19 13:39:49 -08:00
Aubin Sanyal	3231a2e581	Deprecate ttl option from CompactionOptionsFIFO (#4965 ) Summary: We introduced ttl option in CompactionOptionsFIFO when ttl-based file deletion (compaction) was supported only as part of FIFO Compaction. But with the extension of ttl semantics even to Level compaction, CompactionOptionsFIFO.ttl can now be deprecated. Instead we will start using ColumnFamilyOptions.ttl for FIFO compaction as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4965 Differential Revision: D14072960 Pulled By: sagar0 fbshipit-source-id: c98cc2ae695a28136295787cd88d36a220fc219e	2019-02-15 09:51:41 -08:00
Yanqin Jin	a69d4deefb	Atomic ingest (#4895 ) Summary: Make file ingestion atomic. as title. Ingesting external SST files into multiple column families should be atomic. If a crash occurs and db reopens, either all column families have successfully ingested the files before the crash, or non of the ingestions have any effect on the state of the db. Also add unit tests for atomic ingestion. Note that the unit test here does not cover the case of incomplete atomic group in the MANIFEST, which is covered in VersionSetTest already. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4895 Differential Revision: D13718245 Pulled By: riversand963 fbshipit-source-id: 7df97cc483af73ad44dd6993008f99b083852198	2019-02-12 19:16:17 -08:00
Alexander Zinoviev	32a6dd9a41	Add a new CPU time counter to compaction report (#4889 ) Summary: Measure CPU time consumed for a compaction and report it in the stats report Enable NowCPUNanos() to work for MacOS Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889 Differential Revision: D13701276 Pulled By: zinoale fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055	2019-01-29 17:24:00 -08:00
Siying Dong	5bf941966b	CompactionPri = kMinOverlappingRatio also uses compensated file size (#4907 ) Summary: Right now, CompactionPri = kMinOverlappingRatio provides best write amplification, but it doesn't prioritize files with more tombstones. We combine the two good features: make kMinOverlappingRatio to boost files with lots of tombstones too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4907 Differential Revision: D13788774 Pulled By: siying fbshipit-source-id: 1991cbb495fb76c8b529de69896e38d81ed9d9b3	2019-01-23 13:21:01 -08:00
Siying Dong	8641e9adf7	Non-initial file preloading should always prefetch index and filter (#4852 ) Summary: https://github.com/facebook/rocksdb/pull/3340 introduces preloading when max_open_files != -1. It doesn't preload index and filter in non-initial file loading case. This is a little bit too complicated to understand. We observed in one MyRocks use case where the filter is expected to be preloaded but is not. To simplify the use case, we simply always prefetch the index and filter. They anyway is expected to be loaded in the file verification phase anyway. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4852 Differential Revision: D13595402 Pulled By: siying fbshipit-source-id: d4d8624eb3e849e20aeb990df2100502d85aff31	2019-01-08 12:47:34 -08:00
Andrew Kryczka	9e2c804fe6	Fix point lookup on range tombstone sentinel endpoint (#4829 ) Summary: Previously for point lookup we decided which file to look into based on user key overlap only. We also did not truncate range tombstones in the point lookup code path. These two ideas did not interact well in cases like this: - L1 has range tombstone [a, c)#1 and point key b#2. The data is split between file1 with range [a#1,1, b#72057594037927935,15], and file2 with range [b#2, c#1]. - L1's file2 gets compacted to L2. - User issues `Get()` for b#3. - L1's file1 is opened and the range tombstone [a, c)#1 is found for b, while no point-key for b is found in L1. - `Get()` assumes that the range tombstone must cover all data in that range in lower levels, so short circuits and returns `NotFound`. The solution to this problem is to not look into files that only overlap with the point lookup at a range tombstone sentinel endpoint. In the above example, this would mean not opening L1's file1 or its tombstones during the `Get()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4829 Differential Revision: D13561355 Pulled By: ajkr fbshipit-source-id: a13c21c816870a2f5d32a48af6dbd719a7d9d19f	2019-01-04 11:24:08 -08:00
Siying Dong	f0dda35d7d	Preload some files even if options.max_open_files (#3340 ) Summary: Choose to preload some files if options.max_open_files != -1. This can slightly narrow the gap of performance between options.max_open_files is -1 and a large number. To avoid a significant regression to DB reopen speed if options.max_open_files != -1. Limit the files to preload in DB open time to 16. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3340 Differential Revision: D6686945 Pulled By: siying fbshipit-source-id: 8ec11bbdb46e3d0cdee7b6ad5897a09c5a07869f	2018-12-28 18:02:28 -08:00
Abhishek Madan	81b6b09f6b	Remove v1 RangeDelAggregator (#4778 ) Summary: Now that v2 is fully functional, the v1 aggregator is removed. The v2 aggregator has been renamed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4778 Differential Revision: D13495930 Pulled By: abhimadan fbshipit-source-id: 9d69500a60a283e79b6c4fa938fc68a8aa4d40d6	2018-12-17 17:33:46 -08:00
Abhishek Madan	abf931afa6	Add compaction logic to RangeDelAggregatorV2 (#4758 ) Summary: RangeDelAggregatorV2 now supports ShouldDelete calls on snapshot stripes and creation of range tombstone compaction iterators. RangeDelAggregator is no longer used on any non-test code path, and will be removed in a future commit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4758 Differential Revision: D13439254 Pulled By: abhimadan fbshipit-source-id: fe105bcf8e3d4a2df37a622d5510843cd71b0401	2018-12-17 13:20:51 -08:00
Yanqin Jin	4fce44fc8b	Improve flushing multiple column families (#4708 ) Summary: If one column family is dropped, we should simply skip it and continue to flush other active ones. Currently we use Status::ShutdownInProgress to notify caller of column families being dropped. In the future, we should consider using a different Status code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4708 Differential Revision: D13378954 Pulled By: riversand963 fbshipit-source-id: 42f248cdf2d32d4c0f677cd39012694b8f1328ca	2018-12-13 15:12:40 -08:00
Yi Wu	05d9d82181	Revert "Move MemoryAllocator option from Cache to BlockBasedTableOpti… (#4697 ) Summary: …ons (#4676)" This reverts commit `b32d087dbb`. `MemoryAllocator` needs to be with `Cache`, since cache entry can outlive DB and block based table. The cache needs to hold reference to memory allocator when deleting cache entry. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4697 Differential Revision: D13133490 Pulled By: yiwu-arbug fbshipit-source-id: 8ef7e8a51263bfd929f892fd062665ff4ce9ce5a	2018-11-21 11:29:57 -08:00
Abhishek Madan	457f77b9ff	Introduce RangeDelAggregatorV2 (#4649 ) Summary: The old RangeDelAggregator did expensive pre-processing work to create a collapsed, binary-searchable representation of range tombstones. With FragmentedRangeTombstoneIterator, much of this work is now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking in each iterator to find a covering tombstone in ShouldDelete, while doing minimal work in AddTombstones. The old RangeDelAggregator is still used during flush/compaction for now, though RangeDelAggregatorV2 will support those uses in a future PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649 Differential Revision: D13146964 Pulled By: abhimadan fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3	2018-11-21 10:56:45 -08:00
Abhishek Madan	ed5aec5ba3	Fix range tombstone covering short-circuit logic (#4698 ) Summary: Since a range tombstone seen at one level will cover all keys in the range at lower levels, there was a short-circuiting check in Get that reported a key was not found at most one file after the range tombstone was discovered. However, this was incorrect for merge operands, since a deletion might only cover some merge operands, which implies that the key should be found. This PR fixes this logic in the Version portion of Get, and removes the logic from the MemTable portion of Get, since the perforamnce benefit provided there is minimal. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4698 Differential Revision: D13142484 Pulled By: abhimadan fbshipit-source-id: cbd74537c806032f2bfa564724d01a80df7c8f10	2018-11-20 13:29:22 -08:00
Andrew Kryczka	9d6d4867ab	Fix uninitialized fields in file metadata (#4693 ) Summary: This is a quick fix for the uninitialized bugs in `LiveFileMetaData` and `SstFileMetaData` that were uncovered in #4686. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4693 Differential Revision: D13113189 Pulled By: ajkr fbshipit-source-id: 18e798d031d2a59d0b55fc010c135e0126f4042d	2018-11-16 20:49:17 -08:00
Yi Wu	b32d087dbb	Move MemoryAllocator option from Cache to BlockBasedTableOptions (#4676 ) Summary: Per offline discussion with siying, `MemoryAllocator` and `Cache` should be decouple. The idea is that memory allocator handles memory allocation, while cache handle cache policy. It is normal that external cache libraries pack couple the two components for better optimization. If we want to integrate with such library in the future, we can make a wrapper of the library implementing both `Cache` and `MemoryAllocator` interface. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4676 Differential Revision: D13047662 Pulled By: yiwu-arbug fbshipit-source-id: cd42e246d80ab600b4de47d073f7d2db308ce6dd	2018-11-13 13:48:38 -08:00
QingpingWang	4f0fcb78ae	Expose num entries and deletions of sst files (#4623 ) Summary: he ratio of num_deletions to num_entries of a level can be useful to determine if a manual compaction needs to be triggered on a level. Also refer #3980 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4623 Differential Revision: D13045744 Pulled By: sagar0 fbshipit-source-id: 71f3c8e363a8ffd194ec3bb0ed0b69612231f0b3	2018-11-13 11:52:19 -08:00
Zhongyi Xie	b313019326	use per-level perfcontext for DB::Get calls (#4617 ) Summary: this PR adds two more per-level perf context counters to track * number of keys returned in Get call, break down by levels * total processing time at each level during Get call Pull Request resolved: https://github.com/facebook/rocksdb/pull/4617 Differential Revision: D12898024 Pulled By: miasantreble fbshipit-source-id: 6b84ef1c8097c0d9e97bee1a774958f56ab4a6c4	2018-11-13 10:40:49 -08:00
Sagar Vemuri	dc3528077a	Update all unique/shared_ptr instances to be qualified with namespace std (#4638 ) Summary: Ran the following commands to recursively change all the files under RocksDB: ``` find . -type f -name ".cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} + ``` Running `make format` updated some formatting on the files touched. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638 Differential Revision: D12934992 Pulled By: sagar0 fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8	2018-11-09 11:19:58 -08:00
Yanqin Jin	d1118f6f19	Add test to check if DB can handle atomic group (#4433 ) Summary: Add unit tests to demonstrate that `VersionSet::Recover` is able to detect and handle cases in which the MANIFEST has valid atomic group, incomplete trailing atomic group, atomic group mixed with normal version edits and atomic group with incorrect size. With this capability, RocksDB identifies non-valid groups of version edits and do not apply them, thus guaranteeing that the db is restored to a state consistent with the most recent successful atomic flush before applying WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4433 Differential Revision: D10079202 Pulled By: riversand963 fbshipit-source-id: a0e0b8bf4da1cf68e044d397588c121b66c68876	2018-10-30 16:37:47 -07:00
Abhishek Madan	eaaf1a6f05	Promote rocksdb.{deleted.keys,merge.operands} to main table properties (#4594 ) Summary: Since the number of range deletions are reported in TableProperties, it is confusing to not report the number of merge operands and point deletions as top-level properties; they are accessible through the public API, but since they are not the "main" properties, they do not appear in aggregated table properties, or the string representation of table properties. This change promotes those two property keys to `rocksdb/table_properties.h`, adds corresponding uint64 members for them, deprecates the old access methods `GetDeletedKeys()` and `GetMergeOperands()` (though they are still usable for now), and removes `InternalKeyPropertiesCollector`. The property key strings are the same as before this change, so this should be able to read DBs written from older versions (though I haven't tested this yet). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4594 Differential Revision: D12826893 Pulled By: abhimadan fbshipit-source-id: 9e4e4fbdc5b0da161c89582566d184101ba8eb68	2018-10-30 15:34:27 -07:00
Abhishek Madan	7528130e38	Cache fragmented range tombstones in BlockBasedTableReader (#4493 ) Summary: This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses. On the same DB used in #4449, running `readrandom` results in the following: ``` readrandom : 0.983 micros/op 1017076 ops/sec; 78.3 MB/s (63103 of 100000 found) ``` Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results): ``` Tombstones? \| avg micros/op \| stddev micros/op \| avg ops/s \| stddev ops/s ----------------- \| ------------- \| ---------------- \| ------------ \| ------------ None \| 0.6186 \| 0.04637 \| 1,625,252.90 \| 124,679.41 500 Expanded \| 0.6019 \| 0.03628 \| 1,666,670.40 \| 101,142.65 500 Unexpanded \| 0.6435 \| 0.03994 \| 1,559,979.40 \| 104,090.52 1k Expanded \| 0.6034 \| 0.04349 \| 1,665,128.10 \| 125,144.57 1k Unexpanded \| 0.6261 \| 0.03093 \| 1,600,457.50 \| 79,024.94 5k Expanded \| 0.6163 \| 0.05926 \| 1,636,668.80 \| 154,888.85 5k Unexpanded \| 0.6402 \| 0.04002 \| 1,567,804.70 \| 100,965.55 10k Expanded \| 0.6036 \| 0.05105 \| 1,667,237.70 \| 142,830.36 10k Unexpanded \| 0.6128 \| 0.02598 \| 1,634,633.40 \| 72,161.82 25k Expanded \| 0.6198 \| 0.04542 \| 1,620,980.50 \| 116,662.93 25k Unexpanded \| 0.5478 \| 0.0362 \| 1,833,059.10 \| 121,233.81 50k Expanded \| 0.5104 \| 0.04347 \| 1,973,107.90 \| 184,073.49 50k Unexpanded \| 0.4528 \| 0.03387 \| 2,219,034.50 \| 170,984.32 ``` After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493 Differential Revision: D10842844 Pulled By: abhimadan fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509	2018-10-25 19:26:44 -07:00
Abhishek Madan	8c78348c77	Use only "local" range tombstones during Get (#4449 ) Summary: Previously, range tombstones were accumulated from every level, which was necessary if a range tombstone in a higher level covered a key in a lower level. However, RangeDelAggregator::AddTombstones's complexity is based on the number of tombstones that are currently stored in it, which is wasteful in the Get case, where we only need to know the highest sequence number of range tombstones that cover the key from higher levels, and compute the highest covering sequence number at the current level. This change introduces this optimization, and removes the use of RangeDelAggregator from the Get path. In the benchmark results, the following command was used to initialize the database: ``` ./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8 ``` ...and the following command was used to measure read throughput: ``` ./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32 ``` The filluniquerandom command was only run once, and the resulting database was used to measure read performance before and after the PR. Both binaries were compiled with `DEBUG_LEVEL=0`. Readrandom results before PR: ``` readrandom : 4.544 micros/op 220090 ops/sec; 16.9 MB/s (63103 of 100000 found) ``` Readrandom results after PR: ``` readrandom : 11.147 micros/op 89707 ops/sec; 6.9 MB/s (63103 of 100000 found) ``` So it's actually slower right now, but this PR paves the way for future optimizations (see #4493). ---- Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449 Differential Revision: D10370575 Pulled By: abhimadan fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d	2018-10-24 12:31:12 -07:00
Maysam Yabandeh	c34cc40424	Fix user comparator receiving internal key (#4575 ) Summary: There was a bug that the user comparator would receive the internal key instead of the user key. The bug was due to RangeMightExistAfterSortedRun expecting user key but receiving internal key when called in GenerateBottommostFiles. The patch augment an existing unit test to reproduce the bug and fixes it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4575 Differential Revision: D10500434 Pulled By: maysamyabandeh fbshipit-source-id: 858346d2fd102cce9e20516d77338c112bdfe366	2018-10-23 08:14:46 -07:00
Siying Dong	7024263682	Dynamic level to adjust level multiplier when write is too heavy (#4338 ) Summary: Level compaction usually performs poorly when the writes so heavy that the level targets can't be guaranteed. With this improvement, we improve level_compaction_dynamic_level_bytes = true so that in the write heavy cases, the level multiplier can be slightly adjusted based on the size of L0. We keep the behavior the same if number of L0 files is under 2X compaction trigger and the total size is less than options.max_bytes_for_level_base, so that unless write is so heavy that compaction cannot keep up, the behavior doesn't change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4338 Differential Revision: D9636782 Pulled By: siying fbshipit-source-id: e27fc17a7c29c84b00064cc17536a01dacef7595	2018-10-22 10:21:47 -07:00

1 2 3 4 5 ...

612 Commits