rocksdb

Author	SHA1	Message	Date
Siying Dong	1a761e6a6c	Add a placeholder in manifest indicating ignorable record (#4960 ) Summary: We want to reserve some right that some extra information added manifest in the future can be forward compatible by previous versions. Now we create a place holder for that. A bit in tag is added to indicate that a field can be safely ignored. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4960 Differential Revision: D14000484 Pulled By: siying fbshipit-source-id: cbf5bad3f9d5ec798f789806f244d1c20d3b66d6	2019-02-08 11:33:11 -08:00
Siying Dong	f48758e939	Deprecate CompactionFilter::IgnoreSnapshots() = false (#4954 ) Summary: We found that the behavior of CompactionFilter::IgnoreSnapshots() = false isn't what we have expected. We thought that snapshot will always be preserved. However, we just realized that, if no snapshot is created while compaction starts, and a snapshot is created after that, the data seen from the snapshot can successfully be dropped by the compaction. This creates a strange behavior to the feature, which is hard to explain. Like what is documented in code comment, this feature is not very useful with snapshot anyway. The decision is to deprecate the feature. We keep the function to avoid to break users code. However, we will fail compactions if false is returned. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4954 Differential Revision: D13981900 Pulled By: siying fbshipit-source-id: 2db8c2c3865acd86a28dca625945d1481b1d1e36	2019-02-07 16:57:33 -08:00
Siying Dong	cf3a671733	Remove cuckoo hash memtable (#4953 ) Summary: Cuckoo Hash is less useful than we initially expected. Remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4953 Differential Revision: D13979264 Pulled By: siying fbshipit-source-id: 2a60afdaa989f045357398b43a1cc5d46f4492ed	2019-02-07 16:15:27 -08:00
Zhongyi Xie	71cae59a99	exclude test CompactFilesShouldTriggerAutoCompaction from ROCKSDB_LITE (#4950 ) Summary: This will fix the following build error: > db/db_test.cc: In member function ‘virtual void rocksdb::DBTest_CompactFilesShouldTriggerAutoCompaction_Test::TestBody()’: > db/db_test.cc:5462:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’ > db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data); > db/db_test.cc:5490:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’ > db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data); > db/db_test.cc:5499:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’ > db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data); Pull Request resolved: https://github.com/facebook/rocksdb/pull/4950 Differential Revision: D13965378 Pulled By: miasantreble fbshipit-source-id: a975435476fe555b1cd9d5da263ee3da3acdea56	2019-02-05 17:01:11 -08:00
Zhongyi Xie	00ed41daee	Allow copy for PerfContext objects (#4919 ) Summary: Existing implementation of PerfContext does not define copy constructor or assignment operator, which could potentially cause problems when user create copies and resets the builtin one. This PR address the issue by providing these two constructors with deep copy semantics. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4919 Differential Revision: D13960406 Pulled By: miasantreble fbshipit-source-id: 36aab5aaee65d4480f537e4e22148faa45e8e334	2019-02-05 14:29:08 -08:00
Jay Zhuang	c9a52cbdc8	Fix potential DB hang while using CompactFiles (#4940 ) Summary: CompactFiles() may block auto compaction which could cuase DB hang when it reachs level0_stop_writes_trigger. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4940 Differential Revision: D13929648 Pulled By: cooldoger fbshipit-source-id: 10842df38df3bebf862cd1a120a88ce961fdd381	2019-02-05 11:23:38 -08:00
Siying Dong	8fe073324f	BYTES_READ stats miscount for NotFound cases (#4938 ) Summary: In NotFound cases, stats BYTES_READ and perf_context.get_read_bytes is still be increased. The amount increased will be whatever size of the string or PinnableSlice that users passed in as the output data structure. This is wrong. Fix this by not increasing these two counters. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4938 Differential Revision: D13908963 Pulled By: siying fbshipit-source-id: 60bce42e4fbb9862bba3da36dbc27b2963ea6162	2019-02-05 10:53:35 -08:00
yangzhijia	31221bb7e8	Properly set upper bound of subcompaction output (#4879 ) (#4898 ) Summary: Fix the ouput overlap bug when using subcompactions, the upper bound of output file was extended incorrectly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4898 Differential Revision: D13736107 Pulled By: ajkr fbshipit-source-id: 21dca09f81d5f07bf2766bf566f9b50dcab7d8e3	2019-02-05 10:20:16 -08:00
Maysam Yabandeh	30468d8eb4	Fix analyze error on possible un-initialized value (#4937 ) Summary: The patch fixes the following analyze error by checking the return status of ParseInternalKey. ``` db/merge_helper.cc:306:23: warning: The right operand of '==' is a garbage value assert(kTypeMerge == orig_ikey.type); ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4937 Differential Revision: D13908506 Pulled By: maysamyabandeh fbshipit-source-id: 68d7771e75519da3d4bd807fd231675ec12093f6	2019-02-01 09:41:27 -08:00
Ming Zhao	59244447e3	Zero seqnum of final key / drop final tombstone when compacting to bottommost level Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4927 Differential Revision: D13889458 Pulled By: mzhaom fbshipit-source-id: d6b66db85901a9eb90748fba6a9dc4e7457b9c5e	2019-02-01 09:21:57 -08:00
Yanqin Jin	842cdc11dd	Use correct FileMeta for atomic flush result install (#4932 ) Summary: 1. this commit fixes our handling of a combination of two separate edge cases. If a flush job does not pick any memtable to flush (because another flush job has already picked the same memtables), and the column family assigned to the flush job is dropped right before RocksDB calls rocksdb::InstallMemtableAtomicFlushResults, our original code passes a FileMetaData object whose file number is 0, failing the assertion in rocksdb::InstallMemtableAtomicFlushResults (assert(m->GetFileNumber() > 0)). 2. Also piggyback a small change: since we already create a local copy of column family's mutable CF options to eliminate potential race condition with `SetOptions` call, we might as well use the local copy in other function calls in the same scope. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4932 Differential Revision: D13901322 Pulled By: riversand963 fbshipit-source-id: b936580af7c127ea0c6c19ea10cd5fcede9fb0f9	2019-01-31 14:49:51 -08:00
Maysam Yabandeh	35e5689e11	Take snapshots once for all cf flushes (#4934 ) Summary: FlushMemTablesToOutputFiles calls FlushMemTableToOutputFile for each column family. The patch moves the take-snapshot logic to outside FlushMemTableToOutputFile so that it does it once for all the flushes. This also addresses a deadlock issue for resetting the managed snapshot of job_snapshot in the 2nd call to FlushMemTableToOutputFile. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4934 Differential Revision: D13900747 Pulled By: maysamyabandeh fbshipit-source-id: f3cd650c5fff24cf95c1aaf8a10c149d42bf042c	2019-01-31 12:21:59 -08:00
Alexander Zinoviev	32a6dd9a41	Add a new CPU time counter to compaction report (#4889 ) Summary: Measure CPU time consumed for a compaction and report it in the stats report Enable NowCPUNanos() to work for MacOS Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889 Differential Revision: D13701276 Pulled By: zinoale fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055	2019-01-29 17:24:00 -08:00
Yanqin Jin	158da7a6ee	Verify checksum before ingestion (#4916 ) Summary: before file ingestion (in preparation phase), verify the checksums of the blocks of the external SST file, including properties block with global seqno. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4916 Differential Revision: D13863501 Pulled By: riversand963 fbshipit-source-id: dc54697f970e3807832e2460f7228fcc7efe81ee	2019-01-29 17:17:29 -08:00
Sagar Vemuri	4978caaa6f	Remove a redundant call to TableFileName in CompactionJob::FinishCompactionOutputFile (#4925 ) Summary: While stepping through the code I noticed that there is a redundant call to TableFileName. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4925 Differential Revision: D13845749 Pulled By: sagar0 fbshipit-source-id: 31db45716b4d720e0e0350dd457b49d6f1848e7d	2019-01-28 13:33:23 -08:00
Siying Dong	ee1818081f	Remove PlainTable's feature store_index_in_file (#4914 ) Summary: Store_index_in_file is a less useful feature. To simplify the code to maintain, we are dropping the feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4914 Differential Revision: D13791883 Pulled By: siying fbshipit-source-id: d187c5d662584866103e4b77d09dfb925509ae2e	2019-01-28 12:50:22 -08:00
Siying Dong	bc7d1661a8	Fix test name typo in PlainTableDBTest Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4926 Differential Revision: D13830196 Pulled By: siying fbshipit-source-id: e06bf2a6cd273b5eb18dfd82bdd35ffce197d021	2019-01-25 18:14:26 -08:00
Siying Dong	f184bee77b	PlainTable should avoid copying Get() results from immortal source. (#4924 ) Summary: https://github.com/facebook/rocksdb/pull/4053 avoids memcopy for Get() results if files are immortable (read-only DB, max_open_files=-1) and the file is ammaped. The same optimization is being applied to PlainTable here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4924 Differential Revision: D13827749 Pulled By: siying fbshipit-source-id: 1f2cbfc530b40ce08ccd53f95f6e78de4d1c2f96	2019-01-25 17:12:19 -08:00
Siying Dong	fc53839bfa	Disallow customized hash function in DynamicBloom (#4915 ) Summary: I didn't find where customized hash function is used in DynamicBloom. This can only reduce performance. Remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4915 Differential Revision: D13794452 Pulled By: siying fbshipit-source-id: e38669b11e01444d2d782da11c7decabbd851819	2019-01-24 10:34:30 -08:00
Dmitry Fink	e07aa8669d	Allow full merge when root of history for a key is reached (#4909 ) Summary: Previously compaction was not collapsing operands for a first key on a layer, even in cases when it was its root of history. Some tests (CompactionJobTest.NonAssocMerge) was actually accounting for that bug, Pull Request resolved: https://github.com/facebook/rocksdb/pull/4909 Differential Revision: D13781169 Pulled By: finik fbshipit-source-id: d2de353ecf05bec39b942cd8d5b97a8dc445f336	2019-01-23 21:46:10 -08:00
Andrew Kryczka	8ec3e72551	Cache dictionary used for decompressing data blocks (#4881 ) Summary: - If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`. - If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle. - New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count" Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881 Differential Revision: D13663801 Pulled By: ajkr fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f	2019-01-23 18:15:47 -08:00
PeifengSi	43defe9872	Correct the code comment in Compaction::KeyNotExistsBeyondOutputLevel (#4902 ) Summary: Even one key falls in a file's range, we can not infer it definitely exists in this file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4902 Differential Revision: D13795018 Pulled By: siying fbshipit-source-id: 590956f727e9440fcdee55ad9541ace934c64914	2019-01-23 18:00:56 -08:00
Siying Dong	d94aa2f7db	Make compaction_pri = kMinOverlappingRatio to be default (#4911 ) Summary: compaction_pri = kMinOverlappingRatio usually provides much better write amplification than the default. https://github.com/facebook/rocksdb/pull/4907 fixes one shortcome of this option. Make it default. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4911 Differential Revision: D13789262 Pulled By: siying fbshipit-source-id: d90acf8c4dede44f00d183ca4c7a210259378269	2019-01-23 16:47:38 -08:00
Siying Dong	5bf941966b	CompactionPri = kMinOverlappingRatio also uses compensated file size (#4907 ) Summary: Right now, CompactionPri = kMinOverlappingRatio provides best write amplification, but it doesn't prioritize files with more tombstones. We combine the two good features: make kMinOverlappingRatio to boost files with lots of tombstones too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4907 Differential Revision: D13788774 Pulled By: siying fbshipit-source-id: 1991cbb495fb76c8b529de69896e38d81ed9d9b3	2019-01-23 13:21:01 -08:00
Andrew Kryczka	01013ae766	Digest ZSTD compression dictionary once when writing SST file (#4849 ) Summary: This is essentially a re-submission of #4251 with a few improvements: - Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict` - Eliminated `Init` functions. Instead do all initialization work in constructors. - Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849 Differential Revision: D13606039 Pulled By: ajkr fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447	2019-01-18 19:12:57 -08:00
Yi Wu	b1ad6ebba8	WritePrepared: fix two versions in compaction see different status for released snapshots (#4890 ) Summary: Fix how CompactionIterator::findEarliestVisibleSnapshots handles released snapshot. It fixing the two scenarios: Scenario 1: key1 has two values v1 and v2. There're two snapshots s1 and s2 taken after v1 and v2 are committed. Right after compaction output v2, s1 is released. Now findEarliestVisibleSnapshot may see s1 being released, and return the next snapshot, which is s2. That's larger than v2's earliest visible snapshot, which was s1. The fix: the only place we check against last snapshot and current key snapshot is when we decide whether to compact out a value if it is hidden by a later value. In the check if we see current snapshot is even larger than last snapshot, we know last snapshot is released, and we are safe to compact out current key. Scenario 2: key1 has two values v1 and v2. there are two snapshots s1 and s2 taken after v1 and v2 are committed. During compaction before we process the key, s1 is released. When compaction process v2, snapshot checker may return kSnapshotReleased, and the earliest visible snapshot for v2 become s2. When compaction process v1, snapshot checker may return kIsInSnapshot (for WritePrepared transaction, it could be because v1 is still in commit cache). The result will become inconsistent here. The fix: remember the set of released snapshots ever reported by snapshot checker, and ignore them when finding result for findEarliestVisibleSnapshot. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4890 Differential Revision: D13705538 Pulled By: maysamyabandeh fbshipit-source-id: e577f0d9ee1ff5a6035f26859e56902ecc85a5a4	2019-01-18 17:24:06 -08:00
Yi Wu	128f532858	WritePrepared: fix issue with snapshot released during compaction (#4858 ) Summary: Compaction iterator keep a copy of list of live snapshots at the beginning of compaction, and then query snapshot checker to verify if values of a sequence number is visible to these snapshots. However when the snapshot is released in the middle of compaction, the snapshot checker implementation (i.e. WritePreparedSnapshotChecker) may remove info with the snapshot and may report incorrect result, which lead to values being compacted out when it shouldn't. This patch conservatively keep the values if snapshot checker determines that the snapshots is released. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4858 Differential Revision: D13617146 Pulled By: maysamyabandeh fbshipit-source-id: cf18a94f6f61a94bcff73c280f117b224af5fbc3	2019-01-16 09:55:32 -08:00
Yanqin Jin	e79df377c5	Use chrono::time_point instead of time_t (#4868 ) Summary: By convention, time_t almost always stores the integral number of seconds since 00:00 hours, Jan 1, 1970 UTC, according to http://www.cplusplus.com/reference/ctime/time_t/. We surely want more precision than seconds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4868 Differential Revision: D13633046 Pulled By: riversand963 fbshipit-source-id: 4e01e23a22e8838023c51a91247a286dbf3a5396	2019-01-16 09:51:05 -08:00
Yi Wu	5d4fddfa52	WritePrepared: Fix visible key compacted out by compaction (#4883 ) Summary: With WritePrepared transaction, flush/compaction can contain uncommitted keys, and those keys can get committed during compaction. If a snapshot is taken before the key is committed, it should not see the key. On the other hand, compaction grab the list of snapshots at its beginning, and only consider those snapshots to dedup keys. Consider the case: ``` seq = 1: put "foo" = "bar" seq = 2: transaction T: delete "foo", prepare seq = 3: compaction start seq = 4: take snapshot S seq = 5: transaction T: commit. ... seq = N: compaction iterator reached key "foo". ``` When compaction start, the list of snapshot is empty. Compaction doesn't take snapshot S into account. When it reached "foo", transaction T is committed. Compaction may think the value "foo=bar" is not visible by any snapshot (which is wrong), and compact the value out. The fix is to explicitly take a snapshot before compaction grabbing the list of snapshots. Compaction will then has to keep keys visible to this snapshot. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4883 Differential Revision: D13668775 Pulled By: maysamyabandeh fbshipit-source-id: 1cab9615f94b7d3e8522cc3d44c3a14c7d4720e4	2019-01-15 21:34:38 -08:00
Maysam Yabandeh	cad99a6031	WritePrepared: snapshot should be larger than max_evicted_seq_ (#4886 ) Summary: The AdvanceMaxEvictedSeq algorithm assumes that new snapshots always have sequence number larger than the last max_evicted_seq_. To enforce this assumption we make two changes: i) max is not advanced beyond the last published seq, with the exception that the evicted commit entry itself is not published yet, which is quite rare. ii) When obtaining the snapshot if the max_evicted_seq_ is not published yet, commit a dummy entry so that it waits for it to be published and also increased the latest published seq by one above the max. To test these non-realistic corner cases we create a commit cache with size 1 so that every single commit results into eviction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4886 Differential Revision: D13685270 Pulled By: maysamyabandeh fbshipit-source-id: 5461bc09c2a9b75798bfcb9853a256c81cdac0b0	2019-01-15 18:11:52 -08:00
Siying Dong	7d13f307ff	Improve Error Message When wal_dir doesn't exist (#4874 ) Summary: Right now the error mesage when options.wal_dir doesn't exist is not helpful to users. Be more specific Pull Request resolved: https://github.com/facebook/rocksdb/pull/4874 Differential Revision: D13642425 Pulled By: siying fbshipit-source-id: 9a3172ed0f799af233b0f3b2e5e35bc7ce04c7b5	2019-01-15 16:46:04 -08:00
Yanqin Jin	301da345ae	Make a copy of MutableCFOptions to avoid race condition (#4876 ) Summary: If we do not do this, then reading MutableCFOptions may have a race condition with SetOptions which modifies MutableCFOptions. Also reserve space in advance for vectors to avoid reallocation changing the address of its elements. Test plan ``` $make clean && make -j32 all check $make clean && COMPILE_WITH_TSAN=1 make -j32 all check $make clean && COMPILE_WITH_ASAN=1 make -j32 all check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4876 Differential Revision: D13644500 Pulled By: riversand963 fbshipit-source-id: 4b8112c5c819d5a2922bb61ad1521b3d2fb2fd47	2019-01-11 17:43:37 -08:00
Maysam Yabandeh	d56ac22b44	Remove duplicates from SnapshotList::GetAll (#4860 ) Summary: The vector returned by SnapshotList::GetAll could have duplicate entries if two separate snapshots have the same sequence number. However, when this vector is used in compaction the duplicate entires are of no use and could be safely ignored. Moreover not having duplicate entires simplifies reasoning in the compaction_iterator.cc code. For example when searching for the previous_snap we currently use the snapshot before the current one but the way the code uses that it expects it to be also less than the current snapshot, which would be simpler to read if there is no duplicate entry in the snapshot list. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4860 Differential Revision: D13615502 Pulled By: maysamyabandeh fbshipit-source-id: d45bf01213ead5f39db811f951802da6fcc3332b	2019-01-09 16:25:42 -08:00
Siying Dong	8641e9adf7	Non-initial file preloading should always prefetch index and filter (#4852 ) Summary: https://github.com/facebook/rocksdb/pull/3340 introduces preloading when max_open_files != -1. It doesn't preload index and filter in non-initial file loading case. This is a little bit too complicated to understand. We observed in one MyRocks use case where the filter is expected to be preloaded but is not. To simplify the use case, we simply always prefetch the index and filter. They anyway is expected to be loaded in the file verification phase anyway. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4852 Differential Revision: D13595402 Pulled By: siying fbshipit-source-id: d4d8624eb3e849e20aeb990df2100502d85aff31	2019-01-08 12:47:34 -08:00
tom wang	42135523a0	modify comments about flush_queue_ Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4850 Differential Revision: D13591940 Pulled By: sagar0 fbshipit-source-id: 617794e0a41d0f4554d40871180b061e84189fc5	2019-01-07 13:52:59 -08:00
Yi Wu	cf852fdf55	Minor fix: single delete a blob value is not a mismatch (#4848 ) Summary: In compaction iterator, if the next value of single delete is a blob value, it should not treated as mismatch. This is only a minor fix and doesn't affect correctness. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4848 Differential Revision: D13585812 Pulled By: yiwu-arbug fbshipit-source-id: 0ff6223fa03a644ac9fd8a2d77f9d6711d0a62b0	2019-01-04 16:31:02 -08:00
Andrew Kryczka	9e2c804fe6	Fix point lookup on range tombstone sentinel endpoint (#4829 ) Summary: Previously for point lookup we decided which file to look into based on user key overlap only. We also did not truncate range tombstones in the point lookup code path. These two ideas did not interact well in cases like this: - L1 has range tombstone [a, c)#1 and point key b#2. The data is split between file1 with range [a#1,1, b#72057594037927935,15], and file2 with range [b#2, c#1]. - L1's file2 gets compacted to L2. - User issues `Get()` for b#3. - L1's file1 is opened and the range tombstone [a, c)#1 is found for b, while no point-key for b is found in L1. - `Get()` assumes that the range tombstone must cover all data in that range in lower levels, so short circuits and returns `NotFound`. The solution to this problem is to not look into files that only overlap with the point lookup at a range tombstone sentinel endpoint. In the above example, this would mean not opening L1's file1 or its tombstones during the `Get()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4829 Differential Revision: D13561355 Pulled By: ajkr fbshipit-source-id: a13c21c816870a2f5d32a48af6dbd719a7d9d19f	2019-01-04 11:24:08 -08:00
Yanqin Jin	a07175af65	Refactor atomic flush result installation to MANIFEST (#4791 ) Summary: as titled. Since different bg flush threads can flush different sets of column families (due to column family creation and drop), we decide not to let one thread perform atomic flush result installation for other threads. Bg flush threads will install their atomic flush results sequentially to MANIFEST, using a conditional variable, i.e. atomic_flush_install_cv_ to coordinate. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4791 Differential Revision: D13498930 Pulled By: riversand963 fbshipit-source-id: dd7482fc41f4bd22dad1e1ef7d4764ef424688d7	2019-01-03 20:56:24 -08:00
Yi Wu	77a8d4d476	Detect if Jemalloc is linked with the binary (#4844 ) Summary: Declare Jemalloc non-standard APIs as weak symbols, so that if Jemalloc is linked with the binary, these symbols will be replaced by Jemalloc's, otherwise they will be nullptr. This is similar to how folly detect jemalloc, but we assume the main program use jemalloc as long as jemalloc is linked: https://github.com/facebook/folly/blob/master/folly/memory/Malloc.h#L147 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4844 Differential Revision: D13574934 Pulled By: yiwu-arbug fbshipit-source-id: 7ea871beb1be7d5a1259cc38f9b78078793db2db	2019-01-03 16:30:12 -08:00
DorianZheng	8c79f79208	Fix skip WAL for whole write_group when leader's callback fail (#4838 ) Summary: The original implementation has two problems: 1. `f0dda35d7d/db/db_impl_write.cc (L478)` `f0dda35d7d/db/write_thread.h (L231)` If the callback status of leader of the write_group fails, then the whole write_group will not write to WAL, this may cause data loss. 2. `f0dda35d7d/db/write_thread.h (L130)` The annotation says that Writer.status is the status of memtable inserter, but the original implementation use it for another case which is not consistent with the original design. Looks like we can still reuse Writer.status, but we should modify the annotation, so Writer.status is not only the status of memtable inserter. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4838 Differential Revision: D13574070 Pulled By: yiwu-arbug fbshipit-source-id: a2a2aefcfd329c4c6a91652bf090aaf1ce119c4b	2019-01-03 12:40:42 -08:00
Siying Dong	e4feb78606	Try to fix DBSSTTest.RateLimitedDelete flakiness (#4840 ) Summary: DBSSTTest.RateLimitedDelete is flakey. The root cause is not completely identified, but the compaction waiting in the test doesn't strictly wait for compaction cleaning to finish, which may cause test flakiness. Fix it first and see whether the failures still happen. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4840 Differential Revision: D13567273 Pulled By: siying fbshipit-source-id: 6fce38b912aff92a925231e7aa9bb0fef892761a	2019-01-03 11:05:19 -08:00
Andrew Kryczka	ace543a815	fix accounting for range tombstones in TableProperties (#4841 ) Summary: - To be consistent with the accounting of other optypes in `TableProperties`, we should count range tombstones in `TableProperties::num_entries` and `TableProperties::num_deletions`. - Updated assertions in stress test's `OnTableFileCreated` handler to accept files with range tombstones only. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4841 Differential Revision: D13568424 Pulled By: ajkr fbshipit-source-id: 0139d7806494eda20ece67ec460d2458dbbf6026	2019-01-02 15:08:53 -08:00
Anand Ananthabhotla	b9d6eccac1	Lock free MultiGet (#4754 ) Summary: Avoid locking the DB mutex in order to reference SuperVersions. Instead, we get the thread local cached SuperVersion for each column family in the list. It depends on finding a sequence number that overlaps with all the open memtables. We start with the latest published sequence number, and if any of the memtables is sealed before we can get all the SuperVersions, the process is repeated. After a few times, give up and lock the DB mutex. Tests: 1. Unit tests 2. make check 3. db_bench - TEST_TMPDIR=/dev/shm ./db_bench -use_existing_db=true -benchmarks=readrandom -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=5000000 -reads=1000000 -threads=32 -compression_type=none -cache_size=1048576000 -batch_size=1 -bloom_bits=1 readrandom : 0.167 micros/op 5983920 ops/sec; 426.2 MB/s (1000000 of 1000000 found) Multireadrandom with batch size 1: multireadrandom : 0.176 micros/op 5684033 ops/sec; (1000000 of 1000000 found) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4754 Differential Revision: D13363550 Pulled By: anand1976 fbshipit-source-id: 6243e8de7dbd9c8bb490a8eca385da0c855b1dd4	2019-01-02 11:42:54 -08:00
Faustin Lammler	7d65bd5ce4	Fix spelling errors (#4827 ) Summary: Hi, Lintian, the Debian package checker complains about spelling error (spelling-error-in-binary). See https://salsa.debian.org/mariadb-team/mariadb-10.3/-/jobs/98380 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4827 Differential Revision: D13566362 Pulled By: riversand963 fbshipit-source-id: cd4e9212133c73b0591030de6cdedaa47575968d	2019-01-02 11:17:57 -08:00
Yanqin Jin	ec68091d19	Remove an unused parameter (#4816 ) Summary: The `flush_reason` parameter in `DBImpl::InstallSuperVersionAndScheduleWork` is not used. Remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4816 Differential Revision: D13543218 Pulled By: riversand963 fbshipit-source-id: 8fc75d49462ce092e85aef0fe0c50936140db153	2019-01-02 09:59:13 -08:00
Siying Dong	f0dda35d7d	Preload some files even if options.max_open_files (#3340 ) Summary: Choose to preload some files if options.max_open_files != -1. This can slightly narrow the gap of performance between options.max_open_files is -1 and a large number. To avoid a significant regression to DB reopen speed if options.max_open_files != -1. Limit the files to preload in DB open time to 16. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3340 Differential Revision: D6686945 Pulled By: siying fbshipit-source-id: 8ec11bbdb46e3d0cdee7b6ad5897a09c5a07869f	2018-12-28 18:02:28 -08:00
Burton Li	46e3209e0d	Compaction limiter miscs (#4795 ) Summary: 1. Remove unused API SubtractCompactionTask(). 2. Assert outstanding tasks drop to zero in ConcurrentTaskLimiterImpl destructor. 3. Remove GetOutstandingTask() check from manual compaction test, as TEST_WaitForCompact() doesn't synced with 'delete prepicked_compaction' in DBImpl::BGWorkCompaction(), which may make the test flaky. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4795 Differential Revision: D13542183 Pulled By: siying fbshipit-source-id: 5eb2a47e62efe4126937149aa0df6e243ebefc33	2018-12-26 13:59:35 -08:00
Alexander Zinoviev	80bf8975fd	Add a new per level counter for block cache hit (#4796 ) Summary: Add a new per level counter for block cache hits, increase it by one on every successful attempt to get an entry from cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4796 Differential Revision: D13513688 Pulled By: zinoale fbshipit-source-id: 104df038f1232e3356e162eb2d8ca138e34a8281	2018-12-21 13:20:05 -08:00
Andrew Kryczka	e0be1bc4f1	fix DeleteRange memory leak for mmap and block cache (#4810 ) Summary: Previously we were cleaning up range tombstone meta-block by calling `ReleaseCachedEntry`, which wouldn't work if `value != nullptr && cache_handle == nullptr`. This happened at least in the case with mmap reads and block cache both enabled. I noticed `NewDataBlockIterator` intends to handle all these cases, so migrated to that instead of `NewUnfragmentedRangeTombstoneIterator`. Also changed the table-opening logic to fail on `ReadRangeDelBlock` failure, since that can cause data corruption. Added a test case to verify this behavior. Note the test case does not fail on `TryReopen` because failure to preload table handlers is not considered critical. However, it does fail on any read involving that file since it cannot return correct data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4810 Differential Revision: D13534296 Pulled By: ajkr fbshipit-source-id: 55dde1111717cea6ec4bf38418daab81ccef3599	2018-12-20 21:59:49 -08:00
Siying Dong	da1c64b6e7	Introduce a CPU time counter in perf_context (#4741 ) Summary: Introduce the first CPU timing counter, perf_context.get_cpu_nanos. This opens a door to more CPU counters in the future. Only Posix Env has it implemented using clock_gettime() with CLOCK_THREAD_CPUTIME_ID. How accurate the counter is depends on the platform. Make PerfStepTimer to take an Env as an argument, and sometimes pass it in. The direct reason is to make the unit tests to use SpecialEnv where we can ingest logic there. But in long term, this is a good change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4741 Differential Revision: D13287798 Pulled By: siying fbshipit-source-id: 090361049d9d5095d1d1a369fe1338d2e2e1c73f	2018-12-20 12:03:44 -08:00
Abhishek Madan	02bfc5831e	Change is_range_del_table_empty_ flag to atomic (#4801 ) Summary: To avoid a race on the flag, make it an atomic_bool. This doesn't seem to significantly affect benchmarks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4801 Differential Revision: D13523845 Pulled By: abhimadan fbshipit-source-id: 3bc29f53c50a4e06cd9f8c6232a4bb221868e055	2018-12-19 17:21:14 -08:00
Abhishek Madan	8bf73208a4	Remove stale TODO (#4800 ) Summary: This TODO was already addressed, but I forgot to remove it before landing the PR it came from. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4800 Differential Revision: D13522284 Pulled By: abhimadan fbshipit-source-id: 7766bc4f5b54e47d355cf26137ef5e86c604472a	2018-12-19 15:45:37 -08:00
Jakub Tomanik	71a69d9b68	Fix building RocksDB for iOS (#4687 ) Summary: This PR contains the following fixes: 1. Fixing Makefile to support non-default locations of developer tools 2. Fixing compile error using a patch from https://github.com/facebook/rocksdb/pull/4007 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4687 Differential Revision: D13287263 Pulled By: riversand963 fbshipit-source-id: 4525eb42ba7b6f82af5f9bfb8e52fa4024e27ccc	2018-12-19 14:13:55 -08:00
Adam Retter	1b0c9ce396	Fix Windows broken build error due to non-const override (#4798 ) Summary: 1) `transaction_base.h` overrides from `transaction.h` with a `const boolean do_validate`. The non-const base declaration, which I cannot see the need for, causes a compilation error on Microsoft Windows. 2) Implicit cast from `double` to `uint64_t` causes a compilation error on Microsoft Windows. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4798 Differential Revision: D13519734 Pulled By: sagar0 fbshipit-source-id: 6e8cb80e9a589b1122e1500c21b8e3a3a472b459	2018-12-19 13:29:51 -08:00
Yanqin Jin	671a7eb36f	Avoid switching empty memtable in certain cases (#4792 ) Summary: in certain cases, we do not perform memtable switching if the active memtable of the column family is empty. Two exceptions: 1. In manual flush, if cached_recoverable_state_empty_ is false, then we need to switch memtable due to requirement of transaction. 2. In switch WAL, we need to switch memtable anyway because we have to seal the memtable if the WAL on which it depends will be closed. This change can potentially delay the occurence of write stalls because number of memtables increase more slowly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4792 Differential Revision: D13499501 Pulled By: riversand963 fbshipit-source-id: 91c9b17ae753578578039f3851667d93610005e1	2018-12-18 16:47:23 -08:00
Abhishek Madan	c15df15f07	Fix unused member compile error Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4793 Differential Revision: D13509363 Pulled By: abhimadan fbshipit-source-id: 530b4765e3335d6ecd016bfaa89645f8aa98c61f	2018-12-18 14:28:42 -08:00
Abhishek Madan	81b6b09f6b	Remove v1 RangeDelAggregator (#4778 ) Summary: Now that v2 is fully functional, the v1 aggregator is removed. The v2 aggregator has been renamed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4778 Differential Revision: D13495930 Pulled By: abhimadan fbshipit-source-id: 9d69500a60a283e79b6c4fa938fc68a8aa4d40d6	2018-12-17 17:33:46 -08:00
Roman Zeyde	a62c6626e0	Support setting options on column families via C bindings (#4785 ) Summary: Currently, it supports setting options only on the default column family. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4785 Differential Revision: D13491819 Pulled By: ajkr fbshipit-source-id: 75c78bd86222bb05568e538562af84fb53eb4d8d	2018-12-17 13:52:12 -08:00
Abhishek Madan	abf931afa6	Add compaction logic to RangeDelAggregatorV2 (#4758 ) Summary: RangeDelAggregatorV2 now supports ShouldDelete calls on snapshot stripes and creation of range tombstone compaction iterators. RangeDelAggregator is no longer used on any non-test code path, and will be removed in a future commit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4758 Differential Revision: D13439254 Pulled By: abhimadan fbshipit-source-id: fe105bcf8e3d4a2df37a622d5510843cd71b0401	2018-12-17 13:20:51 -08:00
Maysam Yabandeh	4ed3c1eb88	Fix flaky test DeleteFileRange (#4784 ) Summary: The test fails sporadically expecting the DB to be empty after DeleteFilesInRange(..., nullptr, nullptr) call which is not. Debugging shows cases where the files are skipped since they are being compacted. The patch fixes the test by waiting for the last CompactRange to finish before calling DeleteFilesInRange. Verified by ``` ~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.DeleteFileRange --repeat=10000 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4784 Differential Revision: D13469402 Pulled By: maysamyabandeh fbshipit-source-id: 3d8f44abe205b82c69f01e7edf27e1f8098248e1	2018-12-14 13:47:36 -08:00
Maysam Yabandeh	349542332a	Fix race condition on options_file_number_ (#4780 ) Summary: options_file_number_ must be written under db::mutex_ sine its read is protected by mutex_ in ::GetLiveFiles(). However currently it is written in ::RenameTempFileToOptionsFile() which according to its contract must be called without holding db::mutex_. The patch fixes the race condition by also acquitting the mutex_ before writing options_file_number_. Also it does that only if the rename of option file is successful. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4780 Differential Revision: D13461411 Pulled By: maysamyabandeh fbshipit-source-id: 2d5bae96a1f3e969ef2505b737cf2d7ae749787b	2018-12-13 19:27:38 -08:00
Yanqin Jin	4fce44fc8b	Improve flushing multiple column families (#4708 ) Summary: If one column family is dropped, we should simply skip it and continue to flush other active ones. Currently we use Status::ShutdownInProgress to notify caller of column families being dropped. In the future, we should consider using a different Status code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4708 Differential Revision: D13378954 Pulled By: riversand963 fbshipit-source-id: 42f248cdf2d32d4c0f677cd39012694b8f1328ca	2018-12-13 15:12:40 -08:00
DorianZheng	2670fe8c73	Get `CompactionJobInfo` from CompactFiles Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4716 Differential Revision: D13207677 Pulled By: ajkr fbshipit-source-id: d0ccf5a66df6cbb07288b0c5ebad81fd9df3926b	2018-12-13 14:21:24 -08:00
Burton Li	a8b9891f95	Concurrent task limiter for compaction thread control (#4332 ) Summary: The PR is targeting to resolve the issue of: https://github.com/facebook/rocksdb/issues/3972#issue-330771918 We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD. When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth. Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us. With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening. ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant. The usage is straight forward: e.g.: // // Enable compaction thread limiter thru ColumnFamilyOptions // std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4)); Options options; ColumnFamilyOptions cf_opt(options); cf_opt.compaction_thread_limiter = ctl; ... // // Compaction thread limiter can be tuned or disabled on-the-fly // ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks ... ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter ctl->SetMaxOutstandingTask(-1); // Same as above ... ctl->SetMaxOutstandingTask(0); // full throttle (0 task) // // Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue) // std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8)); std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4)); Options options; ColumnFamilyOptions cf_opt_ssd1(options); ColumnFamilyOptions cf_opt_ssd2(options); ColumnFamilyOptions cf_opt_hdd1(options); ColumnFamilyOptions cf_opt_hdd2(options); ColumnFamilyOptions cf_opt_hdd3(options); // SSD CFs cf_opt_ssd1.compaction_thread_limiter = ctl_ssd; cf_opt_ssd2.compaction_thread_limiter = ctl_ssd; // HDD CFs cf_opt_hdd1.compaction_thread_limiter = ctl_hdd; cf_opt_hdd2.compaction_thread_limiter = ctl_hdd; cf_opt_hdd3.compaction_thread_limiter = ctl_hdd; ... // // The limiter is disabled by default (or set to nullptr explicitly) // Options options; ColumnFamilyOptions cf_opt(options); cf_opt.compaction_thread_limiter = nullptr; Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332 Differential Revision: D13226590 Pulled By: siying fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d	2018-12-13 13:18:28 -08:00
Maysam Yabandeh	0aa17c1002	Fix flaky test DBCompactionTest::DeleteFileRange (#4776 ) Summary: The test has been failing sporadically probably because the configured compaction options were actually unused. Verified that by the following: ``` ~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.DeleteFileRange --repeat=1000 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4776 Differential Revision: D13441052 Pulled By: maysamyabandeh fbshipit-source-id: d35075b9e6cef9b9c9d0d571f9cd72ade8eda55d	2018-12-12 16:32:14 -08:00
DorianZheng	4862720e08	Expose column family id to `FlushJobInfo` Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4772 Differential Revision: D13428923 Pulled By: ajkr fbshipit-source-id: e351e9c5eea97816db25429e129357a8af90712a	2018-12-11 20:33:42 -08:00
Abhishek Madan	cad248f5c6	Prepare FragmentedRangeTombstoneIterator for use in compaction (#4740 ) Summary: To support the flush/compaction use cases of RangeDelAggregator in v2, FragmentedRangeTombstoneIterator now supports dropping tombstones that cannot be read in the compaction output file. Furthermore, FragmentedRangeTombstoneIterator supports the "snapshot striping" use case by allowing an iterator to be split by a list of snapshots. RangeDelAggregatorV2 will use these changes in a follow-up change. In the process of making these changes, other miscellaneous cleanups were also done in these files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4740 Differential Revision: D13287382 Pulled By: abhimadan fbshipit-source-id: f5aeb03e1b3058049b80c02a558ee48f723fa48c	2018-12-11 12:10:48 -08:00
Sagar Vemuri	dde3ef1116	Change directory where ExternalSSTFileBasicTest runs (#4766 ) Summary: Change the directory where ExternalSSTFileBasicTest* tests run. Problem: Without this change, I spent considerable time chasing around a non-existent issue as ExternalSSTFileTest.* and ExternalSSTFileBasicTest.* create similar directories. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4766 Differential Revision: D13409384 Pulled By: sagar0 fbshipit-source-id: c33e1f4d505dfa6efbc788d6c57cdb680053ded3	2018-12-11 10:21:37 -08:00
Abhishek Madan	64aabc9183	Properly set smallest key of subcompaction output (#4723 ) Summary: It is possible to see a situation like the following when subcompactions are enabled: 1. A subcompaction boundary is set to `[b, e)`. 2. The first output file in a subcompaction has `c@20` as its smallest key 3. The range tombstone `[a, d)30` is encountered. 4. The tombstone is written to the range-del meta block and the new smallest key is set to `b@0` (since no keys in this subcompaction's output can be smaller than `b`). 5. A key `b@10` in a lower level will now reappear, since it is not covered by the truncated start key `b@0`. In general, unless the smallest data key in a file has a seqnum of 0, it is not safe to truncate a tombstone at the start key to have a seqnum of 0, since it can expose keys with a seqnum greater than 0 but less than the tombstone's actual seqnum. To fix this, when the lower bound of a file is from the subcompaction boundaries, we now set the seqnum of an artificially extended smallest key to the tombstone's seqnum. This is safe because subcompactions operate over disjoint sets of keys, and the subcompactions that can experience this problem are not the first subcompaction (which is unbounded on the left). Furthermore, there is now an assertion to detect the described anomalous case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4723 Differential Revision: D13236188 Pulled By: abhimadan fbshipit-source-id: a6da6a113f2de1e2ff307ca72e055300c8fe5692	2018-12-10 12:38:31 -08:00
Yanqin Jin	f307479ba6	Enable checkpoint of read-only db (#4681 ) Summary: 1. DBImplReadOnly::GetLiveFiles should not return NotSupported. Instead, it should call DBImpl::GetLiveFiles(flush_memtable=false). 2. In DBImp::Recover, we should also recover the OPTIONS file name and/or number so that an immediate subsequent GetLiveFiles will get the correct OPTIONS name. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4681 Differential Revision: D13069205 Pulled By: riversand963 fbshipit-source-id: 3e6a0174307d06db5a01feb099b306cea1f7f88a	2018-12-07 17:06:02 -08:00
Yanqin Jin	9be3e6b488	Allow file-ingest-triggered flush to skip waiting for write-stall clear (#4751 ) Summary: When write stall has already been triggered due to number of L0 files reaching threshold, file ingestion must proceed with its flush without waiting for the write stall condition to cleared by the compaction because compaction can wait for ingestion to finish (circular wait). In order to avoid this wait, we can set `FlushOptions.allow_write_stall` to be true (default is false). Setting it to false can cause deadlock. This can happen when the number of compaction threads is low. Considere the following ``` Time compaction_thread ingestion_thread \| num_running_ingest_file_++ \| while(num_running_ingest_file_>0){wait} \| flush V ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4751 Differential Revision: D13343037 Pulled By: riversand963 fbshipit-source-id: d3b95938814af46ec4c463feff0b50c70bd8b23f	2018-12-05 14:59:29 -08:00
Yanqin Jin	b96fccb1e6	Move a function to critical section (#4752 ) Summary: Test plan ``` $make clean && make -j32 all check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4752 Differential Revision: D13344705 Pulled By: riversand963 fbshipit-source-id: fc3a43174d09d70ccc2b09decd78e1da1b6ba9d1	2018-12-05 13:12:09 -08:00
Zhongyi Xie	2f1ca4e838	Revert "BaseDeltaIterator: always check valid() before accessing key(… (#4744 ) Summary: …) (#4702)" This reverts commit `3a18bb3e15`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4744 Differential Revision: D13311869 Pulled By: miasantreble fbshipit-source-id: 6300b12cc34828d8b9274e907a3aef1506d5d553	2018-12-03 23:38:27 -08:00
Zhongyi Xie	3a18bb3e15	BaseDeltaIterator: always check valid() before accessing key() (#4702 ) Summary: Current implementation of `current_over_upper_bound_` fails to take into consideration that keys might be invalid in either base iterator or delta iterator. Calling key() in such scenario will lead to assertion failure and runtime errors. This PR addresses the bug by adding check for valid keys before calling `IsOverUpperBound()`, also added test coverage for iterate_upper_bound usage in BaseDeltaIterator Also recommit https://github.com/facebook/rocksdb/pull/4656 (It was reverted earlier due to bugs) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4702 Differential Revision: D13146643 Pulled By: miasantreble fbshipit-source-id: 6d136929da12d0f2e2a5cea474a8038ec5cdf1d0	2018-11-30 15:35:13 -08:00
Siying Dong	6e938c904f	Make NewBloomFilterPolicy() use full filter by default (#4735 ) Summary: Full block (use_block_based_builder=false) Bloom filter has clear CPU saving benefits but with limitation of using temp memory when building an SST file proportional to the SST file size. We reduced the chance of having large SST files with multi-level universal compaction. Now we change to a default with better performance. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4735 Differential Revision: D13266674 Pulled By: siying fbshipit-source-id: 7594a4c3e32568a5a2adce22bb0e46553e55c602	2018-11-30 13:13:27 -08:00
Sagar Vemuri	70645355ad	Move FIFOCompactionPicker to a separate file (#4724 ) Summary: Summary: Simplified the code layout by moving FIFOCompactionPicker to a separate file. Why?: While trying to add ttl functionality to universal compaction, I found that `FIFOCompactionPicker` class and its impl methods to be interspersed between `LevelCompactionPicker` methods which kind-of made the code a little hard to traverse. So I moved `FIFOCompactionPicker` to a separate compaction_picker_fifo.h/cc file, similar to `UniversalCompactionPicker`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4724 Differential Revision: D13227914 Pulled By: sagar0 fbshipit-source-id: 89471766ea67fa4d87664a41c057dd7df4b3d4e3	2018-11-29 16:04:52 -08:00
Yanqin Jin	8d7bc76f36	Fix a flaky test DBFlushTest.SyncFail (#4633 ) Summary: There is a race condition in DBFlushTest.SyncFail, as illustrated below. ``` time thread1 bg_flush_thread \| Flush(wait=false, cfd) \| refs_before=cfd->current()->TEST_refs() PickMemtable calls cfd->current()->Ref() V ``` The race condition between thread1 getting the ref count of cfd's current version and bg_flush_thread incrementing the cfd's current version makes it possible for later assertion on refs_before to fail. Therefore, we add test sync points to enforce the order and assert on the ref count before and after PickMemtable is called in bg_flush_thread. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4633 Differential Revision: D12967131 Pulled By: riversand963 fbshipit-source-id: a99d2bacb7869ec5d8d03b24ef2babc0e6ae1a3b	2018-11-29 13:39:56 -08:00
Kefu Chai	7dbee38716	db/repair: reset Repair::db_lock_ in ctor (#4683 ) Summary: there is chance that * the caller tries to repair the db when holding the db_lock, in that case the env implementation might not set the `lock` parameter of Repairer::Run(). * the caller somehow never calls Repairer::Run(). either way, the desctructor of Repair will compare the uninitialized db_lock_ with nullptr, and tries to unlock it. there is good chance that the db_lock_ is not nullptr, then boom. Signed-off-by: Kefu Chai <tchaikov@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/4683 Differential Revision: D13260287 Pulled By: riversand963 fbshipit-source-id: 878a119d2e9f10a0fa17ee62cf3fb24b33d49fa5	2018-11-29 11:26:41 -08:00
Abhishek Madan	8fe1e06ca0	Clean up FragmentedRangeTombstoneList (#4692 ) Summary: Removed `one_time_use` flag, which removed the need for some tests, and changed all `NewRangeTombstoneIterator` methods to return `FragmentedRangeTombstoneIterators`. These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones` and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692 Differential Revision: D13106570 Pulled By: abhimadan fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845	2018-11-28 15:29:02 -08:00
Zhichao Cao	7125e24619	Add the max trace file size limitation option to Tracing (#4610 ) Summary: If user do not end the trace manually, the tracing will continue which can potential use up all the storage space and cause problem. In this PR, the max trace file size is added to the TraceOptions and user can set the value if they need or the default is 64GB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4610 Differential Revision: D12893400 Pulled By: zhichao-cao fbshipit-source-id: acf4b5a6076bb691778bdfbac4864e1006758953	2018-11-27 14:27:05 -08:00
Abhishek Madan	85394a96ca	Speed up range scans with range tombstones (#4677 ) Summary: Previously, every range tombstone iterator was seeked on every ShouldDelete call, which quickly degraded performance for long range scans. This PR improves performance by tracking iterator positions and only advancing iterators when necessary. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4677 Differential Revision: D13205373 Pulled By: abhimadan fbshipit-source-id: 80c199dace1e19362a4c61c686bf01913eae87cb	2018-11-26 16:33:41 -08:00
Zhongyi Xie	a21cb22ee3	Revert "apply ReadOptions.iterate_upper_bound to transaction iterator… (#4705 ) Summary: … (#4656)" This reverts commit `b76398a82b`. Will add test coverage for iterate_upper_bound before re-commit b76398 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4705 Differential Revision: D13148592 Pulled By: miasantreble fbshipit-source-id: 4d1ce0bfd9f7a5359a7688bd780eb06a66f45b1f	2018-11-24 10:46:28 -08:00
Andrew Kryczka	07cf0ee589	Fix ticker stat for number files closed (#4703 ) Summary: We haven't been populating `NO_FILE_CLOSES` since v1.5.8 even though it was never marked as deprecated. Start populating it again. Conveniently `DeleteTableReader` has an unused `void*` argument that we can use... Blame: `63f216ee0a` Closes #4700. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4703 Differential Revision: D13146769 Pulled By: ajkr fbshipit-source-id: ad8d6fb0493e701f60a165a3bca1787d255be008	2018-11-21 18:31:34 -08:00
Yi Wu	05d9d82181	Revert "Move MemoryAllocator option from Cache to BlockBasedTableOpti… (#4697 ) Summary: …ons (#4676)" This reverts commit `b32d087dbb`. `MemoryAllocator` needs to be with `Cache`, since cache entry can outlive DB and block based table. The cache needs to hold reference to memory allocator when deleting cache entry. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4697 Differential Revision: D13133490 Pulled By: yiwu-arbug fbshipit-source-id: 8ef7e8a51263bfd929f892fd062665ff4ce9ce5a	2018-11-21 11:29:57 -08:00
Abhishek Madan	457f77b9ff	Introduce RangeDelAggregatorV2 (#4649 ) Summary: The old RangeDelAggregator did expensive pre-processing work to create a collapsed, binary-searchable representation of range tombstones. With FragmentedRangeTombstoneIterator, much of this work is now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking in each iterator to find a covering tombstone in ShouldDelete, while doing minimal work in AddTombstones. The old RangeDelAggregator is still used during flush/compaction for now, though RangeDelAggregatorV2 will support those uses in a future PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649 Differential Revision: D13146964 Pulled By: abhimadan fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3	2018-11-21 10:56:45 -08:00
Abhishek Madan	ed5aec5ba3	Fix range tombstone covering short-circuit logic (#4698 ) Summary: Since a range tombstone seen at one level will cover all keys in the range at lower levels, there was a short-circuiting check in Get that reported a key was not found at most one file after the range tombstone was discovered. However, this was incorrect for merge operands, since a deletion might only cover some merge operands, which implies that the key should be found. This PR fixes this logic in the Version portion of Get, and removes the logic from the MemTable portion of Get, since the perforamnce benefit provided there is minimal. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4698 Differential Revision: D13142484 Pulled By: abhimadan fbshipit-source-id: cbd74537c806032f2bfa564724d01a80df7c8f10	2018-11-20 13:29:22 -08:00
Siying Dong	13579e8c5a	WriteBufferManger doens't cost to cache if no limit is set (#4695 ) Summary: WriteBufferManger is not invoked when allocating memory for memtable if the limit is not set even if a cache is passed. It is inconsistent from the comment syas. Fix it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4695 Differential Revision: D13112722 Pulled By: siying fbshipit-source-id: 0b27eef63867f679cd06033ea56907c0569597f4	2018-11-18 16:55:43 -08:00
Andrew Kryczka	9d6d4867ab	Fix uninitialized fields in file metadata (#4693 ) Summary: This is a quick fix for the uninitialized bugs in `LiveFileMetaData` and `SstFileMetaData` that were uncovered in #4686. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4693 Differential Revision: D13113189 Pulled By: ajkr fbshipit-source-id: 18e798d031d2a59d0b55fc010c135e0126f4042d	2018-11-16 20:49:17 -08:00
Yanqin Jin	147697420a	Rollback memtable flush upon atomic flush fail (#4641 ) Summary: This fixes an assertion. An atomic flush can have multiple flush jobs. Some of them may fail. If any of them fails, we need to rollback all of them. For the flush jobs that do fail, we already call `RollbackMemTableFlush` in `FlushJob::Run`. The tricky part is for flush jobs that have completed successfully. We need to call `RollbackMemTableFlush` for them as well. The newly added DBAtomicFlushTest.AtomicFlushRollbackSomeJobs will SigAbort without the corresponding change in AtomicFlushMemTablesToOutputFiles. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4641 Differential Revision: D12943649 Pulled By: riversand963 fbshipit-source-id: c66a4a664a1e0938e938fd41edc5a70c34cdd868	2018-11-14 20:54:17 -08:00
Abhishek Madan	6bee36a786	Modify FragmentedRangeTombstoneList member layout (#4632 ) Summary: Rather than storing a `vector<RangeTombstone>`, we now store a `vector<RangeTombstoneStack>` and a `vector<SequenceNumber>`. A `RangeTombstoneStack` contains the start and end keys of a range tombstone fragment, and indices into the seqnum vector to indicate which sequence numbers the fragment is located at. The diagram below illustrates an example: ``` tombstones_: [a, b) [c, e) [h, k) \| \ / \ / \| \| \ / \ / \| v v v v tombstone_seqs_: [ 5 3 10 7 2 8 6 ] ``` This format allows binary searching the tombstone list to use less key comparisons, which helps in cases where there are many overlapping tombstones. Also, this format makes it easier to add DBIter-like semantics to `FragmentedRangeTombstoneIterator` in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4632 Differential Revision: D13053103 Pulled By: abhimadan fbshipit-source-id: e8220cc712fcf5be4d602913bb23ace8ea5f8ef0	2018-11-14 17:52:17 -08:00
Siying Dong	f5c8cf5fed	Increase wait time in DBTest.SanitizeNumThreads (#4659 ) Summary: DBTest.SanitizeNumThreads Sometimes fails. The test waited for 10ms timeout and expect all threads scheduled to be executed. This can be a source of flakiness. Make a check every 1ms and up to 10s. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4659 Differential Revision: D13074174 Pulled By: siying fbshipit-source-id: b1d5ff87a326a4fc9eab8d1cc307bbb940dfe70c	2018-11-14 16:19:36 -08:00
Zhongyi Xie	d8df169b84	release db mutex when calling ApproximateSize (#4630 ) Summary: `GenSubcompactionBoundaries` calls `VersionSet::ApproximateSize` which gets BlockBasedTableReader for every file and seeks in its index block to find `key`'s offset. If the table or index block aren't in memory already, this involves I/O. This can be improved by releasing DB mutex when calling ApproximateSize. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4630 Differential Revision: D13052653 Pulled By: miasantreble fbshipit-source-id: cae31d46d10d0860fa8a26b8d5154b2d17d1685f	2018-11-13 17:08:34 -08:00
Zhongyi Xie	b76398a82b	apply ReadOptions.iterate_upper_bound to transaction iterator (#4656 ) Summary: Currently transaction iterator does not apply `ReadOptions.iterate_upper_bound` when iterating. This PR attempts to fix the problem by having `BaseDeltaIterator` enforcing the upper bound check when iterator state is changed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4656 Differential Revision: D13039257 Pulled By: miasantreble fbshipit-source-id: 909eb9f6b4597a4d80418fb139f32ec82c6ec1d1	2018-11-13 15:44:15 -08:00
Yi Wu	b32d087dbb	Move MemoryAllocator option from Cache to BlockBasedTableOptions (#4676 ) Summary: Per offline discussion with siying, `MemoryAllocator` and `Cache` should be decouple. The idea is that memory allocator handles memory allocation, while cache handle cache policy. It is normal that external cache libraries pack couple the two components for better optimization. If we want to integrate with such library in the future, we can make a wrapper of the library implementing both `Cache` and `MemoryAllocator` interface. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4676 Differential Revision: D13047662 Pulled By: yiwu-arbug fbshipit-source-id: cd42e246d80ab600b4de47d073f7d2db308ce6dd	2018-11-13 13:48:38 -08:00
Siying Dong	abb1a8fc23	Add a unit test to assert number of preads (#4657 ) Summary: We used to have a bug, which caused every block to be read twice, and none of our tests caught it. Add a very simply unit test to make sure that when reading a data block, we only issue one pread against the SST file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4657 Differential Revision: D13005260 Pulled By: siying fbshipit-source-id: 03167b554ad2451192b1707415536d7d05e9026c	2018-11-13 12:52:19 -08:00
QingpingWang	4f0fcb78ae	Expose num entries and deletions of sst files (#4623 ) Summary: he ratio of num_deletions to num_entries of a level can be useful to determine if a manual compaction needs to be triggered on a level. Also refer #3980 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4623 Differential Revision: D13045744 Pulled By: sagar0 fbshipit-source-id: 71f3c8e363a8ffd194ec3bb0ed0b69612231f0b3	2018-11-13 11:52:19 -08:00
Soli Como	5945e16dfc	Divide `NO_ITERATORS` into two counters `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETE` (#4498 ) Summary: Currently, `Statistics` can record tick by `recordTick()` whose second parameter is an `uint64_t`. That means tick can only increase. If we want to reduce tick, we have to work around like `RecordTick(statistics_, NO_ITERATORS, uint64_t(-1));`. That's kind of a hack. So, this PR divide `NO_ITERATORS` into two counters `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETE`, making the counters increase only. Fixes #3013 . Pull Request resolved: https://github.com/facebook/rocksdb/pull/4498 Differential Revision: D10395010 Pulled By: sagar0 fbshipit-source-id: cfb523b22a37411c794b4e9da090f1ae30293db2	2018-11-13 11:46:32 -08:00
Soli	a478682260	Fix #3840 : only `SyncClosedLogs` for multiple CFs (#4460 ) Summary: Call `SyncClosedLogs()` only if there are more than one column families. Update several unit tests (in `fault_injection_test` and `db_flush_test`) correspondingly. See #3840 for more info. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4460 Differential Revision: D12896377 Pulled By: riversand963 fbshipit-source-id: f49afdaec32568f12f001219a3aec1dfde3b32bf	2018-11-13 11:32:16 -08:00
Andrew Kryczka	ea9454700a	Backup engine support for direct I/O reads (#4640 ) Summary: Use the `DBOptions` that the backup engine already holds to figure out the right `EnvOptions` to use when reading the DB files. This means that, if a user opened a DB instance with `use_direct_reads=true`, then using `BackupEngine` to back up that DB instance will use direct I/O to read files when calculating checksums and copying. Currently the WALs and manifests would still be read using buffered I/O to prevent mixing direct I/O reads with concurrent buffered I/O writes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4640 Differential Revision: D13015268 Pulled By: ajkr fbshipit-source-id: 77006ad6f3e00ce58374ca4793b785eea0db6269	2018-11-13 11:17:25 -08:00
Zhongyi Xie	b313019326	use per-level perfcontext for DB::Get calls (#4617 ) Summary: this PR adds two more per-level perf context counters to track * number of keys returned in Get call, break down by levels * total processing time at each level during Get call Pull Request resolved: https://github.com/facebook/rocksdb/pull/4617 Differential Revision: D12898024 Pulled By: miasantreble fbshipit-source-id: 6b84ef1c8097c0d9e97bee1a774958f56ab4a6c4	2018-11-13 10:40:49 -08:00

1 2 3 4 5 ...

3518 Commits