rocksdb

Author	SHA1	Message	Date
Siying Dong	49ddd7ec4f	Stats should be logged in INFO level (#4977 ) Summary: Previously, stats were logged in warning level. This was done in that way because people reported that it wasn't logged in MyRocks. However, later we learned that it turns out to be due to a bug in MyRocks, which is fixed in `79bb705e74` Now we revert the stats logging to INFO level, so that it doesn't pollute the warning level logging. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4977 Differential Revision: D14058485 Pulled By: siying fbshipit-source-id: 19fab323c19d9bc88184287f209551f9a77ca0e6	2019-02-12 16:54:55 -08:00
Yanqin Jin	c5a64cffd2	Avoid fsync on the same directory in atomic flush (#4817 ) Summary: In `DBImpl::AtomicFlushMemTablesToOutputFiles`, we need to call fsync only once on the same data directory. If two column families share a common directory for their data, we call fsync only once. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4817 Differential Revision: D13543689 Pulled By: riversand963 fbshipit-source-id: 4701d77c96a47802fbf6cb9f3337ee65d46b95f5	2019-02-12 12:28:36 -08:00
Andrew Kryczka	62f70f6d14	Reduce scope of compression dictionary to single SST (#4952 ) Summary: Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio. So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include: - The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called. - After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up. - Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952 Differential Revision: D13967980 Pulled By: ajkr fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f	2019-02-11 19:47:32 -08:00
Maysam Yabandeh	576d2d6c60	WritePrepared: relax assert in compaction iterator (#4969 ) Summary: If IsInSnapshot(seq2, snapshot) determines that the snapshot is released, the future queries IsInSnapshot(seq1, snapshot) could still return a definitive answer of true if for example seq1 is too old that is determined visible in all snapshots. This violates a recently added assert statement to compaction iterator. The patch relaxes the assert. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4969 Differential Revision: D14030998 Pulled By: maysamyabandeh fbshipit-source-id: 6db53db0e37d0a20e8997ef2c1004b8627614ab9	2019-02-11 15:01:46 -08:00
Yanqin Jin	2d049ab7e8	Checksum properties block for block-based table (#4956 ) Summary: Always enable properties block checksum verification for block-based table. For external SST file ingested with 'write_global_seqno==true', we use 'DecodeEntrySlow' to parse its blocks' contents so that the process will not die upon failing the assertion possibly caused by corruption. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4956 Differential Revision: D14012741 Pulled By: riversand963 fbshipit-source-id: 8b766e6f54b36f8f9e074c0e19e0926ec3cce186	2019-02-11 11:50:01 -08:00
Siying Dong	5d9a623e2c	Add a unit test to Ignorable manfiest record (#4964 ) Summary: https://github.com/facebook/rocksdb/pull/4960 introduced ignorable manfiest record. Adding a test to it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4964 Differential Revision: D14012667 Pulled By: siying fbshipit-source-id: e5f10ecc68dec2716e178d44f0fe2b76c3d857ef	2019-02-11 11:20:24 -08:00
tang-jianfeng	08809f5e6c	Implement trace sampling (#4963 ) Summary: Implement trace sampling to allow user to specify the sampling frequency, i.e. save one per how many requests, so that a user does not need to log all if he/she is interested in only a sampled set. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4963 Differential Revision: D14011190 Pulled By: tang-jianfeng fbshipit-source-id: 078b631d9319b67cb089dd2c30e21d0df8dc406a	2019-02-08 18:08:18 -08:00
Siying Dong	1a761e6a6c	Add a placeholder in manifest indicating ignorable record (#4960 ) Summary: We want to reserve some right that some extra information added manifest in the future can be forward compatible by previous versions. Now we create a place holder for that. A bit in tag is added to indicate that a field can be safely ignored. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4960 Differential Revision: D14000484 Pulled By: siying fbshipit-source-id: cbf5bad3f9d5ec798f789806f244d1c20d3b66d6	2019-02-08 11:33:11 -08:00
Siying Dong	f48758e939	Deprecate CompactionFilter::IgnoreSnapshots() = false (#4954 ) Summary: We found that the behavior of CompactionFilter::IgnoreSnapshots() = false isn't what we have expected. We thought that snapshot will always be preserved. However, we just realized that, if no snapshot is created while compaction starts, and a snapshot is created after that, the data seen from the snapshot can successfully be dropped by the compaction. This creates a strange behavior to the feature, which is hard to explain. Like what is documented in code comment, this feature is not very useful with snapshot anyway. The decision is to deprecate the feature. We keep the function to avoid to break users code. However, we will fail compactions if false is returned. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4954 Differential Revision: D13981900 Pulled By: siying fbshipit-source-id: 2db8c2c3865acd86a28dca625945d1481b1d1e36	2019-02-07 16:57:33 -08:00
Siying Dong	cf3a671733	Remove cuckoo hash memtable (#4953 ) Summary: Cuckoo Hash is less useful than we initially expected. Remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4953 Differential Revision: D13979264 Pulled By: siying fbshipit-source-id: 2a60afdaa989f045357398b43a1cc5d46f4492ed	2019-02-07 16:15:27 -08:00
Zhongyi Xie	71cae59a99	exclude test CompactFilesShouldTriggerAutoCompaction from ROCKSDB_LITE (#4950 ) Summary: This will fix the following build error: > db/db_test.cc: In member function ‘virtual void rocksdb::DBTest_CompactFilesShouldTriggerAutoCompaction_Test::TestBody()’: > db/db_test.cc:5462:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’ > db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data); > db/db_test.cc:5490:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’ > db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data); > db/db_test.cc:5499:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’ > db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data); Pull Request resolved: https://github.com/facebook/rocksdb/pull/4950 Differential Revision: D13965378 Pulled By: miasantreble fbshipit-source-id: a975435476fe555b1cd9d5da263ee3da3acdea56	2019-02-05 17:01:11 -08:00
Zhongyi Xie	00ed41daee	Allow copy for PerfContext objects (#4919 ) Summary: Existing implementation of PerfContext does not define copy constructor or assignment operator, which could potentially cause problems when user create copies and resets the builtin one. This PR address the issue by providing these two constructors with deep copy semantics. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4919 Differential Revision: D13960406 Pulled By: miasantreble fbshipit-source-id: 36aab5aaee65d4480f537e4e22148faa45e8e334	2019-02-05 14:29:08 -08:00
Jay Zhuang	c9a52cbdc8	Fix potential DB hang while using CompactFiles (#4940 ) Summary: CompactFiles() may block auto compaction which could cuase DB hang when it reachs level0_stop_writes_trigger. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4940 Differential Revision: D13929648 Pulled By: cooldoger fbshipit-source-id: 10842df38df3bebf862cd1a120a88ce961fdd381	2019-02-05 11:23:38 -08:00
Siying Dong	8fe073324f	BYTES_READ stats miscount for NotFound cases (#4938 ) Summary: In NotFound cases, stats BYTES_READ and perf_context.get_read_bytes is still be increased. The amount increased will be whatever size of the string or PinnableSlice that users passed in as the output data structure. This is wrong. Fix this by not increasing these two counters. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4938 Differential Revision: D13908963 Pulled By: siying fbshipit-source-id: 60bce42e4fbb9862bba3da36dbc27b2963ea6162	2019-02-05 10:53:35 -08:00
yangzhijia	31221bb7e8	Properly set upper bound of subcompaction output (#4879 ) (#4898 ) Summary: Fix the ouput overlap bug when using subcompactions, the upper bound of output file was extended incorrectly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4898 Differential Revision: D13736107 Pulled By: ajkr fbshipit-source-id: 21dca09f81d5f07bf2766bf566f9b50dcab7d8e3	2019-02-05 10:20:16 -08:00
Maysam Yabandeh	30468d8eb4	Fix analyze error on possible un-initialized value (#4937 ) Summary: The patch fixes the following analyze error by checking the return status of ParseInternalKey. ``` db/merge_helper.cc:306:23: warning: The right operand of '==' is a garbage value assert(kTypeMerge == orig_ikey.type); ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4937 Differential Revision: D13908506 Pulled By: maysamyabandeh fbshipit-source-id: 68d7771e75519da3d4bd807fd231675ec12093f6	2019-02-01 09:41:27 -08:00
Ming Zhao	59244447e3	Zero seqnum of final key / drop final tombstone when compacting to bottommost level Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4927 Differential Revision: D13889458 Pulled By: mzhaom fbshipit-source-id: d6b66db85901a9eb90748fba6a9dc4e7457b9c5e	2019-02-01 09:21:57 -08:00
Yanqin Jin	842cdc11dd	Use correct FileMeta for atomic flush result install (#4932 ) Summary: 1. this commit fixes our handling of a combination of two separate edge cases. If a flush job does not pick any memtable to flush (because another flush job has already picked the same memtables), and the column family assigned to the flush job is dropped right before RocksDB calls rocksdb::InstallMemtableAtomicFlushResults, our original code passes a FileMetaData object whose file number is 0, failing the assertion in rocksdb::InstallMemtableAtomicFlushResults (assert(m->GetFileNumber() > 0)). 2. Also piggyback a small change: since we already create a local copy of column family's mutable CF options to eliminate potential race condition with `SetOptions` call, we might as well use the local copy in other function calls in the same scope. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4932 Differential Revision: D13901322 Pulled By: riversand963 fbshipit-source-id: b936580af7c127ea0c6c19ea10cd5fcede9fb0f9	2019-01-31 14:49:51 -08:00
Maysam Yabandeh	35e5689e11	Take snapshots once for all cf flushes (#4934 ) Summary: FlushMemTablesToOutputFiles calls FlushMemTableToOutputFile for each column family. The patch moves the take-snapshot logic to outside FlushMemTableToOutputFile so that it does it once for all the flushes. This also addresses a deadlock issue for resetting the managed snapshot of job_snapshot in the 2nd call to FlushMemTableToOutputFile. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4934 Differential Revision: D13900747 Pulled By: maysamyabandeh fbshipit-source-id: f3cd650c5fff24cf95c1aaf8a10c149d42bf042c	2019-01-31 12:21:59 -08:00
Alexander Zinoviev	32a6dd9a41	Add a new CPU time counter to compaction report (#4889 ) Summary: Measure CPU time consumed for a compaction and report it in the stats report Enable NowCPUNanos() to work for MacOS Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889 Differential Revision: D13701276 Pulled By: zinoale fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055	2019-01-29 17:24:00 -08:00
Yanqin Jin	158da7a6ee	Verify checksum before ingestion (#4916 ) Summary: before file ingestion (in preparation phase), verify the checksums of the blocks of the external SST file, including properties block with global seqno. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4916 Differential Revision: D13863501 Pulled By: riversand963 fbshipit-source-id: dc54697f970e3807832e2460f7228fcc7efe81ee	2019-01-29 17:17:29 -08:00
Sagar Vemuri	4978caaa6f	Remove a redundant call to TableFileName in CompactionJob::FinishCompactionOutputFile (#4925 ) Summary: While stepping through the code I noticed that there is a redundant call to TableFileName. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4925 Differential Revision: D13845749 Pulled By: sagar0 fbshipit-source-id: 31db45716b4d720e0e0350dd457b49d6f1848e7d	2019-01-28 13:33:23 -08:00
Siying Dong	ee1818081f	Remove PlainTable's feature store_index_in_file (#4914 ) Summary: Store_index_in_file is a less useful feature. To simplify the code to maintain, we are dropping the feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4914 Differential Revision: D13791883 Pulled By: siying fbshipit-source-id: d187c5d662584866103e4b77d09dfb925509ae2e	2019-01-28 12:50:22 -08:00
Siying Dong	bc7d1661a8	Fix test name typo in PlainTableDBTest Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4926 Differential Revision: D13830196 Pulled By: siying fbshipit-source-id: e06bf2a6cd273b5eb18dfd82bdd35ffce197d021	2019-01-25 18:14:26 -08:00
Siying Dong	f184bee77b	PlainTable should avoid copying Get() results from immortal source. (#4924 ) Summary: https://github.com/facebook/rocksdb/pull/4053 avoids memcopy for Get() results if files are immortable (read-only DB, max_open_files=-1) and the file is ammaped. The same optimization is being applied to PlainTable here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4924 Differential Revision: D13827749 Pulled By: siying fbshipit-source-id: 1f2cbfc530b40ce08ccd53f95f6e78de4d1c2f96	2019-01-25 17:12:19 -08:00
Siying Dong	fc53839bfa	Disallow customized hash function in DynamicBloom (#4915 ) Summary: I didn't find where customized hash function is used in DynamicBloom. This can only reduce performance. Remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4915 Differential Revision: D13794452 Pulled By: siying fbshipit-source-id: e38669b11e01444d2d782da11c7decabbd851819	2019-01-24 10:34:30 -08:00
Dmitry Fink	e07aa8669d	Allow full merge when root of history for a key is reached (#4909 ) Summary: Previously compaction was not collapsing operands for a first key on a layer, even in cases when it was its root of history. Some tests (CompactionJobTest.NonAssocMerge) was actually accounting for that bug, Pull Request resolved: https://github.com/facebook/rocksdb/pull/4909 Differential Revision: D13781169 Pulled By: finik fbshipit-source-id: d2de353ecf05bec39b942cd8d5b97a8dc445f336	2019-01-23 21:46:10 -08:00
Andrew Kryczka	8ec3e72551	Cache dictionary used for decompressing data blocks (#4881 ) Summary: - If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`. - If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle. - New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count" Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881 Differential Revision: D13663801 Pulled By: ajkr fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f	2019-01-23 18:15:47 -08:00
PeifengSi	43defe9872	Correct the code comment in Compaction::KeyNotExistsBeyondOutputLevel (#4902 ) Summary: Even one key falls in a file's range, we can not infer it definitely exists in this file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4902 Differential Revision: D13795018 Pulled By: siying fbshipit-source-id: 590956f727e9440fcdee55ad9541ace934c64914	2019-01-23 18:00:56 -08:00
Siying Dong	d94aa2f7db	Make compaction_pri = kMinOverlappingRatio to be default (#4911 ) Summary: compaction_pri = kMinOverlappingRatio usually provides much better write amplification than the default. https://github.com/facebook/rocksdb/pull/4907 fixes one shortcome of this option. Make it default. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4911 Differential Revision: D13789262 Pulled By: siying fbshipit-source-id: d90acf8c4dede44f00d183ca4c7a210259378269	2019-01-23 16:47:38 -08:00
Siying Dong	5bf941966b	CompactionPri = kMinOverlappingRatio also uses compensated file size (#4907 ) Summary: Right now, CompactionPri = kMinOverlappingRatio provides best write amplification, but it doesn't prioritize files with more tombstones. We combine the two good features: make kMinOverlappingRatio to boost files with lots of tombstones too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4907 Differential Revision: D13788774 Pulled By: siying fbshipit-source-id: 1991cbb495fb76c8b529de69896e38d81ed9d9b3	2019-01-23 13:21:01 -08:00
Andrew Kryczka	01013ae766	Digest ZSTD compression dictionary once when writing SST file (#4849 ) Summary: This is essentially a re-submission of #4251 with a few improvements: - Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict` - Eliminated `Init` functions. Instead do all initialization work in constructors. - Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849 Differential Revision: D13606039 Pulled By: ajkr fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447	2019-01-18 19:12:57 -08:00
Yi Wu	b1ad6ebba8	WritePrepared: fix two versions in compaction see different status for released snapshots (#4890 ) Summary: Fix how CompactionIterator::findEarliestVisibleSnapshots handles released snapshot. It fixing the two scenarios: Scenario 1: key1 has two values v1 and v2. There're two snapshots s1 and s2 taken after v1 and v2 are committed. Right after compaction output v2, s1 is released. Now findEarliestVisibleSnapshot may see s1 being released, and return the next snapshot, which is s2. That's larger than v2's earliest visible snapshot, which was s1. The fix: the only place we check against last snapshot and current key snapshot is when we decide whether to compact out a value if it is hidden by a later value. In the check if we see current snapshot is even larger than last snapshot, we know last snapshot is released, and we are safe to compact out current key. Scenario 2: key1 has two values v1 and v2. there are two snapshots s1 and s2 taken after v1 and v2 are committed. During compaction before we process the key, s1 is released. When compaction process v2, snapshot checker may return kSnapshotReleased, and the earliest visible snapshot for v2 become s2. When compaction process v1, snapshot checker may return kIsInSnapshot (for WritePrepared transaction, it could be because v1 is still in commit cache). The result will become inconsistent here. The fix: remember the set of released snapshots ever reported by snapshot checker, and ignore them when finding result for findEarliestVisibleSnapshot. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4890 Differential Revision: D13705538 Pulled By: maysamyabandeh fbshipit-source-id: e577f0d9ee1ff5a6035f26859e56902ecc85a5a4	2019-01-18 17:24:06 -08:00
Yi Wu	128f532858	WritePrepared: fix issue with snapshot released during compaction (#4858 ) Summary: Compaction iterator keep a copy of list of live snapshots at the beginning of compaction, and then query snapshot checker to verify if values of a sequence number is visible to these snapshots. However when the snapshot is released in the middle of compaction, the snapshot checker implementation (i.e. WritePreparedSnapshotChecker) may remove info with the snapshot and may report incorrect result, which lead to values being compacted out when it shouldn't. This patch conservatively keep the values if snapshot checker determines that the snapshots is released. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4858 Differential Revision: D13617146 Pulled By: maysamyabandeh fbshipit-source-id: cf18a94f6f61a94bcff73c280f117b224af5fbc3	2019-01-16 09:55:32 -08:00
Yanqin Jin	e79df377c5	Use chrono::time_point instead of time_t (#4868 ) Summary: By convention, time_t almost always stores the integral number of seconds since 00:00 hours, Jan 1, 1970 UTC, according to http://www.cplusplus.com/reference/ctime/time_t/. We surely want more precision than seconds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4868 Differential Revision: D13633046 Pulled By: riversand963 fbshipit-source-id: 4e01e23a22e8838023c51a91247a286dbf3a5396	2019-01-16 09:51:05 -08:00
Yi Wu	5d4fddfa52	WritePrepared: Fix visible key compacted out by compaction (#4883 ) Summary: With WritePrepared transaction, flush/compaction can contain uncommitted keys, and those keys can get committed during compaction. If a snapshot is taken before the key is committed, it should not see the key. On the other hand, compaction grab the list of snapshots at its beginning, and only consider those snapshots to dedup keys. Consider the case: ``` seq = 1: put "foo" = "bar" seq = 2: transaction T: delete "foo", prepare seq = 3: compaction start seq = 4: take snapshot S seq = 5: transaction T: commit. ... seq = N: compaction iterator reached key "foo". ``` When compaction start, the list of snapshot is empty. Compaction doesn't take snapshot S into account. When it reached "foo", transaction T is committed. Compaction may think the value "foo=bar" is not visible by any snapshot (which is wrong), and compact the value out. The fix is to explicitly take a snapshot before compaction grabbing the list of snapshots. Compaction will then has to keep keys visible to this snapshot. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4883 Differential Revision: D13668775 Pulled By: maysamyabandeh fbshipit-source-id: 1cab9615f94b7d3e8522cc3d44c3a14c7d4720e4	2019-01-15 21:34:38 -08:00
Maysam Yabandeh	cad99a6031	WritePrepared: snapshot should be larger than max_evicted_seq_ (#4886 ) Summary: The AdvanceMaxEvictedSeq algorithm assumes that new snapshots always have sequence number larger than the last max_evicted_seq_. To enforce this assumption we make two changes: i) max is not advanced beyond the last published seq, with the exception that the evicted commit entry itself is not published yet, which is quite rare. ii) When obtaining the snapshot if the max_evicted_seq_ is not published yet, commit a dummy entry so that it waits for it to be published and also increased the latest published seq by one above the max. To test these non-realistic corner cases we create a commit cache with size 1 so that every single commit results into eviction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4886 Differential Revision: D13685270 Pulled By: maysamyabandeh fbshipit-source-id: 5461bc09c2a9b75798bfcb9853a256c81cdac0b0	2019-01-15 18:11:52 -08:00
Siying Dong	7d13f307ff	Improve Error Message When wal_dir doesn't exist (#4874 ) Summary: Right now the error mesage when options.wal_dir doesn't exist is not helpful to users. Be more specific Pull Request resolved: https://github.com/facebook/rocksdb/pull/4874 Differential Revision: D13642425 Pulled By: siying fbshipit-source-id: 9a3172ed0f799af233b0f3b2e5e35bc7ce04c7b5	2019-01-15 16:46:04 -08:00
Yanqin Jin	301da345ae	Make a copy of MutableCFOptions to avoid race condition (#4876 ) Summary: If we do not do this, then reading MutableCFOptions may have a race condition with SetOptions which modifies MutableCFOptions. Also reserve space in advance for vectors to avoid reallocation changing the address of its elements. Test plan ``` $make clean && make -j32 all check $make clean && COMPILE_WITH_TSAN=1 make -j32 all check $make clean && COMPILE_WITH_ASAN=1 make -j32 all check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4876 Differential Revision: D13644500 Pulled By: riversand963 fbshipit-source-id: 4b8112c5c819d5a2922bb61ad1521b3d2fb2fd47	2019-01-11 17:43:37 -08:00
Maysam Yabandeh	d56ac22b44	Remove duplicates from SnapshotList::GetAll (#4860 ) Summary: The vector returned by SnapshotList::GetAll could have duplicate entries if two separate snapshots have the same sequence number. However, when this vector is used in compaction the duplicate entires are of no use and could be safely ignored. Moreover not having duplicate entires simplifies reasoning in the compaction_iterator.cc code. For example when searching for the previous_snap we currently use the snapshot before the current one but the way the code uses that it expects it to be also less than the current snapshot, which would be simpler to read if there is no duplicate entry in the snapshot list. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4860 Differential Revision: D13615502 Pulled By: maysamyabandeh fbshipit-source-id: d45bf01213ead5f39db811f951802da6fcc3332b	2019-01-09 16:25:42 -08:00
Siying Dong	8641e9adf7	Non-initial file preloading should always prefetch index and filter (#4852 ) Summary: https://github.com/facebook/rocksdb/pull/3340 introduces preloading when max_open_files != -1. It doesn't preload index and filter in non-initial file loading case. This is a little bit too complicated to understand. We observed in one MyRocks use case where the filter is expected to be preloaded but is not. To simplify the use case, we simply always prefetch the index and filter. They anyway is expected to be loaded in the file verification phase anyway. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4852 Differential Revision: D13595402 Pulled By: siying fbshipit-source-id: d4d8624eb3e849e20aeb990df2100502d85aff31	2019-01-08 12:47:34 -08:00
tom wang	42135523a0	modify comments about flush_queue_ Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4850 Differential Revision: D13591940 Pulled By: sagar0 fbshipit-source-id: 617794e0a41d0f4554d40871180b061e84189fc5	2019-01-07 13:52:59 -08:00
Yi Wu	cf852fdf55	Minor fix: single delete a blob value is not a mismatch (#4848 ) Summary: In compaction iterator, if the next value of single delete is a blob value, it should not treated as mismatch. This is only a minor fix and doesn't affect correctness. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4848 Differential Revision: D13585812 Pulled By: yiwu-arbug fbshipit-source-id: 0ff6223fa03a644ac9fd8a2d77f9d6711d0a62b0	2019-01-04 16:31:02 -08:00
Andrew Kryczka	9e2c804fe6	Fix point lookup on range tombstone sentinel endpoint (#4829 ) Summary: Previously for point lookup we decided which file to look into based on user key overlap only. We also did not truncate range tombstones in the point lookup code path. These two ideas did not interact well in cases like this: - L1 has range tombstone [a, c)#1 and point key b#2. The data is split between file1 with range [a#1,1, b#72057594037927935,15], and file2 with range [b#2, c#1]. - L1's file2 gets compacted to L2. - User issues `Get()` for b#3. - L1's file1 is opened and the range tombstone [a, c)#1 is found for b, while no point-key for b is found in L1. - `Get()` assumes that the range tombstone must cover all data in that range in lower levels, so short circuits and returns `NotFound`. The solution to this problem is to not look into files that only overlap with the point lookup at a range tombstone sentinel endpoint. In the above example, this would mean not opening L1's file1 or its tombstones during the `Get()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4829 Differential Revision: D13561355 Pulled By: ajkr fbshipit-source-id: a13c21c816870a2f5d32a48af6dbd719a7d9d19f	2019-01-04 11:24:08 -08:00
Yanqin Jin	a07175af65	Refactor atomic flush result installation to MANIFEST (#4791 ) Summary: as titled. Since different bg flush threads can flush different sets of column families (due to column family creation and drop), we decide not to let one thread perform atomic flush result installation for other threads. Bg flush threads will install their atomic flush results sequentially to MANIFEST, using a conditional variable, i.e. atomic_flush_install_cv_ to coordinate. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4791 Differential Revision: D13498930 Pulled By: riversand963 fbshipit-source-id: dd7482fc41f4bd22dad1e1ef7d4764ef424688d7	2019-01-03 20:56:24 -08:00
Yi Wu	77a8d4d476	Detect if Jemalloc is linked with the binary (#4844 ) Summary: Declare Jemalloc non-standard APIs as weak symbols, so that if Jemalloc is linked with the binary, these symbols will be replaced by Jemalloc's, otherwise they will be nullptr. This is similar to how folly detect jemalloc, but we assume the main program use jemalloc as long as jemalloc is linked: https://github.com/facebook/folly/blob/master/folly/memory/Malloc.h#L147 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4844 Differential Revision: D13574934 Pulled By: yiwu-arbug fbshipit-source-id: 7ea871beb1be7d5a1259cc38f9b78078793db2db	2019-01-03 16:30:12 -08:00
DorianZheng	8c79f79208	Fix skip WAL for whole write_group when leader's callback fail (#4838 ) Summary: The original implementation has two problems: 1. `f0dda35d7d/db/db_impl_write.cc (L478)` `f0dda35d7d/db/write_thread.h (L231)` If the callback status of leader of the write_group fails, then the whole write_group will not write to WAL, this may cause data loss. 2. `f0dda35d7d/db/write_thread.h (L130)` The annotation says that Writer.status is the status of memtable inserter, but the original implementation use it for another case which is not consistent with the original design. Looks like we can still reuse Writer.status, but we should modify the annotation, so Writer.status is not only the status of memtable inserter. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4838 Differential Revision: D13574070 Pulled By: yiwu-arbug fbshipit-source-id: a2a2aefcfd329c4c6a91652bf090aaf1ce119c4b	2019-01-03 12:40:42 -08:00
Siying Dong	e4feb78606	Try to fix DBSSTTest.RateLimitedDelete flakiness (#4840 ) Summary: DBSSTTest.RateLimitedDelete is flakey. The root cause is not completely identified, but the compaction waiting in the test doesn't strictly wait for compaction cleaning to finish, which may cause test flakiness. Fix it first and see whether the failures still happen. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4840 Differential Revision: D13567273 Pulled By: siying fbshipit-source-id: 6fce38b912aff92a925231e7aa9bb0fef892761a	2019-01-03 11:05:19 -08:00
Andrew Kryczka	ace543a815	fix accounting for range tombstones in TableProperties (#4841 ) Summary: - To be consistent with the accounting of other optypes in `TableProperties`, we should count range tombstones in `TableProperties::num_entries` and `TableProperties::num_deletions`. - Updated assertions in stress test's `OnTableFileCreated` handler to accept files with range tombstones only. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4841 Differential Revision: D13568424 Pulled By: ajkr fbshipit-source-id: 0139d7806494eda20ece67ec460d2458dbbf6026	2019-01-02 15:08:53 -08:00
Anand Ananthabhotla	b9d6eccac1	Lock free MultiGet (#4754 ) Summary: Avoid locking the DB mutex in order to reference SuperVersions. Instead, we get the thread local cached SuperVersion for each column family in the list. It depends on finding a sequence number that overlaps with all the open memtables. We start with the latest published sequence number, and if any of the memtables is sealed before we can get all the SuperVersions, the process is repeated. After a few times, give up and lock the DB mutex. Tests: 1. Unit tests 2. make check 3. db_bench - TEST_TMPDIR=/dev/shm ./db_bench -use_existing_db=true -benchmarks=readrandom -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=5000000 -reads=1000000 -threads=32 -compression_type=none -cache_size=1048576000 -batch_size=1 -bloom_bits=1 readrandom : 0.167 micros/op 5983920 ops/sec; 426.2 MB/s (1000000 of 1000000 found) Multireadrandom with batch size 1: multireadrandom : 0.176 micros/op 5684033 ops/sec; (1000000 of 1000000 found) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4754 Differential Revision: D13363550 Pulled By: anand1976 fbshipit-source-id: 6243e8de7dbd9c8bb490a8eca385da0c855b1dd4	2019-01-02 11:42:54 -08:00

1 2 3 4 5 ...

3475 Commits