rocksdb

Author	SHA1	Message	Date
Andrew Kryczka	07cf0ee589	Fix ticker stat for number files closed (#4703 ) Summary: We haven't been populating `NO_FILE_CLOSES` since v1.5.8 even though it was never marked as deprecated. Start populating it again. Conveniently `DeleteTableReader` has an unused `void*` argument that we can use... Blame: `63f216ee0a` Closes #4700. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4703 Differential Revision: D13146769 Pulled By: ajkr fbshipit-source-id: ad8d6fb0493e701f60a165a3bca1787d255be008	2018-11-21 18:31:34 -08:00
Yi Wu	05d9d82181	Revert "Move MemoryAllocator option from Cache to BlockBasedTableOpti… (#4697 ) Summary: …ons (#4676)" This reverts commit `b32d087dbb`. `MemoryAllocator` needs to be with `Cache`, since cache entry can outlive DB and block based table. The cache needs to hold reference to memory allocator when deleting cache entry. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4697 Differential Revision: D13133490 Pulled By: yiwu-arbug fbshipit-source-id: 8ef7e8a51263bfd929f892fd062665ff4ce9ce5a	2018-11-21 11:29:57 -08:00
Abhishek Madan	457f77b9ff	Introduce RangeDelAggregatorV2 (#4649 ) Summary: The old RangeDelAggregator did expensive pre-processing work to create a collapsed, binary-searchable representation of range tombstones. With FragmentedRangeTombstoneIterator, much of this work is now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking in each iterator to find a covering tombstone in ShouldDelete, while doing minimal work in AddTombstones. The old RangeDelAggregator is still used during flush/compaction for now, though RangeDelAggregatorV2 will support those uses in a future PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649 Differential Revision: D13146964 Pulled By: abhimadan fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3	2018-11-21 10:56:45 -08:00
Abhishek Madan	ed5aec5ba3	Fix range tombstone covering short-circuit logic (#4698 ) Summary: Since a range tombstone seen at one level will cover all keys in the range at lower levels, there was a short-circuiting check in Get that reported a key was not found at most one file after the range tombstone was discovered. However, this was incorrect for merge operands, since a deletion might only cover some merge operands, which implies that the key should be found. This PR fixes this logic in the Version portion of Get, and removes the logic from the MemTable portion of Get, since the perforamnce benefit provided there is minimal. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4698 Differential Revision: D13142484 Pulled By: abhimadan fbshipit-source-id: cbd74537c806032f2bfa564724d01a80df7c8f10	2018-11-20 13:29:22 -08:00
Siying Dong	13579e8c5a	WriteBufferManger doens't cost to cache if no limit is set (#4695 ) Summary: WriteBufferManger is not invoked when allocating memory for memtable if the limit is not set even if a cache is passed. It is inconsistent from the comment syas. Fix it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4695 Differential Revision: D13112722 Pulled By: siying fbshipit-source-id: 0b27eef63867f679cd06033ea56907c0569597f4	2018-11-18 16:55:43 -08:00
Andrew Kryczka	9d6d4867ab	Fix uninitialized fields in file metadata (#4693 ) Summary: This is a quick fix for the uninitialized bugs in `LiveFileMetaData` and `SstFileMetaData` that were uncovered in #4686. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4693 Differential Revision: D13113189 Pulled By: ajkr fbshipit-source-id: 18e798d031d2a59d0b55fc010c135e0126f4042d	2018-11-16 20:49:17 -08:00
Yanqin Jin	147697420a	Rollback memtable flush upon atomic flush fail (#4641 ) Summary: This fixes an assertion. An atomic flush can have multiple flush jobs. Some of them may fail. If any of them fails, we need to rollback all of them. For the flush jobs that do fail, we already call `RollbackMemTableFlush` in `FlushJob::Run`. The tricky part is for flush jobs that have completed successfully. We need to call `RollbackMemTableFlush` for them as well. The newly added DBAtomicFlushTest.AtomicFlushRollbackSomeJobs will SigAbort without the corresponding change in AtomicFlushMemTablesToOutputFiles. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4641 Differential Revision: D12943649 Pulled By: riversand963 fbshipit-source-id: c66a4a664a1e0938e938fd41edc5a70c34cdd868	2018-11-14 20:54:17 -08:00
Abhishek Madan	6bee36a786	Modify FragmentedRangeTombstoneList member layout (#4632 ) Summary: Rather than storing a `vector<RangeTombstone>`, we now store a `vector<RangeTombstoneStack>` and a `vector<SequenceNumber>`. A `RangeTombstoneStack` contains the start and end keys of a range tombstone fragment, and indices into the seqnum vector to indicate which sequence numbers the fragment is located at. The diagram below illustrates an example: ``` tombstones_: [a, b) [c, e) [h, k) \| \ / \ / \| \| \ / \ / \| v v v v tombstone_seqs_: [ 5 3 10 7 2 8 6 ] ``` This format allows binary searching the tombstone list to use less key comparisons, which helps in cases where there are many overlapping tombstones. Also, this format makes it easier to add DBIter-like semantics to `FragmentedRangeTombstoneIterator` in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4632 Differential Revision: D13053103 Pulled By: abhimadan fbshipit-source-id: e8220cc712fcf5be4d602913bb23ace8ea5f8ef0	2018-11-14 17:52:17 -08:00
Siying Dong	f5c8cf5fed	Increase wait time in DBTest.SanitizeNumThreads (#4659 ) Summary: DBTest.SanitizeNumThreads Sometimes fails. The test waited for 10ms timeout and expect all threads scheduled to be executed. This can be a source of flakiness. Make a check every 1ms and up to 10s. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4659 Differential Revision: D13074174 Pulled By: siying fbshipit-source-id: b1d5ff87a326a4fc9eab8d1cc307bbb940dfe70c	2018-11-14 16:19:36 -08:00
Zhongyi Xie	d8df169b84	release db mutex when calling ApproximateSize (#4630 ) Summary: `GenSubcompactionBoundaries` calls `VersionSet::ApproximateSize` which gets BlockBasedTableReader for every file and seeks in its index block to find `key`'s offset. If the table or index block aren't in memory already, this involves I/O. This can be improved by releasing DB mutex when calling ApproximateSize. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4630 Differential Revision: D13052653 Pulled By: miasantreble fbshipit-source-id: cae31d46d10d0860fa8a26b8d5154b2d17d1685f	2018-11-13 17:08:34 -08:00
Zhongyi Xie	b76398a82b	apply ReadOptions.iterate_upper_bound to transaction iterator (#4656 ) Summary: Currently transaction iterator does not apply `ReadOptions.iterate_upper_bound` when iterating. This PR attempts to fix the problem by having `BaseDeltaIterator` enforcing the upper bound check when iterator state is changed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4656 Differential Revision: D13039257 Pulled By: miasantreble fbshipit-source-id: 909eb9f6b4597a4d80418fb139f32ec82c6ec1d1	2018-11-13 15:44:15 -08:00
Yi Wu	b32d087dbb	Move MemoryAllocator option from Cache to BlockBasedTableOptions (#4676 ) Summary: Per offline discussion with siying, `MemoryAllocator` and `Cache` should be decouple. The idea is that memory allocator handles memory allocation, while cache handle cache policy. It is normal that external cache libraries pack couple the two components for better optimization. If we want to integrate with such library in the future, we can make a wrapper of the library implementing both `Cache` and `MemoryAllocator` interface. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4676 Differential Revision: D13047662 Pulled By: yiwu-arbug fbshipit-source-id: cd42e246d80ab600b4de47d073f7d2db308ce6dd	2018-11-13 13:48:38 -08:00
Siying Dong	abb1a8fc23	Add a unit test to assert number of preads (#4657 ) Summary: We used to have a bug, which caused every block to be read twice, and none of our tests caught it. Add a very simply unit test to make sure that when reading a data block, we only issue one pread against the SST file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4657 Differential Revision: D13005260 Pulled By: siying fbshipit-source-id: 03167b554ad2451192b1707415536d7d05e9026c	2018-11-13 12:52:19 -08:00
QingpingWang	4f0fcb78ae	Expose num entries and deletions of sst files (#4623 ) Summary: he ratio of num_deletions to num_entries of a level can be useful to determine if a manual compaction needs to be triggered on a level. Also refer #3980 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4623 Differential Revision: D13045744 Pulled By: sagar0 fbshipit-source-id: 71f3c8e363a8ffd194ec3bb0ed0b69612231f0b3	2018-11-13 11:52:19 -08:00
Soli Como	5945e16dfc	Divide `NO_ITERATORS` into two counters `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETE` (#4498 ) Summary: Currently, `Statistics` can record tick by `recordTick()` whose second parameter is an `uint64_t`. That means tick can only increase. If we want to reduce tick, we have to work around like `RecordTick(statistics_, NO_ITERATORS, uint64_t(-1));`. That's kind of a hack. So, this PR divide `NO_ITERATORS` into two counters `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETE`, making the counters increase only. Fixes #3013 . Pull Request resolved: https://github.com/facebook/rocksdb/pull/4498 Differential Revision: D10395010 Pulled By: sagar0 fbshipit-source-id: cfb523b22a37411c794b4e9da090f1ae30293db2	2018-11-13 11:46:32 -08:00
Soli	a478682260	Fix #3840 : only `SyncClosedLogs` for multiple CFs (#4460 ) Summary: Call `SyncClosedLogs()` only if there are more than one column families. Update several unit tests (in `fault_injection_test` and `db_flush_test`) correspondingly. See #3840 for more info. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4460 Differential Revision: D12896377 Pulled By: riversand963 fbshipit-source-id: f49afdaec32568f12f001219a3aec1dfde3b32bf	2018-11-13 11:32:16 -08:00
Andrew Kryczka	ea9454700a	Backup engine support for direct I/O reads (#4640 ) Summary: Use the `DBOptions` that the backup engine already holds to figure out the right `EnvOptions` to use when reading the DB files. This means that, if a user opened a DB instance with `use_direct_reads=true`, then using `BackupEngine` to back up that DB instance will use direct I/O to read files when calculating checksums and copying. Currently the WALs and manifests would still be read using buffered I/O to prevent mixing direct I/O reads with concurrent buffered I/O writes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4640 Differential Revision: D13015268 Pulled By: ajkr fbshipit-source-id: 77006ad6f3e00ce58374ca4793b785eea0db6269	2018-11-13 11:17:25 -08:00
Zhongyi Xie	b313019326	use per-level perfcontext for DB::Get calls (#4617 ) Summary: this PR adds two more per-level perf context counters to track * number of keys returned in Get call, break down by levels * total processing time at each level during Get call Pull Request resolved: https://github.com/facebook/rocksdb/pull/4617 Differential Revision: D12898024 Pulled By: miasantreble fbshipit-source-id: 6b84ef1c8097c0d9e97bee1a774958f56ab4a6c4	2018-11-13 10:40:49 -08:00
Sagar Vemuri	2993cd2002	Fix RocksDB Lite build (#4675 ) Summary: Our internal CI test caught RocksDB Lite build failures. The failures are due to a new test introduced in #4665 using `SSTFileWriter` and `IngestExternalFile`, but these is not exposed under lite mode. Fixed by #ifdef'ing out the test. ``` db/db_test2.cc: In member function ‘virtual void rocksdb::DBTest2_TestCompactFiles_Test::TestBody()’: db/db_test2.cc:2907:3: error: ‘SstFileWriter’ is not a member of ‘rocksdb’ rocksdb::SstFileWriter sst_file_writer{rocksdb::EnvOptions(), options}; ^ In file included from ./util/testharness.h:15:0, from ./table/mock_table.h:23, from ./db/db_test_util.h:44, from db/db_test2.cc:13: db/db_test2.cc:2912:13: error: ‘sst_file_writer’ was not declared in this scope ASSERT_OK(sst_file_writer.Open(external_file1)); ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4675 Differential Revision: D13035984 Pulled By: sagar0 fbshipit-source-id: c1ceac550dfac1a85eeea436693dc7dd467519a6	2018-11-12 19:01:37 -08:00
Abhishek Madan	7d04ef4655	Fix flaky DBDynamicLevelTest.DynamicLevelMaxBytesBase2 (#4668 ) Summary: Part of the test required that a compaction start before a manual flush, but this was not enforced by the test. In some cases, particularly when writing to tmpfs, this could lead to the compaction starting after the flush, which caused the base level to be higher than it was expected to be. Add a sync point in the test to ensure that the flush and compaction happen simultaneously. The test also had some stale comments, so those have been removed or modified, and the test has been simplified so that it no longer uses sleeps and writes uncompressed SSTs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4668 Differential Revision: D13032440 Pulled By: abhimadan fbshipit-source-id: 3f23b583a096454dafb8d8ea75678605dec80209	2018-11-12 16:42:16 -08:00
DorianZheng	0f88160f67	Fix `CompactFiles` bug (#4665 ) Summary: `CompactFiles` gets `SuperVersion` before `WaitForIngestFile`, while `IngestExternalFile` may add files that overlap with `input_file_names` The timeline of execution flow is as follow: Let's say that level N has two file [1,2] and [5,6] ``` timeline user_thread1 user_thread2 t0 \| CompactFiles([1, 2], [5, 6]) begin t1 \| GetReferencedSuperVersion() t2 \| IngestExternalFile([3,4]) to level N begin t3 \| CompactFiles resume V ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4665 Differential Revision: D13030674 Pulled By: ajkr fbshipit-source-id: 8be19477fd6e505032267a979d32f3097cc3be51	2018-11-12 14:32:18 -08:00
Yanqin Jin	05dec0c7c7	Remove redundant member var and set options (#4631 ) Summary: In the past, both `DBImpl::atomic_flush_` and `DBImpl::immutable_db_options_.atomic_flush` exist. However, we fail to set `immutable_db_options_.atomic_flush`, but use `DBImpl::atomic_flush_` which is set correctly. This does not lead to incorrect behavior, but is a duplicate of information. Since `immutable_db_options_` is always there and has `atomic_flush`, we should use it as source of truth and remove `DBImpl::atomic_flush_`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4631 Differential Revision: D12928371 Pulled By: riversand963 fbshipit-source-id: f85a811959d3828aad4a3a1b05f71facf19c636d	2018-11-12 12:24:26 -08:00
DorianZheng	09426ae1c7	Fix `DBImpl::GetColumnFamilyHandleUnlocked` data race (#4666 ) Summary: Hi, yiwu-arbug, I found that `DBImpl::GetColumnFamilyHandleUnlocked` still have data race condition, because `column_family_memtables_` has a stateful cache `current_` and `column_family_memtables_::Seek` maybe call without the protection of `mutex_` by a write thread check `859dbda6e3/db/write_batch.cc (L1188)` and `859dbda6e3/db/write_batch.cc (L1756)` and `859dbda6e3/db/db_impl_write.cc (L318)` So it's better to use `versions_->GetColumnFamilySet()->GetColumnFamily` instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4666 Differential Revision: D13027117 Pulled By: yiwu-arbug fbshipit-source-id: 4e3778eaf8e7f7c8577bbd78129b6a5fd7ce79fb	2018-11-12 11:52:34 -08:00
Yi Wu	859dbda6e3	Fix DBTest.SoftLimit flakyness (#4658 ) Summary: The flakyness can be reproduced with the following patch: ``` --- a/db/db_impl_compaction_flush.cc +++ b/db/db_impl_compaction_flush.cc @@ -2013,6 +2013,9 @@ void DBImpl::BackgroundCallFlush() { if (job_context.HaveSomethingToDelete()) { PurgeObsoleteFiles(job_context); } + static int f_count = 0; + printf("clean flush job context %d\n", ++f_count); + env_->SleepForMicroseconds(1000000); job_context.Clean(); mutex_.Lock(); } ``` The issue is that FlushMemtable with opt.wait=true does not wait for `OnStallConditionsChanged` being called. The event listener is triggered on `JobContext::Clean`, which happens after flush result is installed. At the time we check for stall condition after flushing memtable, the job context cleanup may not be finished. To fix the flaykyness, we use sync point to create a custom WaitForFlush that waits for context cleanup. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4658 Differential Revision: D13007301 Pulled By: yiwu-arbug fbshipit-source-id: d98395ee7b0ad4c62e83e8d0e9b6028058c61712	2018-11-09 16:45:19 -08:00
Sagar Vemuri	dc3528077a	Update all unique/shared_ptr instances to be qualified with namespace std (#4638 ) Summary: Ran the following commands to recursively change all the files under RocksDB: ``` find . -type f -name ".cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} + ``` Running `make format` updated some formatting on the files touched. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638 Differential Revision: D12934992 Pulled By: sagar0 fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8	2018-11-09 11:19:58 -08:00
Zhongyi Xie	fce5994603	Add more sync point to fix flaky test GroupCommitTest Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4637 Differential Revision: D12963727 Pulled By: miasantreble fbshipit-source-id: 76053501afbecc6ef388ddc56542fa0185243e3f	2018-11-07 14:07:53 -08:00
Siying Dong	566fc8b994	Black list some valgrind tests (#4642 ) Summary: valgrind tests with 1 thread run too long. To make it shorter, black list some long tests. These are already blacklisted in parallel valgrind tests, but they are not in non-parallel mode Pull Request resolved: https://github.com/facebook/rocksdb/pull/4642 Differential Revision: D12945237 Pulled By: siying fbshipit-source-id: 04cf977d435996480fe87aa09f14b17975b74f7d	2018-11-06 14:22:36 -08:00
Andrew Kryczka	fffac43cfb	Add DB property for SST files kept from deletion (#4618 ) Summary: This property can help debug why SST files aren't being deleted. Previously we only had the property "rocksdb.is-file-deletions-enabled". However, even when that returned true, obsolete SSTs may still not be deleted due to the coarse-grained mechanism we use to prevent newly created SSTs from being accidentally deleted. That coarse-grained mechanism uses a lower bound file number for SSTs that should not be deleted, and this property exposes that lower bound. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4618 Differential Revision: D12898179 Pulled By: ajkr fbshipit-source-id: fe68acc041ddbcc9276bbd48976524d95aafc776	2018-11-05 20:24:40 -08:00
Siying Dong	c3105aa50d	Try to fix ExternalSSTFileTest.IngestNonExistingFile flakines (#4625 ) Summary: ExternalSSTFileTest.IngestNonExistingFile occasionally fail for number of SST files after manual compaction doesn't go down as expected. Although I don't find a reason how this can happen, adding an extra waiting to make sure obsolete file purging has finished before we check the files doesn't hurt. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4625 Differential Revision: D12910586 Pulled By: siying fbshipit-source-id: 2a5ddec6908c99cf3bcc78431c6f93151c2cab59	2018-11-02 17:26:35 -07:00
Zhongyi Xie	61311157ff	exclude get db property calls from rocksdb_lite (#4619 ) Summary: fix current failing lite test: > In file included from ./util/testharness.h:15:0, from ./table/mock_table.h:23, from ./db/db_test_util.h:44, from db/db_flush_test.cc:10: db/db_flush_test.cc: In member function ‘virtual void rocksdb::DBFlushTest_ManualFlushFailsInReadOnlyMode_Test::TestBody()’: db/db_flush_test.cc:250:35: error: ‘Properties’ is not a member of ‘rocksdb::DB’ ASSERT_TRUE(db_->GetIntProperty(DB::Properties::kBackgroundErrors, ^ make: *** [db/db_flush_test.o] Error 1 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4619 Differential Revision: D12898319 Pulled By: miasantreble fbshipit-source-id: 72de603b1f2e972fc8caa88611798c4e98e348c6	2018-11-02 11:28:59 -07:00
Yanqin Jin	de18a2d82e	Update test to cover a new case in file ingestion (#4614 ) Summary: The new case is directIO = true, write_global_seqno = false in which we no longer write global_seqno to the external SST file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4614 Differential Revision: D12885001 Pulled By: riversand963 fbshipit-source-id: 7541bdc608b3a0c93d3c3c435da1b162b36673d4	2018-11-01 16:23:49 -07:00
Bo Hou	cd9404bb77	xxhash 64 support Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4607 Reviewed By: siying Differential Revision: D12836696 Pulled By: jsjhoubo fbshipit-source-id: 7122ccb712d0b0f1cd998aa4477e0da1401bd870	2018-11-01 15:44:06 -07:00
Andrew Kryczka	5c794d94c4	Prevent manual flush hanging in read-only mode (#4615 ) Summary: The logic to wait for stall conditions to clear before beginning a manual flush didn't take into account whether the DB was in read-only mode. In read-only mode the stall conditions would never clear since no background work is happening, so the wait would be never-ending. It's probably better to return an error to the user. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4615 Differential Revision: D12888008 Pulled By: ajkr fbshipit-source-id: 1c474b42a7ac38d9fd0d0e2340ff1d53e684d83c	2018-11-01 15:27:06 -07:00
Andrew Kryczka	b8f68bac38	Prevent manual compaction hanging in read-only mode (#4611 ) Summary: A background compaction with pre-picked files (i.e., either a manual compaction or a bottom-pri compaction) fails when the DB is in read-only mode. In the failure handling, we forgot to unregister the compaction and the files it covered. Then subsequent manual compactions could conflict with this zombie compaction (possibly Halloween related) and wait forever for it to finish. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4611 Differential Revision: D12871217 Pulled By: ajkr fbshipit-source-id: 9d24e921d5bbd2ee8c2c9536a30abfa42a220c6e	2018-10-31 17:24:36 -07:00
Yanqin Jin	d1118f6f19	Add test to check if DB can handle atomic group (#4433 ) Summary: Add unit tests to demonstrate that `VersionSet::Recover` is able to detect and handle cases in which the MANIFEST has valid atomic group, incomplete trailing atomic group, atomic group mixed with normal version edits and atomic group with incorrect size. With this capability, RocksDB identifies non-valid groups of version edits and do not apply them, thus guaranteeing that the db is restored to a state consistent with the most recent successful atomic flush before applying WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4433 Differential Revision: D10079202 Pulled By: riversand963 fbshipit-source-id: a0e0b8bf4da1cf68e044d397588c121b66c68876	2018-10-30 16:37:47 -07:00
Abhishek Madan	eaaf1a6f05	Promote rocksdb.{deleted.keys,merge.operands} to main table properties (#4594 ) Summary: Since the number of range deletions are reported in TableProperties, it is confusing to not report the number of merge operands and point deletions as top-level properties; they are accessible through the public API, but since they are not the "main" properties, they do not appear in aggregated table properties, or the string representation of table properties. This change promotes those two property keys to `rocksdb/table_properties.h`, adds corresponding uint64 members for them, deprecates the old access methods `GetDeletedKeys()` and `GetMergeOperands()` (though they are still usable for now), and removes `InternalKeyPropertiesCollector`. The property key strings are the same as before this change, so this should be able to read DBs written from older versions (though I haven't tested this yet). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4594 Differential Revision: D12826893 Pulled By: abhimadan fbshipit-source-id: 9e4e4fbdc5b0da161c89582566d184101ba8eb68	2018-10-30 15:34:27 -07:00
Siying Dong	9da88a8321	Remove info logging in db mutex inside EnableFileDeletions() (#4604 ) Summary: EnableFileDeletions() does info logging inside db mutex. This is not recommended in the code base, since there could be I/O involved. Move this outside the DB mutex. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4604 Differential Revision: D12834432 Pulled By: siying fbshipit-source-id: ffe5c2626fcfdb4c54a661a3c3b0bc95054816cf	2018-10-30 10:33:59 -07:00
Andrew Kryczka	cae540ebef	Fix range tombstones written to more files than necessary (#4592 ) Summary: When there's a gap between files, we do not need to output tombstones starting at the next output file's begin key to the current output file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4592 Differential Revision: D12808627 Pulled By: ajkr fbshipit-source-id: 77c8b2e7523a95b1cd6611194144092c06acb505	2018-10-29 19:23:27 -07:00
Yanqin Jin	806ff34b61	Disable DBIOFailureTest.NoSpaceCompactRange in LITE (#4596 ) Summary: Since ErrorHandler::RecoverFromNoSpace is no-op in LITE mode, then we should not have this test in LITE mode. If we do keep it, it will cause the test thread to wait on bg_cv_ that will not be signalled. How to reproduce ``` $make clean && git checkout `a27fce408e` $OPT="-DROCKSDB_LITE -g" make -j20 $./db_io_failure_test --gtest_filter=DBIOFailureTest.NoSpaceCompactRange ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4596 Differential Revision: D12818516 Pulled By: riversand963 fbshipit-source-id: bc83524f40fff1e29506979017f7f4c2b70322f3	2018-10-29 14:36:31 -07:00
Yanqin Jin	92b4401566	Avoid memtable cut when active memtable is empty (#4595 ) Summary: For flush triggered by RocksDB due to memory usage approaching certain threshold (WriteBufferManager or Memtable full), we should cut the memtable only when the current active memtable is not empty, i.e. contains data. This is what we do for non-atomic flush. If we always cut memtable even when the active memtable is empty, we will generate extra, empty immutable memtable. This is not ideal since it may cause write stall. It also causes some DBAtomicFlushTest to fail because cfd->imm()->NumNotFlushed() is different from expectation. Test plan ``` $make clean && make J=1 -j32 all check $make clean && OPT="-DROCKSDB_LITE -g" make J=1 -j32 all check $make clean && TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make J=1 -j32 valgrind_test ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4595 Differential Revision: D12818520 Pulled By: riversand963 fbshipit-source-id: d867bdbeacf4199fdd642debb085f94703c41a18	2018-10-29 09:45:32 -07:00
Yanqin Jin	5b4c709fad	Enable atomic flush (#4023 ) Summary: Adds a DB option `atomic_flush` to control whether to enable this feature. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4023 Differential Revision: D8518381 Pulled By: riversand963 fbshipit-source-id: 1e3bb33e99bb102876a31b378d93b0138ff6634f	2018-10-26 15:08:43 -07:00
Yi Wu	f560c8f5c8	s/CacheAllocator/MemoryAllocator/g (#4590 ) Summary: Rename the interface, as it is mean to be a generic interface for memory allocation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4590 Differential Revision: D10866340 Pulled By: yiwu-arbug fbshipit-source-id: 85cb753351a40cb856c046aeaa3f3b369eef3d16	2018-10-26 14:30:30 -07:00
Abhishek Madan	7528130e38	Cache fragmented range tombstones in BlockBasedTableReader (#4493 ) Summary: This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses. On the same DB used in #4449, running `readrandom` results in the following: ``` readrandom : 0.983 micros/op 1017076 ops/sec; 78.3 MB/s (63103 of 100000 found) ``` Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results): ``` Tombstones? \| avg micros/op \| stddev micros/op \| avg ops/s \| stddev ops/s ----------------- \| ------------- \| ---------------- \| ------------ \| ------------ None \| 0.6186 \| 0.04637 \| 1,625,252.90 \| 124,679.41 500 Expanded \| 0.6019 \| 0.03628 \| 1,666,670.40 \| 101,142.65 500 Unexpanded \| 0.6435 \| 0.03994 \| 1,559,979.40 \| 104,090.52 1k Expanded \| 0.6034 \| 0.04349 \| 1,665,128.10 \| 125,144.57 1k Unexpanded \| 0.6261 \| 0.03093 \| 1,600,457.50 \| 79,024.94 5k Expanded \| 0.6163 \| 0.05926 \| 1,636,668.80 \| 154,888.85 5k Unexpanded \| 0.6402 \| 0.04002 \| 1,567,804.70 \| 100,965.55 10k Expanded \| 0.6036 \| 0.05105 \| 1,667,237.70 \| 142,830.36 10k Unexpanded \| 0.6128 \| 0.02598 \| 1,634,633.40 \| 72,161.82 25k Expanded \| 0.6198 \| 0.04542 \| 1,620,980.50 \| 116,662.93 25k Unexpanded \| 0.5478 \| 0.0362 \| 1,833,059.10 \| 121,233.81 50k Expanded \| 0.5104 \| 0.04347 \| 1,973,107.90 \| 184,073.49 50k Unexpanded \| 0.4528 \| 0.03387 \| 2,219,034.50 \| 170,984.32 ``` After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493 Differential Revision: D10842844 Pulled By: abhimadan fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509	2018-10-25 19:26:44 -07:00
Zhongyi Xie	fe0d23059d	Fix two contrun job failures (#4587 ) Summary: Currently there are two contrun test failures: * rocksdb-contrun-lite: > tools/db_bench_tool.cc: In function ‘int rocksdb::db_bench_tool(int, char)’: tools/db_bench_tool.cc:5814:5: error: ‘DumpMallocStats’ is not a member of ‘rocksdb’ rocksdb::DumpMallocStats(&stats_string); ^ make: * [tools/db_bench_tool.o] Error 1 * rocksdb-contrun-unity: > In file included from unity.cc:44:0: db/range_tombstone_fragmenter.cc: In member function ‘void rocksdb::FragmentedRangeTombstoneIterator::FragmentTombstones(std::unique_ptr<rocksdb::InternalIteratorBase<rocksdb::Slice> >, rocksdb::SequenceNumber)’: db/range_tombstone_fragmenter.cc:90:14: error: reference to ‘ParsedInternalKeyComparator’ is ambiguous auto cmp = ParsedInternalKeyComparator(icmp_); This PR will fix them Pull Request resolved: https://github.com/facebook/rocksdb/pull/4587 Differential Revision: D10846554 Pulled By: miasantreble fbshipit-source-id: 8d3358879e105060197b1379c84aecf51b352b93	2018-10-24 20:16:45 -07:00
Yanqin Jin	eb8c9918f7	Remove unused variable Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4585 Differential Revision: D10841983 Pulled By: riversand963 fbshipit-source-id: 6a7e0b40065bcfbb10a2cac0cec1e8da0750a617	2018-10-24 15:51:45 -07:00
Abhishek Madan	8c78348c77	Use only "local" range tombstones during Get (#4449 ) Summary: Previously, range tombstones were accumulated from every level, which was necessary if a range tombstone in a higher level covered a key in a lower level. However, RangeDelAggregator::AddTombstones's complexity is based on the number of tombstones that are currently stored in it, which is wasteful in the Get case, where we only need to know the highest sequence number of range tombstones that cover the key from higher levels, and compute the highest covering sequence number at the current level. This change introduces this optimization, and removes the use of RangeDelAggregator from the Get path. In the benchmark results, the following command was used to initialize the database: ``` ./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8 ``` ...and the following command was used to measure read throughput: ``` ./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32 ``` The filluniquerandom command was only run once, and the resulting database was used to measure read performance before and after the PR. Both binaries were compiled with `DEBUG_LEVEL=0`. Readrandom results before PR: ``` readrandom : 4.544 micros/op 220090 ops/sec; 16.9 MB/s (63103 of 100000 found) ``` Readrandom results after PR: ``` readrandom : 11.147 micros/op 89707 ops/sec; 6.9 MB/s (63103 of 100000 found) ``` So it's actually slower right now, but this PR paves the way for future optimizations (see #4493). ---- Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449 Differential Revision: D10370575 Pulled By: abhimadan fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d	2018-10-24 12:31:12 -07:00
Zhongyi Xie	21bf7421ca	use per-level perf context for bloom filter related counters (#4581 ) Summary: PR https://github.com/facebook/rocksdb/pull/4226 introduced per-level perf context which allows breaking down perf context by levels. This PR takes advantage of the feature to populate a few counters related to bloom filters Pull Request resolved: https://github.com/facebook/rocksdb/pull/4581 Differential Revision: D10518010 Pulled By: miasantreble fbshipit-source-id: 011244561783ec860d32d5b0fa6bce6e78d70ef8	2018-10-24 12:21:38 -07:00
Neil Mayhew	43dbd4411e	Adapt three unit tests with newer compiler/libraries (#4562 ) Summary: This fixes three tests that fail with relatively recent tools and libraries: The tests are: * `spatial_db_test` * `table_test` * `db_universal_compaction_test` I'm using: * `gcc` 7.3.0 * `glibc` 2.27 * `snappy` 1.1.7 * `gflags` 2.2.1 * `zlib` 1.2.11 * `bzip2` 1.0.6.0.1 * `lz4` 1.8.2 * `jemalloc` 5.0.1 The versions used in the Travis environment (which is two Ubuntu LTS versions behind the current one and doesn't use `lz4` or `jemalloc`) don't seem to have a problem. However, to be safe, I verified that these tests pass with and without my changes in a trusty Docker container without `lz4` and `jemalloc`. However, I do get an unrelated set of other failures when using a trusty Docker container that uses `lz4` and `jemalloc`: ``` db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/0, where GetParam() = (1, false) (1189 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1, where GetParam() = (1, true) (1246 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2, where GetParam() = (3, false) (1237 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3, where GetParam() = (3, true) (1195 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4, where GetParam() = (5, false) (1161 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5, where GetParam() = (5, true) (1229 ms) ``` I haven't attempted to fix these since I'm not using trusty and Travis doesn't use `lz4` and `jemalloc`. However, the final commit in this PR does at least fix the compilation errors that occur when using trusty's version of `lz4`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4562 Differential Revision: D10510917 Pulled By: maysamyabandeh fbshipit-source-id: 59534042015ec339270e5fc2f6ac4d859370d189	2018-10-24 08:17:56 -07:00
Zhongyi Xie	f6b151f16d	fix clang analyzer error (#4583 ) Summary: clang analyzer currently fails with the following warnings: > db/log_reader.cc:323:9: warning: Undefined or garbage value returned to caller return r; ^~~~~~~~ db/log_reader.cc:344:11: warning: Undefined or garbage value returned to caller return r; ^~~~~~~~ db/log_reader.cc:369:11: warning: Undefined or garbage value returned to caller return r; Pull Request resolved: https://github.com/facebook/rocksdb/pull/4583 Differential Revision: D10523517 Pulled By: miasantreble fbshipit-source-id: 0cc8b8f27657b202bead148bbe7c4aa84fed095b	2018-10-23 22:14:54 -07:00
Maysam Yabandeh	c34cc40424	Fix user comparator receiving internal key (#4575 ) Summary: There was a bug that the user comparator would receive the internal key instead of the user key. The bug was due to RangeMightExistAfterSortedRun expecting user key but receiving internal key when called in GenerateBottommostFiles. The patch augment an existing unit test to reproduce the bug and fixes it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4575 Differential Revision: D10500434 Pulled By: maysamyabandeh fbshipit-source-id: 858346d2fd102cce9e20516d77338c112bdfe366	2018-10-23 08:14:46 -07:00
Siying Dong	7024263682	Dynamic level to adjust level multiplier when write is too heavy (#4338 ) Summary: Level compaction usually performs poorly when the writes so heavy that the level targets can't be guaranteed. With this improvement, we improve level_compaction_dynamic_level_bytes = true so that in the write heavy cases, the level multiplier can be slightly adjusted based on the size of L0. We keep the behavior the same if number of L0 files is under 2X compaction trigger and the total size is less than options.max_bytes_for_level_base, so that unless write is so heavy that compaction cannot keep up, the behavior doesn't change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4338 Differential Revision: D9636782 Pulled By: siying fbshipit-source-id: e27fc17a7c29c84b00064cc17536a01dacef7595	2018-10-22 10:21:47 -07:00
Yi Wu	933250e355	Fix RepeatableThreadTest::MockEnvTest hang (#4560 ) Summary: When `MockTimeEnv` is used in test to mock time methods, we cannot use `CondVar::TimedWait` because it is using real time, not the mocked time for wait timeout. On Mac the method can return immediately without awaking other waiting threads, if the real time is larger than `wait_until` (which is a mocked time). When that happen, the `wait()` method will fall into an infinite loop. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4560 Differential Revision: D10472851 Pulled By: yiwu-arbug fbshipit-source-id: 898902546ace7db7ac509337dd8677a527209d19	2018-10-21 20:17:18 -07:00
Yanqin Jin	da4aa59b4c	Add read retry support to log reader (#4394 ) Summary: Current `log::Reader` does not perform retry after encountering `EOF`. In the future, we need the log reader to be able to retry tailing the log even after `EOF`. Current implementation is simple. It does not provide more advanced retry policies. Will address this in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4394 Differential Revision: D9926508 Pulled By: riversand963 fbshipit-source-id: d86d145792a41bd64a72f642a2a08c7b7b5201e1	2018-10-19 11:53:00 -07:00
Maysam Yabandeh	0afa5b53d7	Disable GroupCommitTest in Appveyor (#4536 ) Summary: We have already disabled it on Travis since it has been too flaky. The same problem arises in Appveyor as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4536 Differential Revision: D10452240 Pulled By: maysamyabandeh fbshipit-source-id: 728f4ecddf780097159dc0a0737d460eb5ce4f09	2018-10-18 14:21:09 -07:00
Abhishek Madan	45f213b558	Lazily initialize RangeDelAggregator stripe map entries (#4497 ) Summary: When there are no range deletions, flush and compaction perform a binary search on an effectively empty map every time they call ShouldDelete. This PR lazily initializes each stripe map entry so that the binary search can be elided in these cases. After this PR, the total amount of time spent in compactions is 52.541331s, and the total amount of time spent in flush is 5.532608s, the former of which is a significant improvement from the results after #4495. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4497 Differential Revision: D10428610 Pulled By: abhimadan fbshipit-source-id: 6f7e1ce3698fac3ef86d1197955e6b72e0931a0f	2018-10-17 11:47:34 -07:00
Zhongyi Xie	d6ec288703	Add PerfContextByLevel to provide per level perf context information (#4226 ) Summary: Current implementation of perf context is level agnostic. Making it hard to do performance evaluation for the LSM tree. This PR adds `PerfContextByLevel` to decompose the counters by level. This will be helpful when analyzing point and range query performance as well as tuning bloom filter Also replaced __thread with thread_local keyword for perf_context Pull Request resolved: https://github.com/facebook/rocksdb/pull/4226 Differential Revision: D10369509 Pulled By: miasantreble fbshipit-source-id: f1ced4e0de5fcebdb7f9cff36164516bc6382d82	2018-10-17 11:19:40 -07:00
anand1976	1e3845805d	Properly determine a truncated CompactRange stop key (#4496 ) Summary: When a CompactRange() call for a level is truncated before the end key is reached, because it exceeds max_compaction_bytes, we need to properly set the compaction_end parameter to indicate the stop key. The next CompactRange will use that as the begin key. We set it to the smallest key of the next file in the level after expanding inputs to get a clean cut. Previously, we were setting it before expanding inputs. So we could end up recompacting some files. In a pathological case, where a single key has many entries spanning all the files in the level (possibly due to merge operands without a partial merge operator, thus resulting in compaction output identical to the input), this would result in an endless loop over the same set of files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4496 Differential Revision: D10395026 Pulled By: anand1976 fbshipit-source-id: f0c2f89fee29b4b3be53b6467b53abba8e9146a9	2018-10-15 23:22:51 -07:00
Yanqin Jin	e633983cf1	Add support to flush multiple CFs atomically (#4262 ) Summary: Leverage existing `FlushJob` to implement atomic flush of multiple column families. This PR depends on other PRs and is a subset of #3752 . This PR itself is not sufficient in fulfilling atomic flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4262 Differential Revision: D9283109 Pulled By: riversand963 fbshipit-source-id: 65401f913e4160b0a61c0be6cd02adc15dad28ed	2018-10-15 20:01:17 -07:00
Andrew Kryczka	32b4d4ad47	Avoid per-key linear scan over snapshots in compaction (#4495 ) Summary: `CompactionIterator::snapshots_` is ordered by ascending seqnum, just like `DBImpl`'s linked list of snapshots from which it was copied. This PR exploits this ordering to make `findEarliestVisibleSnapshot` do binary search rather than linear scan. This can make flush/compaction significantly faster when many snapshots exist since that function is called on every single key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4495 Differential Revision: D10386470 Pulled By: ajkr fbshipit-source-id: 29734991631227b6b7b677e156ac567690118a8b	2018-10-15 16:21:22 -07:00
Yanqin Jin	729a617b5b	Add listener to sample file io (#3933 ) Summary: We would like to collect file-system-level statistics including file name, offset, length, return code, latency, etc., which requires to add callbacks to intercept file IO function calls when RocksDB is running. To collect file-system-level statistics, users can inherit the class `EventListener`, as in `TestFileOperationListener `. Note that `TestFileOperationListener::ShouldBeNotifiedOnFileIO()` returns true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3933 Differential Revision: D10219571 Pulled By: riversand963 fbshipit-source-id: 7acc577a2d31097766a27adb6f78eaf8b1e8ff15	2018-10-12 18:36:11 -07:00
Yi Wu	6f8d4bdff1	Fix compile error with jemalloc (#4488 ) Summary: The "je_" prefix of jemalloc APIs presents only when the macro `JEMALLOC_NO_RENAME` from jemalloc.h presents. With the patch I'm also adding -DROCKSDB_JEMALLOC flag in buck TARGETS. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4488 Differential Revision: D10355971 Pulled By: yiwu-arbug fbshipit-source-id: 03a2d69790a44ac89219c7525763fa937a63d95a	2018-10-12 11:50:50 -07:00
Chinmay Kamat	6422356a27	Acquire lock on DB LOCK file before starting repair. (#4435 ) Summary: This commit adds code to acquire lock on the DB LOCK file before starting the repair process. This will prevent multiple processes from performing repair on the same DB simultaneously. Fixes repair_test to work with this change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4435 Differential Revision: D10361499 Pulled By: riversand963 fbshipit-source-id: 3c512c48b7193d383b2279ccecabdb660ac1cf22	2018-10-12 10:41:54 -07:00
Abhishek Madan	7dd1641048	Use vector in UncollapsedRangeDelMap (#4487 ) Summary: Using `./range_del_aggregator_bench --use_collapsed=false --num_range_tombstones=5000 --num_runs=1000`, here are the results before and after this change: Before: ``` ========================= Results: ========================= AddTombstones: 1822.61 us ShouldDelete (first): 94.5286 us ``` After: ``` ========================= Results: ========================= AddTombstones: 199.26 us ShouldDelete (first): 38.9344 us ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4487 Differential Revision: D10347288 Pulled By: abhimadan fbshipit-source-id: d44efe3a166d583acfdc3ec1199e0892f34dbfb7	2018-10-11 15:29:14 -07:00
UncP	531786ebf7	DBWriteImpl: remove redundant code (#4450 ) Summary: in `WriteThread::LaunchParallelMemTableWriters`, there is ` write_group->running.store(write_group->size); ` https://github.com/facebook/rocksdb/blob/master/db/write_thread.cc#L510 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4450 Differential Revision: D10201900 Pulled By: yiwu-arbug fbshipit-source-id: 96c8fbbba5aff7ba8a6ceb3117a2bd7cc9b2f34b	2018-10-10 21:00:32 -07:00
Simon Grätzer	ceded4535d	WriteBatch::Iterate wrongly returns Status::Corruption (#4478 ) Summary: Wrong I overwrite `WriteBatch::Handler::Continue` to return _false_ at some point, I always get the `Status::Corruption` error. I don't think this check is used correctly here: The counter in `found` cannot reflect all entries in the WriteBatch when we exit the loop early. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4478 Differential Revision: D10317416 Pulled By: yiwu-arbug fbshipit-source-id: cccae3382805035f9b3239b66682b5fcbba6bb61	2018-10-10 20:57:27 -07:00
Andrew Kryczka	7e56072290	Fix merge operand reappearing when covered by DeleteRange (#4481 ) Summary: Even during `DBIter::Prev()`, there is a case where we need to use `RangeDelPositioningMode::kForwardTraversal`. In particular, when we hit too many internal keys for a single user key, we use seek to find the newest internal key. If it's a merge operand, we then scan forwards, collecting the merge operands. This forward scan should be using `RangeDelPositioningMode::kForwardTraversal`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4481 Differential Revision: D10319507 Pulled By: ajkr fbshipit-source-id: b5ce7352461f3a7696b28a5136ae0076f2bde51f	2018-10-10 18:16:12 -07:00
Peter Pei	09814f2cfc	support OnCompactionBegin (#4431 ) Summary: fix #4288 Add `OnCompactionBegin` support to `rocksdb::EventListener`. Currently, we only have these three callbacks: - OnFlushBegin - OnFlushCompleted - OnCompactionCompleted As paolococchi requested in #4288 , and ajkr agreed, we should also support `OnCompactionBegin`. This PR is a try to implement the support of `OnCompactionBegin`. Hope it is useful to you. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4431 Differential Revision: D10055515 Pulled By: yiwu-arbug fbshipit-source-id: 39c0f95f8e9ff1c7ca3a10787502a17f258d2334	2018-10-10 17:32:27 -07:00
Andrew Kryczka	faa70fc575	DeleteRange regression tests using public API (#4476 ) Summary: I wrote a couple tests using the public API to expose/prevent the bugs we talked. In particular, - When files have overlapping endpoints and a range tombstone spans them, ensure the largest key does not reappear to readers. This was happening due to a bug that skipped writing range tombstones to an output file when their begin key exactly matched the file's largest key. - When a tombstone spans multiple atomic compaction units, ensure newer keys do not disappear by being compacted beneath it. This happened due to a range tombstone appearing untruncated to readers when it spanned files with overlapping endpoints, even if it extended into files without overlapping endpoints (i.e., different atomic compaction units). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4476 Differential Revision: D10286001 Pulled By: ajkr fbshipit-source-id: bb5ca51d0f90812fb37bfe1d01aec93f7eda55aa	2018-10-10 12:30:11 -07:00
Abhishek Madan	9c6fea7fe1	Update HISTORY.md, fix unity_test failure (#4479 ) Summary: Follow-up to https://github.com/facebook/rocksdb/pull/4432. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4479 Differential Revision: D10304151 Pulled By: abhimadan fbshipit-source-id: 3608b95c324702ca26791f95cb26dae1d49efbe7	2018-10-10 12:09:56 -07:00
Anand Ananthabhotla	854a4be03f	Handle mixed slowdown/no_slowdown writer properly (#4475 ) Summary: There is a bug when the write queue leader is blocked on a write delay/stop, and the queue has writers with WriteOptions::no_slowdown set to true. They are not woken up until the write stall is cleared. The fix introduces a dummy writer inserted at the tail to indicate a write stall and prevent further inserts into the queue, and a condition variable that writers who can tolerate slowdown wait on before adding themselves to the queue. The leader calls WriteThread::BeginWriteStall() to add the dummy writer and then walk the queue to fail any writers with no_slowdown set. Once the stall clears, the leader calls WriteThread::EndWriteStall() to remove the dummy writer and signal the condition variable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4475 Differential Revision: D10285827 Pulled By: anand1976 fbshipit-source-id: 747465e5e7f07a829b1fb0bc1afcd7b93f4ab1a9	2018-10-09 22:52:40 -07:00
jsteemann	141ef7f8d3	avoid copying when iterating using range-based for (#4459 ) Summary: this avoids a few copies of std::string and other structs in the context of range-based for loops. instead of copying the values for each iteration, use a const reference to avoid copying. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4459 Differential Revision: D10282045 Pulled By: sagar0 fbshipit-source-id: 5012e910dca279abd2be847e1fb432d96274edfb	2018-10-09 17:15:51 -07:00
jsteemann	517d3b8b77	fix typo in error message, twice (#4457 ) Summary: Fixes a typo in error messages returned by Iterator::GetProperty(...) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4457 Differential Revision: D10281965 Pulled By: sagar0 fbshipit-source-id: 1cd3c665f467ef06cdfd9f482692e6f8568f3d22	2018-10-09 17:07:27 -07:00
Abhishek Madan	3a4bd36fed	Truncate range tombstones by leveraging InternalKeys (#4432 ) Summary: To more accurately truncate range tombstones at SST boundaries, we now represent them in RangeDelAggregator using InternalKeys, which are end-key-exclusive as they were before this change. During compaction, "atomic compaction unit boundaries" (the range of keys contained in neighbouring and overlaping SSTs) are propagated down to RangeDelAggregator to truncate range tombstones at those boundariies instead. See https://github.com/facebook/rocksdb/pull/4432#discussion_r221072219 and https://github.com/facebook/rocksdb/pull/4432#discussion_r221138683 for motivating examples. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4432 Differential Revision: D10263952 Pulled By: abhimadan fbshipit-source-id: 2fe85ff8a02b3a6a2de2edfe708012797a7bd579	2018-10-09 15:19:38 -07:00
Zhongyi Xie	283a700f5d	add locking around calls to RecalculateWriteStallConditions in column_family_test (#4474 ) Summary: this should fix the current failing TSAN jobs: The callstack for TSAN: > WARNING: ThreadSanitizer: data race (pid=87440) Read of size 8 at 0x7d580000fce0 by thread T22 (mutexes: write M548703): #0 rocksdb::InternalStats::DumpCFStatsNoFileHistogram(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) db/internal_stats.cc:1204 (column_family_test+0x00000080eca7) #1 rocksdb::InternalStats::DumpCFStats(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) db/internal_stats.cc:1169 (column_family_test+0x0000008106d0) #2 rocksdb::InternalStats::HandleCFStats(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, rocksdb::Slice) db/internal_stats.cc:578 (column_family_test+0x000000810720) #3 rocksdb::InternalStats::GetStringProperty(rocksdb::DBPropertyInfo const&, rocksdb::Slice const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) db/internal_stats.cc:488 (column_family_test+0x00000080670c) #4 rocksdb::DBImpl::DumpStats() db/db_impl.cc:625 (column_family_test+0x00000070ce9a) > Previous write of size 8 at 0x7d580000fce0 by main thread: #0 rocksdb::InternalStats::AddCFStats(rocksdb::InternalStats::InternalCFStatsType, unsigned long) db/internal_stats.h:324 (column_family_test+0x000000693bbf) #1 rocksdb::ColumnFamilyData::RecalculateWriteStallConditions(rocksdb::MutableCFOptions const&) db/column_family.cc:818 (column_family_test+0x000000693bbf) #2 rocksdb::ColumnFamilyTest_WriteStallSingleColumnFamily_Test::TestBody() db/column_family_test.cc:2563 (column_family_test+0x0000005e5a49) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4474 Differential Revision: D10262099 Pulled By: miasantreble fbshipit-source-id: 1247973a3ca32e399b4575d3401dd5439c39efc5	2018-10-09 14:10:13 -07:00
Zhongyi Xie	cac87fcf57	move dump stats to a separate thread (#4382 ) Summary: Currently statistics are supposed to be dumped to info log at intervals of `options.stats_dump_period_sec`. However the implementation choice was to bind it with compaction thread, meaning if the database has been serving very light traffic, the stats may not get dumped at all. We decided to separate stats dumping into a new timed thread using `TimerQueue`, which is already used in blob_db. This will allow us schedule new timed tasks with more deterministic behavior. Tested with db_bench using `--stats_dump_period_sec=20` in command line: > LOG:2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG:2018/09/17-14:08:05.643286 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG:2018/09/17-14:08:25.691325 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG:2018/09/17-14:08:45.740989 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG content: > 2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- 2018/09/17-14:07:45.575080 7fe99fbfe700 [WARN] [db/db_impl.cc:606] DB Stats Uptime(secs): 20.0 total, 20.0 interval Cumulative writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5.57 GB, 285.01 MB/s Cumulative WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 GB, 285.01 MB/s Cumulative stall: 00:00:0.012 H:M:S, 0.1 percent Interval writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5700.71 MB, 285.01 MB/s Interval WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 MB, 285.01 MB/s Interval stall: 00:00:0.012 H:M:S, 0.1 percent Compaction Stats [default] Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Pull Request resolved: https://github.com/facebook/rocksdb/pull/4382 Differential Revision: D9933051 Pulled By: miasantreble fbshipit-source-id: 6d12bb1e4977674eea4bf2d2ac6d486b814bb2fa	2018-10-08 22:54:43 -07:00
DorianZheng	27090ae8f6	Fix DBImpl::GetColumnFamilyHandleUnlocked race condition (#4391 ) Summary: - Fix DBImpl API race condition The timeline of execution flow is as follow: ``` timeline user_thread1 user_thread2 t1 \| cfh = GetColumnFamilyHandleUnlocked(0) t2 \| id1 = cfh->GetID() t3 \| GetColumnFamilyHandleUnlocked(1) t4 \| id2 = cfh->GetID() V ``` The original implementation return a pointer to a stateful variable, so that the return `ColumnFamilyHandle` will be changed when another thread calls `GetColumnFamilyHandleUnlocked` with different `column family id` - Expose ColumnFamily ID to compaction event listener - Fix the return status of `DBImpl::GetLatestSequenceForKey` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4391 Differential Revision: D10221243 Pulled By: yiwu-arbug fbshipit-source-id: dec60ee9ff0c8261a2f2413a8506ec1063991993	2018-10-08 14:24:16 -07:00
DorianZheng	e0f05754ba	Expose column family id to OnCompactionCompleted (#4466 ) Summary: The controller you requested could not be found. PTAL Pull Request resolved: https://github.com/facebook/rocksdb/pull/4466 Differential Revision: D10241358 Pulled By: yiwu-arbug fbshipit-source-id: 99664eb286860a6c8844d50efeb0ef6f0e10dd1e	2018-10-08 14:24:16 -07:00
DorianZheng	7487a7628c	Fix return status of DBImpl::GetLatestSequenceForKey Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4467 Differential Revision: D10241418 Pulled By: yiwu-arbug fbshipit-source-id: f6adbe7292b2c934e14971c7432b3eb115c35026	2018-10-08 14:22:05 -07:00
Maysam Yabandeh	21b51dfec4	Add inline comments to flush job (#4464 ) Summary: It also renames InstallMemtableFlushResults to MaybeInstallMemtableFlushResults to clarify its contract. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4464 Differential Revision: D10224918 Pulled By: maysamyabandeh fbshipit-source-id: 04e3f2d8542002cb9f8010cb436f5152751b3cbe	2018-10-05 15:41:17 -07:00
Maysam Yabandeh	1fb6805527	Fix snprintf buffer overflow bug (#4465 ) Summary: The contract of snprintf says that it returns "The number of characters that would have been written if n had been sufficiently large" http://www.cplusplus.com/reference/cstdio/snprintf/ The existing code however was assuming that the return value is the actual number of written bytes and uses that to reposition the starting point on the next call to snprintf. This leads to buffer overflow when the last call to snprintf has filled up the buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4465 Differential Revision: D10224080 Pulled By: maysamyabandeh fbshipit-source-id: 40f44e122d15b0db439812a0a361167cf012de3e	2018-10-05 14:50:51 -07:00
Dmitry Alimov	e13d8dcbbb	Fix typos in comments (#4456 ) Summary: Fix some typos in the comments Pull Request resolved: https://github.com/facebook/rocksdb/pull/4456 Differential Revision: D10209214 Pulled By: miasantreble fbshipit-source-id: dff857ba60396bc95126e635db96d7dc8330d2cb	2018-10-04 20:46:50 -07:00
Zhongyi Xie	ce1fc5af09	fix unused param `allocator` in compression.h (#4453 ) Summary: this should fix currently failing contrun test: rocksdb-contrun-no_compression, rocksdb-contrun-tsan, rocksdb-contrun-tsan_crash Pull Request resolved: https://github.com/facebook/rocksdb/pull/4453 Differential Revision: D10202626 Pulled By: miasantreble fbshipit-source-id: 850b07f14f671b5998c22d8239e2a55b2fc1e355	2018-10-04 13:24:22 -07:00
JiYou	a1f6142f38	VersionSet: GetOverlappingInputs() fix overflow and optimize. (#4385 ) Summary: This fix is for `level == 0` in `GetOverlappingInputs()`: - In `GetOverlappingInputs()`, if `level == 0`, it has potential risk of overflow if `i == 0`. - Optmize process when `expand = true`, the expected complexity can be reduced to O(n). Signed-off-by: JiYou <jiyou09@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/4385 Differential Revision: D10181001 Pulled By: riversand963 fbshipit-source-id: 46eef8a1d1605c9329c164e6471cd5c5b6de16b5	2018-10-03 18:40:59 -07:00
Yanqin Jin	4e58b2ea3d	Check for compression lib support before test exec (#4443 ) Summary: Before running CompactFilesTest.SentinelCompressionType, we should check whether zlib and snappy are supported. CompactFilesTest.SentinelCompressionType is a newly added test. Compilation and linking with different options, e.g. COMPILE_WITH_TSAN, COMPILE_WITH_ASAN, etc. lead to generation of different binaries. On the one hand, it's not clear why zlib or snappy is present under ASAN, but not under TSAN. On the other hand, changing the compilation flags for TSAN or ASAN seems a bigger change worth much more attention. To unblock the cont-runs, I suggest that we simply add these two checks at the beginning of the test, as we did for GeneralTableTest.ApproximateOffsetOfCompressed in table/table_test.cc. Future actions include invesigating the absence of zlib and snappy when compiling with TSAN, i.e. COMPILE_WITH_TSAN=1, if necessary. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4443 Differential Revision: D10140935 Pulled By: riversand963 fbshipit-source-id: 62f96d1e685386accd2ef0b98f6f754d3fd67b3e	2018-10-02 10:42:01 -07:00
Yanqin Jin	be5cc4c7b8	Remove a race condition between lsdir and rm (#4440 ) Summary: In DBCompactionTestWithParam::ManualLevelCompactionOutputPathId, there is a race condition between `DBTestBase::GetSstFileCount` and `DBImpl::PurgeObsoleteFiles`. The following graph explains why. ``` Timeline db_compact_test_t bg_flush_t bg_compact_t \| [initiate bg flush and \| start waiting] \| flush \| DeleteObsoleteFiles \| [waken up by bg_flush_t which \| signaled in DeleteObsoleteFiles] \| \| [initiate compaction and \| start waiting] \| \| [compact, \| set manual.done to true] \| [signal at the end of \| BackgroundCallFlush] \| \| [waken up by bg_flush_t \| which signaled before \| returning from \| BackgroundCallFlush] \| \| Check manual.done is true \| \| GetSstFileCount <-- race condition --> PurgeObsoleteFiles V ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4440 Differential Revision: D10122628 Pulled By: riversand963 fbshipit-source-id: 3ede73c39fee6ad804dc6ac1ed84759c7e63977f	2018-10-01 11:57:55 -07:00
Andrew Kryczka	ac6f435a9a	Fix CompactFiles support for kDisableCompressionOption (#4438 ) Summary: Previously `CompactFiles` with `CompressionType::kDisableCompressionOption` caused program to crash on assertion failure. This PR fixes the crash by adding support for that setting. Now, that setting will cause RocksDB to choose compression according to the column family's options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4438 Differential Revision: D10115761 Pulled By: ajkr fbshipit-source-id: a553c6fa76fa5b6f73b0d165d95640da6f454122	2018-10-01 01:18:10 -07:00
JiYou	75ca13875c	FindFile: use std::lower_bound reduce the repeated code. (#4372 ) Summary: `FindFile()` and `FindFileInRange()` actually works as the same of `std::lower_bound()`. Use `std::lower_bound()` to reduce the repeated code. - change `FindFile()` and `FindFileInRange()` to use `std::lower_bound()` Signed-off-by: JiYou <jiyou09@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/4372 Differential Revision: D9919677 Pulled By: ajkr fbshipit-source-id: f74aaa30e2f80e410e299c5a5bca4eaf2a7a26de	2018-09-27 10:35:00 -07:00
Yi Wu	dc813e4b85	Improve log handling when recover without flush (#4405 ) Summary: Improve log handling when avoid_flush_during_recovery=true. 1. restore total_log_size_ after recovery, by summing up existing log sizes. Fixes #4253. 2. truncate the last existing log, since this log can contain preallocated space and it will be a waste to keep the space. It avoids a crash loop of user application cause a lot of log with non-trivial size being created and ultimately take up all disk space. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4405 Differential Revision: D9953933 Pulled By: yiwu-arbug fbshipit-source-id: 967780fee8acec7f358b6eb65190fb4684f82e56	2018-09-26 10:37:48 -07:00
Nikhil Benesch	17edc82a4b	Handle tombstones at the same seqno in the CollapsedRangeDelMap (#4424 ) Summary: The CollapsedRangeDelMap was entirely mishandling tombstones at the same sequence number when the tombstones did not have identical start and end keys. Such tombstones are common since `90fc40690`, which causes tombstones to be split during compactions. For example, if the tombstone [a, c) @ 1 lies across a compaction boundary at b, it will be split into [a, b) @ 1 and [b, c) @ 1. Without this patch, the collapsed range deletion map would look like this: a -> 1 b -> 1 c -> 0 Notice how the b -> 1 entry is redundant. When the tombstones overlap, the problem is even worse. Consider tombstones [a, c) @ 1 and [b, d) @ 1, which produces this map without this patch: a -> 1 b -> 1 c -> 0 d -> 0 This map is corrupt, as a map can never contain adjacent sentinel (zero) entries. When the iterator advances from b to c, it will notice that c is a sentinel enty and skip to d--but d is also a sentinel entry! Asking what tombstone this iterator points to will trigger an assertion, as it is not pointing to a valid tombstone. /cc ajkr Pull Request resolved: https://github.com/facebook/rocksdb/pull/4424 Differential Revision: D10039248 Pulled By: abhimadan fbshipit-source-id: 6d737c1e88d60e80cf27286726627ba44463e7f4	2018-09-25 14:50:31 -07:00
Abhishek Madan	3c350a7cf0	Improve RangeDelAggregator benchmarks (#4395 ) Summary: Improve time measurements for AddTombstones to only include the call and not the VectorIterator setup. Also add a new add_tombstones_per_run flag to call AddTombstones multiple times per aggregator, which will help simulate more realistic workloads. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4395 Differential Revision: D9996811 Pulled By: abhimadan fbshipit-source-id: 5865a95c323fbd9b3606493013664b4890fe5a02	2018-09-21 16:13:08 -07:00
Anand Ananthabhotla	72712f4e28	Allow dynamic modification of window size and deletion trigger (#4403 ) Summary: Make the CompactOnDeletionCollectorFactory class public, and provide methods to update the window size and deletion trigger params. These will take effect on subsequent created SST files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4403 Differential Revision: D9976857 Pulled By: anand1976 fbshipit-source-id: 31dbf0511c12fa2bb9b2a7ba620079e0ee09cf48	2018-09-20 15:15:28 -07:00
Andrew Kryczka	990b52e95b	Unit test for custom comparator RangeDelAggregator (#4388 ) Summary: Add a unit test for range collapsing when non-default comparator is used. This exposes the bug fixed in #4386. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4388 Differential Revision: D9918252 Pulled By: ajkr fbshipit-source-id: 99501b96b251eab41791a7e33b27055ee36c5c39	2018-09-18 12:13:20 -07:00
jsteemann	27221b0cc2	use specified comparator in CollapsedRangeDelMap (#4386 ) Summary: The Comparator passed to CollapsedRangeDelMap was not used for operator less of the std::map `rep_` object contained in CollapsedRangeDelMap. So the map was always sorted using the default ByteWiseComparator, which seems wrong. Passing the specified Comparator through for usage in that map object fixes actual problems we were seeing with RangeDelete operations that do not delete keys as expected when using a custom Comparator. I found that the tests in current master crash when I run them locally, both with and without my patch, at the very same location. I therefore don't know if the patch breaks something else, but it seems to fix RangeDeletion issues in our product that uses RocksDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4386 Differential Revision: D9916506 Pulled By: ajkr fbshipit-source-id: 27bff8c775831f089dde8c5289df7343d88b2d66	2018-09-18 09:28:30 -07:00
Maysam Yabandeh	65ac72edd9	Fix bug in partition filters with format_version=4 (#4381 ) Summary: Value delta encoding in format_version 4 requires the differences between the size of two consecutive handles to be sent to BlockBuilder::Add. This applies not only to indexes on blocks but also the indexes on indexes and filters in partitioned indexes and filters respectively. The patch fixes a bug where the partitioned filters would encode the entire size of the handle rather than the difference of the size with the last size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4381 Differential Revision: D9879505 Pulled By: maysamyabandeh fbshipit-source-id: 27a22e49b482b927fbd5629dc310c46d63d4b6d1	2018-09-17 17:28:15 -07:00
Abhishek Madan	1626f6ab6b	Add RangeDelAggregator microbenchmarks (#4363 ) Summary: To measure the results of upcoming DeleteRange v2 work, this commit adds simple benchmarks for RangeDelAggregator. It measures the average time for AddTombstones and ShouldDelete calls. Using this to compare the results before #4014 and on the latest master (using the default arguments) produces the following results: Before #4014: ``` ======================= Results: ======================= AddTombstones: 1356.28 us ShouldDelete: 0.401732 us ``` Latest master: ``` ======================= Results: ======================= AddTombstones: 740.82 us ShouldDelete: 0.383271 us ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4363 Differential Revision: D9881676 Pulled By: abhimadan fbshipit-source-id: 793e7d61aa4b9d47eb917bbcc03f08695b5e5442	2018-09-17 14:58:31 -07:00
Anand Ananthabhotla	30c21df97c	Fix regression test failures introduced by PR #4164 (#4375 ) Summary: 1. Add override keyword to overridden virtual functions in EventListener 2. Fix a memory corruption that can happen during DB shutdown when in read-only mode due to a background write error 3. Fix uninitialized buffers in error_handler_test.cc that cause valgrind to complain Pull Request resolved: https://github.com/facebook/rocksdb/pull/4375 Differential Revision: D9875779 Pulled By: anand1976 fbshipit-source-id: 022ede1edc01a9f7e21ecf4c61ef7d46545d0640	2018-09-17 13:14:07 -07:00
Anand Ananthabhotla	a27fce408e	Auto recovery from out of space errors (#4164 ) Summary: This commit implements automatic recovery from a Status::NoSpace() error during background operations such as write callback, flush and compaction. The broad design is as follows - 1. Compaction errors are treated as soft errors and don't put the database in read-only mode. A compaction is delayed until enough free disk space is available to accomodate the compaction outputs, which is estimated based on the input size. This means that users can continue to write, and we rely on the WriteController to delay or stop writes if the compaction debt becomes too high due to persistent low disk space condition 2. Errors during write callback and flush are treated as hard errors, i.e the database is put in read-only mode and goes back to read-write only fater certain recovery actions are taken. 3. Both types of recovery rely on the SstFileManagerImpl to poll for sufficient disk space. We assume that there is a 1-1 mapping between an SFM and the underlying OS storage container. For cases where multiple DBs are hosted on a single storage container, the user is expected to allocate a single SFM instance and use the same one for all the DBs. If no SFM is specified by the user, DBImpl::Open() will allocate one, but this will be one per DB and each DB will recover independently. The recovery implemented by SFM is as follows - a) On the first occurance of an out of space error during compaction, subsequent compactions will be delayed until the disk free space check indicates enough available space. The required space is computed as the sum of input sizes. b) The free space check requirement will be removed once the amount of free space is greater than the size reserved by in progress compactions when the first error occured c) If the out of space error is a hard error, a background thread in SFM will poll for sufficient headroom before triggering the recovery of the database and putting it in write-only mode. The headroom is calculated as the sum of the write_buffer_size of all the DB instances associated with the SFM 4. EventListener callbacks will be called at the start and completion of automatic recovery. Users can disable the auto recov ery in the start callback, and later initiate it manually by calling DB::Resume() Todo: 1. More extensive testing 2. Add disk full condition to db_stress (follow-on PR) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164 Differential Revision: D9846378 Pulled By: anand1976 fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a	2018-09-15 13:43:04 -07:00
Sagar Vemuri	3db584059c	Remove sync point from Block destructor (#4370 ) Summary: AddressSanitizer: heap-use-after-free in std::__atomic_base<bool>::load(std::memory_order) const ==1798517==ABORTING ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4370 Differential Revision: D9844146 Pulled By: sagar0 fbshipit-source-id: 18a2970b1d504b4f6c8fb04857f26e0f32124dd1	2018-09-15 00:12:57 -07:00
Dmitri Smirnov	879998b369	Adjust c test and fix windows compilation issues Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4369 Differential Revision: D9844200 Pulled By: sagar0 fbshipit-source-id: 0d9f5f73b28234eaac55d3551ce4e2dc177af138	2018-09-14 20:57:22 -07:00
JiYou	82e8e9e26b	VersionBuilder: optmize SaveTo() to linear time. (#4366 ) Summary: Because `base_files` and `added_files` both are sorted, using a merge operation to these two sorted arrays is more effective. The complexity is reduced to linear time. - optmize the merge complexity. - move the `NDEBUG` of sorted `added_files` out of merge process. Signed-off-by: JiYou <jiyou09@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/4366 Differential Revision: D9833592 Pulled By: ajkr fbshipit-source-id: dd32b67ebdca4c20e5e9546ab8082cecefe99fd0	2018-09-14 19:43:04 -07:00

1 2 3 4 5 ...

3436 Commits