rocksdb

Author	SHA1	Message	Date
Yanqin Jin	229350ef48	Allow iterate refresh for secondary instance (#8700 ) Summary: Test plan make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/8700 Reviewed By: zhichao-cao Differential Revision: D30523907 Pulled By: riversand963 fbshipit-source-id: 68928ab4dafb64ce80ab7bc69d83727a4713ab91	2021-08-24 15:40:56 -07:00
Andrew Kryczka	c521f22a1e	Deflake write-prepared and write-unprepared tests (#8696 ) Summary: The `JobContext::job_snapshot` referenced DB state but could have been deleted by a BG thread after the signal/unlock allowing shutdown to proceed. Then we would see an error like this (valgrind): ``` ==354104== Thread 2: ==354104== Invalid read of size 8 ==354104== at 0x694C4D: rocksdb::ManagedSnapshot::~ManagedSnapshot() (snapshot_impl.cc:20) ==354104== by 0x58F5BA: operator() (unique_ptr.h:81) ==354104== by 0x58F5BA: operator() (unique_ptr.h:75) ==354104== by 0x58F5BA: ~unique_ptr (unique_ptr.h:292) ==354104== by 0x58F5BA: rocksdb::JobContext::~JobContext() (job_context.h:221) ==354104== by 0x5F155E: rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction, rocksdb::Env::Priority) (db_impl_compaction_flush.cc:2696) ==354104== by 0x5F1BC2: rocksdb::DBImpl::BGWorkCompaction(void) (db_impl_compaction_flush.cc:2468) ==354104== by 0x83707A: operator() (std_function.h:688) ==354104== by 0x83707A: rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) (threadpool_imp.cc:266) ==354104== by 0x8373ED: rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*) (threadpool_imp.cc:307) ==354104== by 0x492A800: execute_native_thread_routine (in /usr/local/fbcode/platform009/lib/libstdc++.so.6.0.28) ==354104== by 0x4A5020B: start_thread (in /usr/local/fbcode/platform009/lib/libpthread-2.30.so) ==354104== by 0x4CF281E: clone (in /usr/local/fbcode/platform009/lib/libc-2.30.so) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8696 Test Plan: unable to repro Reviewed By: pdillinger Differential Revision: D30505277 Pulled By: ajkr fbshipit-source-id: 5a99f34137cd14d06b0f624add6d37a70a61135d	2021-08-23 23:09:17 -07:00
Jay Zhuang	249b1078c9	Add extra information to RemoteCompaction APIs (#8680 ) Summary: Currently, we only provide job_id in RemoteCompaction APIs, the main problem of `job_id` is it cannot uniquely identify a compaction job between DB instances or between sessions. Providing DB and session id to the user, which will make building cross DB compaction service easier. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8680 Test Plan: unittest Reviewed By: ajkr Differential Revision: D30444859 Pulled By: jay-zhuang fbshipit-source-id: fdf107f4286564049637f154193c6d94c3c59448	2021-08-23 16:27:38 -07:00
Peter Dillinger	04db764831	Embed original file number in SST table properties (#8686 ) Summary: I very recently realized that with https://github.com/facebook/rocksdb/issues/8669 we cannot later add file numbers to external SST files (so that more can share db session ids for better uniqueness properties), because of forward compatibility. We would have a version of RocksDB that assumes session IDs are unique on external SST files and therefore can't really break that invariant in future files. This change adds a table property for "orig_file_number" which is populated by normal SST files and also external SST files generated by SstFileWriter. SstFileWriter now keeps a db_session_id for life of the object and increments its own file numbers for embedding in table properties. (They are arguably "fake" file numbers because these numbers and not embedded in the file name.) While updating block_based_table_builder, I removed several unnecessary fields from Rep, because following the pattern would have created another unnecessary field. This change also updates block_based_table_reader to use this new property when available, which means that for newer SST files, we can determine the stable/original <db_session_id,file_number> unique identifier using just the file contents, not the file name. (It's a bit complicated; detailed comments in block_based_table_reader.) Also added DB host id to properties listing by sst_dump, which could be useful in debugging. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8686 Test Plan: majorly overhauled StableCacheKeys test for this change Reviewed By: zhichao-cao Differential Revision: D30457742 Pulled By: pdillinger fbshipit-source-id: 2e5ae7dddeb94fb9d8eac8a928486aed8b8cd445	2021-08-20 20:40:48 -07:00
Peter Dillinger	2a383f21f4	Add Bloom/Ribbon hybrid API support (#8679 ) Summary: This is essentially resurrection and fixing of the part of https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically, when configuring Ribbon filter, you can specify an LSM level before which Bloom will be used instead of Ribbon. But Bloom is only considered for Leveled and Universal compaction styles and file going into a known LSM level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as you would expect with NewRibbonFilterPolicy. So that this can be controlled with a single int value and so that flushes can be distinguished from intra-L0, we consider flush to go to level -1 for the purposes of this option. (Explained in API comment.) I also expect the most common and recommended Ribbon configuration to use Bloom during flush, to minimize slowing down writes and because according to my estimates, Ribbon only pays off if the structure lives in memory for more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy to be this mild hybrid configuration. I don't really want to add something like NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for flush, Ribbon otherwise) should be considered a natural choice. C APIs also updated, but because they don't support overloading, rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid configuration. While touching C API, I changed bits per key options from int to double. BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit unused fields from BloomFilterPolicy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679 Test Plan: new + updated tests, including crash test Reviewed By: jay-zhuang Differential Revision: D30445797 Pulled By: pdillinger fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1	2021-08-20 18:00:16 -07:00
Merlin Mao	baf22b4ee6	Add `IteratorTraceExecutionResult` for iterator related trace records. (#8687 ) Summary: - Allow to get `Valid()`, `status()`, `key()` and `value()` of an iterator from `IteratorTraceExecutionResult`. - Move lower bound and upper bound from `IteratorSeekQueryTraceRecord` to `IteratorQueryTraceRecord`. Added test in `DBTest2.TraceAndReplay`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8687 Reviewed By: zhichao-cao Differential Revision: D30457630 Pulled By: autopear fbshipit-source-id: be433099a25895b3aa6f0c00f95ad7b1d7489c1d	2021-08-20 15:35:56 -07:00
Akanksha Mahajan	5efec84c60	Fix blob callback in compaction and atomic flush (#8681 ) Summary: Pass BlobFileCompletionCallback in case of atomic flush and compaction job which is currently nullptr(default parameter). BlobFileCompletionCallback is used in case of IntegratedBlobDB to report new blob files to SstFileManager. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8681 Test Plan: CircleCI jobs Reviewed By: ltamasi Differential Revision: D30445998 Pulled By: akankshamahajan15 fbshipit-source-id: ba48093843864faec57f1f365cce7b5a569c4021	2021-08-20 11:41:14 -07:00
Merlin Mao	ff8953380f	Add iterator's lower and upper bounds to `TraceRecord` (#8677 ) Summary: Trace file V2 added lower/upper bounds to `Iterator::Seek()` and `Iterator::SeekForPrev()`. They were not used anywhere during the execution of a `TraceRecord`. Now they are added to be used by `ReadOptions` during `Iterator::Seek()` and `Iterator::SeekForPrev()` if they are set. Added test cases in `DBTest2.TraceAndManualReplay`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8677 Reviewed By: zhichao-cao Differential Revision: D30438255 Pulled By: autopear fbshipit-source-id: 82563006be0b69155990e506a74951c18af8d288	2021-08-19 17:27:12 -07:00
Baptiste Lemaire	c625b8d017	Add condition on NotifyOnFlushComplete that FlushJob was not mempurge. Add event listeners to mempurge tests. (#8672 ) Summary: Previously, when a `FlushJob` was redirected to a MemPurge, the function `DBImpl::NotifyOnFlushComplete` was called, which created a series of issues because the JobInfo was not correctly collected from the memtables. This diff aims at correcting these two issues (`FlushJobInfo` collection in `FlushJob::MemPurge` , no call to `DBImpl::NotifyOnFlushComplete` after successful mempurge). Event listeners were added to the unit tests to handle these situations. Surprisingly none of the crashtests caught this issue, I will try to add event listeners to crash tests in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8672 Reviewed By: akankshamahajan15 Differential Revision: D30383109 Pulled By: bjlemaire fbshipit-source-id: 35a8d4295886923ee4049a6447f00022cb221c73	2021-08-18 17:40:01 -07:00
Merlin Mao	d10801e983	Allow Replayer to report the results of TraceRecords. (#8657 ) Summary: `Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.) `Replayer::Replay()` reports the results via a callback function. New interface: `TraceRecordResult` in "rocksdb/trace_record_result.h". `DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657 Reviewed By: ajkr Differential Revision: D30290216 Pulled By: autopear fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6	2021-08-18 17:06:14 -07:00
Peter Dillinger	b6269b078a	Stable cache keys on ingested SST files (#8669 ) Summary: Extends https://github.com/facebook/rocksdb/issues/8659 to work for ingested external SST files, even the same file ingested into different DBs sharing a block cache. Note: These new cache keys are currently only enabled when FileSystem does not provide GetUniqueId. For now, they are typically larger, so slightly less efficient. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8669 Test Plan: Extended unit test Reviewed By: zhichao-cao Differential Revision: D30398532 Pulled By: pdillinger fbshipit-source-id: 1f13e2af4b8bfff5741953a69466e9589fbc23c7	2021-08-18 11:33:03 -07:00
Yanqin Jin	2b367fa8cc	Fix bug caused by releasing snapshot(s) during compaction (#8608 ) Summary: In debug mode, we are seeing assertion failure as follows ``` db/compaction/compaction_iterator.cc:980: void rocksdb::CompactionIterator::PrepareOutput(): \ Assertion `ikey_.type != kTypeDeletion && ikey_.type != kTypeSingleDeletion' failed. ``` It is caused by releasing earliest snapshot during compaction between the execution of `NextFromInput()` and `PrepareOutput()`. In one case, as demonstrated in unit test `WritePreparedTransaction.ReleaseEarliestSnapshotDuringCompaction_WithSD2`, incorrect result may be returned by a following range scan if we disable assertion, as in opt compilation level: the SingleDelete marker's sequence number is zeroed out, but the preceding PUT is also outputted to the SST file after compaction. Due to the logic of DBIter, the PUT will not be skipped and will be returned by iterator in range scan. https://github.com/facebook/rocksdb/issues/8661 illustrates what happened. Fix by taking a more conservative approach: make compaction zero out sequence number only if key is in the earliest snapshot when the compaction starts. Another assertion failure is ``` Assertion `current_user_key_snapshot_ == last_snapshot' failed. ``` It's caused by releasing the snapshot between the PUT and SingleDelete during compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8608 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D30145645 Pulled By: riversand963 fbshipit-source-id: 699f58e66faf70732ad53810ccef43935d3bbe81	2021-08-17 22:14:20 -07:00
Levi Tamasi	6878cedcc3	Add statistics support to integrated BlobDB (#8667 ) Summary: The patch adds statistics support to the integrated BlobDB implementation, namely the tickers `BLOB_DB_BLOB_FILE_BYTES_READ` and `BLOB_DB_GC_{NUM_KEYS,BYTES}_RELOCATED`, and the histograms `BLOB_DB_(DE)COMPRESSION_MICROS`. (Some other statistics, like `BLOB_DB_BLOB_FILE_BYTES_WRITTEN`, `BLOB_DB_BLOB_FILE_SYNCED`, `BLOB_DB_BLOB_FILE_{READ,WRITE,SYNC}_MICROS` were already supported.) Note that the vast majority of the old BlobDB's tickers/histograms are not really applicable to the new implementation, since they e.g. pertain to calling dedicated BlobDB APIs (which the integrated BlobDB does not have) or are tied to the legacy BlobDB's design of writing blob files synchronously when a write API is called. Such statistics are marked "legacy BlobDB only" in `statistics.h`. Fixes https://github.com/facebook/rocksdb/issues/8645 . Pull Request resolved: https://github.com/facebook/rocksdb/pull/8667 Test Plan: Ran `make check` and tested the new statistics using `db_bench`. Reviewed By: riversand963 Differential Revision: D30356884 Pulled By: ltamasi fbshipit-source-id: 5f8a833faee60401c5643c2f0a6c0415488190a4	2021-08-17 17:22:31 -07:00
Peter Dillinger	a207c27809	Stable cache keys using DB session ids in SSTs (#8659 ) Summary: Use DB session ids in SST table properties to make cache keys stable across DB re-open and copy / move / restore / etc. These new cache keys are currently only enabled when FileSystem does not provide GetUniqueId. For now, they are typically larger, so slightly less efficient. Relevant to https://github.com/facebook/rocksdb/issues/7405 This change has a minor regression in PersistentCache functionality: metaindex blocks are no longer cached in PersistentCache. Table properties blocks already were not but ideally should be. I didn't spent effort to fix & test these issues because we don't believe PersistentCache is used much if at all and expect SecondaryCache to replace it. (Though PRs are welcome.) FIXME: there is more to be fixed for stable cache keys on external SST files Pull Request resolved: https://github.com/facebook/rocksdb/pull/8659 Test Plan: new unit test added, which fails when disabling new functionality Reviewed By: zhichao-cao Differential Revision: D30297705 Pulled By: pdillinger fbshipit-source-id: e8539a5c8802a79340405629870f2e3fb3822d3a	2021-08-16 20:37:20 -07:00
Adam Retter	5de333fd99	Add db_test2 to to ASSERT_STATUS_CHECKED (#8640 ) Summary: This is the `db_test2` parts of https://github.com/facebook/rocksdb/pull/7737 reworked on the latest HEAD. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8640 Reviewed By: akankshamahajan15 Differential Revision: D30303684 Pulled By: mrambacher fbshipit-source-id: 263e2f82d849bde4048b60aed8b31e7deed4706a	2021-08-16 08:10:32 -07:00
Jay Zhuang	c55460c734	Add property `LiveSstFilesSizeAtTemperature` for tiered storage (#8644 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8644 Reviewed By: siying, zhichao-cao Differential Revision: D30236535 Pulled By: jay-zhuang fbshipit-source-id: 1758d1c46d83a5087560fb63d53a016bf999da81	2021-08-15 14:17:45 -07:00
Baptiste Lemaire	e51be2c5a1	Improve MemPurge sampling (#8656 ) Summary: Previously, the `MemPurge` sampling function was assessing whether a random entry from a memtable was garbage or not by simply querying the given memtable (see https://github.com/facebook/rocksdb/issues/8628 for more details). In this diff, I am updating the sampling function by querying not only the memtable the entry was drawn from, but also all subsequent memtables that have a greater memtable ID. I also added the size of the value for KV entries in the payload/useful payload estimates (which was also one of the reasons why sampling was not as good as mempurging all the time in terms of L0 SST files reduction). Once these changes were made, I was able to clean obsolete objects and functions from the `MemtableList` struct, and did a bit of cleanup everywhere. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8656 Reviewed By: pdillinger Differential Revision: D30288583 Pulled By: bjlemaire fbshipit-source-id: 7646a545ec56f4715949daa59ab5eee74540feb3	2021-08-13 14:35:41 -07:00
Merlin Mao	f58d276764	Make TraceRecord and Replayer public (#8611 ) Summary: New public interfaces: `TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h". `Replayer`, available in `rocksdb/utilities/replayer.h`. User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file. Unit tests: - `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes. - `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611 Reviewed By: ajkr Differential Revision: D30266329 Pulled By: autopear fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8	2021-08-11 19:32:46 -07:00
Baptiste Lemaire	e3a96c4823	Memtable sampling for mempurge heuristic. (#8628 ) Summary: Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option. This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value. Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate. At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation. Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries. The unit tests have been readapted to support this new API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628 Reviewed By: pdillinger Differential Revision: D30149315 Pulled By: bjlemaire fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27	2021-08-10 18:09:03 -07:00
Levi Tamasi	f63331ebaf	Attempt to deflake DBTestXactLogIterator.TransactionLogIteratorCorruptedLog (#8627 ) Summary: The patch attempts to deflake `DBTestXactLogIterator.TransactionLogIteratorCorruptedLog` by disabling file deletions while retrieving the list of WAL files and truncating the first WAL file. This is to prevent the `PurgeObsoleteFiles` call triggered by `GetSortedWalFiles` from invalidating the result of `GetSortedWalFiles`. The patch also cleans up the test case a bit and changes it to using `test::TruncateFile` instead of calling the `truncate` syscall directly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8627 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D30147002 Pulled By: ltamasi fbshipit-source-id: db11072a4ad8900a2f859cb5294e22b1888c23f6	2021-08-10 11:10:07 -07:00
Jay Zhuang	61f83dfeb7	Add an unittest for tiered storage universal compaction (#8631 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8631 Reviewed By: siying Differential Revision: D30200385 Pulled By: jay-zhuang fbshipit-source-id: 0fa2bb15e74ff81762d767f234078e0fe0106c55	2021-08-09 13:44:23 -07:00
sdong	e7c24168d8	Move old files to warm tier in FIFO compactions (#8310 ) Summary: Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310 Test Plan: Add several unit tests. Reviewed By: ajkr Differential Revision: D28493792 fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501	2021-08-09 12:51:14 -07:00
Roy Crihfield	d4b75d295f	Add more C bindings for OptimisticTransactionDB (#8526 ) Summary: * `rocksdb_optimistictransactiondb_checkpoint_object_create` * `rocksdb_optimistictransactiondb_write` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8526 Reviewed By: ajkr Differential Revision: D30076822 Pulled By: jay-zhuang fbshipit-source-id: a59956a8d5449e75d39a8087fbb2bad148cf697d	2021-08-06 19:10:48 -07:00
Levi Tamasi	87882736ef	Fix the sorting of KeyContexts for batched MultiGet (#8633 ) Summary: `CompareKeyContext::operator()` on the trunk has a bug: when comparing column family IDs, `lhs` is used for both sides of the comparison. This results in the `KeyContext`s getting sorted solely based on key, which in turn means that keys with the same column family do not necessarily form a single range in the sorted list. This violates an assumption of the batched `MultiGet` logic, leading to the same column family showing up multiple times in the list of `MultiGetColumnFamilyData`. The end result is the code attempting to check out the thread-local `SuperVersion` for the same CF multiple times, causing an assertion violation in debug builds and memory corruption/crash in release builds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8633 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D30169182 Pulled By: ltamasi fbshipit-source-id: a47710652df7e95b14b40fb710924c11a8478023	2021-08-06 16:27:42 -07:00
Zaorang Yang	e95c570047	Fix the wrong comment of level compaction cf paths test (#8533 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8533 Reviewed By: ajkr Differential Revision: D29718067 fbshipit-source-id: b4b91c9271362e7a7d47ddbaf28f56fb537cc668	2021-08-06 15:27:12 -07:00
mrambacher	d057e8326d	Make MergeOperator+CompactionFilter/Factory into Customizable Classes (#8481 ) Summary: - Changed MergeOperator, CompactionFilter, and CompactionFilterFactory into Customizable classes. - Added Options/Configurable/Object Registration for TTL and Cassandra variants - Changed the StringAppend MergeOperators to accept a string delimiter rather than a simple char. Made the delimiter into a configurable option - Added tests for new functionality Pull Request resolved: https://github.com/facebook/rocksdb/pull/8481 Reviewed By: zhichao-cao Differential Revision: D30136050 Pulled By: mrambacher fbshipit-source-id: 271d1772835935b6773abaf018ee71e42f9491af	2021-08-06 08:27:25 -07:00
Akanksha Mahajan	fd2079938d	Dynamically configure BlockBasedTableOptions.prepopulate_block_cache (#8620 ) Summary: Dynamically configure BlockBasedTableOptions.prepopulate_block_cache using DB::SetOptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8620 Test Plan: Added new unit test Reviewed By: anand1976 Differential Revision: D30091319 Pulled By: akankshamahajan15 fbshipit-source-id: fb586d1848a8dd525bba7b2f9eeac34f2fc6d82c	2021-08-05 19:44:51 -07:00
Levi Tamasi	9b25d26dc8	Attempt to deflake ObsoleteFilesTest.DeleteObsoleteOptionsFile (#8624 ) Summary: We've been seeing occasional crashes on CI while inserting into the vectors in `ObsoleteFilesTest.DeleteObsoleteOptionsFile`. The crashes don't reproduce locally (could be either a race or an object lifecycle issue) but the good news is that the vectors in question are not really used for anything meaningful by the test. (The assertion about the sizes of the two vectors being equal is guaranteed to hold, since the two sync points where they are populated are right after each other.) The patch simply removes the vectors from the test, alongside the associated callbacks and sync points. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8624 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D30118485 Pulled By: ltamasi fbshipit-source-id: 0a4c3d06584e84cd2b1dcc212d274fa1b89cb647	2021-08-05 18:36:16 -07:00
Andrew Kryczka	a685a701ca	Do not attempt to rename non-existent info log (#8622 ) Summary: Previously we attempted to rename "LOG" to "LOG.old.*" without checking its existence first. "LOG" had no reason to exist in a new DB. Errors in renaming a non-existent "LOG" were swallowed via `PermitUncheckedError()` so things worked. However the storage service's error monitoring was detecting all these benign rename failures. So it is better to fix it. Also with this PR we can now distinguish rename failure for other reasons and return them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8622 Test Plan: new unit test Reviewed By: akankshamahajan15 Differential Revision: D30115189 Pulled By: ajkr fbshipit-source-id: e2f337ffb2bd171be0203172abc8e16e7809b170	2021-08-04 17:25:00 -07:00
Yanqin Jin	0879c24040	Fix NotifyOnFlushCompleted() for atomic flush (#8585 ) Summary: PR https://github.com/facebook/rocksdb/issues/5908 added `flush_jobs_info_` to `FlushJob` to make sure `OnFlushCompleted()` is called after committing flush results to MANIFEST. However, `flush_jobs_info_` is not updated in atomic flush, causing `NotifyOnFlushCompleted()` to skip `OnFlushCompleted()`. This PR fixes this, in a similar way to https://github.com/facebook/rocksdb/issues/5908 that handles regular flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8585 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D29913720 Pulled By: riversand963 fbshipit-source-id: 4ff023c98372fa2c93188d4a5c8a4e9ffa0f4dda	2021-08-03 13:31:10 -07:00
Akanksha Mahajan	8b2f60b668	Cache warming blocks during flush (#8561 ) Summary: Insert warm blocks (data, uncompressed dict, index and filter blocks) during flush in Block cache which is enabled under option BlockBasedTableOptions.prepopulate_block_cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8561 Test Plan: Added unit test Reviewed By: anand1976 Differential Revision: D29773411 Pulled By: akankshamahajan15 fbshipit-source-id: 6631123c10134340ef0bd7e90baafaa6deba0e66	2021-08-03 12:44:15 -07:00
Baptiste Lemaire	b278152261	Fix db stress crash mempurge (#8604 ) Summary: The db_stress crash was caused by a call to `IsFlushPending()` made by a stats function which triggered an `assert([false])`, which I didn't plan when I created the `trigger_flush` bool. It turns out that this bool variable is not useful: I created it because I thought the `imm_flush_needed` atomic bool would actually trigger a flush. It turns out that this bool is only checked in `IsFlushPending` - this is its only use - and a flush is triggered by either a background thread checking on the imm array, or by an explicit call to `SchedulePendingFlush` which creates a flush request, that is then added to a flush request queue. In this PR, I reverted the MemtableList::Add function to what it was before my changes. I tested the fix by running the exact command line that deterministically triggered the assert error (see below), which confirmed that this is where the error was coming from. I also run `db_crashtest.py whitebox` and `blackbox` for a couple hours locally before committing this PR. Experiment run: ```./db_stress --acquire_snapshot_one_in=0 --allow_concurrent_memtable_write=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=76.90653425292307 --bottommost_compression_type=disable --cache_index_and_filter_blocks=1 --cache_size=1048576 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=0 --compaction_ttl=2 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --enable_compaction_filter=1 --enable_pipelined_write=0 --expected_values_path=/dev/shm/rocksdb/rocksdb_crashtest_expected --experimental_allow_mempurge=1 --experimental_mempurge_policy=kAlternate --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --format_version=2 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=14 --index_type=0 --iterpercent=0 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --long_running_snapshots=1 --mark_for_compaction_one_file_in=10 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtablerep=skip_list --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=32 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=1000 --prefix_size=-1 --prefixpercent=0 --progress_reports=0 --read_fault_one_in=0 --readpercent=60 --recycle_log_file_num=1 --reopen=20 --set_options_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --sync=1 --sync_fault_injection=False --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=3 --use_clock_cache=0 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_ribbon_filter=1 --user_timestamp_size=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=35``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8604 Reviewed By: pdillinger Differential Revision: D30047295 Pulled By: bjlemaire fbshipit-source-id: b9e379bfa3d6b9bd2b275725fb0bca4bd81a3dbe	2021-08-02 20:26:35 -07:00
Levi Tamasi	3f7e929865	Fix a race in ColumnFamilyData::UnrefAndTryDelete (#8605 ) Summary: The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk unlocks the DB mutex before destroying the `ThreadLocalPtr` holding the per-thread `SuperVersion` pointers when the only remaining reference is the back reference from `super_version_`. The idea behind this was to break the circular dependency between `ColumnFamilyData` and `SuperVersion`: when the penultimate reference goes away, `ColumnFamilyData` can clean up the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there is a `SuperVersion` and it is not referenced by anything else.) However, unlocking the mutex throws a wrench in this plan by making it possible for another thread to jump in and take another reference to the `ColumnFamilyData`, keeping the object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like https://github.com/facebook/rocksdb/issues/8440 , https://github.com/facebook/rocksdb/issues/8382 , and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet` destructor we sometimes observe during close in our stress tests. Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call `SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone. https://github.com/facebook/rocksdb/pull/3510 fixed the deadlock but left the unlocking in place. https://github.com/facebook/rocksdb/pull/6147 then introduced the circular dependency and associated cleanup logic described above (in order to enable iterators to keep the `ColumnFamilyData` for dropped column families alive), and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`. Finally, https://github.com/facebook/rocksdb/pull/7749 fixed a memory leak but apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete` in `SuperVersion::Cleanup`. The patch simply eliminates the unlocking and relocking, which has been unnecessary ever since https://github.com/facebook/rocksdb/issues/3510 made `SuperVersionUnrefHandle` lock-free. This closes the window during which another thread could increase the reference count, and hopefully fixes the issues above. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8605 Test Plan: Ran `make check` and stress tests locally. Reviewed By: pdillinger Differential Revision: D30051035 Pulled By: ltamasi fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528	2021-08-02 18:12:11 -07:00
yangzaorang	8e91bd90d2	Fix a issue with initializing blob header buffer (#8537 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8537 Reviewed By: ajkr Differential Revision: D29838132 Pulled By: jay-zhuang fbshipit-source-id: e3e78d5f85f240a1800ace417a8b634f74488e41	2021-08-02 17:15:06 -07:00
mrambacher	ab7f7c9e49	Allow WAL dir to change with db dir (#8582 ) Summary: Prior to this change, the "wal_dir" DBOption would always be set (defaults to dbname) when the DBOptions were sanitized. Because of this setitng in the options file, it was not possible to rename/relocate a database directory after it had been created and use the existing options file. After this change, the "wal_dir" option is only set under specific circumstances. Methods were added to the ImmutableDBOptions class to see if it is set and if it is set to something other than the dbname. Additionally, a method was added to retrieve the effective value of the WAL dir (either the option or the dbname/path). Tests were added to the core and ldb to test that a database could be created and renamed without issue. Additional tests for various permutations of wal_dir were also added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8582 Reviewed By: pdillinger, autopear Differential Revision: D29881122 Pulled By: mrambacher fbshipit-source-id: 67d3d033dc8813d59917b0a3fba2550c0efd6dfb	2021-07-30 12:16:44 -07:00
Yanqin Jin	066b51126d	Several simple local code clean-ups (#8565 ) Summary: This PR tries to remove some unnecessary checks as well as unreachable code blocks to improve readability. An obvious non-public API method naming typo is also corrected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8565 Test Plan: make check Reviewed By: lth Differential Revision: D29963984 Pulled By: riversand963 fbshipit-source-id: cc96e8f09890e5cfe9b20eadb63bdca5484c150a	2021-07-30 12:07:49 -07:00
Peter Dillinger	1d34cd797e	Fix insecure internal API for GetImpl (#8590 ) Summary: Calling the GetImpl function could leave reference to a local callback function in a field of a parameter struct. As this is performance-critical code, I'm not going to attempt to sanitize this code too much, but make the existing hack a bit cleaner by reverting what it overwrites in the input struct. Added SaveAndRestore utility class to make that easier. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8590 Test Plan: added unit test for SaveAndRestore; existing tests for GetImpl Reviewed By: riversand963 Differential Revision: D29947983 Pulled By: pdillinger fbshipit-source-id: 2f608853f970bc06724e834cc84dcc4b8599ddeb	2021-07-29 17:23:01 -07:00
sdong	e8f218cb68	DB::GetSortedWalFiles() to ensure file deletion is disabled (#8591 ) Summary: If DB::GetSortedWalFiles() runs without file deletion disbled, file might get deleted in the middle and error is returned to users. It makes the function hard to use. Fix it by disabling file deletion if it is not done. Fix another minor issue of logging within DB mutex, which should not be done unless a major failure happens. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8591 Test Plan: Run all existing tests Reviewed By: pdillinger Differential Revision: D29969412 fbshipit-source-id: d5f42b5271608a35b9b07687ce18157d7447b0de	2021-07-29 11:51:08 -07:00
Peter Dillinger	0804b44fb6	Some fixes and enhancements to `ldb repair` (#8544 ) Summary: * Basic handling of SST file with just range tombstones rather than failing assertion about smallest_seqno <= largest_seqno * Adds --verbose option so that there exists a way to see the INFO output from Repairer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8544 Test Plan: unit test added, manual testing for --verbose Reviewed By: ajkr Differential Revision: D29954805 Pulled By: pdillinger fbshipit-source-id: 696af25805fc36cc178b04ba6045922a22625fd9	2021-07-28 16:44:14 -07:00
jimmycleary	e0ff365a76	Replace macros in compaction_iterator.cc with inline functions (#8592 ) Summary: Internal task T96186510. Created new inline member functions in `CompactionIterator`, `DefinitelyInSnapshot`, `DefinitelyNotInSnapshot`, and `InEarliestSnapshot` to replace the macros at the top of `compaction_iterator.cc`. Placed the definitions in `compaction_iterator.h` in accordance with Google's style guide for inline functions. Separated the declarations and definitions, and only placed the `inline` keyword on the definitions, in line with ISO CPP recommendations. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8592 Test Plan: Ran `make check`. Successful build and all tests appeared to pass. Reviewed By: riversand963 Differential Revision: D29966782 Pulled By: jimmycFB fbshipit-source-id: 3584290bbbabf862e9ab58852281f46d37f58be6	2021-07-28 14:53:29 -07:00
Peter Dillinger	74b7c0d249	Fix use-after-free on implicit temporary FileOptions (#8571 ) Summary: FileOptions has an implicit conversion from EnvOptions and some internal APIs take `const FileOptions&` and save the reference, which is counter to Google C++ guidelines, > Avoid defining functions that require a const reference parameter to outlive the call, because const reference parameters bind to temporaries. Instead, find a way to eliminate the lifetime requirement (for example, by copying the parameter), or pass it by const pointer and document the lifetime and non-null requirements. This is at least a problem for repair.cc, which passes an EnvOptions to TableCache(), which would save a reference to the temporary copy as FileOptions. This was unfortunately only caught as a side effect of changes in https://github.com/facebook/rocksdb/issues/8544. This change fixes the repair.cc case and updates the involved internal APIs that save a reference to use `const FileOptions*` instead. Unfortunately, I don't know how to get any of our sanitizers to reliably report bugs like this, so I can't rule out more existing in our codebase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8571 Test Plan: Test that issues seen with https://github.com/facebook/rocksdb/issues/8544 are fixed (can reproduce on AWS EC2) Reviewed By: ajkr Differential Revision: D29943890 Pulled By: pdillinger fbshipit-source-id: 95f9c5251548777b4dc994c1a083dd2add5799c9	2021-07-27 21:49:14 -07:00
Peter Dillinger	e352bd5742	Fix missing Handle release in TableCache::GetRangeTombstoneIterator (#8589 ) Summary: This appears to be little used code so not a major bug, but is blocking https://github.com/facebook/rocksdb/issues/8544 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8589 Test Plan: Added regression test to the end of DBRangeDelTest::TableEvictedDuringScan. Without this fix, ASAN reports memory leak. Reviewed By: ajkr Differential Revision: D29943623 Pulled By: pdillinger fbshipit-source-id: f7115fa6d4440aef83888ff609aa03d09216463b	2021-07-27 21:32:11 -07:00
mrambacher	3aee4fbd41	Make EventListener into a Customizable Class (#8473 ) Summary: - Added Type/CreateFromString - Added ability to load EventListeners to DBOptions - Since EventListeners did not previously have a Name(), defaulted to "". If there is no name, the listener cannot be loaded from the ObjectRegistry. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473 Reviewed By: zhichao-cao Differential Revision: D29901488 Pulled By: mrambacher fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254	2021-07-27 07:47:02 -07:00
Baptiste Lemaire	4361d6d163	Add simple heuristics for experimental mempurge. (#8583 ) Summary: Add `experimental_mempurge_policy` option flag and introduce two new `MemPurge` (Memtable Garbage Collection) policies: 'ALWAYS' and 'ALTERNATE'. Default value: ALTERNATE. `ALWAYS`: every flush will first go through a `MemPurge` process. If the output is too big to fit into a single memtable, then the mempurge is aborted and a regular flush process carries on. `ALWAYS` is designed for user that need to reduce the number of L0 SST file created to a strict minimum, and can afford a small dent in performance (possibly hits to CPU usage, read efficiency, and maximum burst write throughput). `ALTERNATE`: a flush is transformed into a `MemPurge` except if one of the memtables being flushed is the product of a previous `MemPurge`. `ALTERNATE` is a good tradeoff between reduction in number of L0 SST files created and performance. `ALTERNATE` perform particularly well for completely random garbage ratios, or garbage ratios anywhere in (0%,50%], and even higher when there is a wild variability in garbage ratios. This PR also includes support for `experimental_mempurge_policy` in `db_bench`. Testing was done locally by replacing all the `MemPurge` policies of the unit tests with `ALTERNATE`, as well as local testing with `db_crashtest.py` `whitebox` and `blackbox`. Overall, if an `ALWAYS` mempurge policy passes the tests, there is no reasons why an `ALTERNATE` policy would fail, and therefore the mempurge policy was set to `ALWAYS` for all mempurge unit tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8583 Reviewed By: pdillinger Differential Revision: D29888050 Pulled By: bjlemaire fbshipit-source-id: e2cf26646d66679f6f5fb29842624615610759c1	2021-07-26 11:56:29 -07:00
leipeng	4171e3db9b	CompactionJob::Install(): fix log truncation (#8563 ) Summary: event log info may be truncated, the default buffer size is 512, this PR changes buffer size to 8192. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8563 Reviewed By: ajkr Differential Revision: D29838229 Pulled By: jay-zhuang fbshipit-source-id: 00c5dea3caff0641a209f02c972e92d65b505f50	2021-07-23 11:39:24 -07:00
Drewryz	3b27725245	Fix a minor issue with initializing the test path (#8555 ) Summary: The PerThreadDBPath has already specified a slash. It does not need to be specified when initializing the test path. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8555 Reviewed By: ajkr Differential Revision: D29758399 Pulled By: jay-zhuang fbshipit-source-id: 6d2b878523e3e8580536e2829cb25489844d9011	2021-07-23 08:38:45 -07:00
Baptiste Lemaire	c521a9ab2b	Retire superfluous functions introduced in earlier mempurge PRs. (#8558 ) Summary: The main challenge to make the memtable garbage collection prototype (nicknamed `mempurge`) was to not get rid of WAL files that contain unflushed (but mempurged) data. That was successfully guaranteed by not writing the VersionEdit to the MANIFEST file after a successful mempurge. By not writing VersionEdits to the `MANIFEST` file after a succesful mempurge operation, we do not change the earliest log file number that contains unflushed data: `cfd->GetLogNumber()` (`cfd->SetLogNumber()` is only called in `VersionSet::ProcessManifestWrites`). As a result, a number of functions introduced earlier just for the mempurge operation are not obscolete/redundant. (e.g.: `FlushJob::ExtractEarliestLogFileNumber`), and this PR aims at cleaning up all these now-unnecessary functions. In particular, we no longer need to store the earliest log file number in the `MemTable` struct itself. This PR therefore also reverts the `MemTable` struct to its original form. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8558 Test Plan: Already included in `db_flush_test.cc`. Reviewed By: anand1976 Differential Revision: D29764351 Pulled By: bjlemaire fbshipit-source-id: 0f43b260fa270251862512f397d3f24ee62e8437	2021-07-22 18:29:13 -07:00
Yanqin Jin	2e5388178f	Return error if trying to open secondary on missing or inaccessible primary (#8200 ) Summary: If the primary's CURRENT file is missing or inaccessible, the secondary should not hang trying repeatedly to switch to the next MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8200 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D27840627 Pulled By: riversand963 fbshipit-source-id: 071fed97cbab1bc5cdefd1dc235e5cd406c174e1	2021-07-22 15:48:58 -07:00
Peter Dillinger	84eef260de	Remove TaskLimiterToken::ReleaseOnce for fix (#8567 ) Summary: Rare TSAN and valgrind failures are caused by unnecessary reading of a field on the TaskLimiterToken::limiter_ for an assertion after the token has been released and the limiter destroyed. To simplify we can simply destroy the token before triggering DB shutdown (potentially destroying the limiter). This makes the ReleaseOnce logic unnecessary. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8567 Test Plan: watch for more failures in CI Reviewed By: ajkr Differential Revision: D29811795 Pulled By: pdillinger fbshipit-source-id: 135549ebb98fe4f176d1542ed85d5bd6350a40b3	2021-07-21 17:37:53 -07:00
Jay Zhuang	42eaa45c1b	Avoid updating option if there's no value updated (#8518 ) Summary: Try avoid expensive updating options operation if `SetDBOptions()` does not change any option value. Skip updating is not guaranteed, for example, changing `bytes_per_sync` to `0` may still trigger updating, as the value could be sanitized. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8518 Test Plan: added unittest Reviewed By: riversand963 Differential Revision: D29672639 Pulled By: jay-zhuang fbshipit-source-id: b7931de62ceea6f1bdff0d1209adf1197d3ed1f4	2021-07-21 13:45:59 -07:00

1 2 3 4 5 ...

4561 Commits