rocksdb

Author	SHA1	Message	Date
Akanksha Mahajan	5efec84c60	Fix blob callback in compaction and atomic flush (#8681 ) Summary: Pass BlobFileCompletionCallback in case of atomic flush and compaction job which is currently nullptr(default parameter). BlobFileCompletionCallback is used in case of IntegratedBlobDB to report new blob files to SstFileManager. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8681 Test Plan: CircleCI jobs Reviewed By: ltamasi Differential Revision: D30445998 Pulled By: akankshamahajan15 fbshipit-source-id: ba48093843864faec57f1f365cce7b5a569c4021	2021-08-20 11:41:14 -07:00
Baptiste Lemaire	c625b8d017	Add condition on NotifyOnFlushComplete that FlushJob was not mempurge. Add event listeners to mempurge tests. (#8672 ) Summary: Previously, when a `FlushJob` was redirected to a MemPurge, the function `DBImpl::NotifyOnFlushComplete` was called, which created a series of issues because the JobInfo was not correctly collected from the memtables. This diff aims at correcting these two issues (`FlushJobInfo` collection in `FlushJob::MemPurge` , no call to `DBImpl::NotifyOnFlushComplete` after successful mempurge). Event listeners were added to the unit tests to handle these situations. Surprisingly none of the crashtests caught this issue, I will try to add event listeners to crash tests in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8672 Reviewed By: akankshamahajan15 Differential Revision: D30383109 Pulled By: bjlemaire fbshipit-source-id: 35a8d4295886923ee4049a6447f00022cb221c73	2021-08-18 17:40:01 -07:00
Peter Dillinger	b6269b078a	Stable cache keys on ingested SST files (#8669 ) Summary: Extends https://github.com/facebook/rocksdb/issues/8659 to work for ingested external SST files, even the same file ingested into different DBs sharing a block cache. Note: These new cache keys are currently only enabled when FileSystem does not provide GetUniqueId. For now, they are typically larger, so slightly less efficient. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8669 Test Plan: Extended unit test Reviewed By: zhichao-cao Differential Revision: D30398532 Pulled By: pdillinger fbshipit-source-id: 1f13e2af4b8bfff5741953a69466e9589fbc23c7	2021-08-18 11:33:03 -07:00
Adam Retter	5de333fd99	Add db_test2 to to ASSERT_STATUS_CHECKED (#8640 ) Summary: This is the `db_test2` parts of https://github.com/facebook/rocksdb/pull/7737 reworked on the latest HEAD. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8640 Reviewed By: akankshamahajan15 Differential Revision: D30303684 Pulled By: mrambacher fbshipit-source-id: 263e2f82d849bde4048b60aed8b31e7deed4706a	2021-08-16 08:10:32 -07:00
Merlin Mao	f58d276764	Make TraceRecord and Replayer public (#8611 ) Summary: New public interfaces: `TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h". `Replayer`, available in `rocksdb/utilities/replayer.h`. User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file. Unit tests: - `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes. - `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611 Reviewed By: ajkr Differential Revision: D30266329 Pulled By: autopear fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8	2021-08-11 19:32:46 -07:00
Baptiste Lemaire	e3a96c4823	Memtable sampling for mempurge heuristic. (#8628 ) Summary: Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option. This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value. Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate. At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation. Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries. The unit tests have been readapted to support this new API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628 Reviewed By: pdillinger Differential Revision: D30149315 Pulled By: bjlemaire fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27	2021-08-10 18:09:03 -07:00
Levi Tamasi	87882736ef	Fix the sorting of KeyContexts for batched MultiGet (#8633 ) Summary: `CompareKeyContext::operator()` on the trunk has a bug: when comparing column family IDs, `lhs` is used for both sides of the comparison. This results in the `KeyContext`s getting sorted solely based on key, which in turn means that keys with the same column family do not necessarily form a single range in the sorted list. This violates an assumption of the batched `MultiGet` logic, leading to the same column family showing up multiple times in the list of `MultiGetColumnFamilyData`. The end result is the code attempting to check out the thread-local `SuperVersion` for the same CF multiple times, causing an assertion violation in debug builds and memory corruption/crash in release builds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8633 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D30169182 Pulled By: ltamasi fbshipit-source-id: a47710652df7e95b14b40fb710924c11a8478023	2021-08-06 16:27:42 -07:00
Levi Tamasi	9b25d26dc8	Attempt to deflake ObsoleteFilesTest.DeleteObsoleteOptionsFile (#8624 ) Summary: We've been seeing occasional crashes on CI while inserting into the vectors in `ObsoleteFilesTest.DeleteObsoleteOptionsFile`. The crashes don't reproduce locally (could be either a race or an object lifecycle issue) but the good news is that the vectors in question are not really used for anything meaningful by the test. (The assertion about the sizes of the two vectors being equal is guaranteed to hold, since the two sync points where they are populated are right after each other.) The patch simply removes the vectors from the test, alongside the associated callbacks and sync points. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8624 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D30118485 Pulled By: ltamasi fbshipit-source-id: 0a4c3d06584e84cd2b1dcc212d274fa1b89cb647	2021-08-05 18:36:16 -07:00
Yanqin Jin	0879c24040	Fix NotifyOnFlushCompleted() for atomic flush (#8585 ) Summary: PR https://github.com/facebook/rocksdb/issues/5908 added `flush_jobs_info_` to `FlushJob` to make sure `OnFlushCompleted()` is called after committing flush results to MANIFEST. However, `flush_jobs_info_` is not updated in atomic flush, causing `NotifyOnFlushCompleted()` to skip `OnFlushCompleted()`. This PR fixes this, in a similar way to https://github.com/facebook/rocksdb/issues/5908 that handles regular flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8585 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D29913720 Pulled By: riversand963 fbshipit-source-id: 4ff023c98372fa2c93188d4a5c8a4e9ffa0f4dda	2021-08-03 13:31:10 -07:00
Levi Tamasi	3f7e929865	Fix a race in ColumnFamilyData::UnrefAndTryDelete (#8605 ) Summary: The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk unlocks the DB mutex before destroying the `ThreadLocalPtr` holding the per-thread `SuperVersion` pointers when the only remaining reference is the back reference from `super_version_`. The idea behind this was to break the circular dependency between `ColumnFamilyData` and `SuperVersion`: when the penultimate reference goes away, `ColumnFamilyData` can clean up the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there is a `SuperVersion` and it is not referenced by anything else.) However, unlocking the mutex throws a wrench in this plan by making it possible for another thread to jump in and take another reference to the `ColumnFamilyData`, keeping the object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like https://github.com/facebook/rocksdb/issues/8440 , https://github.com/facebook/rocksdb/issues/8382 , and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet` destructor we sometimes observe during close in our stress tests. Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call `SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone. https://github.com/facebook/rocksdb/pull/3510 fixed the deadlock but left the unlocking in place. https://github.com/facebook/rocksdb/pull/6147 then introduced the circular dependency and associated cleanup logic described above (in order to enable iterators to keep the `ColumnFamilyData` for dropped column families alive), and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`. Finally, https://github.com/facebook/rocksdb/pull/7749 fixed a memory leak but apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete` in `SuperVersion::Cleanup`. The patch simply eliminates the unlocking and relocking, which has been unnecessary ever since https://github.com/facebook/rocksdb/issues/3510 made `SuperVersionUnrefHandle` lock-free. This closes the window during which another thread could increase the reference count, and hopefully fixes the issues above. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8605 Test Plan: Ran `make check` and stress tests locally. Reviewed By: pdillinger Differential Revision: D30051035 Pulled By: ltamasi fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528	2021-08-02 18:12:11 -07:00
mrambacher	ab7f7c9e49	Allow WAL dir to change with db dir (#8582 ) Summary: Prior to this change, the "wal_dir" DBOption would always be set (defaults to dbname) when the DBOptions were sanitized. Because of this setitng in the options file, it was not possible to rename/relocate a database directory after it had been created and use the existing options file. After this change, the "wal_dir" option is only set under specific circumstances. Methods were added to the ImmutableDBOptions class to see if it is set and if it is set to something other than the dbname. Additionally, a method was added to retrieve the effective value of the WAL dir (either the option or the dbname/path). Tests were added to the core and ldb to test that a database could be created and renamed without issue. Additional tests for various permutations of wal_dir were also added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8582 Reviewed By: pdillinger, autopear Differential Revision: D29881122 Pulled By: mrambacher fbshipit-source-id: 67d3d033dc8813d59917b0a3fba2550c0efd6dfb	2021-07-30 12:16:44 -07:00
Yanqin Jin	066b51126d	Several simple local code clean-ups (#8565 ) Summary: This PR tries to remove some unnecessary checks as well as unreachable code blocks to improve readability. An obvious non-public API method naming typo is also corrected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8565 Test Plan: make check Reviewed By: lth Differential Revision: D29963984 Pulled By: riversand963 fbshipit-source-id: cc96e8f09890e5cfe9b20eadb63bdca5484c150a	2021-07-30 12:07:49 -07:00
Peter Dillinger	1d34cd797e	Fix insecure internal API for GetImpl (#8590 ) Summary: Calling the GetImpl function could leave reference to a local callback function in a field of a parameter struct. As this is performance-critical code, I'm not going to attempt to sanitize this code too much, but make the existing hack a bit cleaner by reverting what it overwrites in the input struct. Added SaveAndRestore utility class to make that easier. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8590 Test Plan: added unit test for SaveAndRestore; existing tests for GetImpl Reviewed By: riversand963 Differential Revision: D29947983 Pulled By: pdillinger fbshipit-source-id: 2f608853f970bc06724e834cc84dcc4b8599ddeb	2021-07-29 17:23:01 -07:00
sdong	e8f218cb68	DB::GetSortedWalFiles() to ensure file deletion is disabled (#8591 ) Summary: If DB::GetSortedWalFiles() runs without file deletion disbled, file might get deleted in the middle and error is returned to users. It makes the function hard to use. Fix it by disabling file deletion if it is not done. Fix another minor issue of logging within DB mutex, which should not be done unless a major failure happens. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8591 Test Plan: Run all existing tests Reviewed By: pdillinger Differential Revision: D29969412 fbshipit-source-id: d5f42b5271608a35b9b07687ce18157d7447b0de	2021-07-29 11:51:08 -07:00
Baptiste Lemaire	c521a9ab2b	Retire superfluous functions introduced in earlier mempurge PRs. (#8558 ) Summary: The main challenge to make the memtable garbage collection prototype (nicknamed `mempurge`) was to not get rid of WAL files that contain unflushed (but mempurged) data. That was successfully guaranteed by not writing the VersionEdit to the MANIFEST file after a successful mempurge. By not writing VersionEdits to the `MANIFEST` file after a succesful mempurge operation, we do not change the earliest log file number that contains unflushed data: `cfd->GetLogNumber()` (`cfd->SetLogNumber()` is only called in `VersionSet::ProcessManifestWrites`). As a result, a number of functions introduced earlier just for the mempurge operation are not obscolete/redundant. (e.g.: `FlushJob::ExtractEarliestLogFileNumber`), and this PR aims at cleaning up all these now-unnecessary functions. In particular, we no longer need to store the earliest log file number in the `MemTable` struct itself. This PR therefore also reverts the `MemTable` struct to its original form. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8558 Test Plan: Already included in `db_flush_test.cc`. Reviewed By: anand1976 Differential Revision: D29764351 Pulled By: bjlemaire fbshipit-source-id: 0f43b260fa270251862512f397d3f24ee62e8437	2021-07-22 18:29:13 -07:00
Yanqin Jin	2e5388178f	Return error if trying to open secondary on missing or inaccessible primary (#8200 ) Summary: If the primary's CURRENT file is missing or inaccessible, the secondary should not hang trying repeatedly to switch to the next MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8200 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D27840627 Pulled By: riversand963 fbshipit-source-id: 071fed97cbab1bc5cdefd1dc235e5cd406c174e1	2021-07-22 15:48:58 -07:00
Peter Dillinger	84eef260de	Remove TaskLimiterToken::ReleaseOnce for fix (#8567 ) Summary: Rare TSAN and valgrind failures are caused by unnecessary reading of a field on the TaskLimiterToken::limiter_ for an assertion after the token has been released and the limiter destroyed. To simplify we can simply destroy the token before triggering DB shutdown (potentially destroying the limiter). This makes the ReleaseOnce logic unnecessary. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8567 Test Plan: watch for more failures in CI Reviewed By: ajkr Differential Revision: D29811795 Pulled By: pdillinger fbshipit-source-id: 135549ebb98fe4f176d1542ed85d5bd6350a40b3	2021-07-21 17:37:53 -07:00
Jay Zhuang	42eaa45c1b	Avoid updating option if there's no value updated (#8518 ) Summary: Try avoid expensive updating options operation if `SetDBOptions()` does not change any option value. Skip updating is not guaranteed, for example, changing `bytes_per_sync` to `0` may still trigger updating, as the value could be sanitized. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8518 Test Plan: added unittest Reviewed By: riversand963 Differential Revision: D29672639 Pulled By: jay-zhuang fbshipit-source-id: b7931de62ceea6f1bdff0d1209adf1197d3ed1f4	2021-07-21 13:45:59 -07:00
sdong	9e885939a3	Change to code for trimmed memtable history is to released outside DB mutex (#8530 ) Summary: Currently, the code shows that we delete memtables immedately after it is trimmed from history. Although it should never happen as the super version still holds the memtable, which is only switched after it, it feels a good practice not to do it, but use clean it up in the standard way: put it to WriteContext and clean it after DB mutex. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8530 Test Plan: Run all existing tests. Reviewed By: ajkr Differential Revision: D29703410 fbshipit-source-id: 21d8068ac6377de4b6fa7a89697195742659fde4	2021-07-16 19:28:48 -07:00
Peter Dillinger	df5dc73bec	Don't hold DB mutex for block cache entry stat scans (#8538 ) Summary: I previously didn't notice the DB mutex was being held during block cache entry stat scans, probably because I primarily checked for read performance regressions, because they require the block cache and are traditionally latency-sensitive. This change does some refactoring to avoid holding DB mutex and to avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats"). Some tests have to be updated because now the stats collector is populated in the Cache aggressively on DB startup rather than lazily. (I hope to clean up some of this added complexity in the future.) This change also ensures proper treatment of need_out_of_mutex for non-int DB properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538 Test Plan: Added unit test logic that uses sync points to fail if the DB mutex is held during a scan, covering the various ways that a scan might be triggered. Performance test - the known impact to holding the DB mutex is on TransactionDB, and the easiest way to see the impact is to hack the scan code to almost always miss and take an artificially long time scanning. Here I've injected an unconditional 5s sleep at the call to ApplyToAllEntries. Before (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0) rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856 $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0) rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323 Notice the 5s P100 write time. After (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0) rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407 $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918 P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code: $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767 $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0) rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832 No substantial difference. Reviewed By: siying Differential Revision: D29738847 Pulled By: pdillinger fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b	2021-07-16 14:13:08 -07:00
Baptiste Lemaire	206845c057	Mempurge support for wal (#8528 ) Summary: In this PR, `mempurge` is made compatible with the Write Ahead Log: in case of recovery, the DB is now capable of recovering the data that was "mempurged" and kept in the `imm()` list of immutable memtables. The twist was to add a uint64_t to the `memtable` struct to store the number of the earliest log file containing entries from the `memtable`. When a `Flush` operation is replaced with a `MemPurge`, the `VersionEdit` (which usually contains the new min log file number to pick up for recovery and the level 0 file path of the newly created SST file) is no longer appended to the manifest log, and every time the `deleteWal` method is called, a check is made on the list of immutable memtables. This PR also includes a unit test that verifies that no data is lost upon Reopening of the database when the mempurge feature is activated. This extensive unit test includes two column families, with valid data contained in the imm() at time of "crash"/reopening (recovery). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8528 Reviewed By: pdillinger Differential Revision: D29701097 Pulled By: bjlemaire fbshipit-source-id: 072a900fb6ccc1edcf5eef6caf88f3060238edf9	2021-07-15 17:49:13 -07:00
longlijian	803a40d412	Delete legacy code not used any more. (#8508 ) Summary: The removed function in this PR, just only have declared and dose not have any reference used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8508 Reviewed By: mrambacher Differential Revision: D29649033 Pulled By: jay-zhuang fbshipit-source-id: df98143b73d6c184a2a60c9f7ea2548a065ee35d	2021-07-14 16:04:56 -07:00
Dmitry Vorobev	0b75b22321	Implement missing Handler methods in ColumnFamilyCollector. (#8456 ) Summary: When db is open as secondary, there are basically 2 step process: 1) Collect column families from wal log 2) Apply changes to Memtable In case primary db is TransactionDB instance, wal log will contain some additional data, like noop, etc. ColumnFamilyCollector doesn't implement methods to handle these, so it fails to open a wal log written by TransactionDB. (Everything works fine with standard DB::Open). Memtable recovery process knows how to handle such wal logs, so only missing piece seems to be ColumnFamilyCollector. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8456 Reviewed By: ajkr Differential Revision: D29455945 Pulled By: mrambacher fbshipit-source-id: 5b29560fcbc008e17e95d0dc4b07558f3d63e26f	2021-07-12 09:23:45 -07:00
anand76	d1b70b05a6	Avoid passing existing BG error to WriteStatusCheck (#8511 ) Summary: In ```DBImpl::WriteImpl()```, we call ```PreprocessWrite()``` which, among other things, checks the BG error and returns it set. This return status is later on passed to ```WriteStatusCheck()```, which calls ```SetBGError()```. This results in a spurious call, and info logs, on every user write request. We should avoid passing the ```PreprocessWrite()``` return status to ```WriteStatusCheck()```, as the former would have called ```SetBGError()``` already if it encountered any new errors, such as error when creating a new WAL file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8511 Test Plan: Run existing tests Reviewed By: zhichao-cao Differential Revision: D29639917 Pulled By: anand1976 fbshipit-source-id: 19234163969e1645dbeb273712aaf5cd9ea2b182	2021-07-11 22:37:52 -07:00
Baptiste Lemaire	837705ad80	Make mempurge a background process (equivalent to in-memory compaction). (#8505 ) Summary: In https://github.com/facebook/rocksdb/issues/8454, I introduced a new process baptized `MemPurge` (memtable garbage collection). This new PR is built upon this past mempurge prototype. In this PR, I made the `mempurge` process a background task, which provides superior performance since the mempurge process does not cling on the db_mutex anymore, and addresses severe restrictions from the past iteration (including a scenario where the past mempurge was failling, when a memtable was mempurged but was still referred to by an iterator/snapshot/...). Now the mempurge process ressembles an in-memory compaction process: the stack of immutable memtables is filtered out, and the useful payload is used to populate an output memtable. If the output memtable is filled at more than 60% capacity (arbitrary heuristic) the mempurge process is aborted and a regular flush process takes place, else the output memtable is kept in the immutable memtable stack. Note that adding this output memtable to the `imm()` memtable stack does not trigger another flush process, so that the flush thread can go to sleep at the end of a successful mempurge. MemPurge is activated by making the `experimental_allow_mempurge` flag `true`. When activated, the `MemPurge` process will always happen when the flush reason is `kWriteBufferFull`. The 3 unit tests confirm that this process supports `Put`, `Get`, `Delete`, `DeleteRange` operators and is compatible with `Iterators` and `CompactionFilters`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8505 Reviewed By: pdillinger Differential Revision: D29619283 Pulled By: bjlemaire fbshipit-source-id: 8a99bee76b63a8211bff1a00e0ae32360aaece95	2021-07-09 17:23:59 -07:00
longlijian	ac3f3f3719	Eliminate compiler complaining, which the return type of the function… (#8498 ) Summary: … should be uint64_t. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8498 Reviewed By: jay-zhuang Differential Revision: D29605064 Pulled By: ajkr fbshipit-source-id: e431448ac9d8a37ae83679c4cc5732e29fe49de4	2021-07-08 10:09:05 -07:00
Baptiste Lemaire	714ce5041d	Fix clang_analyzer failure (#8492 ) Summary: Previously, the following command: ```USE_CLANG=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j$(nproc) analyze``` was raising an error/warning the new_mem could potentially be a `nullptr`. This error appeared due to code changes from https://github.com/facebook/rocksdb/issues/8454, including an if-statement containing "`... && new_mem != nullptr && ...`", which made the analyzer believe that past this `if`-statement, a `new_mem==nullptr` was a possible scenario. This code patch simply introduces `assert`s and removes this condition in the `if`-statement. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8492 Reviewed By: jay-zhuang Differential Revision: D29571275 Pulled By: bjlemaire fbshipit-source-id: 75d72246b70ebbbae7dea11ccb5778686d8bcbea	2021-07-06 18:48:56 -07:00
Baptiste Lemaire	9dc887ece0	Memtable "MemPurge" prototype (#8454 ) Summary: Implement an experimental feature called "MemPurge", which consists in purging "garbage" bytes out of a memtable and reuse the memtable struct instead of making it immutable and eventually flushing its content to storage. The prototype is by default deactivated and is not intended for use. It is intended for correctness and validation testing. At the moment, the "MemPurge" feature can be switched on by using the `options.experimental_allow_mempurge` flag. For this early stage, when the allow_mempurge flag is set to `true`, all the flush operations will be rerouted to perform a MemPurge. This is a temporary design decision that will give us the time to explore meaningful heuristics to use MemPurge at the right time for relevant workloads . Moreover, the current MemPurge operation only supports `Puts`, `Deletes`, `DeleteRange` operations, and handles `Iterators` as well as `CompactionFilter`s that are invoked at flush time . Three unit tests are added to `db_flush_test.cc` to test if MemPurge works correctly (and checks that the previously mentioned operations are fully supported thoroughly tested). One noticeable design decision is the timing of the MemPurge operation in the memtable workflow: for this prototype, the mempurge happens when the memtable is switched (and usually made immutable). This is an inefficient process because it implies that the entirety of the MemPurge operation happens while holding the db_mutex. Future commits will make the MemPurge operation a background task (akin to the regular flush operation) and aim at drastically enhancing the performance of this operation. The MemPurge is also not fully "WAL-compatible" yet, but when the WAL is full, or when the regular MemPurge operation fails (or when the purged memtable still needs to be flushed), a regular flush operation takes place. Later commits will also correct these behaviors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8454 Reviewed By: anand1976 Differential Revision: D29433971 Pulled By: bjlemaire fbshipit-source-id: 6af48213554e35048a7e03816955100a80a26dc5	2021-07-02 05:23:02 -07:00
Akanksha Mahajan	c76778e2bd	Call OnCompactionCompleted API in case of DisableManualCompaction (#8469 ) Summary: Call OnCompactionCompleted API in case of DisableManualCompaction() with updated Status::Incomplete Pull Request resolved: https://github.com/facebook/rocksdb/pull/8469 Reviewed By: ajkr Differential Revision: D29475517 Pulled By: akankshamahajan15 fbshipit-source-id: a1726c5e6ee18c0b5097ea04f5e6975fbe108055	2021-07-01 19:18:55 -07:00
Jay Zhuang	93a7389442	Add statistics support on CompactionService remote side (#8368 ) Summary: Add statistics option on CompactionService remote side. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8368 Test Plan: `make check` Reviewed By: ajkr Differential Revision: D28944427 Pulled By: jay-zhuang fbshipit-source-id: 2a19217f4a69b6e511af87eed12391860ef00c5e	2021-06-29 11:48:14 -07:00
mrambacher	be219089ad	Add BlobMetaData retrieval methods (#8273 ) Summary: Added BlobMetaData to ColumnFamilyMetaData and LiveBlobMetaData and DB API GetLiveBlobMetaData to retrieve it. First pass at struct. More tests and maybe fields to come... Pull Request resolved: https://github.com/facebook/rocksdb/pull/8273 Reviewed By: ltamasi Differential Revision: D29102400 Pulled By: mrambacher fbshipit-source-id: 8a2383a4446328be6b91dced9841fdd3dfc80b73	2021-06-28 08:13:29 -07:00
Zhichao Cao	a904c62d28	Using existing crc32c checksum in checksum handoff for Manifest and WAL (#8412 ) Summary: In PR https://github.com/facebook/rocksdb/issues/7523 , checksum handoff is introduced in RocksDB for WAL, Manifest, and SST files. When user enable checksum handoff for a certain type of file, before the data is written to the lower layer storage system, we calculate the checksum (crc32c) of each piece of data and pass the checksum down with the data, such that data verification can be down by the lower layer storage system if it has the capability. However, it cannot cover the whole lifetime of the data in the memory and also it potentially introduces extra checksum calculation overhead. In this PR, we introduce a new interface in WritableFileWriter::Append, which allows the caller be able to pass the data and the checksum (crc32c) together. In this way, WritableFileWriter can directly use the pass-in checksum (crc32c) to generate the checksum of data being passed down to the storage system. It saves the calculation overhead and achieves higher protection coverage. When a new checksum is added with the data, we use Crc32cCombine https://github.com/facebook/rocksdb/issues/8305 to combine the existing checksum and the new checksum. To avoid the segmenting of data by rate-limiter before it is stored, rate-limiter is called enough times to accumulate enough credits for a certain write. This design only support Manifest and WAL which use log_writer in the current stage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8412 Test Plan: make check, add new testing cases. Reviewed By: anand1976 Differential Revision: D29151545 Pulled By: zhichao-cao fbshipit-source-id: 75e2278c5126cfd58393c67b1efd18dcc7a30772	2021-06-25 00:47:17 -07:00
Jay Zhuang	54d73d6429	Fix DeleteFilesInRange may cause inconsistent compaction error (#8434 ) Summary: `DeleteFilesInRange()` marks deleting files to `being_compacted` before deleting, which may cause ongoing compactions report corruption exception or ASSERT for debug build. Adding the missing `ComputeCompactionScore()` when `being_compacted` is set. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8434 Test Plan: Unittest Reviewed By: ajkr Differential Revision: D29276127 Pulled By: jay-zhuang fbshipit-source-id: f5b223e3c1fc6d821e100e3f3442bc70c1d50cf7	2021-06-22 09:17:37 -07:00
Zhichao Cao	82a70e1470	Trace MultiGet Keys and CF_IDs to the trace file (#8421 ) Summary: Tracing the MultiGet information including timestamp, keys, and CF_IDs to the trace file for analyzing and replay. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8421 Test Plan: make check, add test to trace_analyzer_test Reviewed By: anand1976 Differential Revision: D29221195 Pulled By: zhichao-cao fbshipit-source-id: 30c677d6c39ab31ef4bbdf7e0d1fa1fd79f295ff	2021-06-18 15:04:05 -07:00
anand76	575ea26ec9	Don't log a warning if file system doesn't support ReopenWritableFile() (#8414 ) Summary: RocksDB logs a warning if WAL truncation on DB open fails. Its possible that on some file systems, truncation is not required and they would return ```Status::NotSupported()``` for ```ReopenWritableFile```. Don't log a warning in such cases. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8414 Reviewed By: akankshamahajan15 Differential Revision: D29181738 Pulled By: anand1976 fbshipit-source-id: 6e01e9117e1e4c1d67daa4dcee7fa59d06e057a7	2021-06-17 12:05:40 -07:00
mrambacher	d5bd0039b9	Rename ImmutableOptions variables (#8409 ) Summary: This is the next part of the ImmutableOptions cleanup. After changing the use of ImmutableCFOptions to ImmutableOptions, there were places in the code that had did something like "ImmutableOptions* immutable_cf_options", where "cf" referred to the "old" type. This change simply renames the variables to match the current type. No new functionality is introduced. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8409 Reviewed By: pdillinger Differential Revision: D29166248 Pulled By: mrambacher fbshipit-source-id: 96de97f8e743f5c5160f02246e3ed8269556dc6f	2021-06-16 16:51:38 -07:00
Peter Dillinger	b3dbeadc34	Fix double-dumping CF stats to log (#8380 ) Summary: DBImpl::DumpStats is supposed to do this: Dump DB stats to LOG For each CF, dump CFStatsNoFileHistogram to LOG For each CF, dump CFFileHistogram to LOG Instead, due to a longstanding bug from 2017 (https://github.com/facebook/rocksdb/issues/2126), it would dump CFStats, which includes both CFStatsNoFileHistogram and CFFileHistogram, in both loops, resulting in near-duplicate output. This fixes the bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8380 Test Plan: Manual inspection of LOG after db_bench Reviewed By: jay-zhuang Differential Revision: D29017535 Pulled By: pdillinger fbshipit-source-id: 3010604c4a629a80347f129cd746ce9b0d0cbda6	2021-06-11 17:06:09 -07:00
Zhichao Cao	f44e69c64a	Use DbSessionId as cache key prefix when secondary cache is enabled (#8360 ) Summary: Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360 Test Plan: add the test to lru_cache_test Reviewed By: pdillinger Differential Revision: D29006215 Pulled By: zhichao-cao fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814	2021-06-10 11:02:43 -07:00
David Devecsery	80a59a03a7	Cancel compact range (#8351 ) Summary: Added the ability to cancel an in-progress range compaction by storing to an atomic "canceled" variable pointed to within the CompactRangeOptions structure. Tested via two tests added to db_tests2.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8351 Reviewed By: ajkr Differential Revision: D28808894 Pulled By: ddevec fbshipit-source-id: cb321361c9e23b084b188bb203f11c375a22c2dd	2021-06-07 11:41:31 -07:00
Peter (Stig) Edwards	c75ef03e58	Do not truncate WAL if in read_only mode (#8313 ) Summary: I noticed ```openat``` system call with ```O_WRONLY``` flag and ```sync_file_range``` and ```truncate``` on WAL file when using ```rocksdb::DB::OpenForReadOnly``` by way of ```db_bench --readonly=true --benchmarks=readseq --use_existing_db=1 --num=1 ...``` Noticed in ```strace``` after seeing the last modification time of the WAL file change after each run (with ```--readonly=true```). I think introduced by `7d7f14480e` from https://github.com/facebook/rocksdb/pull/8122 I added a test to catch the WAL file being truncated and the modification time on it changing. I am not sure if a mock filesystem with mock clock could be used to avoid having to sleep 1.1s. The test could also check the set of files is the same and that the sizes are also unchanged. Before: ``` [ RUN ] DBBasicTest.ReadOnlyReopenMtimeUnchanged db/db_basic_test.cc:182: Failure Expected equality of these values: file_mtime_after_readonly_reopen Which is: 1621611136 file_mtime_before_readonly_reopen Which is: 1621611135 file is: 000010.log [ FAILED ] DBBasicTest.ReadOnlyReopenMtimeUnchanged (1108 ms) ``` After: ``` [ RUN ] DBBasicTest.ReadOnlyReopenMtimeUnchanged [ OK ] DBBasicTest.ReadOnlyReopenMtimeUnchanged (1108 ms) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8313 Reviewed By: pdillinger Differential Revision: D28656925 Pulled By: jay-zhuang fbshipit-source-id: ea9e215cb53e7c830e76bc5fc75c45e21f12a1d6	2021-05-27 10:27:55 -07:00
Jay Zhuang	3786181a90	Add remote compaction public API (#8300 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8300 Reviewed By: ajkr Differential Revision: D28464726 Pulled By: jay-zhuang fbshipit-source-id: 49e9f4fb791808a6cbf39a7b1a331373f645fc5e	2021-05-19 21:41:31 -07:00
mrambacher	8948dc8524	Make ImmutableOptions struct that inherits from ImmutableCFOptions and ImmutableDBOptions (#8262 ) Summary: The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions. This change cleans that up by introducing an ImmutableOptions struct. Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form). Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR. All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes. Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262 Reviewed By: pdillinger Differential Revision: D28226540 Pulled By: mrambacher fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf	2021-05-05 14:00:17 -07:00
Andrew Kryczka	c70bae1b05	Fix ConcurrentTaskLimiter token release for shutdown (#8253 ) Summary: Previously the shutdown process did not properly wait for all `compaction_thread_limiter` tokens to be released before proceeding to delete the DB's C++ objects. When this happened, we saw tests like "DBCompactionTest.CompactionLimiter" flake with the following error: ``` virtual rocksdb::ConcurrentTaskLimiterImpl::~ConcurrentTaskLimiterImpl(): Assertion `outstanding_tasks_ == 0' failed. ``` There is a case where a token can still be alive even after the shutdown process has waited for BG work to complete. In particular, this happens because the shutdown process only waits for flush/compaction scheduled/unscheduled counters to all reach zero. These counters are decremented in `BackgroundCallCompaction()` functions. However, tokens are released in `BGWorkCompaction()` functions, which actually wrap the `BackgroundCallCompaction()` function. A simple sleep could repro the race condition: ``` $ diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc index 806bc548a..ba59efa89 100644 --- a/db/db_impl/db_impl_compaction_flush.cc +++ b/db/db_impl/db_impl_compaction_flush.cc @@ -2442,6 +2442,7 @@ void DBImpl::BGWorkCompaction(void arg) { static_cast<PrepickedCompaction*>(ca.prepicked_compaction); static_cast_with_check<DBImpl>(ca.db)->BackgroundCallCompaction( prepicked_compaction, Env::Priority::LOW); + sleep(1); delete prepicked_compaction; } $ ./db_compaction_test --gtest_filter=DBCompactionTest.CompactionLimiter db_compaction_test: util/concurrent_task_limiter_impl.cc:24: virtual rocksdb::ConcurrentTaskLimiterImpl::~ConcurrentTaskLimiterImpl(): Assertion `outstanding_tasks_ == 0' failed. Received signal 6 (Aborted) #0 /usr/local/fbcode/platform007/lib/libc.so.6(gsignal+0xcf) [0x7f02673c30ff] ?? ??:0 https://github.com/facebook/rocksdb/issues/1 /usr/local/fbcode/platform007/lib/libc.so.6(abort+0x134) [0x7f02673ac934] ?? ??:0 ... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8253 Test Plan: sleeps to expose race conditions Reviewed By: akankshamahajan15 Differential Revision: D28168064 Pulled By: ajkr fbshipit-source-id: 9e5167c74398d323e7975980c5cc00f450631160	2021-05-04 17:27:24 -07:00
Peter Dillinger	d2ca04e3ed	Add more LSM info to FilterBuildingContext (#8246 ) Summary: Add `num_levels`, `is_bottommost`, and table file creation `reason` to `FilterBuildingContext`, in anticipation of more powerful Bloom-like filter support. To support this, added `is_bottommost` and `reason` to `TableBuilderOptions`, which allowed removing `reason` parameter from `rocksdb::BuildTable`. I attempted to remove `skip_filters` from `TableBuilderOptions`, because filter construction decisions should arise from options, not one-off parameters. I could not completely remove it because the public API for SstFileWriter takes a `skip_filters` parameter, and translating this into an option change would mean awkwardly replacing the table_factory if it is BlockBasedTableFactory with new filter_policy=nullptr option. I marked this public skip_filters option as deprecated because of this oddity. (skip_filters on the read side probably makes sense.) At least `skip_filters` is now largely hidden for users of `TableBuilderOptions` and is no longer used for implementing the optimize_filters_for_hits option. Bringing the logic for that option closer to handling of FilterBuildingContext makes it more obvious that hese two are using the same notion of "bottommost." (Planned: configuration options for Bloom-like filters that generalize `optimize_filters_for_hits`) Recommended follow-up: Try to get away from "bottommost level" naming of things, which is inaccurate (see VersionStorageInfo::RangeMightExistAfterSortedRun), and move to "bottommost run" or just "bottommost." Pull Request resolved: https://github.com/facebook/rocksdb/pull/8246 Test Plan: extended an existing unit test to exercise and check various filter building contexts. Also, existing tests for optimize_filters_for_hits validate some of the "bottommost" handling, which is now closely connected to FilterBuildingContext::is_bottommost through TableBuilderOptions::is_bottommost Reviewed By: mrambacher Differential Revision: D28099346 Pulled By: pdillinger fbshipit-source-id: 2c1072e29c24d4ac404c761a7b7663292372600a	2021-04-30 13:50:13 -07:00
Peter Dillinger	85becd94c1	Refactor: use TableBuilderOptions to reduce parameter lists (#8240 ) Summary: Greatly reduced the not-quite-copy-paste giant parameter lists of rocksdb::NewTableBuilder, rocksdb::BuildTable, BlockBasedTableBuilder::Rep ctor, and BlockBasedTableBuilder ctor. Moved weird separate parameter `uint32_t column_family_id` of TableFactory::NewTableBuilder into TableBuilderOptions. Re-ordered parameters to TableBuilderOptions ctor, so that `uint64_t target_file_size` is not randomly placed between uint64_t timestamps (was easy to mix up). Replaced a couple of fields of BlockBasedTableBuilder::Rep with a FilterBuildingContext. The motivation for this change is making it easier to pass along more data into new fields in FilterBuildingContext (follow-up PR). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8240 Test Plan: ASAN make check Reviewed By: mrambacher Differential Revision: D28075891 Pulled By: pdillinger fbshipit-source-id: fddb3dbb8260a0e8bdcbb51b877ebabf9a690d4f	2021-04-29 07:00:50 -07:00
mrambacher	01e460d538	Make types of Immutable/Mutable Options fields match that of the underlying Option (#8176 ) Summary: This PR is a first step at attempting to clean up some of the Mutable/Immutable Options code. With this change, a DBOption and a ColumnFamilyOption can be reconstructed from their Mutable and Immutable equivalents, respectively. readrandom tests do not show any performance degradation versus master (though both are slightly slower than the current 6.19 release). There are still fields in the ImmutableCFOptions that are not CF options but DB options. Eventually, I would like to move those into an ImmutableOptions (= ImmutableDBOptions+ImmutableCFOptions). But that will be part of a future PR to minimize changes and disruptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8176 Reviewed By: pdillinger Differential Revision: D27954339 Pulled By: mrambacher fbshipit-source-id: ec6b805ba9afe6e094bffdbd76246c2d99aa9fad	2021-04-22 20:43:54 -07:00
Jay Zhuang	f0fca2b1d5	Add internal compaction API for Secondary instance (#8171 ) Summary: Add compaction API for secondary instance, which compact the files to a secondary DB path without installing to the LSM tree. The API will be used to remote compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8171 Test Plan: `make check` Reviewed By: ajkr Differential Revision: D27694545 Pulled By: jay-zhuang fbshipit-source-id: 8ff3ec1bffdb2e1becee994918850c8902caf731	2021-04-22 13:02:28 -07:00
Zhichao Cao	09a9ec3ac0	Fix the false positive alert of CF consistency check in WAL recovery (#8207 ) Summary: In current RocksDB, in recover the information form WAL, we do the consistency check for each column family when one WAL file is corrupted and PointInTimeRecovery is set. However, it will report a false positive alert on "SST file is ahead of WALs" when one of the CF current log number is greater than the corrupted WAL number (CF contains the data beyond the corrupted WAl) due to a new column family creation during flush. In this case, a new WAL is created (it is empty) during a flush. Also, due to some reason (e.g., storage issue or crash happens before SyncCloseLog is called), the old WAL is corrupted. The new CF has no data, therefore, it does not have the consistency issue. Fix: when checking cfd->GetLogNumber() > corrupted_wal_number also check cfd->GetLiveSstFilesSize() > 0. So the CFs with no SST file data will skip the check here. Note potential ignored inconsistency caused due to fix: empty CF can also be caused by write+delete. In this case, after flush, there is no SST files being generated. However, this CF still have the log in the WAL. When the WAL is corrupted, the DB might be inconsistent. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8207 Test Plan: added unit test, make crash_test Reviewed By: riversand963 Differential Revision: D27898839 Pulled By: zhichao-cao fbshipit-source-id: 931fc2d8b92dd00b4169bf84b94e712fd688a83e	2021-04-22 10:28:37 -07:00
sdong	4985cea141	Add comment to DisableManualCompaction() (#8186 ) Summary: Add comment to DisableManualCompaction() which was missing. Also explictly return from DBImpl::CompactRange() to avoid memtable flush when manual compaction is disabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8186 Test Plan: Run existing unit tests. Reviewed By: jay-zhuang Differential Revision: D27744517 fbshipit-source-id: 449548a48905903b888dc9612bd17480f6596a71	2021-04-21 15:23:46 -07:00
Akanksha Mahajan	596e9008e4	Stall writes in WriteBufferManager when memory_usage exceeds buffer_size (#7898 ) Summary: When WriteBufferManager is shared across DBs and column families to maintain memory usage under a limit, OOMs have been observed when flush cannot finish but writes continuously insert to memtables. In order to avoid OOMs, when memory usage goes beyond buffer_limit_ and DBs tries to write, this change will stall incoming writers until flush is completed and memory_usage drops. Design: Stall condition: When total memory usage exceeds WriteBufferManager::buffer_size_ (memory_usage() >= buffer_size_) WriterBufferManager::ShouldStall() returns true. DBImpl first block incoming/future writers by calling write_thread_.BeginWriteStall() (which adds dummy stall object to the writer's queue). Then DB is blocked on a state State::Blocked (current write doesn't go through). WBStallInterface object maintained by every DB instance is added to the queue of WriteBufferManager. If multiple DBs tries to write during this stall, they will also be blocked when check WriteBufferManager::ShouldStall() returns true. End Stall condition: When flush is finished and memory usage goes down, stall will end only if memory waiting to be flushed is less than buffer_size/2. This lower limit will give time for flush to complete and avoid continous stalling if memory usage remains close to buffer_size. WriterBufferManager::EndWriteStall() is called, which removes all instances from its queue and signal them to continue. Their state is changed to State::Running and they are unblocked. DBImpl then signal all incoming writers of that DB to continue by calling write_thread_.EndWriteStall() (which removes dummy stall object from the queue). DB instance creates WBMStallInterface which is an interface to block and signal DBs during stall. When DB needs to be blocked or signalled by WriteBufferManager, state_for_wbm_ state is changed accordingly (RUNNING or BLOCKED). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7898 Test Plan: Added a new test db/db_write_buffer_manager_test.cc Reviewed By: anand1976 Differential Revision: D26093227 Pulled By: akankshamahajan15 fbshipit-source-id: 2bbd982a3fb7033f6de6153aa92a221249861aae	2021-04-21 13:54:02 -07:00
Yanqin Jin	a376c22066	Handle rename() failure in non-local FS (#8192 ) Summary: In a distributed environment, a file `rename()` operation can succeed on server (remote) side, but the client can somehow return non-ok status to RocksDB. Possible reasons include network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a new MANIFEST. We currently always delete the new MANIFEST if an error occurs. This is problematic in distributed world. If the server-side successfully updates the CURRENT file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail. As a fix, we can track the execution result of IO operations on the new MANIFEST. - If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original MANIFEST. Therefore, it is safe to remove the new MANIFEST. - If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the new MANIFEST.) Therefore, we keep the new MANIFEST. - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT. - If process reopens the db immediately after the failure, then the CURRENT file can point to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can succeed and ignore the other. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192 Test Plan: make check Reviewed By: zhichao-cao Differential Revision: D27804648 Pulled By: riversand963 fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4	2021-04-19 18:11:13 -07:00
Yanqin Jin	b1f62be10e	Use the right level (L0) for files written during WAL recovery (#8187 ) Summary: As the name of `DBImpl::WriteLevel0TableForRecovery` suggests, the resulting table file should be placed on L0. However, the argument `level` passed to `BuildTable()` is -1. We need to correct this since the level information will be useful to determine file placement. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8187 Test Plan: make check Reviewed By: ltamasi Differential Revision: D27748570 Pulled By: riversand963 fbshipit-source-id: e1cd23128a8de31f14b1edc2ea92754c154e4f10	2021-04-14 23:40:22 -07:00
Giuseppe Ottaviano	48cd7a3aae	Fix flush reason attribution (#8150 ) Summary: Current flush reason attribution is misleading or incorrect (depending on what the original intention was): - Flush due to WAL reaching its maximum size is attributed to `kWriteBufferManager` - Flushes due to full write buffer and write buffer manager are not distinguishable, both are attributed to `kWriteBufferFull` This changes the first to a new flush reason `kWALFull`, and splits the second between `kWriteBufferManager` and `kWriteBufferFull`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8150 Reviewed By: zhichao-cao Differential Revision: D27569645 Pulled By: ot fbshipit-source-id: 7e3c8ca186a6e71976e6b8e937297eebd4b769cc	2021-04-07 23:18:37 -07:00
Peter Dillinger	a4e82a3cca	Fix read-only DB writing to filesystem with write_dbid_to_manifest (#8164 ) Summary: Fixing another crash test failure in the case of write_dbid_to_manifest=true and reading a backup as read-only DB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8164 Test Plan: enhanced unit test for backup as read-only DB, ran blackbox_crash_test more with elevated backup_one_in Reviewed By: zhichao-cao Differential Revision: D27622237 Pulled By: pdillinger fbshipit-source-id: 680d0f99ddb465a601737f2e3f2c80efd47384fb	2021-04-07 10:26:47 -07:00
Peter Dillinger	879357fdb0	Make backups openable as read-only DBs (#8142 ) Summary: A current limitation of backups is that you don't know the exact database state of when the backup was taken. With this new feature, you can at least inspect the backup's DB state without restoring it by opening it as a read-only DB. Rather than add something like OpenAsReadOnlyDB to the BackupEngine API, which would inhibit opening stackable DB implementations read-only (if/when their APIs support it), we instead provide a DB name and Env that can be used to open as a read-only DB. Possible follow-up work: * Add a version of GetBackupInfo for a single backup. * Let CreateNewBackup return the BackupID of the newly-created backup. Implementation details: Refactored ChrootFileSystem to split off new base class RemapFileSystem, which allows more general remapping of files. We use this base class to implement BackupEngineImpl::RemapSharedFileSystem. To minimize API impact, I decided to just add these fields `name_for_open` and `env_for_open` to those set by GetBackupInfo when include_file_details=true. Creating the RemapSharedFileSystem adds a bit to the memory consumption, perhaps unnecessarily in some cases, but this has been mitigated by (a) only initialize the RemapSharedFileSystem lazily when GetBackupInfo with include_file_details=true is called, and (b) using the existing `shared_ptr<FileInfo>` objects to hold most of the mapping data. To enhance API safety, RemapSharedFileSystem is wrapped by new ReadOnlyFileSystem which rejects any attempts to write. This uncovered a couple of places in which DB::OpenForReadOnly would write to the filesystem, so I fixed these. Added a release note because this affects logging. Additional minor refactoring in backupable_db.cc to support the new functionality. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8142 Test Plan: new test (run with ASAN and UBSAN), added to stress test and ran it for a while with amplified backup_one_in Reviewed By: ajkr Differential Revision: D27535408 Pulled By: pdillinger fbshipit-source-id: 04666d310aa0261ef6b2385c43ca793ce1dfd148	2021-04-06 14:37:53 -07:00
Akanksha Mahajan	689b13e639	Add request_id in IODebugContext. (#8045 ) Summary: Add request_id in IODebugContext which will be populated by underlying FileSystem for IOTracing purposes. Update IOTracer to trace request_id in the tracing records. Provided API IODebugContext::SetRequestId which will set the request_id and enable tracing for request_id. The API hides the implementation and underlying file system needs to call this API directly. Update DB::StartIOTrace API and remove redundant Env* from the argument as its not used and DB already has Env that is passed down to IOTracer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8045 Test Plan: Update unit test. Differential Revision: D26899871 Pulled By: akankshamahajan15 fbshipit-source-id: 56adef52ee5af0fb3060b607c3af1ec01635fa2b	2021-04-01 13:14:51 -07:00
sherriiiliu	e6534900bd	Fix possible hang issue in ~DBImpl() when flush is scheduled in LOW pool (#8125 ) Summary: In DBImpl::CloseHelper, we wait for bg_compaction_scheduled_ and bg_flush_scheduled_ to drop to 0. Unschedule is called prior to cancel any unscheduled flushes/compactions. It is assumed that anything in the high priority is a flush, and anything in the low priority pool is a compaction. This assumption, however, is broken when the high-pri pool is full. As a result, bg_compaction_scheduled_ can go < 0 and bg_flush_scheduled_ will remain > 0 and DB can be in hang state. The fix is, we decrement the `bg_{flush,compaction,bottom_compaction}_scheduled_` inside the `Unschedule{Flush,Compaction,BottomCompaction}Callback()`s. DB `mutex_` will make the counts atomic in `Unschedule`. Related discussion: https://github.com/facebook/rocksdb/issues/7928 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8125 Test Plan: Added new test case which hangs without the fix. Reviewed By: jay-zhuang Differential Revision: D27390043 Pulled By: ajkr fbshipit-source-id: 78a367fba9a59ac5607ad24bd1c46dc16d5ec110	2021-03-30 18:35:20 -07:00
anand76	7d7f14480e	Always truncate the latest WAL file on DB Open (#8122 ) Summary: Currently, we only truncate the latest alive WAL files when the DB is opened. If the latest WAL file is empty or was flushed during Open, its not truncated since the file will be deleted later on in the Open path. However, before deletion, a new WAL file is created, and if the process crash loops between the new WAL file creation and deletion of the old WAL file, the preallocated space will keep accumulating and eventually use up all disk space. To prevent this, always truncate the latest WAL file, even if its empty or the data was flushed. Tests: Add unit tests to db_wal_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/8122 Reviewed By: riversand963 Differential Revision: D27366132 Pulled By: anand1976 fbshipit-source-id: f923cc03ef033ccb32b140d36c6a63a8152f0e8e	2021-03-28 10:00:08 -07:00
Zhichao Cao	af80a78ba4	Fix flush no wal IO error bug (#8107 ) Summary: There is bug in the current code base introduced in https://github.com/facebook/rocksdb/issues/8049 , we still set the SST file write IO Error only case as hard error. Fix it by removing the logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8107 Test Plan: make check, error_handler_fs_test Reviewed By: anand1976 Differential Revision: D27321422 Pulled By: zhichao-cao fbshipit-source-id: c014afc1553ca66b655e3bbf9d0bf6eb417ccf94	2021-03-25 21:42:50 -07:00
Andrew Kryczka	c20a7cd6c7	Apply `sample_for_compression` to all block-based tables (#8105 ) Summary: Previously it only applied to block-based tables generated by flush. This restriction was undocumented and blocked a new use case. Now compression sampling applies to all block-based tables we generate when it is enabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8105 Test Plan: new unit test Reviewed By: riversand963 Differential Revision: D27317275 Pulled By: ajkr fbshipit-source-id: cd9fcc5178d6515e8cb59c6facb5ac01893cb5b0	2021-03-25 15:00:45 -07:00
Yanqin Jin	9f7c02dad5	Move compacted_db_impl.[c\|h] to db/db_impl (#8082 ) Summary: As title. All core db implementations should stay in db_impl. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8082 Test Plan: make check Reviewed By: ajkr Differential Revision: D27211442 Pulled By: riversand963 fbshipit-source-id: e0953fde75064740e899aaff7989ff033b7f5232	2021-03-23 13:49:26 -07:00
Yanqin Jin	7ee41a5d25	Fix a test failure when built with ASSERT_STATUS_CHECKED=1 (#8075 ) Summary: As title. Test plan ASSERT_STATUS_CHECKED=1 make -j20 backupable_db_test error_handler_fs_test ./backupable_db_test ./error_handler_fs_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/8075 Reviewed By: zhichao-cao Differential Revision: D27173832 Pulled By: riversand963 fbshipit-source-id: 37dac50f7c89127804ff2572abddd4174642de30	2021-03-18 21:52:48 -07:00
Zhichao Cao	c810947184	Separate handling of WAL Sync io error with SST flush io error (#8049 ) Summary: In previous codebase, if WAL is used, all the retryable IO Error will be treated as hard error. So write is stalled. In this PR, the retryable IO error from WAL sync is separated from SST file flush io error. If WAL Sync is ok and retryable IO Error only happens during SST flush, the error is mapped to soft error. So user can continue insert to Memtable and append to WAL. Resolve the bug that if WAL sync fails, the memtable status does not roll back due to calling PickMemtable early than calling and checking SyncClosedLog. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8049 Test Plan: added new unit test, make check Reviewed By: anand1976 Differential Revision: D26965529 Pulled By: zhichao-cao fbshipit-source-id: f5fecb66602212523c92ee49d7edcb6065982410	2021-03-18 14:33:16 -07:00
Peter Dillinger	e7a60d01b2	Revamp WriteController (#8064 ) Summary: WriteController had a number of issues: * It could introduce a delay of 1ms even if the write rate never exceeded the configured delayed_write_rate. * The DB-wide delayed_write_rate could be exceeded in a number of ways with multiple column families: * Wiping all pending delay "debts" when another column family joins the delay with GetDelayToken(). * Resetting last_refill_time_ to (now + sleep amount) means each column family can write with delayed_write_rate for large writes. * Updating bytes_left_ for a partial refill without updating last_refill_time_ would essentially give out random bonuses, especially to medium-sized writes. Now the code is much simpler, with these issues fixed. See comments in the new code and new (replacement) tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8064 Test Plan: new tests, better than old tests Reviewed By: mrambacher Differential Revision: D27064936 Pulled By: pdillinger fbshipit-source-id: 497c23fe6819340b8f3d440bd634d8a2bc47323f	2021-03-18 09:47:31 -07:00
Akanksha Mahajan	27d57a035e	Use SST file manager to track blob files as well (#8037 ) Summary: Extend support to track blob files in SST File manager. This PR notifies SstFileManager whenever a new blob file is created, via OnAddFile and an obsolete blob file deleted via OnDeleteFile and delete file via ScheduleFileDeletion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8037 Test Plan: Add new unit tests Reviewed By: ltamasi Differential Revision: D26891237 Pulled By: akankshamahajan15 fbshipit-source-id: 04c69ccfda2a73782fd5c51982dae58dd11979b6	2021-03-17 20:44:49 -07:00
Yanqin Jin	85d4f2c8b3	Move a test file to a better location (#8054 ) Summary: As title. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8054 Test Plan: make check Reviewed By: mrambacher Differential Revision: D27017955 Pulled By: riversand963 fbshipit-source-id: 829497d507bc89afbe982f8a8cf3555e52fd7098	2021-03-15 15:03:27 -07:00
mrambacher	3dff28cf9b	Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033 ) Summary: For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes. For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource. There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved. Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17: 6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found) 6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found) PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found) (Note that the times for 6.18 are prior to revert of the SystemClock). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033 Reviewed By: pdillinger Differential Revision: D27014563 Pulled By: mrambacher fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67	2021-03-15 04:34:11 -07:00
Yanqin Jin	64517d184a	Make secondary instance use ManifestTailer (#7998 ) Summary: This PR - adds a class `ManifestTailer` that inherits from `VersionEditHandlerPointInTime`. `ManifestTailer::Iterate()` can be called multiple times to tail the primary instance's MANIFEST and apply the changes to the secondary, - updates the implementation of `ReactiveVersionSet::ReadAndApply` to use this class, - removes unused code in version_set.cc, - updates existing tests, e.g. removing deleted sync points from unit tests, - adds a new test to address the bug in https://github.com/facebook/rocksdb/issues/7815. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7998 Test Plan: make check Existing and newly-added tests in version_set_test.cc and db_secondary_test.cc Reviewed By: jay-zhuang Differential Revision: D26926641 Pulled By: riversand963 fbshipit-source-id: 8d4dd15db0ba863c213f743e33b5a207e948c980	2021-03-10 10:59:44 -08:00
fanrui03	67d72fb5dc	Fix checkpoint stuck (#7921 ) Summary: ## 1. Bug description: When RocksDB Checkpoint, it may be stuck in `WaitUntilFlushWouldNotStallWrites` method. ## 2. Simple analysis of the reasons: ### 2.1 Configuration parameters: ```yaml Compaction Style : Universal max_write_buffer_number : 4 min_write_buffer_number_to_merge : 3 ``` Checkpoint is usually very fast. When the Checkpoint is executed, `WaitUntilFlushWouldNotStallWrites` is called. If there are 2 Immutable MemTables, which are less than `min_write_buffer_number_to_merge`, they will not be flushed. But will enter this code. ```c++ // method: GetWriteStallConditionAndCause if (mutable_cf_options.max_write_buffer_number> 3 && num_unflushed_memtables >= mutable_cf_options.max_write_buffer_number-1) { return {WriteStallCondition::kDelayed, WriteStallCause::kMemtableLimit}; } ``` code link: `fbed72f03c/db/column_family.cc (L847)` Checkpoint thought there was a FlushJob, but it didn't. So will always wait. ### 2.2 solution: Increase the restriction: the `number of Immutable MemTable` >= `min_write_buffer_number_to_merge will wait`. If there are other better solutions, you can correct me. ### 2.3 Code that can reproduce the problem: https://github.com/1996fanrui/fanrui-learning/blob/flink-1.12/module-java/src/main/java/com/dream/rocksdb/RocksDBCheckpointStuck.java ## 3. Interesting point This bug will be triggered only when `the number of sorted runs >= level0_file_num_compaction_trigger`. Because there is a break in WaitUntilFlushWouldNotStallWrites. ```c++ if (cfd->imm()->NumNotFlushed() < cfd->ioptions()->min_write_buffer_number_to_merge && vstorage->l0_delay_trigger_count() < mutable_cf_options.level0_file_num_compaction_trigger) { break; } ``` code link: `fbed72f03c/db/db_impl/db_impl_compaction_flush.cc (L1974)` Universal may have `l0_delay_trigger_count() >= level0_file_num_compaction_trigger`, so this bug is triggered. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7921 Reviewed By: jay-zhuang Differential Revision: D26900559 Pulled By: ajkr fbshipit-source-id: 133c1252dad7393753f04a47590b68c7d8e670df	2021-03-09 02:21:25 -08:00
Levi Tamasi	a46f080cce	Break down the amount of data written during flushes/compactions per file type (#8013 ) Summary: The patch breaks down the "bytes written" (as well as the "number of output files") compaction statistics into two, so the values are logged separately for table files and blob files in the info log, and are shown in separate columns (`Write(GB)` for table files, `Wblob(GB)` for blob files) when the compaction statistics are dumped. This will also come in handy for fixing the write amplification statistics, which currently do not consider the amount of data read from blob files during compaction. (This will be fixed by an upcoming patch.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8013 Test Plan: Ran `make check` and `db_bench`. Reviewed By: riversand963 Differential Revision: D26742156 Pulled By: ltamasi fbshipit-source-id: 31d18ee8f90438b438ca7ed1ea8cbd92114442d5	2021-03-02 09:48:00 -08:00
Yanqin Jin	cef4a6c49f	Compaction filter support for (new) BlobDB (#7974 ) Summary: Allow applications to implement a custom compaction filter and pass it to BlobDB. The compaction filter's custom logic can operate on blobs. To do so, application needs to subclass `CompactionFilter` abstract class and implement `FilterV2()` method. Optionally, a method called `ShouldFilterBlobByKey()` can be implemented if application's custom logic rely solely on the key to make a decision without reading the blob, thus saving extra IO. Examples can be found in db/blob/db_blob_compaction_test.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7974 Test Plan: make check Reviewed By: ltamasi Differential Revision: D26509280 Pulled By: riversand963 fbshipit-source-id: 59f9ae5614c4359de32f4f2b16684193cc537b39	2021-02-25 16:32:35 -08:00
Akanksha Mahajan	46cf5fbfdd	Extend VerifyFileChecksums API for blob files (#7979 ) Summary: Extend VerifyFileChecksums API to verify blob files in case of use_file_checksum. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7979 Test Plan: New unit test db_blob_corruption_test Reviewed By: ltamasi Differential Revision: D26534040 Pulled By: akankshamahajan15 fbshipit-source-id: 7dc5951a3df9d265ea1265e0122b43c966856ade	2021-02-22 22:09:22 -08:00
Zhichao Cao	b0fd1cc45a	Introduce a new trace file format (v 0.2) for better extension (#7977 ) Summary: The trace file record and payload encode is fixed, which requires complex backward compatibility resolving. This PR introduce a new trace file format, which makes it easier to add new entries to the payload and does not have backward compatible issues. V 0.1 is still supported in this PR. Added the tracing for lower_bound and upper_bound for iterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7977 Test Plan: make check. tested with old trace file in replay and analyzing. Reviewed By: anand1976 Differential Revision: D26529948 Pulled By: zhichao-cao fbshipit-source-id: ebb75a127ce3c07c25a1ccc194c551f917896a76	2021-02-18 23:05:35 -08:00
Jay Zhuang	59ba104e4a	Fix txn `MultiGet()` return un-committed data with snapshot (#7963 ) Summary: TransactionDB uses read callback to filter out un-committed data before a snapshot. But `MultiGet()` API doesn't use that at all, which causes returning unwanted data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7963 Test Plan: Added unittest to reproduce Reviewed By: anand1976 Differential Revision: D26455851 Pulled By: jay-zhuang fbshipit-source-id: 265276698cf9d8c4cd79e3250ef10d14375bac55	2021-02-18 08:49:00 -08:00
Akanksha Mahajan	6a85aea5b1	Bug fix for status overridden by Status::NotFound in db_impl_readonly (#7972 ) Summary: Bug fix for status returned being overridden by Status::NotFound in DBImpl::OpenForReadOnlyCheckExistence. This was casuing some service owners to misinterpret the actual error and take appropriate steps. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7972 Reviewed By: riversand963 Differential Revision: D26499598 Pulled By: akankshamahajan15 fbshipit-source-id: 05e9fedbe2a2e0e53135760f8ff578a2816d2b8e	2021-02-17 19:35:57 -08:00
Zhichao Cao	d1c510baec	Handoff checksum Implementation (#7523 ) Summary: in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523 Test Plan: add new unit test, pass make check/ make asan_check Reviewed By: pdillinger Differential Revision: D24313271 Pulled By: zhichao-cao fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0	2021-02-10 22:20:32 -08:00
Jay Zhuang	cf160b98e1	Add full_history_ts_low option to compaction (#7884 ) Summary: The full_history_ts_low is used for user-defined timestamp GC compaction, which is introduced in https://github.com/facebook/rocksdb/issues/7740, https://github.com/facebook/rocksdb/issues/7657 and https://github.com/facebook/rocksdb/issues/7655. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7884 Reviewed By: ltamasi Differential Revision: D25982553 Pulled By: jay-zhuang fbshipit-source-id: 36303d412d65b5d8166b6da24fa21ad85adbabee	2021-02-08 13:45:48 -08:00
mrambacher	0a9a05ae12	Make builds reproducible (#7866 ) Summary: Closes https://github.com/facebook/rocksdb/issues/7035 Changed how build_version.cc was generated: - Included the GIT tag/branch in the build_version file - Changed the "Build Date" to be: - If the GIT branch is "clean" (no changes), the date of the last git commit - If the branch is not clean, the current date - Added APIs to access the "build information", rather than accessing the strings directly. The build_version.cc file is now regenerated whenever the library objects are rebuilt. Verified that the built files remain the same size across builds on a "clean build" and the same information is reported by sst_dump --version Pull Request resolved: https://github.com/facebook/rocksdb/pull/7866 Reviewed By: pdillinger Differential Revision: D26086565 Pulled By: mrambacher fbshipit-source-id: 6fcbe47f6033989d5cf26a0ccb6dfdd9dd239d7f	2021-01-28 17:42:16 -08:00
Levi Tamasi	c696f27432	Accumulate blob file additions in VersionEdit during recovery (#7903 ) Summary: During recovery, RocksDB performs a kind of dummy flush; namely, entries from the WAL are added to memtables, which then get written to SSTs and blob files (if enabled) just like during a regular flush. Note that multiple memtables might be flushed during recovery for the same column family, for example, if the DB is reopened with a lower write buffer size, and therefore, we need to make sure to collect all SST and blob file additions. The patch fixes a bug in the earlier logic which resulted in later blob file additions overwriting earlier ones. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7903 Test Plan: Added a unit test and ran `db_stress`. Reviewed By: jay-zhuang Differential Revision: D26110847 Pulled By: ltamasi fbshipit-source-id: eddb50a608a88f54f3cec3a423de8235aba951fd	2021-01-27 18:46:15 -08:00
Zhichao Cao	95013df278	Do not set bg error for compaction in retryable IO Error case (#7899 ) Summary: When retryable IO error occurs during compaction, it is mapped to soft error and set the BG error. However, auto resume is not called to clean the soft error since compaction will reschedule by itself. In this change, When retryable IO error occurs during compaction, BG error is not set. User will be informed the error via EventHelper. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7899 Test Plan: tested with error_handler_fs_test Reviewed By: anand1976 Differential Revision: D26094097 Pulled By: zhichao-cao fbshipit-source-id: c53424f11d237405592cd762f43cbbdf8da8234f	2021-01-27 17:58:12 -08:00
mrambacher	12f1137355	Add a SystemClock class to capture the time functions of an Env (#7858 ) Summary: Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock. Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock. There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc). Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858 Reviewed By: pdillinger Differential Revision: D26006406 Pulled By: mrambacher fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90	2021-01-25 22:09:11 -08:00
Jay Zhuang	77b4bfe511	Disable PeriodicWorkScheduler during RateLimited test (#7810 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7810 Reviewed By: akankshamahajan15 Differential Revision: D25695454 Pulled By: jay-zhuang fbshipit-source-id: 963d11f38a959de7227ba2be15795af2792413a6	2021-01-11 15:01:52 -08:00
Adam Retter	6e0f62f2b6	Add more tests to ASSERT_STATUS_CHECKED (3), API change (#7715 ) Summary: Third batch of adding more tests to ASSERT_STATUS_CHECKED. * db_compaction_filter_test * db_compaction_test * db_dynamic_level_test * db_inplace_update_test * db_sst_test * db_tailing_iter_test * db_io_failure_test Also update GetApproximateSizes APIs to all return Status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7715 Reviewed By: jay-zhuang Differential Revision: D25806896 Pulled By: pdillinger fbshipit-source-id: 6cb9d62ba5a756c645812754c596ad3995d7c262	2021-01-06 14:15:02 -08:00
mrambacher	e628f59e87	Create a CustomEnv class; Add WinFileSystem; Make LegacyFileSystemWrapper private (#7703 ) Summary: This PR does the following: -> Creates a WinFileSystem class. This class is the Windows equivalent of the PosixFileSystem and will be used on Windows systems. -> Introduces a CustomEnv class. A CustomEnv is an Env that takes a FileSystem as constructor argument. I believe there will only ever be two implementations of this class (PosixEnv and WinEnv). There is still a CustomEnvWrapper class that takes an Env and a FileSystem and wraps the Env calls with the input Env but uses the FileSystem for the FileSystem calls -> Eliminates the public uses of the LegacyFileSystemWrapper. With this change in place, there are effectively the following patterns of Env: - "Base Env classes" (PosixEnv, WinEnv). These classes implement the core Env functions (e.g. Threads) and have a hard-coded input FileSystem. These classes inherit from CompositeEnv, implement the core Env functions (threads) and delegate the FileSystem-like calls to the input file system. - Wrapped Composite Env classes (MemEnv). These classes take in an Env and a FileSystem. The core env functions are re-directed to the wrapped env. The file system calls are redirected to the input file system - Legacy Wrapped Env classes. These classes take in an Env input (but no FileSystem). The core env functions are re-directed to the wrapped env. A "Legacy File System" is created using this env and the file system calls directed to the env itself. With these changes in place, the PosixEnv becomes a singleton -- there is only ever one created. Any other use of the PosixEnv is via another wrapped env. This cleans up some of the issues with the env construction and destruction. Additionally, there were places in the code that required had an Env when they required a FileSystem. Many of these places would wrap the Env with a LegacyFileSystemWrapper instead of using the env->GetFileSystem(). These places were changed, thereby removing layers of additional redirection (LegacyFileSystem --> Env --> Env::FileSystem). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7703 Reviewed By: zhichao-cao Differential Revision: D25762190 Pulled By: anand1976 fbshipit-source-id: 1a088e97fc916f28ac69c149cd1dcad0ab31704b	2021-01-06 10:49:32 -08:00
mrambacher	0bad2b4308	Ignore the OnAddFile Status for SSTFileManager (#7826 ) Summary: The returned Status is ignored here as some stress tests are failing, presumably when attempting to add an empty file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7826 Reviewed By: jay-zhuang Differential Revision: D25742931 fbshipit-source-id: a1fcd620d9472993a009929306dfc421f93eb43b	2021-01-04 11:08:28 -08:00
Zhichao Cao	44ebc24dca	Add rate_limiter to GenerateOneFileChecksum (#7811 ) Summary: In GenerateOneFileChecksum(), RocksDB reads the file and computes its checksum. A rate limiter can be passed to the constructor of RandomAccessFileReader so that read I/O can be rate limited. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7811 Test Plan: make check Reviewed By: cheng-chang Differential Revision: D25699896 Pulled By: zhichao-cao fbshipit-source-id: e2688bc1126c543979a3bcf91dda784bd7b74164	2020-12-26 22:07:24 -08:00
mrambacher	55e99688cc	No elide constructors (#7798 ) Summary: Added "no-elide-constructors to the ASSERT_STATUS_CHECK builds. This flag gives more errors/warnings for some of the Status checks where an inner class checks a Status and later returns it. In this case, without the elide check on, the returned status may not have been checked in the caller, thereby bypassing the checked code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7798 Reviewed By: jay-zhuang Differential Revision: D25680451 Pulled By: pdillinger fbshipit-source-id: c3f14ed9e2a13f0a8c54d839d5fb4d1fc1e93917	2020-12-23 16:55:53 -08:00
mrambacher	02418194d7	Add more tests for assert status checked (#7524 ) Summary: Added 10 more tests that pass the ASSERT_STATUS_CHECKED test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7524 Reviewed By: akankshamahajan15 Differential Revision: D24323093 Pulled By: ajkr fbshipit-source-id: 28d4106d0ca1740c3b896c755edf82d504b74801	2020-12-22 23:45:58 -08:00
Adam Retter	81592d9ffa	Add more tests to ASSERT_STATUS_CHECKED (4) (#7718 ) Summary: Fourth batch of adding more tests to ASSERT_STATUS_CHECKED. * db_range_del_test * db_write_test * random_access_file_reader_test * merge_test * external_sst_file_test * write_buffer_manager_test * stringappend_test * deletefile_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/7718 Reviewed By: pdillinger Differential Revision: D25671608 fbshipit-source-id: 687a794e98a9e0cd5428ead9898ef05ced987c31	2020-12-22 15:09:39 -08:00
Cheng Chang	fbce7a3808	Track WAL obsoletion when updating empty CF's log number (#7781 ) Summary: In the write path, there is an optimization: when a new WAL is created during SwitchMemtable, we update the internal log number of the empty column families to the new WAL. `FindObsoleteFiles` marks a WAL as obsolete if the WAL's log number is less than `VersionSet::MinLogNumberWithUnflushedData`. After updating the empty column families' internal log number, `VersionSet::MinLogNumberWithUnflushedData` might change, so some WALs might become obsolete to be purged from disk. For example, consider there are 3 column families: 0, 1, 2: 1. initially, all the column families' log number is 1; 2. write some data to cf0, and flush cf0, but the flush is pending; 3. now a new WAL 2 is created; 4. write data to cf1 and WAL 2, now cf0's log number is 1, cf1's log number is 2, cf2's log number is 2 (because cf1 and cf2 are empty, so their log numbers will be set to the highest log number); 5. now cf0's flush hasn't finished, flush cf1, a new WAL 3 is created, and cf1's flush finishes, now cf0's log number is 1, cf1's log number is 3, cf2's log number is 3, since WAL 1 still contains data for the unflushed cf0, no WAL can be deleted from disk; 6. now cf0's flush finishes, cf0's log number is 2 (because when cf0 was switching memtable, WAL 3 does not exist yet), cf1's log number is 3, cf2's log number is 3, so WAL 1 can be purged from disk now, but WAL 2 still cannot because `MinLogNumberToKeep()` is 2; 7. write data to cf2 and WAL 3, because cf0 is empty, its log number is updated to 3, so now cf0's log number is 3, cf1's log number is 3, cf2's log number is 3; 8. now if the background threads want to purge obsolete files from disk, WAL 2 can be purged because `MinLogNumberToKeep()` is 3. But there are only two flush results written to MANIFEST: the first is for flushing cf1, and the `MinLogNumberToKeep` is 1, the second is for flushing cf0, and the `MinLogNumberToKeep` is 2. So without this PR, if the DB crashes at this point and try to recover, `WalSet` will still expect WAL 2 to exist. When WAL tracking is enabled, we assume WALs will only become obsolete after a flush result is written to MANIFEST in `MemtableList::TryInstallMemtableFlushResults` (or its atomic flush counterpart). The above situation breaks this assumption. This PR tracks WAL obsoletion if necessary before updating the empty column families' log numbers. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7781 Test Plan: watch existing tests and stress tests to pass. `make -j48 blackbox_crash_test` on devserver Reviewed By: ltamasi Differential Revision: D25631695 Pulled By: cheng-chang fbshipit-source-id: ca7fff967bdb42204b84226063d909893bc0a4ec	2020-12-18 21:34:36 -08:00
Burton Li	2021392e25	Do not full scan obsolete files on compaction busy (#7739 ) Summary: When ConcurrentTaskLimiter is enabled and there are too many outstanding compactions, BackgroundCompaction returns Status::Busy(), which shouldn't be treat as compaction failure. This caused performance issue when outstanding compactions reached the limit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7739 Reviewed By: cheng-chang Differential Revision: D25508319 Pulled By: ltamasi fbshipit-source-id: 3b181b16ada0ca3393cfa3a7412985764e79c719	2020-12-15 13:51:10 -08:00
Levi Tamasi	1afbd1948c	Add initial blob support to batched MultiGet (#7766 ) Summary: The patch adds initial support for reading blobs to the batched `MultiGet` API. The current implementation simply retrieves the blob values as the blob indexes are encountered; that is, reads from blob files are currently not batched. (This will be optimized in a separate phase.) In addition, the patch removes some dead code related to BlobDB from the batched `MultiGet` implementation, namely the `is_blob` / `is_blob_index` flags that are passed around in `DBImpl` and `MemTable` / `MemTableListVersion`. These were never hooked up to anything and wouldn't work anyways, since a single flag is not sufficient to communicate the "blobness" of multiple key-values. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7766 Test Plan: `make check` Reviewed By: jay-zhuang Differential Revision: D25479290 Pulled By: ltamasi fbshipit-source-id: 7aba2d290e31876ee592bcf1adfd1018713a8000	2020-12-14 13:48:22 -08:00
Adam Retter	8ff6557e7f	Add further tests to ASSERT_STATUS_CHECKED (2) (#7698 ) Summary: Second batch of adding more tests to ASSERT_STATUS_CHECKED. * external_sst_file_basic_test * checkpoint_test * db_wal_test * db_block_cache_test * db_logical_block_size_cache_test * db_blob_index_test * optimistic_transaction_test * transaction_test * point_lock_manager_test * write_prepared_transaction_test * write_unprepared_transaction_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/7698 Reviewed By: cheng-chang Differential Revision: D25441664 Pulled By: pdillinger fbshipit-source-id: 9e78867f32321db5d4833e95eb96c5734526ef00	2020-12-09 21:21:16 -08:00
Cheng Chang	efe827baf0	Always track WAL obsoletion (#7759 ) Summary: Currently, when a WAL becomes obsolete after flushing, if VersionSet::WalSet does not contain the WAL, we do not track the WAL obsoletion event in MANIFEST. But consider this case: * WAL 10 is synced, a VersionEdit is LogAndApplied to MANIFEST to log this WAL addition event, but the VersionEdit is not applied to WalSet yet since its corresponding ManifestWriter is still pending in the write queue; * Since the above ManifestWriter is blocking, the LogAndApply will block on a conditional variable and release the db mutex, so another LogAndApply can proceed to enqueue other VersionEdits concurrently; * Now flush happens, and WAL 10 becomes obsolete, although WalSet does not contain WAL 10 yet, we should call LogAndApply to enqueue a VersionEdit to indicate the obsoletion of WAL 10; * otherwise, when the queued edit indicating WAL 10 addition is logged to MANIFEST, and DB crashes and reopens, the WAL 10 might have been removed from disk, but it still exists in MANIFEST. This PR changes the behavior to: always `LogAndApply` any WAL addition or obsoletion event, without considering the order issues caused by concurrency, but when applying the edits to `WalSet`, do not add the WALs if they are already obsolete. In this approach, the logical events of WAL addition and obsoletion are always tracked in MANIFEST, so we can inspect the MANIFEST and know all the previous WAL events, but we choose to ignore certain events due to the concurrency issues such as the case above, or the case in https://github.com/facebook/rocksdb/pull/7725. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7759 Test Plan: make check Reviewed By: pdillinger Differential Revision: D25423089 Pulled By: cheng-chang fbshipit-source-id: 9cb9a7fbc1875bf954f2a42f9b6cfd6d49a7b21c	2020-12-09 16:02:12 -08:00
Cheng Chang	07030c6f4a	Do not track obsolete WALs in MANIFEST even if they are synced (#7725 ) Summary: Consider the case: 1. All column families are flushed, so all WALs become obsolete, but no WAL is removed from disk yet because the removal is asynchronous, a VersionEdit is written to MANIFEST indicating that WALs before a certain WAL number are obsolete, let's say this number is 3; 2. `SyncWAL` is called, so all the on-disk WALs are synced, and if track_and_verify_wal_in_manifest=true, the WALs will be tracked in MANIFEST, let's say the WAL numbers are 1 and 2; 3. DB crashes; 4. During DB recovery, when replaying MANIFEST, we first see that WAL with number < 3 are obsolete, then we see that WAL 1 and 2 are synced, so according to current implementation of `WalSet`, the `WalSet` will be recovered to include WAL 1 and 2; 5. WAL 1 and 2 are asynchronously deleted from disk, then the WAL verification algorithm fails with `Corruption: missing WAL`. The above case is reproduced in a new unit test `DBBasicTestTrackWal::DoNotTrackObsoleteWal`. The fix is to maintain the upper bound of the obsolete WAL numbers, any WAL with number less than the maintained number is considered to be obsolete, so shouldn't be tracked even if they are later synced. The number is maintained in `WalSet`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7725 Test Plan: 1. a new unit test `DBBasicTestTrackWal::DoNotTrackObsoleteWal` is added. 2. run `make crash_test` on devserver. Reviewed By: riversand963 Differential Revision: D25238914 Pulled By: cheng-chang fbshipit-source-id: f5dccd57c3d89f19565ec5731f2d42f06d272b72	2020-12-08 10:58:04 -08:00
mrambacher	db03172d08	Change ErrorHandler methods to return const Status& (#7539 ) Summary: This change eliminates the need for a lot of the PermitUncheckedError calls on return from ErrorHandler methods. The calls are no longer needed as the status is returned as a reference rather than a copy. Additionally, this means that the originating status (recovery_error_, bg_error_) is not cleared implicitly as a result of calling one of these methods. For this class, I do not know if the proper behavior should be to call PermitUncheckedError in the destructor or if the checked state should be cleared when the status is cleared. I did tests both ways. Without the code in the destructor, the status will need to be cleared in at least some of the places where it is set to OK. When running tests, I found no instances where this class was destructed with a non-OK, non-checked Status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7539 Reviewed By: anand1976 Differential Revision: D25340565 Pulled By: pdillinger fbshipit-source-id: 1730c035c81a475875ea745226112030ec25136c	2020-12-07 20:11:35 -08:00
Yanqin Jin	eee0af9af1	Add full_history_ts_low to column family (#7740 ) Summary: Following https://github.com/facebook/rocksdb/issues/7655 and https://github.com/facebook/rocksdb/issues/7657, this PR adds `full_history_ts_low_` to `ColumnFamilyData`. `ColumnFamilyData::full_history_ts_low_` will be used to create `FlushJob` and `CompactionJob`. `ColumnFamilyData::full_history_ts_low` is persisted to the MANIFEST file. An application can only increase its value. Consider the following case: > > The database has a key at ts=950. `full_history_ts_low` is first set to 1000, and then a GC is triggered > and cleans up all data older than 1000. If the application sets `full_history_ts_low` to 900 afterwards, > and tries to read at ts=960, the key at 950 is not seen. From the perspective of the read, the result > is hard to reason. For simplicity, we just do now allow decreasing full_history_ts_low for now. > During recovery, the value of `full_history_ts_low` is restored for each column family if applicable. Note that version edits in the MANIFEST file for the same column family may have `full_history_ts_low` unsorted due to the potential interleaving of `LogAndApply` calls. Only the max will be used to restore the state of the column family. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7740 Test Plan: make check Reviewed By: ltamasi Differential Revision: D25296217 Pulled By: riversand963 fbshipit-source-id: 24acda1df8262cd7cfdc6ce7b0ec56438abe242a	2020-12-05 14:18:22 -08:00
Levi Tamasi	61932cdf1d	Add blob support to DBIter (#7731 ) Summary: The patch adds iterator support to the integrated BlobDB implementation. Whenever a blob reference is encountered during iteration, the corresponding blob is retrieved by calling `Version::GetBlob`, assuming the `expose_blob_index` (formerly `allow_blob`) flag is not set. (Note: the flag is set by the old stacked BlobDB implementation, which has its own blob file handling/blob retrieval logic.) In addition, `DBIter` now uniformly returns `Status::NotSupported` with the error message `"BlobDB does not support merge operator."` when encountering a blob reference while performing a merge (instead of potentially returning a message that implies the database should be opened using the stacked BlobDB's `Open`.) TODO: We can implement support for lazily retrieving the blob value (or in other words, bypassing the retrieval of blob values based on key) by extending the `Iterator` API with a new `PrepareValue` method (similarly to `InternalIterator`, which already supports lazy values). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7731 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D25256293 Pulled By: ltamasi fbshipit-source-id: c39cd782011495a526cdff99c16f5fca400c4811	2020-12-04 21:29:38 -08:00
Zhichao Cao	e102de7318	Fix assert(cfd->imm()->NumNotFlushed() > 0) in FlushMemtable (#7744 ) Summary: In current code base, in FlushMemtable, when `(Flush_reason == FlushReason::kErrorRecoveryRetryFlush && (!cfd->mem()->IsEmpty() \|\| !cached_recoverable_state_empty_.load()))`, we assert that cfd->imm()->NumNotFlushed() > 0. However, there are some corner cases that can fail this assert: 1) if there are multiple CFs, some CF has immutable memtable, some CFs don't. In ResumeImpl, all CFs will call FlushMemtable, which will hit the assert. 2) Regular flush is scheduled and running, the resume thread is waiting. New KVs are inserted and SchedulePendingFlush is called. Regular flush will continue call MaybeScheduleFlushAndCompaction until all the immutable memtables are flushed. When regular flush ends and auto resume thread starts to schedule new flushes, cfd->imm()->NumNotFlushed() can be 0. Remove the assert and added the comments. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7744 Test Plan: make check and pass the stress test Reviewed By: riversand963 Differential Revision: D25340573 Pulled By: zhichao-cao fbshipit-source-id: eac357bdace660247c197f01a9ff6857e3c97672	2020-12-04 20:31:39 -08:00
Cheng Chang	70f2e0916a	Write min_log_number_to_keep to MANIFEST during atomic flush under 2 phase commit (#7570 ) Summary: When 2 phase commit is enabled, if there are prepared data in a WAL, the WAL should be kept, the minimum log number for such a WAL is written to MANIFEST during flush. In atomic flush, such information is not written to MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7570 Test Plan: Added a new unit test `DBAtomicFlushTest.ManualFlushUnder2PC`, this test fails in atomic flush without this PR, after this PR, it succeeds. Reviewed By: riversand963 Differential Revision: D24394222 Pulled By: cheng-chang fbshipit-source-id: 60ce74b21b704804943be40c8de01b41269cf116	2020-12-03 19:22:24 -08:00

1 2 3 4 5 ...

379 Commits