rocksdb

Author	SHA1	Message	Date
Jay Zhuang	97bf78721b	Update HISTORY.md and version.h for 6.13.3	2020-10-14 09:29:38 -07:00
Jay Zhuang	0bda0e3dfa	Fix StallWrite crash with mixed of slowdown/no_slowdown writes (#7508 ) Summary: `BeginWriteStall()` removes no_slowdown write from the write list and updates `link_newer`, which makes `CreateMissingNewerLinks()` thought all write list has valid `link_newer` and failed to create link for all writers. It caused flaky test and SegFault for release build. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7508 Test Plan: Add unittest to reproduce the issue. Reviewed By: anand1976 Differential Revision: D24126601 Pulled By: jay-zhuang fbshipit-source-id: f8ac5dba653f7ee1b0950296427d4f5f8ee34a06	2020-10-13 17:44:02 -07:00
Andrew Kryczka	950b72c743	Update HISTORY.md and version.h for 6.13.2	2020-10-13 09:41:35 -07:00
Andrew Kryczka	6eccfbbf46	Fix bug when `DeleteRange()` used with `paranoid_file_checks == true` Backports part of `7508175558`. The unit test comes from https://github.com/facebook/rocksdb/pull/7521.	2020-10-13 09:40:12 -07:00
Andrew Kryczka	ed5c64e0d9	update HISTORY.md and version.h for 6.13.1	2020-10-12 14:06:52 -07:00
Andrew Kryczka	6676d31d3f	Fix bug in pinned partitioned indexes with some reads bypassing block cache Backports part of `75d3b6fdf0`.	2020-10-12 13:55:59 -07:00
Andrew Kryczka	82c5a09241	Fix bug in pinned partitioned user key indexes Backports part of `75d3b6fdf0`.	2020-10-12 13:54:35 -07:00
Andrew Kryczka	ac39f413d8	Add missing release note	2020-10-12 13:54:11 -07:00
Peter Dillinger	e29a510f45	Add Unreleased to HISTORY.md for patches	2020-09-24 08:21:36 -07:00
Peter Dillinger	cda8a74eb7	Less I/O for incremental backups, slightly better corruption detection (#7413 ) Summary: Two relatively simple functional changes to incremental backup behavior, integrated with a minor refactoring to reduce code redundancy and improve error/log message. There are nuances to the impact of these changes, but I believe they are fundamentally good and generally safe. Those functional changes: * Incremental backups no longer read DB table files that are already saved to a shared part of the backup directory, unless `share_files_with_checksum` is used with `kLegacyCrc32cAndFileSize` naming (discouraged) where crc32c full file checksums are needed to determine file naming. * Justification: incremental backups should not need to read the whole DB, especially without rate limiting. (Although other BackupEngine reads are not rate limited either, other non-trivial reads are generally limited by a corresponding write, as in copying files.) Also, the fact that this is not already fixed was arguably a bug/oversight in the implementation of https://github.com/facebook/rocksdb/issues/7110. * When considering whether a table file is already backed up in a shared part of backup directory, BackupEngine would already query the sizes of source (DB) and pre-existing destination (backup) files. BackupEngine now uses these file sizes to detect corruption, as at least one of (a) old backup, (b) backup in progress, or (c) current DB is corrupt if there's a size mismatch. * Justification: a random related fix that also helps to cover a small hole in corruption checking uncovered by the other functional change: * For `share_table_files` without "checksum" (not recommended), the other change regresses in detecting fundamentally unsafe use of this option combination: when you might generate different versions of same SST file number. As demonstrated by `BackupableDBTest.FailOverwritingBackups,` this regression is greatly mitigated by the new file size checking. Nevertheless, almost no reason to use `share_files_with_checksum=false` should remain, and comments are updated appropriately. Also, this change renames internal function `CalculateChecksum` to `ReadFileAndComputeChecksum` to make the performance impact of this function clear in code reviews. It is not clear what 'same_path' is for in backupable_db.cc, and I suspect it cannot be true for a DB with unique file names (like DBImpl). Nevertheless, I've tried to keep its functionality intact when `true` to minimize risk for now, despite having no unit tests for which it is true. Select impact details (much more in unit tests): For `share_files_with_checksum`, I am confident there is no regression (vs. pre-6.12) in detecting DB or backup corruption at backup creation time, mostly because the old design did not leverage this extra checksum computation for detecting inconsistencies at backup creation time. (With computed checksums in names, a recently corrupted file just looked like a different file vs. what was already backed up.) Even in the hypothetical case of DB session id collision (~100 bits entropy collision), file size in name and/or our file size check add an extra layer of protection against false success in creating an accurate new backup. (Unit test included.) `DB::VerifyChecksum` and `BackupEngine::VerifyBackup` with checksum checking are still able to catch corruptions that `CreateNewBackup` does not. Note that when custom file checksum support is added to BackupEngine, that will essentially give the same power as `DB::VerifyChecksum` into `CreateNewBackup`. We could add options for `CreateNewBackup` to cover some of what would be caught by `VerifyBackup` with checksum checking. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7413 Test Plan: Two new unit tests included, both of which fail without these changes. Although we don't test the I/O improvement directly, we test it indirectly in DB corruption detection power that was inadvertently unlocked with new backup file naming PLUS computing current content checksums (now removed). (I don't think that case of DB corruption detection justifies reading the whole DB on incremental backup.) Reviewed By: zhichao-cao Differential Revision: D23818480 Pulled By: pdillinger fbshipit-source-id: 148aff16f001af5b9fd4b22f155311c2461f1bac	2020-09-22 10:20:22 -07:00
Peter Dillinger	fb98398ca9	Postponing custom checksum support in BackupEngine (#7411 ) Summary: This change reverts BackupEngine to 6.12 state to accommodate a higher-priority fix that does not easily merge with this custom checksum support. We intend to reinstate this support soon, by merging a revert of this change. For backupable_db_test, I've removed the tests depending on this feature. I've also removed relevant HISTORY.md entry. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7411 Test Plan: unit tests Reviewed By: ajkr Differential Revision: D23793835 Pulled By: pdillinger fbshipit-source-id: 7e861436539584799b13d1a8ae559b81b6d08052	2020-09-22 10:18:30 -07:00
Peter Dillinger	2dbb90a064	Update HISTORY.md for #7346 (#7417 ) Summary: Copied from Andrew's entry for 6.12.3. Inserted here retroactive to 6.12 Pull Request resolved: https://github.com/facebook/rocksdb/pull/7417 Test Plan: no code change Reviewed By: jay-zhuang Differential Revision: D23815980 Pulled By: pdillinger fbshipit-source-id: 3c8a052cdb61be1215d311556c9487f9ea5c8cb0	2020-09-21 09:49:45 -07:00
Peter Dillinger	b4f29f7515	Restore file size in backup table file names (and other cleanup) (#7400 ) Summary: Prior to 6.12, backup files using share_files_with_checksum had the file size encoded in the file name, after the last '\_' and before the last '.'. We considered this an implementation detail subject to change, and indeed removed this information from the file name (with an option to use old behavior) because it was considered ineffective/inefficient for file name uniqueness. However, some downstream RocksDB users were relying on this information since the file size is not explicitly in the backup manifest file. This primary purpose of this change is "retrofitting" the 6.12 release (not yet a public release) to simultaneously support the benefits of the new naming scheme (I/O performance and data correctness at scale) and preserve the file size information, both as default behaviors. With this change, we are essentially making the file size information encoded in the file name an official, though obscure, extension of the backup meta file format. We preserve an option (kLegacyCrc32cAndFileSize) to use the original "legacy" naming scheme, with its caveats, and make it easy to omit the file size information (no kFlagIncludeFileSize), for more compact file names. But note that changing the naming scheme used on an existing db and backup directory can lead to transient space amplification, as some files will be stored under two names in the shared_checksum directory. Because some backups were saved using the original 6.12 naming scheme, we offer two ways of dealing with those files: SST files generated by older 6.12 versions can either use the default naming scheme in effect when the SST files were generated (kFlagMatchInterimNaming, default, no transient space amplification) or can use a new naming scheme (no kFlagMatchInterimNaming, potential space amplification because some already stored files getting a new name). We don't have a natural way to detect which files were generated by previous 6.12 versions, but this change hacks one in by changing DB session ids to now use a more concise encoding, reducing file name length, saving ~dozen bytes from SST files, and making them visually distinct from DB ids so that they are less likely to be mixed up. Two final auxiliary notes: Recognizing that the backup file names have become a de facto part of the backup meta schema, this change makes them easier to parse and extend by putting a distinct marker, 's', before DB session ids embedded in the name. When we extend this to allow custom checksums in the name, they can get their own marker to ensure safe parsing. For backward compatibility, file size does not get a marker but is assumed for `_[0-9]+[.]` Another change from initial 6.12 default behavior is never including file custom checksum in the file name. Looking ahead to 6.13, we do not want the default behavior to cause backup space amplification for someone turning on file custom checksum checking in BackupEngine; we want that to be an easy decision. When implemented, including file custom checksums in backup file names will be a non-default option. Actual file name patterns and priorities, as regexes: kLegacyCrc32cAndFileSize OR pre-6.12 SST file -> [0-9]+_[0-9]+_[0-9]+[.]sst kFlagMatchInterimNaming set (default) AND early 6.12 SST file -> [0-9]+_[0-9a-fA-F-]+[.]sst kUseDbSessionId AND NOT kFlagIncludeFileSize -> [0-9]+_s[0-9A-Z]{20}[.]sst kUseDbSessionId AND kFlagIncludeFileSize (default) -> [0-9]+_s[0-9A-Z]{20}_[0-9]+[.]sst We might add opt-in options for more '\_' separated data in the name, but embedded file size, if present, will always be after last '\_' and before '.sst'. This change was originally applied to version 6.12. (See https://github.com/facebook/rocksdb/issues/7390) Pull Request resolved: https://github.com/facebook/rocksdb/pull/7400 Test Plan: unit tests included. Sync point callbacks are used to mimic previous version SST files. Reviewed By: ajkr Differential Revision: D23759587 Pulled By: pdillinger fbshipit-source-id: f62d8af4e0978de0a34f26288cfbe66049b70025	2020-09-21 08:05:23 -07:00
Zhichao Cao	a02495cf80	Map retryable IO error during Flush without WAL to soft error and no switch memtable during resume (#7310 ) Summary: In the current implementation, any retryable IO error happens during Flush is mapped to a hard error. In this case, DB is stopped and write is stalled unless the background error is cleaned. In this PR, if WAL is DISABLED, the retryable IO error during FLush is mapped to a soft error. Such that, the memtable can continue receive the writes. At the same time, if auto resume is triggered, SwtichMemtable will not be called during Flush when resuming the DB to avoid to many small memtables. Testing cases are added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7310 Test Plan: adding new unit test, pass make check. Reviewed By: anand1976 Differential Revision: D23710892 Pulled By: zhichao-cao fbshipit-source-id: bc4ca50d11c6b23b60d2c0cb171d86d542b038e9	2020-09-21 07:59:19 -07:00
anand76	5a0cef9259	Update HISTORY.md with IO fencing error code (#7402 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7402 Reviewed By: pdillinger Differential Revision: D23761689 Pulled By: anand1976 fbshipit-source-id: 59e10f0aaa80f6c0f5a46dc99467138c4cee0511	2020-09-17 11:52:23 -07:00
Peter Dillinger	ecc8ffe17b	Update master to version 6.13 (#7378 ) Summary: for release fork Pull Request resolved: https://github.com/facebook/rocksdb/pull/7378 Test Plan: make check + CI Reviewed By: jay-zhuang Differential Revision: D23669163 Pulled By: pdillinger fbshipit-source-id: 14cbf95b32717c28418c71cc8e10f06733bbc49f	2020-09-12 13:18:09 -07:00
Yanqin Jin	205e577694	Cancel tombstone skipping during bottommost compaction (#7356 ) Summary: During bottommost compaction, RocksDB cannot simply drop a tombstone if this tombstone is not in the earliest snapshot. The current behavior is: RocksDB skips other internal keys (of the same user key) in the same snapshot range. In the meantime, RocksDB should check for the `shutting_down` flag. Otherwise, it is possible for a bottommost compaction that has already started running to take a long time to finish, even if the application has tried to cancel all background jobs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7356 Test Plan: make check Reviewed By: ltamasi Differential Revision: D23663241 Pulled By: riversand963 fbshipit-source-id: 25f8e9b51bc3bfa3353cdf87557800f9d90ee0b5	2020-09-11 17:45:43 -07:00
Peter Dillinger	92639b93a6	Fix checkpoint file deletion race with avoid_unnecessary_blocking_io (#7369 ) Summary: https://github.com/facebook/rocksdb/issues/3341 guaranteed that upon return of `GetSortedWalFiles` after `DisableFileDeletions`, all pending purges of previously obsolete WAL files will have finished. However, the addition of avoid_unnecessary_blocking_io in https://github.com/facebook/rocksdb/issues/5043 opened a hole in the code making that assurance, which can lead to files to be copied for checkpoint or backup going missing before being copied, with that option enabled. This change patches the hole. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7369 Test Plan: apparent fix to backups in crash test observed. Will work on a unit test for another commit Reviewed By: ajkr Differential Revision: D23620258 Pulled By: pdillinger fbshipit-source-id: bea36b461a5b719c3e3ef802f967bc3e8ae71614	2020-09-10 22:35:25 -07:00
Yanqin Jin	8307d4400c	Update HISTORY.md for PR7329 (#7355 ) Summary: As title. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7355 Reviewed By: pdillinger Differential Revision: D23566635 Pulled By: riversand963 fbshipit-source-id: f8d846bcff637e7617b764b7bfb9a948ea18d195	2020-09-08 11:10:25 -07:00
Andrew Kryczka	5746767387	add `ldb unsafe_remove_sst_file` subcommand (#7335 ) Summary: This is adapted from https://github.com/facebook/rocksdb/issues/6678 but takes a different approach, avoiding opening a read-write DB and avoiding the `DeleteFile()` API. First, this PR refactors how options variables are initialized in `ldb` so it can be reused in a subcommand that doesn't open a DB: - Separated remaining option initialization logic out of `OpenDB()`. The new `PrepareOptions()` function initializes the full options state. - Fixed an old TODO about applying the subcommand CF option overrides to the proper `ColumnFamilyOptions` object. Second, this PR adds the `ldb unsafe_remove_sst_file` subcommand. It uses the `VersionSet`-level APIs to remove the file with the specified number. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7335 Test Plan: played with interactive python and this file removal command. Verified openability/correct results in case of multiple column families, multiple levels, etc. Reviewed By: pdillinger Differential Revision: D23454575 Pulled By: ajkr fbshipit-source-id: 039b7a8cbfc42fd123dcb25821eef51d61148afe	2020-09-03 16:54:51 -07:00
Andrew Kryczka	40e97b02be	add warning on `DeleteFile()` API (#7337 ) Summary: Since we can't land https://github.com/facebook/rocksdb/issues/7336 until the next major release, added a strong warning against the `DeleteFile()` API in the meantime. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7337 Reviewed By: pdillinger Differential Revision: D23459728 Pulled By: ajkr fbshipit-source-id: 326cb9b18190386080c35c761a8736d8a877dafb	2020-09-03 16:42:01 -07:00
Andrew Kryczka	af54c4092a	fix SstFileWriter with dictionary compression (#7323 ) Summary: In block-based table builder, the cut-over from buffered to unbuffered mode involves sampling the buffered blocks and generating a dictionary. There was a bug where `SstFileWriter` passed zero as the `target_file_size` causing the cutover to happen immediately, so there were no samples available for generating the dictionary. This PR changes the meaning of `target_file_size == 0` to mean buffer the whole file before cutting over. It also adds dictionary compression support to `sst_dump --command=recompress` for easy evaluation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7323 Reviewed By: cheng-chang Differential Revision: D23412158 Pulled By: ajkr fbshipit-source-id: 3b232050e70ef3c2ee85a4b5f6fadb139c569873	2020-09-03 15:49:57 -07:00
Hiep	d0c1a01c1b	Avoid converting MERGES to PUTS when allow_ingest_behind is true (#7166 ) Summary: - Closes https://github.com/facebook/rocksdb/issues/6490 - Currently MERGEs are converted to PUTs at bottom or compaction has reached the beginning of the key, this can wrongly cover a PUT future base case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7166 Test Plan: - Automated: `make all check` - Manual: With `allow_ingest_behind = true`, add Merge operations to a key then run compaction. Then run ingesting external files to make sure the base case is probably compacted with existing Merges. Reviewed By: cheng-chang Differential Revision: D23325425 Pulled By: ajkr fbshipit-source-id: 3eb415eb7b381b5453e45245393566153b1abb68	2020-09-03 14:39:58 -07:00
Andrew Kryczka	177f8bd063	Bound L0->Lbase fanout in dynamic leveled compaction (#7325 ) Summary: L0 score is based on size target and number of files. The size target used is `max_bytes_for_level_base`. However, the base level's size can dynamically expand in write burst mode. In fact, it can expand so much that L0->Lbase becomes the highest fanout in target sizes. This doesn't make sense from an efficiency perspective, so this PR bounds the L0->Lbase fanout to the smoothed level multiplier. The L0 scoring based on file count remains unchanged. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7325 Test Plan: contrived benchmark that exhibits the problem: ``` $ TEST_TMPDIR=/data/users/andrewkr/ ./db_bench -benchmarks=filluniquerandom,readrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -level0_file_num_compaction_trigger=4 -level_compaction_dynamic_level_bytes=true -compression_type=none -max_background_jobs=12 -rate_limiter_bytes_per_sec=104857600 -benchmark_write_rate_limit=10485760 -num=100000000 ``` Results: - "Burst W-Amp" is the write-amp near the end of the fillrandom benchmark - "Total W-Amp" is the write-amp after readrandom has run a while and all levels no longer need compaction Branch \| Burst W-Amp \| Total W-Amp \| fillrandom (MB/s) -- \| -- \| -- \| -- master \| 20.2 \| 21.5 \| 4.7 dynamic-l0-score \| 12.6 \| 14.1 \| 7.2 Reviewed By: siying Differential Revision: D23412935 Pulled By: ajkr fbshipit-source-id: f91f2067188e432dd39deab02f1c56f195057a0e	2020-09-01 19:34:01 -07:00
Akanksha Mahajan	963314ffd6	Add unit test for max_write_buffer_size_to_maintain (#7311 ) Summary: Add a unit test case to check memory usage when max_write_buffer_size_to_maintain is set if flushed immutable memtables are trimmed timely or not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7311 Test Plan: Compared the results with before bug fix. Reviewed By: ltamasi Differential Revision: D23321702 Pulled By: akankshamahajan15 fbshipit-source-id: da04ee21137d641a07fd499a9e2749eb036fcb1e	2020-08-28 17:38:05 -07:00
Jay Zhuang	c2485f2d81	Add buffer prefetch support for non directIO usecase (#7312 ) Summary: A new file interface `SupportPrefetch()` is added. When the user overrides it to `false`, an internal prefetch buffer will be used for readahead. Useful for non-directIO but FS doesn't have readahead support. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7312 Reviewed By: anand1976 Differential Revision: D23329847 Pulled By: jay-zhuang fbshipit-source-id: 71cd4ce6f4a820840294e4e6aec111ab76175527	2020-08-27 18:16:53 -07:00
Peter Dillinger	9aad24da55	Real fix for race in backup custom checksum checking (#7309 ) Summary: This is a "real" fix for the issue worked around in https://github.com/facebook/rocksdb/issues/7294. To get DB checksum info for live files, we now read the manifest file that will become part of the checkpoint/backup. This requires a little extra handling in taking a custom checkpoint, including only reading the manifest file up to the size prescribed by the checkpoint. This moves GetFileChecksumsFromManifest from backup code to file_checksum_helper.{h,cc} and removes apparently unnecessary checking related to column families. Updated HISTORY.md and warned potential future users of DB::GetLiveFilesChecksumInfo() Pull Request resolved: https://github.com/facebook/rocksdb/pull/7309 Test Plan: updated unit test, before and after Reviewed By: ajkr Differential Revision: D23311994 Pulled By: pdillinger fbshipit-source-id: 741e30a2dc1830e8208f7648fcc8c5f000d4e2d5	2020-08-26 10:39:20 -07:00
sdong	722814e357	Get() to fail with underlying failures in PartitionIndexReader::CacheDependencies() (#7297 ) Summary: Right now all I/O failures under PartitionIndexReader::CacheDependencies() is swallowed. This doesn't impact correctness but we've made a decision that any I/O error in read path now should be returned to users for awareness. Return errors in those cases instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7297 Test Plan: Add a new unit test that ingest errors in this code path and see Get() fails. Only one I/O path is hit in PartitionIndexReader::CacheDependencies(). Several option changes are attempt but not able to got other pread paths triggered. Not sure whether other failure cases would be even possible. Would rely on continuous stress test to validate it. Reviewed By: anand1976 Differential Revision: D23257950 fbshipit-source-id: 859dbc92fa239996e1bb378329344d3d54168c03	2020-08-25 19:01:05 -07:00
Zhichao Cao	d51f88c9e4	Pass SST file checksum information through OnTableFileCreated (#7108 ) Summary: When SST file is created, application is able to know the file information through OnTableFileCreated callback in LogAndNotifyTableFileCreationFinished. Since file checksum information can be useful for application when the SST file is created, we add file_checksum and file_checksum_func_name information to TableFileCreationInfo, which will be passed through OnTableFileCreated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7108 Test Plan: make check, listener_test. Reviewed By: ajkr Differential Revision: D22470240 Pulled By: zhichao-cao fbshipit-source-id: 92c20344d9b986eadfe3480f3769bf4add0dbaae	2020-08-25 10:46:11 -07:00
Connor1996	416943bf28	Eliminates a no-op compaction upon snapshot release when disabling auto compactions (#7267 ) Summary: After releasing a snapshot, it checks whether it is suitable to trigger bottom compactions. When disabling auto compactions, it may still schedule compaction when releasing a snapshot. Whereas no compaction job will be actually handled, so the state of LSM is not changed and compaction will be triggered again and again every time releasing a snapshot. Too frequent compactions lead to high CPU usage and high db_mutex lock contention which affects foreground write duration finally. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7267 Test Plan: - make check - manual test Reviewed By: akankshamahajan15 Differential Revision: D23252880 Pulled By: ajkr fbshipit-source-id: 4431e071a35d9912a2a3592875db27bae521434b	2020-08-24 22:06:45 -07:00
Akanksha Mahajan	3844612625	Bug Fix for memtables not trimmed down. (#7296 ) Summary: When a memtable is trimmed in MemTableListVersion, the memtable is only added to delete list if it is the last reference. However it is not the last reference as it is held by the super version. But the super version would not be switched if the delete list is empty. So the memtable is never destroyed and memory usage increases beyond write_buffer_size + max_write_buffer_size_to_maintain. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7296 Test Plan: 1. ./db_bench -benchmarks=randomtransaction -optimistic_transaction_db=1 -statistics -stats_interval_seconds=1 -duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000 --transaction_set_snapshot Reviewed By: ltamasi Differential Revision: D23267395 Pulled By: akankshamahajan15 fbshipit-source-id: 3a8d437fe9f4015f851ff84c0e29528aa946b650	2020-08-21 13:29:05 -07:00
Peter Dillinger	a1b5484811	Work around a backup bug with DB custom checksums (#7294 ) Summary: On a read-write DB configured with DBOptions::file_checksum_gen_factory, BackupEngine::CreateNewBackup can fail intermittently, with non-OK status. This is due to a race between GetLiveFiles and GetLiveFilesChecksumInfo in creating backups. For patching 6.12 release (as this commit is intended for, except this is a forward-merged version), we can simply treat files for which we falsely failed to get checksum info as legacy files lacking checksum info. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7294 Test Plan: unit test reproducer included Reviewed By: ajkr Differential Revision: D23253489 Pulled By: pdillinger fbshipit-source-id: 9e4945dad120b776ad3e753be10b962f61f28e14	2020-08-21 08:16:04 -07:00
Andrew Kryczka	5d5ff82408	Disable `recycle_log_file_num` with `kTolerateCorruptedTailRecords` (#7271 ) Summary: The two features are naturally incompatible. WAL recycling expects the recovery to succeed upon encountering a corrupt record at the point where new data ends and recycled data remains at the tail. However, `WALRecoveryMode::kTolerateCorruptedTailRecords` must fail upon encountering any such corrupt record, as it cannot differentiate between this and a real corruption, which would cause committed updates to be truncated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7271 Reviewed By: riversand963 Differential Revision: D23169923 Pulled By: ajkr fbshipit-source-id: 2cf8a3bcd2c9a0ecb0055a84725047a10fd4db50	2020-08-17 18:21:10 -07:00
Yanqin Jin	92593d511a	Add a new EntryType for deletion with timestamp (#7195 ) Summary: Add `kEntryDeleteWithTimestamp` to `EntryType` which is a public API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7195 Test Plan: make check Reviewed By: ajkr Differential Revision: D22914704 Pulled By: riversand963 fbshipit-source-id: 886f73c6b70c527cad1c8fc9fc8d3afe60e1ea39	2020-08-17 16:26:06 -07:00
sdong	1760637539	CompactRange() refit level should confirm destination level is not empty (#7261 ) Summary: There is potential data race related CompactRange() with level refitting. After the compaction step and refitting step, some automatic compaction could put data to the destination level and cause the DB to be corrupted. Fix the bug by checking the target level to be empty. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7261 Test Plan: Add a unit test, which would fail with "Corruption: L1 have overlapping ranges '666F6F' seq:6, type:1 vs. '626172' seq:2, type:1", and now it succeeds. Reviewed By: ajkr Differential Revision: D23142269 fbshipit-source-id: 28bc14d5ac934c192260b23a4ce3f10a95e3ee91	2020-08-17 14:21:53 -07:00
Jay Zhuang	69760b4d05	Introduce a global StatsDumpScheduler for stats dumping (#7223 ) Summary: Have a global StatsDumpScheduler for all DB instance stats dumping, including `DumpStats()` and `PersistStats()`. Before this, there're 2 dedicate threads for every DB instance, one for DumpStats() one for PersistStats(), which could create lots of threads if there're hundreds DB instances. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7223 Reviewed By: riversand963 Differential Revision: D23056737 Pulled By: jay-zhuang fbshipit-source-id: 0faa2311142a73433ebb3317361db7cbf43faeba	2020-08-14 20:12:44 -07:00
Andrew Kryczka	a1aa3f8385	Disable manual compaction during `ReFitLevel()` (#7250 ) Summary: Manual compaction with `CompactRangeOptions::change_levels` set could refit to a level targeted by another manual compaction. If force_consistency_checks were disabled, it could be possible for overlapping files to be written at that target level. This PR prevents the possibility by calling `DisableManualCompaction()` prior to `ReFitLevel()`. It also improves the manual compaction disabling mechanism to wait for pending manual compactions to complete before returning, and support disabling from multiple threads. Fixes https://github.com/facebook/rocksdb/issues/6432. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7250 Test Plan: crash test command that repro'd the bug reliably: ``` $ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --simple -target_file_size_base=524288 -write_buffer_size=1048576 -clear_column_family_one_in=0 -reopen=0 -max_key=10000000 -column_families=1 -max_background_compactions=8 -compact_range_one_in=100000 -compression_type=none -compaction_style=1 -num_levels=5 -universal_min_merge_width=4 -universal_max_merge_width=8 -level0_file_num_compaction_trigger=12 -rate_limiter_bytes_per_sec=1048576000 -universal_max_size_amplification_percent=100 --duration=3600 --interval=60 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --enable_compaction_filter=0 ``` Reviewed By: ltamasi Differential Revision: D23090800 Pulled By: ajkr fbshipit-source-id: afcbcd51b42ce76789fdb907d8b9ada790709c13	2020-08-14 11:29:52 -07:00
Zitan Chen	b578ca2e4d	BackupEngine supports custom file checksums (#7085 ) Summary: A new option `std::shared_ptr<FileChecksumGenFactory> backup_checksum_gen_factory` is added to `BackupableDBOptions`. This allows custom checksum functions to be used for creating, verifying, or restoring backups. Tests are added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7085 Test Plan: Passed make check Reviewed By: pdillinger Differential Revision: D22390756 Pulled By: gg814 fbshipit-source-id: 3b7756ca444c2129844536b91c3ca09f53b6248f	2020-08-12 13:31:09 -07:00
sdong	41c328fe57	Fix a perf regression that caused every key to go through upper bound check (#7209 ) Summary: https://github.com/facebook/rocksdb/pull/5289 introduces a performance regression that caused an upper bound check within every BlockBasedTableIterator::Next(). This is unnecessary if we've checked the boundary key for current block and it is within upper bound. Fix the bug. Also rename the boolean to a enum so that the code is slightly better readable. The original regression was probably to fix a bug that the block upper bound check status is not reset after a new block is created. Fix it bug so that the regression can be avoided without hitting the bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7209 Test Plan: Run all existing tests. Will run atomic black box crash test for a while. Reviewed By: anand1976 Differential Revision: D22859246 fbshipit-source-id: cbdad1f5e656c55fd8b71726d5a4f6cb53ff9140	2020-08-04 11:30:09 -07:00
Yanqin Jin	a38f04ac26	Update HISTORY and version for 6.12 release (#7194 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7194 Reviewed By: gg814 Differential Revision: D22810654 Pulled By: riversand963 fbshipit-source-id: 01f13089fa2b7e31b827da3e30c90e5c62c41380	2020-07-29 10:13:21 -07:00
Tomas Kolda	cd4592c220	SST Partitioner interface that allows to split SST files (#6957 ) Summary: SST Partitioner interface that allows to split SST files during compactions. It basically instruct compaction to create a new file when needed. When one is using well defined prefixes and prefixed way of defining tables it is good to define also partitioning so that promotion of some SST file does not cover huge key space on next level (worst case complete space). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6957 Reviewed By: ajkr Differential Revision: D22461239 fbshipit-source-id: 9ce07bba08b3ba89c2d45630520368f704d1316e	2020-07-24 13:44:49 -07:00
Cheng Chang	7af1fab443	Update HISTORY (#7158 ) Summary: Mention the MultiRead bug in HISTORY. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7158 Test Plan: N/A Reviewed By: siying Differential Revision: D22670565 Pulled By: cheng-chang fbshipit-source-id: 16abf0192957be66511f6a08e00157bfd37b189f	2020-07-23 08:47:13 -07:00
Jay Zhuang	b0c5ecd6b3	Make max_subcompactions dynamically changeable (#7159 ) Summary: Make `max-subcompactions` dynamically changeable by passing the `DBOption` to Compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7159 Reviewed By: siying Differential Revision: D22671238 Pulled By: jay-zhuang fbshipit-source-id: 311ca9f6bb606965544d8708616d358cfed5be42	2020-07-22 18:32:52 -07:00
mrambacher	d44cbc5314	Add hash of key/value checks when paranoid_file_checks=true (#7134 ) Summary: When paraoid_files_checks=true, a rolling key-value hash is generated and compared to what is written to the file. If the values do not match, the SST file is rejected. Code put in place for the check for both flush and compaction jobs. Corresponding test added to corruption_test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7134 Reviewed By: cheng-chang Differential Revision: D22646149 fbshipit-source-id: 8fde1984a1a11edd3bd82a413acffc5ea7aa683f	2020-07-22 11:04:40 -07:00
Haosen Wen	dbc51adbac	Use steady_clock instead of system_clock in FileOperationInfo::TimePoint (#7153 ) Summary: Issue https://github.com/facebook/rocksdb/issues/7133 reported that using `system_clock` in `FileOperationInfo::TimePoint` causes the duration of file flush operation (which can be a noop on MacOS in some scenarios) appears to be 0 and fail an assertion in listener_test. Using `steady_clock` supposedly fixed the problem. `steady_clock` actually fits better into the use cases of `FileOperationInfo::TimePoint` as all usages care about durations but not wall clock time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7153 Test Plan: make check. Reviewed By: riversand963 Differential Revision: D22654136 Pulled By: roghnin fbshipit-source-id: 5980b1080734bdae496a18071a2c2b5887c67d85	2020-07-22 08:55:02 -07:00
Zitan Chen	b923dc720b	BackupEngine computes table checksums only once if db session ids are available (#7110 ) Summary: BackupEngine requires computing table checksums twice when backing up table files to the `shared_checksum` directory. The repeated computation can be avoided by utilizing the db session id stored as a part of the table properties. Filenames of table files in the `shared_checksum` directory depend on the following conditions: 1. the naming scheme is `kOptionalChecksumAndDbSessionId`, 2. `db_session_id` is not empty, 3. checksum is available in the DB manifest. If 1,2,3 are satisfied, then the filenames will be of the form `<file_number>_<checksum>_<db_session_id>.sst`. If 1,2 are satisfied, then the filenames will be of the form `<file_number>_<db_session_id>.sst`. In all other cases, the filenames are of the form `<file_number>_<checksum>_<size>.sst`. Additionally, if `kOptionalChecksumAndDbSessionId` is used (and not falling back to `kChecksumAndFileSize`), the `<checksum>` appeared in the filenames is hexadecimally encoded, instead of being plain `uint32_t` value. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7110 Test Plan: backupable_db_test and manual tests. Reviewed By: ajkr Differential Revision: D22508992 Pulled By: gg814 fbshipit-source-id: 5669f0ea9ad5a097f69f6d87aca4abba15032389	2020-07-21 10:35:40 -07:00
Andrew Kryczka	9a83fd21e6	stagger first DumpMallocStats after opening DB (#7145 ) Summary: Previously when running `db_bench` with large value for `num_multi_dbs` and enabled `Options::dump_malloc_stats`, we would see most CPU spent in jemalloc locking. After this PR that no longer shows up at the top of the profile. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7145 Reviewed By: riversand963 Differential Revision: D22593031 Pulled By: ajkr fbshipit-source-id: 3b3fc91f93249c6afee53f59f34c487c3fc5add6	2020-07-17 16:13:26 -07:00
Zhichao Cao	a10f12eda1	Auto resume the DB from Retryable IO Error (#6765 ) Summary: In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765 Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check Reviewed By: anand1976 Differential Revision: D21916789 Pulled By: zhichao-cao fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b	2020-07-15 11:03:58 -07:00
Yanqin Jin	27735dea9a	Report corrupted keys during compaction (#7124 ) Summary: Currently, RocksDB lets compaction to go through even in case of corrupted keys, the number of which is reported in CompactionJobStats. However, RocksDB does not check this value. We should let compaction run in a stricter mode. Temporarily disable two tests that allow corrupted keys in compaction. With this PR, the two tests will assert(false) and terminate. Still need to investigate what is the recommended google-test way of doing it. Death test (EXPECT_DEATH) in gtest has warnings now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7124 Test Plan: make check Reviewed By: ajkr Differential Revision: D22530722 Pulled By: riversand963 fbshipit-source-id: 6a5a6a992028c6d4f92cb74693c92db462ae4ad6	2020-07-14 17:18:17 -07:00
Andrew Kryczka	82611ee25a	save key comparisons in BlockIter::BinarySeek (#7068 ) Summary: This is a followup to https://github.com/facebook/rocksdb/issues/6646. In that PR, for simplicity I just appended a comparison against the 0th restart key in case `BinarySeek()`'s binary search landed at index 0. As a result there were `2/(N+1) + log_2(N)` key comparisons. This PR does it differently. Now we expand the binary search range by one so it also covers the case where target is at or before the restart key at index 0. As a result, it involves `log_2(N+1)` key comparisons. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7068 Test Plan: ran readrandom with mostly default settings and counted key comparisons using `PerfContext`. before: `user_key_comparison_count = 28881965` after: `user_key_comparison_count = 27823245` setup command: ``` $ TEST_TMPDIR=/dev/shm/dbbench ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -level_compaction_dynamic_level_bytes=true -num=10000000 ``` benchmark command: ``` $ TEST_TMPDIR=/dev/shm/dbbench/ ./db_bench -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=10000000 -compression_type=none -reads=1000000 -perf_level=3 ``` Reviewed By: anand1976 Differential Revision: D22357032 Pulled By: ajkr fbshipit-source-id: 8b01e9c1c2a4e9d02fc9dfe16c1cc0327f8bdf24	2020-07-09 12:27:20 -07:00

1 2 3 4 5 ...

821 Commits