rocksdb

Author	SHA1	Message	Date
Levi Tamasi	e5311a8ea4	Fix a SingleDelete related optimization for blob indexes (#7904 ) Summary: There is a small `SingleDelete` related optimization in the `CompactionIterator` code: when a `SingleDelete`-`Put` pair is preserved solely for the purposes of transaction conflict checking, the value itself gets cleared. (This is referred to as "optimization 3" in the `CompactionIterator` code.) Though the rest of the code got updated to support `SingleDelete`'ing blob indexes, this chunk was apparently missed, resulting in an assertion failure (or `ROCKS_LOG_FATAL` in release builds) when triggered. Note: in addition to clearing the value, we also need to update the type of the KV to regular value when dealing with blob indexes here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7904 Test Plan: `make check` Reviewed By: ajkr Differential Revision: D26118009 Pulled By: ltamasi fbshipit-source-id: 6bf78043d20265e2b15c2e1ab8865025040c42ae	2021-01-29 12:41:25 -08:00
Andrew Kryczka	78ee8564ad	Integrity protection for live updates to WriteBatch (#7748 ) Summary: This PR adds the foundation classes for key-value integrity protection and the first use case: protecting live updates from the source buffers added to `WriteBatch` through the destination buffer in `MemTable`. The width of the protection info is not yet configurable -- only eight bytes per key is supported. This PR allows users to enable protection by constructing `WriteBatch` with `protection_bytes_per_key == 8`. It does not yet expose a way for users to get integrity protection via other write APIs (e.g., `Put()`, `Merge()`, `Delete()`, etc.). The foundation classes (`ProtectionInfo.`) embed the coverage info in their type, and provide `Protect.()` and `Strip.()` functions to navigate between types with different coverage. For making bytes per key configurable (for powers of two up to eight) in the future, these classes are templated on the unsigned integer type used to store the protection info. That integer contains the XOR'd result of hashes with independent seeds for all covered fields. For integer fields, the hash is computed on the raw unadjusted bytes, so the result is endian-dependent. The most significant bytes are truncated when the hash value (8 bytes) is wider than the protection integer. When `WriteBatch` is constructed with `protection_bytes_per_key == 8`, we hold a `ProtectionInfoKVOTC` (i.e., one that covers key, value, optype aka `ValueType`, timestamp, and CF ID) for each entry added to the batch. The protection info is generated from the original buffers passed by the user, as well as the original metadata generated internally. When writing to memtable, each entry is transformed to a `ProtectionInfoKVOTS` (i.e., dropping coverage of CF ID and adding coverage of sequence number), since at that point we know the sequence number, and have already selected a memtable corresponding to a particular CF. This protection info is verified once the entry is encoded in the `MemTable` buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7748 Test Plan: - an integration test to verify a wide variety of single-byte changes to the encoded `MemTable` buffer are caught - add to stress/crash test to verify it works in variety of configs/operations without intentional corruption - [deferred] unit tests for `ProtectionInfo.` classes for edge cases like KV swap, `SliceParts` and `Slice` APIs are interchangeable, etc. Reviewed By: pdillinger Differential Revision: D25754492 Pulled By: ajkr fbshipit-source-id: e481bac6c03c2ab268be41359730f1ceb9964866	2021-01-29 12:18:58 -08:00
mrambacher	4a09d632c4	Remove Legacy and Custom FileWrapper classes from header files (#7851 ) Summary: Removed the uses of the Legacy FileWrapper classes from the source code. The wrappers were creating an additional layer of indirection/wrapping, as the Env already has a FileSystem. Moved the Custom FileWrapper classes into the CustomEnv, as these classes are really for the private use the the CustomEnv class. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7851 Reviewed By: anand1976 Differential Revision: D26114816 Pulled By: mrambacher fbshipit-source-id: db32840e58d969d3a0fa6c25aaf13d6dcdc74150	2021-01-28 22:10:32 -08:00
mrambacher	0a9a05ae12	Make builds reproducible (#7866 ) Summary: Closes https://github.com/facebook/rocksdb/issues/7035 Changed how build_version.cc was generated: - Included the GIT tag/branch in the build_version file - Changed the "Build Date" to be: - If the GIT branch is "clean" (no changes), the date of the last git commit - If the branch is not clean, the current date - Added APIs to access the "build information", rather than accessing the strings directly. The build_version.cc file is now regenerated whenever the library objects are rebuilt. Verified that the built files remain the same size across builds on a "clean build" and the same information is reported by sst_dump --version Pull Request resolved: https://github.com/facebook/rocksdb/pull/7866 Reviewed By: pdillinger Differential Revision: D26086565 Pulled By: mrambacher fbshipit-source-id: 6fcbe47f6033989d5cf26a0ccb6dfdd9dd239d7f	2021-01-28 17:42:16 -08:00
Levi Tamasi	c696f27432	Accumulate blob file additions in VersionEdit during recovery (#7903 ) Summary: During recovery, RocksDB performs a kind of dummy flush; namely, entries from the WAL are added to memtables, which then get written to SSTs and blob files (if enabled) just like during a regular flush. Note that multiple memtables might be flushed during recovery for the same column family, for example, if the DB is reopened with a lower write buffer size, and therefore, we need to make sure to collect all SST and blob file additions. The patch fixes a bug in the earlier logic which resulted in later blob file additions overwriting earlier ones. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7903 Test Plan: Added a unit test and ran `db_stress`. Reviewed By: jay-zhuang Differential Revision: D26110847 Pulled By: ltamasi fbshipit-source-id: eddb50a608a88f54f3cec3a423de8235aba951fd	2021-01-27 18:46:15 -08:00
Zhichao Cao	95013df278	Do not set bg error for compaction in retryable IO Error case (#7899 ) Summary: When retryable IO error occurs during compaction, it is mapped to soft error and set the BG error. However, auto resume is not called to clean the soft error since compaction will reschedule by itself. In this change, When retryable IO error occurs during compaction, BG error is not set. User will be informed the error via EventHelper. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7899 Test Plan: tested with error_handler_fs_test Reviewed By: anand1976 Differential Revision: D26094097 Pulled By: zhichao-cao fbshipit-source-id: c53424f11d237405592cd762f43cbbdf8da8234f	2021-01-27 17:58:12 -08:00
Jay Zhuang	c6ff4c0b70	Fix deadlock in `fs_test.WALWriteRetryableErrorAutoRecover1` (#7897 ) Summary: The recovery thread could hold the db.mutex, which is needed from sync write in main thread. Make sure the write is done before recovery thread starts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7897 Test Plan: `gtest-parallel ./error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.WALWriteRetryableErrorAutoRecover1 -r 10000 --workers=200` Reviewed By: zhichao-cao Differential Revision: D26082933 Pulled By: jay-zhuang fbshipit-source-id: 226fc49228c0e5903f86ff45cc3fed3080abdb1f	2021-01-26 17:02:03 -08:00
Jay Zhuang	9425acacce	Fix flaky `error_handler_fs_test.MultiDBCompactionError` (#7896 ) Summary: The error recovery thread may out-live DBImpl object, which causing access released DBImpl.mutex. Close SstFileManager before closing DB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7896 Test Plan: the issue can be reproduced by adding sleep in recovery code. Pass the tests with sleep. Reviewed By: zhichao-cao Differential Revision: D26076655 Pulled By: jay-zhuang fbshipit-source-id: 0d9cc5639c12fcfc001427015e75a9736f33cd96	2021-01-26 11:00:12 -08:00
mrambacher	12f1137355	Add a SystemClock class to capture the time functions of an Env (#7858 ) Summary: Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock. Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock. There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc). Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858 Reviewed By: pdillinger Differential Revision: D26006406 Pulled By: mrambacher fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90	2021-01-25 22:09:11 -08:00
Akanksha Mahajan	1d226018af	In IOTracing, add filename with each operation in trace file. (#7885 ) Summary: 1. In IOTracing, add filename with each IOTrace record. Filename is stored in file object (Tracing Wrappers). 2. Change the logic of figuring out which additional information (file_size, length, offset etc) needs to be store with each operation which is different for different operations. When new information will be added in future (depends on operation), this change would make the future additions simple. Logic: In IOTraceRecord, io_op_data is added and its bitwise positions represent which additional information need to added in the record from enum IOTraceOp. Values in IOTraceOp represent bitwise positions. So if length and offset needs to be stored (IOTraceOp::kIOLen is 1 and IOTraceOp::kIOOffset is 2), position 1 and 2 (from rightmost bit) will be set and io_op_data will contain 110. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7885 Test Plan: Updated io_tracer_test and verified the trace file manually. Reviewed By: anand1976 Differential Revision: D25982353 Pulled By: akankshamahajan15 fbshipit-source-id: ebfc5539cc0e231d7794a6b42b73f5403e360b22	2021-01-25 14:37:35 -08:00
Levi Tamasi	431e8afba7	Do not explicitly flush blob files when using the integrated BlobDB (#7892 ) Summary: In the original stacked BlobDB implementation, which writes blobs to blob files immediately and treats blob files as logs, it makes sense to flush the file after writing each blob to protect against process crashes; however, in the integrated implementation, which builds blob files in the background jobs, this unnecessarily reduces performance. This patch fixes this by simply adding a `do_flush` flag to `BlobLogWriter`, which is set to `true` by the stacked implementation and to `false` by the new code. Note: the change itself is trivial but the tests needed some work; since in the new implementation, blobs are now buffered, adding a blob to `BlobFileBuilder` is no longer guaranteed to result in an actual I/O. Therefore, we can no longer rely on `FaultInjectionTestEnv` when testing failure cases; instead, we manipulate the return values of I/O methods directly using `SyncPoint`s. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7892 Test Plan: `make check` Reviewed By: jay-zhuang Differential Revision: D26022814 Pulled By: ltamasi fbshipit-source-id: b3dce419f312137fa70d84cdd9b908fd5d60d8cd	2021-01-25 13:32:33 -08:00
Matthew Von-Maszewski	12a8be1d44	MergeHelper::FilterMerge() calling ElapsedNanosSafe() upon exit even … (#7867 ) Summary: …when unused. Causes many calls to clock_gettime, impacting performance. Was looking for something else via Linux "perf" command when I spotted heavy usage of clock_gettime during a compaction. Our product heavily uses the rocksdb::Options::merge_operator. MergeHelper::FilterMerge() properly tests if timing is enabled/disabled upon entry, but not on exit. This patch fixes the exit. Note: the entry test also verifies if "nullptr!=stats_". This test is redundant to code within ShouldReportDetailedTime(). Therefore I omitted it in my change. merge_test.cc updated with test that shows failure before merge_helper.cc change ... and fix after change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7867 Reviewed By: jay-zhuang Differential Revision: D25960175 Pulled By: ajkr fbshipit-source-id: 56e66d7eb6ae5eae89c8e0d5a262bd2905a226b6	2021-01-21 13:13:02 -08:00
Andrew Kryczka	e18a4df62a	workaround race conditions during `PeriodicWorkScheduler` registration (#7888 ) Summary: This provides a workaround for two race conditions that will be fixed in a more sophisticated way later. This PR: (1) Makes the client serialize calls to `Timer::Start()` and `Timer::Shutdown()` (see https://github.com/facebook/rocksdb/issues/7711). The long-term fix will be to make those functions thread-safe. (2) Makes `PeriodicWorkScheduler` atomically add/cancel work together with starting/shutting down its `Timer`. The long-term fix will be for `Timer` API to offer more specialized APIs so the client will not need to synchronize. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7888 Test Plan: ran the repro provided in https://github.com/facebook/rocksdb/issues/7881 Reviewed By: jay-zhuang Differential Revision: D25990891 Pulled By: ajkr fbshipit-source-id: a97fdaebbda6d7db7ddb1b146738b68c16c5be38	2021-01-21 08:50:38 -08:00
Levi Tamasi	2d37830e44	Make blob related VersionEdit tags unignorable (#7886 ) Summary: BlobFileAddition and BlobFileGarbage should not be in the ignorable tag range, since if they are present in the MANIFEST, users cannot downgrade to a RocksDB version that does not understand them without losing access to the data in the blob files. The patch moves these two tags to the unignorable range; this should still be safe at this point, since the integrated BlobDB project is still work in progress and thus there shouldn't be any ignorable BlobFileAddition/BlobFileGarbage tags out there. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7886 Test Plan: `make check` Reviewed By: cheng-chang Differential Revision: D25980956 Pulled By: ltamasi fbshipit-source-id: 13cf5bd61d77f049b513ecd5ad0be8c637e40a9d	2021-01-20 20:29:04 -08:00
Cheng Chang	e44948295e	Make it able to ignore WAL related VersionEdits in older versions (#7873 ) Summary: Although the tags for `WalAddition`, `WalDeletion` are after `kTagSafeIgnoreMask`, to actually be able to skip these entries in older versions of RocksDB, we require that they are encoded with their encoded size as the prefix. This requirement is not met in the current codebase, so a downgraded DB may fail to open if these entries exist in the MANIFEST. If a DB wants to downgrade, and its MANIFEST contains `WalAddition` or `WalDeletion`, it can set `track_and_verify_wals_in_manifest` to `false`, then restart twice, then downgrade. On the first restart, a new MANIFEST will be created with a `WalDeletion` indicating that all previously tracked WALs are removed from MANIFEST. On the second restart, since there is no tracked WALs in MANIFEST now, a new MANIFEST will be created with neither `WalAddition` nor `WalDeletion`. Then the DB can downgrade. Tags for `BlobFileAddition`, `BlobFileGarbage` also have the same problem, but this PR focuses on solving the problem for WAL edits. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7873 Test Plan: Added a `VersionEditTest::IgnorableTags` unit test to verify all entries with tags larger than `kTagSafeIgnoreMask` can actually be skipped and won't affect parsing of other entries. Reviewed By: ajkr Differential Revision: D25935930 Pulled By: cheng-chang fbshipit-source-id: 7a02fdba4311d6084328c14aed110a26d08c3efb	2021-01-19 19:27:53 -08:00
Vladimir Maksimovski	4db58bcfb2	Fix write-ahead log file size overflow (#7870 ) Summary: The WAL's file size is stored as an unsigned 64 bit integer. In db_info_dumper.cc, this integer gets converted to a string. Since 2^64 is approximately 10^19, we need 20 digits to represent the integer correctly. To store the decimal representation, we need 21 bytes (+1 due to the '\0' terminator at the end). The code previously used 16 bytes, which would overflow if the log is really big (>1 petabyte). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7870 Reviewed By: ajkr Differential Revision: D25938776 Pulled By: jay-zhuang fbshipit-source-id: 6ee9e21ebd65d297ea90fa1e7e74f3e1c533299d	2021-01-19 13:47:48 -08:00
darionyaphet	2fb6d9337f	Using emplace_back replace push_back (#7568 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7568 Reviewed By: akankshamahajan15 Differential Revision: D24437383 Pulled By: jay-zhuang fbshipit-source-id: 7c9b3c4944b959aa7796c53b410c2b1055dc5641	2021-01-15 16:56:41 -08:00
Jay Zhuang	77b4bfe511	Disable PeriodicWorkScheduler during RateLimited test (#7810 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7810 Reviewed By: akankshamahajan15 Differential Revision: D25695454 Pulled By: jay-zhuang fbshipit-source-id: 963d11f38a959de7227ba2be15795af2792413a6	2021-01-11 15:01:52 -08:00
Jay Zhuang	eccc47e81c	Fix tsan options_test (#7845 ) Summary: Minor tsan issue that counter could be bumped concurrently: https://app.circleci.com/pipelines/github/facebook/rocksdb/5431/workflows/79312c7c-5815-4f07-8836-94625db8e33e/jobs/81619 Pull Request resolved: https://github.com/facebook/rocksdb/pull/7845 Reviewed By: akankshamahajan15 Differential Revision: D25851472 Pulled By: jay-zhuang fbshipit-source-id: 74cc8797ac503413bec27a30e5d1f055379777e8	2021-01-11 10:17:57 -08:00
Adam Retter	4926b33742	Improvements to Env::GetChildren (#7819 ) Summary: The main improvement here is to not include `.` or `..` in the results of `Env::GetChildren`. The occurrence of `.` or `..`; it is non-portable, dependent on the Operating System and the File System. See: https://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html There were lots of duplicate checks spread through the RocksDB codebase previously to skip `.` and `..`. This new removes the need for those at the source. Also some minor fixes to `Env::GetChildren`: * Improve error handling in POSIX implementation * Remove unnecessary array allocation on Windows * Fix struct name for Windows Non-UTF-8 API Pull Request resolved: https://github.com/facebook/rocksdb/pull/7819 Reviewed By: ajkr Differential Revision: D25837394 Pulled By: jay-zhuang fbshipit-source-id: 1e137e7218d38b450af9c083f73d5357abcbba2e	2021-01-09 09:44:34 -08:00
Zhichao Cao	48c0843e69	Treat File Scope Write IO Error the same as Retryable IO Error (#7840 ) Summary: In RocksDB, when IO error happens, the flags of IOStatus can be set. If the IOStatus is set as "File Scope IO Error", it indicate that the error is constrained in the file level. Since RocksDB does not continues write data to a file when any IO Error happens, File Scope IO Error can be treated the same as Retryable IO Error. Adding the logic to ErrorHandler::SetBGError to include the file scope IO Error in its error handling logic, which is the same as retryable IO Error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7840 Test Plan: added new unit tests in error_handler_fs_test. make check Reviewed By: anand1976 Differential Revision: D25820481 Pulled By: zhichao-cao fbshipit-source-id: 69cabd3d010073e064d6142ce1cabf341b8a6806	2021-01-07 16:31:33 -08:00
mrambacher	cc2a180d00	Add more tests to the ASC pass list (#7834 ) Summary: Fixed the following to now pass ASC checks: * `ttl_test` * `blob_db_test` * `backupable_db_test`, * `delete_scheduler_test` Pull Request resolved: https://github.com/facebook/rocksdb/pull/7834 Reviewed By: jay-zhuang Differential Revision: D25795398 Pulled By: ajkr fbshipit-source-id: a10037817deda4fc7cbb353a2e00b62ed89b6476	2021-01-07 15:22:53 -08:00
Adam Retter	6e0f62f2b6	Add more tests to ASSERT_STATUS_CHECKED (3), API change (#7715 ) Summary: Third batch of adding more tests to ASSERT_STATUS_CHECKED. * db_compaction_filter_test * db_compaction_test * db_dynamic_level_test * db_inplace_update_test * db_sst_test * db_tailing_iter_test * db_io_failure_test Also update GetApproximateSizes APIs to all return Status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7715 Reviewed By: jay-zhuang Differential Revision: D25806896 Pulled By: pdillinger fbshipit-source-id: 6cb9d62ba5a756c645812754c596ad3995d7c262	2021-01-06 14:15:02 -08:00
Zhichao Cao	5792b73fdc	Fixed the swallowed IOStatus in Compaction Job introduced in PR 7718 (#7838 ) Summary: The IOStatus of TableBuilder is returned by copy the io status from builder->io_status(). pr https://github.com/facebook/rocksdb/issues/7718 swallowed the io status and it will cause the write IO error become non-retryable and no auto resume logic will handle it. Roll back to previous implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7838 Test Plan: make check Reviewed By: ajkr Differential Revision: D25795387 Pulled By: zhichao-cao fbshipit-source-id: bc35e69e0b71aa4148a6ed76f073357041b8e372	2021-01-06 13:18:00 -08:00
mrambacher	e628f59e87	Create a CustomEnv class; Add WinFileSystem; Make LegacyFileSystemWrapper private (#7703 ) Summary: This PR does the following: -> Creates a WinFileSystem class. This class is the Windows equivalent of the PosixFileSystem and will be used on Windows systems. -> Introduces a CustomEnv class. A CustomEnv is an Env that takes a FileSystem as constructor argument. I believe there will only ever be two implementations of this class (PosixEnv and WinEnv). There is still a CustomEnvWrapper class that takes an Env and a FileSystem and wraps the Env calls with the input Env but uses the FileSystem for the FileSystem calls -> Eliminates the public uses of the LegacyFileSystemWrapper. With this change in place, there are effectively the following patterns of Env: - "Base Env classes" (PosixEnv, WinEnv). These classes implement the core Env functions (e.g. Threads) and have a hard-coded input FileSystem. These classes inherit from CompositeEnv, implement the core Env functions (threads) and delegate the FileSystem-like calls to the input file system. - Wrapped Composite Env classes (MemEnv). These classes take in an Env and a FileSystem. The core env functions are re-directed to the wrapped env. The file system calls are redirected to the input file system - Legacy Wrapped Env classes. These classes take in an Env input (but no FileSystem). The core env functions are re-directed to the wrapped env. A "Legacy File System" is created using this env and the file system calls directed to the env itself. With these changes in place, the PosixEnv becomes a singleton -- there is only ever one created. Any other use of the PosixEnv is via another wrapped env. This cleans up some of the issues with the env construction and destruction. Additionally, there were places in the code that required had an Env when they required a FileSystem. Many of these places would wrap the Env with a LegacyFileSystemWrapper instead of using the env->GetFileSystem(). These places were changed, thereby removing layers of additional redirection (LegacyFileSystem --> Env --> Env::FileSystem). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7703 Reviewed By: zhichao-cao Differential Revision: D25762190 Pulled By: anand1976 fbshipit-source-id: 1a088e97fc916f28ac69c149cd1dcad0ab31704b	2021-01-06 10:49:32 -08:00
mrambacher	c1a65a4de4	Make StringEnv, StringSink, StringSource use FS classes (#7786 ) Summary: Change the StringEnv and related classes to be based on FileSystem APIs rather than the corresponding Env ones. The StringSink and StringSource classes were changed to be based on the corresponding FS file classes. Part of a cleanup to use the newer interfaces. This change also eliminates some of the casts/wrappers to LegacyFile classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7786 Reviewed By: jay-zhuang Differential Revision: D25761460 Pulled By: anand1976 fbshipit-source-id: 428ae8e32b3db97dbeeca08c9d3bb0d9d4d3a38f	2021-01-04 16:01:01 -08:00
Andrew Kryczka	225abffd8f	Verify file checksum generator name (#7824 ) Summary: Previously we only had a debug assertion to check the right generator was being used for verification. However a user hit a problem in production where their factory was creating the wrong generator for some files, leading to checksum mismatches. It would have been easier to debug if we verified in optimized builds that the generator with the proper name is used. This PR adds such verification. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7824 Reviewed By: zhichao-cao Differential Revision: D25740254 Pulled By: ajkr fbshipit-source-id: a6231521747605021bad3231484b5d4f99f4044f	2021-01-04 11:51:50 -08:00
mrambacher	0bad2b4308	Ignore the OnAddFile Status for SSTFileManager (#7826 ) Summary: The returned Status is ignored here as some stress tests are failing, presumably when attempting to add an empty file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7826 Reviewed By: jay-zhuang Differential Revision: D25742931 fbshipit-source-id: a1fcd620d9472993a009929306dfc421f93eb43b	2021-01-04 11:08:28 -08:00
Zhichao Cao	44ebc24dca	Add rate_limiter to GenerateOneFileChecksum (#7811 ) Summary: In GenerateOneFileChecksum(), RocksDB reads the file and computes its checksum. A rate limiter can be passed to the constructor of RandomAccessFileReader so that read I/O can be rate limited. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7811 Test Plan: make check Reviewed By: cheng-chang Differential Revision: D25699896 Pulled By: zhichao-cao fbshipit-source-id: e2688bc1126c543979a3bcf91dda784bd7b74164	2020-12-26 22:07:24 -08:00
mrambacher	55e99688cc	No elide constructors (#7798 ) Summary: Added "no-elide-constructors to the ASSERT_STATUS_CHECK builds. This flag gives more errors/warnings for some of the Status checks where an inner class checks a Status and later returns it. In this case, without the elide check on, the returned status may not have been checked in the caller, thereby bypassing the checked code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7798 Reviewed By: jay-zhuang Differential Revision: D25680451 Pulled By: pdillinger fbshipit-source-id: c3f14ed9e2a13f0a8c54d839d5fb4d1fc1e93917	2020-12-23 16:55:53 -08:00
Akanksha Mahajan	30a5ed9c53	Update "num_data_read" stat in RetrieveMultipleBlocks (#7770 ) Summary: RetrieveMultipleBlocks which is used by MultiGet to read data blocks is not updating num_data_read stat in GetContextStats. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7770 Test Plan: make check -j64 Reviewed By: anand1976 Differential Revision: D25538982 Pulled By: akankshamahajan15 fbshipit-source-id: e3daedb035b1be8ab6af6f115cb3793ccc7b1ec6	2020-12-23 15:16:46 -08:00
Peter Dillinger	a727efca99	Remove flaky, redundant, and dubious DBTest.SparseMerge (#7800 ) Summary: This test would occasionally fail like this: WARNING: c:\users\circleci\project\db\db_test.cc(1343): error: Expected: (dbfull()->TEST_MaxNextLevelOverlappingBytes(handles_[1])) <= (20 * 1048576), actual: 33501540 vs 20971520 And being a super old test, it's not structured in a sound way. And it appears that DBTest2.MaxCompactionBytesTest is a better test of what SparseMerge was intended to test. In fact, SparseMerge fails if I set options.max_compaction_bytes = options.target_file_size_base * 1000; Thus, we are removing this negative-value test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7800 Test Plan: Q.E.D. Reviewed By: ajkr Differential Revision: D25693366 Pulled By: pdillinger fbshipit-source-id: 9da07d4dce0559547fc938b2163a2015e956c548	2020-12-23 11:08:12 -08:00
mrambacher	02418194d7	Add more tests for assert status checked (#7524 ) Summary: Added 10 more tests that pass the ASSERT_STATUS_CHECKED test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7524 Reviewed By: akankshamahajan15 Differential Revision: D24323093 Pulled By: ajkr fbshipit-source-id: 28d4106d0ca1740c3b896c755edf82d504b74801	2020-12-22 23:45:58 -08:00
Adam Retter	81592d9ffa	Add more tests to ASSERT_STATUS_CHECKED (4) (#7718 ) Summary: Fourth batch of adding more tests to ASSERT_STATUS_CHECKED. * db_range_del_test * db_write_test * random_access_file_reader_test * merge_test * external_sst_file_test * write_buffer_manager_test * stringappend_test * deletefile_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/7718 Reviewed By: pdillinger Differential Revision: D25671608 fbshipit-source-id: 687a794e98a9e0cd5428ead9898ef05ced987c31	2020-12-22 15:09:39 -08:00
cheng-chang	41ff125a8a	SyncWAL shouldn't be supported in compacted db (#7788 ) Summary: `CompactedDB` is a kind of read-only DB, so it shouldn't support `SyncWAL`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7788 Test Plan: watch existing tests to pass. Reviewed By: akankshamahajan15 Differential Revision: D25661209 Pulled By: cheng-chang fbshipit-source-id: 9eb2cc3f73736dcc205c8410e5944aa203f002d3	2020-12-22 14:53:43 -08:00
sdong	9057d0a079	Minimize Timing Issue in test WALTrashCleanupOnOpen (#7796 ) Summary: We saw DBWALTestWithParam/DBWALTestWithParam.WALTrashCleanupOnOpen sometimes fail with: db/db_sst_test.cc:575: Failure Expected: (trash_log_count) >= (1), actual: 0 vs 1 The suspicious is that delete scheduling actually deleted all trash files based on rate, but it is not expected. This can be reproduced if we manually add sleep after DB is closed for serveral seconds. Minimize its chance by setting the delete rate to be lowest possible. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7796 Test Plan: The test doesn't fail with the manual sleeping anymore Reviewed By: anand1976 Differential Revision: D25675000 fbshipit-source-id: a39fd05e1a83719c41014e48843792e752368e22	2020-12-22 14:44:08 -08:00
Peter Dillinger	4d1ac19e3d	aggregated-table-properties with GetMapProperty (#7779 ) Summary: So that we can more easily get aggregate live table data such as total filter, index, and data sizes. Also adds ldb support for getting properties Also fixed some missing/inaccurate related comments in db.h For example: $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties rocksdb.aggregated-table-properties.data_size: 102871 rocksdb.aggregated-table-properties.filter_size: 0 rocksdb.aggregated-table-properties.index_partitions: 0 rocksdb.aggregated-table-properties.index_size: 2232 rocksdb.aggregated-table-properties.num_data_blocks: 100 rocksdb.aggregated-table-properties.num_deletions: 0 rocksdb.aggregated-table-properties.num_entries: 15000 rocksdb.aggregated-table-properties.num_merge_operands: 0 rocksdb.aggregated-table-properties.num_range_deletions: 0 rocksdb.aggregated-table-properties.raw_key_size: 288890 rocksdb.aggregated-table-properties.raw_value_size: 198890 rocksdb.aggregated-table-properties.top_level_index_size: 0 $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1 rocksdb.aggregated-table-properties-at-level1.data_size: 80909 rocksdb.aggregated-table-properties-at-level1.filter_size: 0 rocksdb.aggregated-table-properties-at-level1.index_partitions: 0 rocksdb.aggregated-table-properties-at-level1.index_size: 1787 rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81 rocksdb.aggregated-table-properties-at-level1.num_deletions: 0 rocksdb.aggregated-table-properties-at-level1.num_entries: 12466 rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0 rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0 rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210 rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414 rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0 $ Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779 Test Plan: Added a test to ldb_test.py Reviewed By: jay-zhuang Differential Revision: D25653103 Pulled By: pdillinger fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3	2020-12-19 08:00:14 -08:00
Cheng Chang	fbce7a3808	Track WAL obsoletion when updating empty CF's log number (#7781 ) Summary: In the write path, there is an optimization: when a new WAL is created during SwitchMemtable, we update the internal log number of the empty column families to the new WAL. `FindObsoleteFiles` marks a WAL as obsolete if the WAL's log number is less than `VersionSet::MinLogNumberWithUnflushedData`. After updating the empty column families' internal log number, `VersionSet::MinLogNumberWithUnflushedData` might change, so some WALs might become obsolete to be purged from disk. For example, consider there are 3 column families: 0, 1, 2: 1. initially, all the column families' log number is 1; 2. write some data to cf0, and flush cf0, but the flush is pending; 3. now a new WAL 2 is created; 4. write data to cf1 and WAL 2, now cf0's log number is 1, cf1's log number is 2, cf2's log number is 2 (because cf1 and cf2 are empty, so their log numbers will be set to the highest log number); 5. now cf0's flush hasn't finished, flush cf1, a new WAL 3 is created, and cf1's flush finishes, now cf0's log number is 1, cf1's log number is 3, cf2's log number is 3, since WAL 1 still contains data for the unflushed cf0, no WAL can be deleted from disk; 6. now cf0's flush finishes, cf0's log number is 2 (because when cf0 was switching memtable, WAL 3 does not exist yet), cf1's log number is 3, cf2's log number is 3, so WAL 1 can be purged from disk now, but WAL 2 still cannot because `MinLogNumberToKeep()` is 2; 7. write data to cf2 and WAL 3, because cf0 is empty, its log number is updated to 3, so now cf0's log number is 3, cf1's log number is 3, cf2's log number is 3; 8. now if the background threads want to purge obsolete files from disk, WAL 2 can be purged because `MinLogNumberToKeep()` is 3. But there are only two flush results written to MANIFEST: the first is for flushing cf1, and the `MinLogNumberToKeep` is 1, the second is for flushing cf0, and the `MinLogNumberToKeep` is 2. So without this PR, if the DB crashes at this point and try to recover, `WalSet` will still expect WAL 2 to exist. When WAL tracking is enabled, we assume WALs will only become obsolete after a flush result is written to MANIFEST in `MemtableList::TryInstallMemtableFlushResults` (or its atomic flush counterpart). The above situation breaks this assumption. This PR tracks WAL obsoletion if necessary before updating the empty column families' log numbers. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7781 Test Plan: watch existing tests and stress tests to pass. `make -j48 blackbox_crash_test` on devserver Reviewed By: ltamasi Differential Revision: D25631695 Pulled By: cheng-chang fbshipit-source-id: ca7fff967bdb42204b84226063d909893bc0a4ec	2020-12-18 21:34:36 -08:00
Burton Li	2021392e25	Do not full scan obsolete files on compaction busy (#7739 ) Summary: When ConcurrentTaskLimiter is enabled and there are too many outstanding compactions, BackgroundCompaction returns Status::Busy(), which shouldn't be treat as compaction failure. This caused performance issue when outstanding compactions reached the limit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7739 Reviewed By: cheng-chang Differential Revision: D25508319 Pulled By: ltamasi fbshipit-source-id: 3b181b16ada0ca3393cfa3a7412985764e79c719	2020-12-15 13:51:10 -08:00
Jay Zhuang	a0e4421e81	Log sst number in Corruption status (#7767 ) Summary: sst file number in corruption error would be very useful for debugging Pull Request resolved: https://github.com/facebook/rocksdb/pull/7767 Reviewed By: zhichao-cao Differential Revision: D25485872 Pulled By: jay-zhuang fbshipit-source-id: 67315b582cedeefbce6676015303ebe5bf6526a3	2020-12-14 14:07:52 -08:00
Levi Tamasi	1afbd1948c	Add initial blob support to batched MultiGet (#7766 ) Summary: The patch adds initial support for reading blobs to the batched `MultiGet` API. The current implementation simply retrieves the blob values as the blob indexes are encountered; that is, reads from blob files are currently not batched. (This will be optimized in a separate phase.) In addition, the patch removes some dead code related to BlobDB from the batched `MultiGet` implementation, namely the `is_blob` / `is_blob_index` flags that are passed around in `DBImpl` and `MemTable` / `MemTableListVersion`. These were never hooked up to anything and wouldn't work anyways, since a single flag is not sufficient to communicate the "blobness" of multiple key-values. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7766 Test Plan: `make check` Reviewed By: jay-zhuang Differential Revision: D25479290 Pulled By: ltamasi fbshipit-source-id: 7aba2d290e31876ee592bcf1adfd1018713a8000	2020-12-14 13:48:22 -08:00
Peter Dillinger	b1ee191405	Fix memory leak for ColumnFamily drop with live iterator (#7749 ) Summary: Uncommon bug seen by ASAN with ColumnFamilyTest.LiveIteratorWithDroppedColumnFamily, if the last two references to a ColumnFamilyData are both SuperVersions (during InstallSuperVersion). The fix is to use UnrefAndTryDelete even in SuperVersion::Cleanup but with a parameter to avoid re-entering Cleanup on the same SuperVersion being cleaned up. ColumnFamilyData::Unref is considered unsafe so removed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7749 Test Plan: ./column_family_test --gtest_filter=LiveIter --gtest_repeat=100 Reviewed By: jay-zhuang Differential Revision: D25354304 Pulled By: pdillinger fbshipit-source-id: e78f3a3f67c40013b8432f31d0da8bec55c5321c	2020-12-11 11:18:21 -08:00
Cheng Chang	fd7d8dc56e	Do not log unnecessary WAL obsoletion events (#7765 ) Summary: min_wal_number_to_keep should not be decreasing, if it does not increase, then there is no need to log the WAL obsoletions in MANIFEST since a previous one has been logged. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7765 Test Plan: watch existing tests and stress tests to pass Reviewed By: pdillinger Differential Revision: D25462542 Pulled By: cheng-chang fbshipit-source-id: 0085fcb6edf5cf2b0fc32f9932a7566f508768ff	2020-12-10 12:55:49 -08:00
Adam Retter	8ff6557e7f	Add further tests to ASSERT_STATUS_CHECKED (2) (#7698 ) Summary: Second batch of adding more tests to ASSERT_STATUS_CHECKED. * external_sst_file_basic_test * checkpoint_test * db_wal_test * db_block_cache_test * db_logical_block_size_cache_test * db_blob_index_test * optimistic_transaction_test * transaction_test * point_lock_manager_test * write_prepared_transaction_test * write_unprepared_transaction_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/7698 Reviewed By: cheng-chang Differential Revision: D25441664 Pulled By: pdillinger fbshipit-source-id: 9e78867f32321db5d4833e95eb96c5734526ef00	2020-12-09 21:21:16 -08:00
Cheng Chang	80159f6e0b	Carry over min_log_number_to_keep_2pc in new MANIFEST (#7747 ) Summary: When two phase commit is enabled, `VersionSet::min_log_number_to_keep_2pc` is set during flush. But when a new MANIFEST is created, the `min_log_number_to_keep_2pc` is not carried over to the new MANIFEST. So if a new MANIFEST is created and then DB is reopened, the `min_log_number_to_keep_2pc` will be lost. This may cause DB recovery errors. The bug is reproduced in a new unit test in `version_set_test.cc`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7747 Test Plan: The new unit test in `version_set_test.cc` should pass. Reviewed By: jay-zhuang Differential Revision: D25350661 Pulled By: cheng-chang fbshipit-source-id: eee890d5b19f15769069670692e270ae31044ece	2020-12-09 19:07:25 -08:00
anand76	8a1488efbf	Ensure that MultiGet works properly with compressed cache (#7756 ) Summary: Ensure that when direct IO is enabled and a compressed block cache is configured, MultiGet inserts compressed data blocks into the compressed block cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7756 Test Plan: Add unit test to db_basic_test Reviewed By: cheng-chang Differential Revision: D25416240 Pulled By: anand1976 fbshipit-source-id: 75d57526370c9c0a45ff72651f3278dbd8a9086f	2020-12-09 17:01:13 -08:00
Cheng Chang	3c2a448856	Add a test for disabling tracking WAL (#7757 ) Summary: If WAL tracking was enabled, then disabled during reopen, the previously tracked WALs should be removed from MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7757 Test Plan: a new unit test `DBBasicTest.DisableTrackWal` is added. Reviewed By: jay-zhuang Differential Revision: D25410508 Pulled By: cheng-chang fbshipit-source-id: 9d8d9e665066135930a7c1035bb8c2f68bded6a0	2020-12-09 16:58:26 -08:00
Cheng Chang	efe827baf0	Always track WAL obsoletion (#7759 ) Summary: Currently, when a WAL becomes obsolete after flushing, if VersionSet::WalSet does not contain the WAL, we do not track the WAL obsoletion event in MANIFEST. But consider this case: * WAL 10 is synced, a VersionEdit is LogAndApplied to MANIFEST to log this WAL addition event, but the VersionEdit is not applied to WalSet yet since its corresponding ManifestWriter is still pending in the write queue; * Since the above ManifestWriter is blocking, the LogAndApply will block on a conditional variable and release the db mutex, so another LogAndApply can proceed to enqueue other VersionEdits concurrently; * Now flush happens, and WAL 10 becomes obsolete, although WalSet does not contain WAL 10 yet, we should call LogAndApply to enqueue a VersionEdit to indicate the obsoletion of WAL 10; * otherwise, when the queued edit indicating WAL 10 addition is logged to MANIFEST, and DB crashes and reopens, the WAL 10 might have been removed from disk, but it still exists in MANIFEST. This PR changes the behavior to: always `LogAndApply` any WAL addition or obsoletion event, without considering the order issues caused by concurrency, but when applying the edits to `WalSet`, do not add the WALs if they are already obsolete. In this approach, the logical events of WAL addition and obsoletion are always tracked in MANIFEST, so we can inspect the MANIFEST and know all the previous WAL events, but we choose to ignore certain events due to the concurrency issues such as the case above, or the case in https://github.com/facebook/rocksdb/pull/7725. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7759 Test Plan: make check Reviewed By: pdillinger Differential Revision: D25423089 Pulled By: cheng-chang fbshipit-source-id: 9cb9a7fbc1875bf954f2a42f9b6cfd6d49a7b21c	2020-12-09 16:02:12 -08:00
Adam Retter	7b2216c906	Add further tests to ASSERT_STATUS_CHECKED (1) (#7679 ) Summary: First batch of adding more tests to ASSERT_STATUS_CHECKED. * db_iterator_test * db_memtable_test * db_merge_operator_test * db_merge_operand_test * write_batch_test * write_batch_with_index_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/7679 Reviewed By: ajkr Differential Revision: D25399270 Pulled By: pdillinger fbshipit-source-id: 3017d0a686aec5cd2d743fc2acbbf75df239f3ba	2020-12-08 15:55:04 -08:00
Cheng Chang	07030c6f4a	Do not track obsolete WALs in MANIFEST even if they are synced (#7725 ) Summary: Consider the case: 1. All column families are flushed, so all WALs become obsolete, but no WAL is removed from disk yet because the removal is asynchronous, a VersionEdit is written to MANIFEST indicating that WALs before a certain WAL number are obsolete, let's say this number is 3; 2. `SyncWAL` is called, so all the on-disk WALs are synced, and if track_and_verify_wal_in_manifest=true, the WALs will be tracked in MANIFEST, let's say the WAL numbers are 1 and 2; 3. DB crashes; 4. During DB recovery, when replaying MANIFEST, we first see that WAL with number < 3 are obsolete, then we see that WAL 1 and 2 are synced, so according to current implementation of `WalSet`, the `WalSet` will be recovered to include WAL 1 and 2; 5. WAL 1 and 2 are asynchronously deleted from disk, then the WAL verification algorithm fails with `Corruption: missing WAL`. The above case is reproduced in a new unit test `DBBasicTestTrackWal::DoNotTrackObsoleteWal`. The fix is to maintain the upper bound of the obsolete WAL numbers, any WAL with number less than the maintained number is considered to be obsolete, so shouldn't be tracked even if they are later synced. The number is maintained in `WalSet`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7725 Test Plan: 1. a new unit test `DBBasicTestTrackWal::DoNotTrackObsoleteWal` is added. 2. run `make crash_test` on devserver. Reviewed By: riversand963 Differential Revision: D25238914 Pulled By: cheng-chang fbshipit-source-id: f5dccd57c3d89f19565ec5731f2d42f06d272b72	2020-12-08 10:58:04 -08:00

1 2 3 4 5 ...

4337 Commits