rocksdb

Author	SHA1	Message	Date
anand76	832b056a30	Enable IO timeouts for iterators (#7161 ) Summary: Introduce io_timeout in ReadOptions and enabled deadline/io_timeout for Iterators. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7161 Test Plan: New unit tests in db_basic_test Reviewed By: riversand963 Differential Revision: D22687352 Pulled By: anand1976 fbshipit-source-id: 67bbb0e6d7ae80b256589244468494292538c6ec	2020-08-07 12:01:08 -07:00
Zhichao Cao	b79f13b2aa	Fix the potential deadlock in WriteImplWALOnly and UnorderedWriteMemtable (#7199 ) Summary: Pointed out by https://github.com/facebook/rocksdb/issues/7197 , there is a double lock in WriteImplWALOnly. Also find another deadlock in UnorderedWriteMemtable. Move the check after switch_all_.notify_all(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7199 Test Plan: pass make check Reviewed By: anand1976 Differential Revision: D22961714 Pulled By: zhichao-cao fbshipit-source-id: 0707922dc50d28ea141a15a8cdcbd1c8993ea0d8	2020-08-07 11:28:49 -07:00
Yingchun Lai	67bbac3621	Remove duplicate colon in Status message (#7041 ) Summary: A colon will be added after 'msg' automatically when invoke function Status(Code _code, const Slice& msg, const Slice& msg2), it's not needed to append a colon explicitly to 'msg'. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7041 Reviewed By: ajkr Differential Revision: D22292801 fbshipit-source-id: 8f2d69065bb779d2613468bf9fc9169f32c3f1ec	2020-08-06 15:18:04 -07:00
Akanksha Mahajan	493f425e77	Add support to start and end IOTracing through DB APIs (#7203 ) Summary: 1. Add support to start io tracing through DB::StartIOTrace(Env, const TraceOptions&, std::unique_ptr<TraceWriter>&&) and end tracing through DB::EndIOTrace(). This doesn't trace DB::Open. User side code: //Open DB DB::Open(options, dbname, &db); / Start tracing / db->StartIOTrace(env, trace_opt, std::move(trace_writer)); / Perform Operations / /End tracing*/ db->EndIOTrace(); 2. Fix the build errors for Windows. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7203 Test Plan: make check -j64 Reviewed By: anand1976 Differential Revision: D22901947 Pulled By: akankshamahajan15 fbshipit-source-id: e59c0b785a802168e6f1aa028d99c224a35cb30c	2020-08-04 18:41:45 -07:00
Andrew Kryczka	a4a4a2dabd	dedup ReadOptions in iterator hierarchy (#7210 ) Summary: Previously, a `ReadOptions` object was stored in every `BlockBasedTableIterator` and every `LevelIterator`. This redundancy consumes extra memory, resulting in the `Arena` making more allocations, and iteration observing worse cache performance. This PR migrates callers of `NewInternalIterator()` and `MakeInputIterator()` to provide a `ReadOptions` object guaranteed to outlive the returned iterator. When the iterator's lifetime will be managed by the user, this lifetime guarantee is achieved by storing the `ReadOptions` value in `ArenaWrappedDBIter`. Then, sub-iterators of `NewInternalIterator()` and `MakeInputIterator()` can hold a reference-to-const `ReadOptions`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7210 Test Plan: - `make check` under ASAN and valgrind - benchmark: on a DB with 2 L0 files and 3 L1+ levels, this PR reduced `Arena` allocation 4792 -> 4160 bytes. Reviewed By: anand1976 Differential Revision: D22861323 Pulled By: ajkr fbshipit-source-id: 54aebb3e89c872eeab0f5793b4b6e42878d093ce	2020-08-03 15:23:04 -07:00
mrambacher	d9d190742c	Make env_test work with ASSERT_STATUS_CHECKED (#7176 ) Summary: Make (most of) the env_test pass when ASSERT_STATUS_CHECKED is enabled. One test that opens a database is currently disabled in this mode, as there are many errors that need revisited for DB tests and status checks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7176 Reviewed By: cheng-chang Differential Revision: D22799278 Pulled By: ajkr fbshipit-source-id: 16d8a02eaeecd6df1060249b6a5811292801f2ed	2020-07-28 22:59:48 -07:00
Jay Zhuang	b0c5ecd6b3	Make max_subcompactions dynamically changeable (#7159 ) Summary: Make `max-subcompactions` dynamically changeable by passing the `DBOption` to Compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7159 Reviewed By: siying Differential Revision: D22671238 Pulled By: jay-zhuang fbshipit-source-id: 311ca9f6bb606965544d8708616d358cfed5be42	2020-07-22 18:32:52 -07:00
sdong	9870704420	Fix a minor data race in stats dumping threads initialization (#7151 ) Summary: https://github.com/facebook/rocksdb/pull/7145 creates a minor data race against the stat creation counter. Turn it to atomic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7151 Test Plan: Run the test. Reviewed By: ajkr Differential Revision: D22631014 fbshipit-source-id: c6fb69ac5b9df7139795dacea5ce9fb9fd3278d7	2020-07-20 12:12:43 -07:00
Andrew Kryczka	9a83fd21e6	stagger first DumpMallocStats after opening DB (#7145 ) Summary: Previously when running `db_bench` with large value for `num_multi_dbs` and enabled `Options::dump_malloc_stats`, we would see most CPU spent in jemalloc locking. After this PR that no longer shows up at the top of the profile. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7145 Reviewed By: riversand963 Differential Revision: D22593031 Pulled By: ajkr fbshipit-source-id: 3b3fc91f93249c6afee53f59f34c487c3fc5add6	2020-07-17 16:13:26 -07:00
Zhichao Cao	a10f12eda1	Auto resume the DB from Retryable IO Error (#6765 ) Summary: In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765 Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check Reviewed By: anand1976 Differential Revision: D21916789 Pulled By: zhichao-cao fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b	2020-07-15 11:03:58 -07:00
wenh	4924a506b9	Reduce `env_->GetChildren()` calls in DBImpl::Recover() (#7044 ) Summary: There currently exist multiple `GetChildren()` calls in `DBImpl::Recover()`, which can be expensive in cases of distributed file systems. This pull request try to call `DBImpl::Recover()` of each necessary directory only _once_ and reuse the results in the places of repeated calls in current code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7044 Test Plan: Run `make check` and use the default test suite. The modified code should be semantically identical to the current code. As a proof of this solution, we may optionally deploy the system onto a (real or simulated) distributed system and expect reduced latency caused by manifest fetching. (WIP) Reviewed By: riversand963 Differential Revision: D22419925 Pulled By: roghnin fbshipit-source-id: d3774fbfbc246c5527101bc16747eb5c90919886	2020-07-10 13:41:08 -07:00
mrambacher	c7c7b07f06	More Makefile Cleanup (#7097 ) Summary: Cleans up some of the dependencies on test code in the Makefile while building tools: - Moves the test::RandomString, DBBaseTest::RandomString into Random - Moves the test::RandomHumanReadableString into Random - Moves the DestroyDir method into file_utils - Moves the SetupSyncPointsToMockDirectIO into sync_point. - Moves the FaultInjection Env and FS classes under env These changes allow all of the tools to build without dependencies on test_util, thereby simplifying the build dependencies. By moving the FaultInjection code, the dependency in db_stress on different libraries for debug vs release was eliminated. Tested both release and debug builds via Make and CMake for both static and shared libraries. More work remains to clean up how the tools are built and remove some unnecessary dependencies. There is also more work that should be done to get the Makefile and CMake to align in their builds -- what is in the libraries and the sizes of the executables are different. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7097 Reviewed By: riversand963 Differential Revision: D22463160 Pulled By: pdillinger fbshipit-source-id: e19462b53324ab3f0b7c72459dbc73165cc382b2	2020-07-09 14:35:17 -07:00
Jay Zhuang	00de699096	Replace reinterpret_cast with static_cast_with_check (#7067 ) Summary: Replace `reinterpret_cast` with `static_cast_with_check` for `DBImpl` and `ColumnFamilyHandleImpl`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7067 Reviewed By: siying Differential Revision: D22361587 Pulled By: jay-zhuang fbshipit-source-id: dfe9e8f3af39c3d27cc372c55ab9ad905eb0a5a1	2020-07-02 19:25:41 -07:00
Zitan Chen	373d5ac485	BackupEngine verifies table file checksums on creating new backups (#7015 ) Summary: When table file checksums are enabled and stored in the DB manifest by using the RocksDB default crc32c checksum function, BackupEngine will calculate the crc32c checksum of the file to be copied and compare the calculated result with the one stored in the DB manifest before copying the file to the backup directory. After copying to the backup directory, BackupEngine will verify the checksum of the copied file with the one calculated before copying. This helps detect some rare corruption events such as bit-flips during the copying process. No verification with checksums in DB manifest will be performed if the table file checksum function is not the RocksDB default crc32c checksum function. In addition, If `share_table_files` and `share_files_with_checksum` are true, BackupEngine will compare the checksums computed before and after copying of the table files. Corresponding tests are added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7015 Test Plan: Passed make check Reviewed By: pdillinger Differential Revision: D22165732 Pulled By: gg814 fbshipit-source-id: ee0e8cc397c455eba64545c29380b9d9853588ec	2020-07-02 18:15:12 -07:00
Peter Dillinger	52d59e0c93	Revert "Whole DBTest to skip fsync (#7049 )" (#7070 ) Summary: This reverts commit `4f1534bdb0`. This commit caused failures and deadlocks in MultiThreadedDBTest.MultiThreaded/69 and others. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7070 Reviewed By: riversand963 Differential Revision: D22358778 Pulled By: pdillinger fbshipit-source-id: faf8f2cb469a7063a113921c8e9c64a9f7610dac	2020-07-02 10:22:43 -07:00
sdong	4f1534bdb0	Whole DBTest to skip fsync (#7049 ) Summary: After https://github.com/facebook/rocksdb/pull/7036, we still see extra DBTest that can timeout when running 10 or 20 in parallel. Expand skip-fsync mode in whole DBTest. Still preserve other tests from doing this mode to be conservative. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7049 Test Plan: Run all existing files. Reviewed By: pdillinger Differential Revision: D22301700 fbshipit-source-id: f9a9e3b3b26ce640665a47cb8bff33ba0c89b565	2020-07-01 19:37:56 -07:00
sdong	80b107a0a9	Divide WriteCallbackTest.WriteWithCallbackTest (#7037 ) Summary: WriteCallbackTest.WriteWithCallbackTest has a deep for-loop and in some cases runs very long. Parameterimized it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7037 Test Plan: Run the test and see it passes. Reviewed By: ltamasi Differential Revision: D22269259 fbshipit-source-id: a1b6687b5bf4609754833d14cf383d68bc7ab27a	2020-06-30 12:31:30 -07:00
Anand Ananthabhotla	9a5886bd8c	Extend Get/MultiGet deadline support to table open (#6982 ) Summary: Current implementation of the ```read_options.deadline``` option only checks the deadline for random file reads during point lookups. This PR extends the checks to file opens, prefetches and preloads as part of table open. The main changes are in the ```BlockBasedTable```, partitioned index and filter readers, and ```TableCache``` to take ReadOptions as an additional parameter. In ```BlockBasedTable::Open```, in order to retain existing behavior w.r.t checksum verification and block cache usage, we filter out most of the options in ```ReadOptions``` except ```deadline```. However, having the ```ReadOptions``` gives us more flexibility to honor other options like verify_checksums, fill_cache etc. in the future. Additional changes in callsites due to function signature changes in ```NewTableReader()``` and ```FilePrefetchBuffer```. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6982 Test Plan: Add new unit tests in db_basic_test Reviewed By: riversand963 Differential Revision: D22219515 Pulled By: anand1976 fbshipit-source-id: 8a3b92f4a889808013838603aa3ca35229cd501b	2020-06-29 14:53:17 -07:00
Yanqin Jin	d47c871190	Fix data race to VersionSet::io_status_ (#7034 ) Summary: After https://github.com/facebook/rocksdb/issues/6949 , VersionSet::io_status_ can be concurrently accessed by multiple threads without lock, causing tsan test to fail. For example, a bg flush thread resets io_status_ before calling LogAndApply(), while another thread already in the process of LogAndApply() reads io_status_. This is a bug. We do not have to reset io_status_ each time we call LogAndApply(). io_status_ is part of the state of VersionSet, and it indicates the outcome of preceding MANIFEST/CURRENT files IO operations. Its value should be updated only when: 1. MANIFEST/CURRENT files IO fail for the first time. 2. MANIFEST/CURRENT files IO succeed as part of recovering from a prior failure without process restart, e.g. calling Resume(). Test Plan (devserver): COMPILE_WITH_TSAN=1 make check COMPILE_WITH_TSAN=1 make db_test2 ./db_test2 --gtest_filter=DBTest2.CompactionStall Pull Request resolved: https://github.com/facebook/rocksdb/pull/7034 Reviewed By: zhichao-cao Differential Revision: D22247137 Pulled By: riversand963 fbshipit-source-id: 77b83e05390f3ee3cd2d96d3fdd6fe4f225e3216	2020-06-27 08:57:31 -07:00
Zitan Chen	be41c61f22	Add a new option for BackupEngine to store table files under shared_checksum using DB session id in the backup filenames (#6997 ) Summary: `BackupableDBOptions::new_naming_for_backup_files` is added. This option is false by default. When it is true, backup table filenames under directory shared_checksum are of the form `<file_number>_<crc32c>_<db_session_id>.sst`. Note that when this option is true, it comes into effect only when both `share_files_with_checksum` and `share_table_files` are true. Three new test cases are added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6997 Test Plan: Passed make check. Reviewed By: ajkr Differential Revision: D22098895 Pulled By: gg814 fbshipit-source-id: a1d9145e7fe562d71cde7ac995e17cb24fd42e76	2020-06-24 19:31:25 -07:00
Yanqin Jin	e66199d848	First step towards handling MANIFEST write error (#6949 ) Summary: This PR provides preliminary support for handling IO error during MANIFEST write. File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted. One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach. If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled. Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled. Possible future directions: - Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added. Test plan (dev server): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949 Reviewed By: anand1976 Differential Revision: D22026020 Pulled By: riversand963 fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8	2020-06-24 19:07:08 -07:00
sdong	d6b7b7712f	Fix a bug that causes iterator to return wrong result in a rare data race (#6973 ) Summary: The bug fixed in https://github.com/facebook/rocksdb/pull/1816/ is now applicable to iterator too. This was not an issue but https://github.com/facebook/rocksdb/pull/2886 caused the regression. If a put and DB flush happens just between iterator to get latest sequence number and getting super version, empty result for the key or an older value can be returned, which is wrong. Fix it in the same way as the fix in https://github.com/facebook/rocksdb/issues/1816, that is to get the sequence number after referencing the super version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6973 Test Plan: Will run stress tests for a while to make sure there is no general regression. Reviewed By: ajkr Differential Revision: D22029348 fbshipit-source-id: 94390f93630906796d6e2fec321f44a920953fd1	2020-06-18 10:16:38 -07:00
Zitan Chen	94d04529de	Store DB identity and DB session ID in SST files (#6983 ) Summary: `db_id` and `db_session_id` are now part of the table properties for all formats and stored in SST files. This adds about 99 bytes to each new SST file. The `TablePropertiesNames` for these two identifiers are `rocksdb.creating.db.identity` and `rocksdb.creating.session.identity`. In addition, SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as `DB::GetDbSessionId`. A table property test is added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6983 Test Plan: make check and some manual tests. Reviewed By: zhichao-cao Differential Revision: D22048826 Pulled By: gg814 fbshipit-source-id: afdf8c11424a6f509b5c0b06dafad584a80103c9	2020-06-17 10:57:40 -07:00
Yanqin Jin	9bfd46d0d8	Let best-efforts recovery ignore CURRENT file (#6970 ) Summary: Best-efforts recovery does not check the content of CURRENT file to determine which MANIFEST to recover from. However, it still checks the presence of CURRENT file to determine whether to create a new DB during `open()`. Therefore, we can tweak the logic in `open()` a little bit so that best-efforts recovery does not rely on CURRENT file at all. Test plan (dev server): make check ./db_basic_test --gtest_filter=DBBasicTest.RecoverWithNoCurrentFile Pull Request resolved: https://github.com/facebook/rocksdb/pull/6970 Reviewed By: anand1976 Differential Revision: D22013990 Pulled By: riversand963 fbshipit-source-id: db552a1868c60ed70e1f7cd252a3a076eb8ea58f	2020-06-15 14:11:24 -07:00
Zitan Chen	88db97b06d	Add a DB Session ID (#6959 ) Summary: Added DB::GetDbSessionId by using the same format and machinery as DB::GetDbIdentity. The DB Session ID is generated (and therefore, updated) each time a DB object is opened. It is written to the LOG file right after the line of “DB SUMMARY”. A test for the uniqueness, for different openings and during the same opening, is also added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6959 Test Plan: Passed make check Reviewed By: zhichao-cao Differential Revision: D21951721 Pulled By: gg814 fbshipit-source-id: 958a48a612db49a39998ea703cded45987d3fa8b	2020-06-15 10:47:02 -07:00
Yanqin Jin	717749f4c0	Fail point-in-time WAL recovery upon IOError reading WAL (#6963 ) Summary: If `options.wal_recovery_mode == WALRecoveryMode::kPointInTimeRecovery`, RocksDB stops replaying WAL once hitting an error and discards the rest of the WAL. This can lead to data loss if the error occurs at an offset smaller than the last sync'ed offset. Ideally, RocksDB point-in-time recovery should permit recovery if the error occurs after last synced offset while fail recovery if error occurs before the last synced offset. However, RocksDB does not track the synced offset of WALs. Consequently, RocksDB does not know whether an error occurs before or after the last synced offset. An error can be one of the following. - WAL record checksum mismatch. This can result from both corruption of synced data and dropping of unsynced data during shutdown. We cannot be sure which one. In order not to defeat the original motivation to permit the latter case, we keep the original behavior of point-in-time WAL recovery. - IOError. This means the WAL can be bad, an indicator of whole file becoming unavailable, not to mention synced part of the WAL. Therefore, we choose to modify the behavior of point-in-time recovery and fail the database recovery. Test plan (devserver): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6963 Reviewed By: ajkr Differential Revision: D22011083 Pulled By: riversand963 fbshipit-source-id: f9cbf29a37dc5cc40d3fa62f89eed1ad67ca1536	2020-06-11 18:42:10 -07:00
Zhichao Cao	b3585a11b4	Ingest SST files with checksum information (#6891 ) Summary: Application can ingest SST files with file checksum information, such that during ingestion, DB is able to check data integrity and identify of the SST file. The PR introduces generate_and_verify_file_checksum to IngestExternalFileOption to control if the ingested checksum information should be verified with the generated checksum. 1. If generate_and_verify_file_checksum options is FALSE: 1) if DB does not enable SST file checksum, the checksum information ingested will be ignored; 2) if DB enables the SST file checksum and the checksum function name matches the checksum function name in DB, we trust the ingested checksum, store it in Manifest. If the checksum function name does not match, we treat that as an error and fail the IngestExternalFile() call. 2. If generate_and_verify_file_checksum options is TRUE: 1) if DB does not enable SST file checksum, the checksum information ingested will be ignored; 2) if DB enable the SST file checksum, we will use the checksum generator from DB to calculate the checksum for each ingested SST files after they are copied or moved. Then, compare the checksum results with the ingested checksum information: _A)_ if the checksum function name does not match, _verification always report true_ and we store the DB generated checksum information in Manifest. _B)_ if the checksum function name mach, and checksum match, ingestion continues and stores the checksum information in the Manifest. Otherwise, terminate file ingestion and report file corruption. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6891 Test Plan: added unit test, pass make asan_check Reviewed By: pdillinger Differential Revision: D21935988 Pulled By: zhichao-cao fbshipit-source-id: 7b55f486632db467e76d72602218d0658aa7f6ed	2020-06-11 14:27:36 -07:00
Akanksha Mahajan	2677bd5967	Add logs and stats in DeleteScheduler (#6927 ) Summary: Add logs and stats for files marked as trash and files deleted immediately in DeleteScheduler Pull Request resolved: https://github.com/facebook/rocksdb/pull/6927 Test Plan: make check -j64 Reviewed By: riversand963 Differential Revision: D21869068 Pulled By: akankshamahajan15 fbshipit-source-id: e9f673c4fa8049ce648b23c75d742f2f9c6c57a1	2020-06-05 09:43:04 -07:00
Yanqin Jin	2f3261831b	Fix a typo (bug) when setting error during Flush (#6928 ) Summary: As title. The prior change to the line is a typo. Fixing it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6928 Test Plan: make check Reviewed By: zhichao-cao Differential Revision: D21873587 Pulled By: riversand963 fbshipit-source-id: f4837fc8792d7106bc230b7b499dfbb7a2847430	2020-06-04 08:30:42 -07:00
Zitan Chen	02df00d97b	API change: DB::OpenForReadOnly will not write to the file system unless create_if_missing is true (#6900 ) Summary: DB::OpenForReadOnly will not write anything to the file system (i.e., create directories or files for the DB) unless create_if_missing is true. This change also fixes some subcommands of ldb, which write to the file system even if the purpose is for readonly. Two tests for this updated behavior of DB::OpenForReadOnly are also added. Other minor changes: 1. Updated HISTORY.md to include this API change of DB::OpenForReadOnly; 2. Updated the help information for the put and batchput subcommands of ldb with the option [--create_if_missing]; 3. Updated the comment of Env::DeleteDir to emphasize that it returns OK only if the directory to be deleted is empty. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6900 Test Plan: passed make check; also manually tested a few ldb subcommands Reviewed By: pdillinger Differential Revision: D21822188 Pulled By: gg814 fbshipit-source-id: 604cc0f0d0326a937ee25a32cdc2b512f9a3be6e	2020-06-03 18:57:49 -07:00
Yanqin Jin	961c7590d6	Add timestamp to delete (#6253 ) Summary: Preliminary user-timestamp support for delete. If ["a", ts=100] exists, you can delete it by calling `DB::Delete(write_options, key)` in which `write_options.timestamp` points to a `ts` higher than 100. Implementation A new ValueType, i.e. `kTypeDeletionWithTimestamp` is added for deletion marker with timestamp. The reason for a separate `kTypeDeletionWithTimestamp`: RocksDB may drop tombstones (keys with kTypeDeletion) when compacting them to the bottom level. This is OK and useful if timestamp is disabled. When timestamp is enabled, should we still reuse `kTypeDeletion`, we may drop the tombstone with a more recent timestamp, causing deleted keys to re-appear. Test plan (dev server) ``` make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6253 Reviewed By: ltamasi Differential Revision: D20995328 Pulled By: riversand963 fbshipit-source-id: a9e5c22968ad76f98e3dc6ee0151265a3f0df619	2020-05-28 10:40:03 -07:00
Akanksha Mahajan	bcefc59e9f	Allow MultiGet users to limit cumulative value size (#6826 ) Summary: 1. Add a value_size in read options which limits the cumulative value size of keys read in batches. Once the size exceeds read_options.value_size, all the remaining keys are returned with status Abort without further fetching any key. 2. Add a unit test case MultiGetBatchedValueSizeSimple the reads keys from memory and sst files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6826 Test Plan: 1. make check -j64 2. Add a new unit test case Reviewed By: anand1976 Differential Revision: D21471483 Pulled By: akankshamahajan15 fbshipit-source-id: dea51b8e76d5d1df38ece8cdb29933b1d798b900	2020-05-27 13:07:14 -07:00
Yanqin Jin	e72e2167fd	Fix a few bugs in best-efforts recovery (#6824 ) Summary: 1. Update column_family_memtables_ to point to latest column_family_set in version_set after recovery. 2. Normalize file paths passed by application so that directories end with '/' or '\\'. 3. In addition to missing files, corrupted files are also ignored in best-efforts recovery. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6824 Test Plan: COMPILE_WITH_ASAN=1 make check Reviewed By: anand1976 Differential Revision: D21463905 Pulled By: riversand963 fbshipit-source-id: c48db8843cc93c8c1c7139c474b64e6f775307d2	2020-05-08 13:01:42 -07:00
Levi Tamasi	ac3ae1df0b	Find/purge obsolete blob files (#6807 ) Summary: The patch extends `FindObsoleteFiles` and `PurgeObsoleteFiles` with support for blob files. The behavior is analogous to SST files: obsolete blob files are put on the "candidates for deletion" list, while live (and pending) files are preserved. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6807 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D21406249 Pulled By: ltamasi fbshipit-source-id: 1948f71c31927564b61e8af394f50ca3964880d9	2020-05-07 09:32:51 -07:00
Yanqin Jin	5584595f80	Do not swallow error returned from SaveTo() (#6801 ) Summary: With consistency check enabled, VersionBuilder::SaveTo() may return error once corruption is detected while building versions. We should handle these errors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6801 Test Plan: make check Reviewed By: siying Differential Revision: D21385045 Pulled By: riversand963 fbshipit-source-id: 98f6424e2a4699b62befa21e9fe00e70a771118e	2020-05-05 10:46:20 -07:00
Yanqin Jin	5a61e7864d	Fix db_stress when GetLiveFiles() flushes dropped CF (#6805 ) Summary: Current impl. of db_stress will abort verification and report failure if GetLiveFiles() causes a dropped column family to be flushed. This is not desired. To fix, this PR makes the following change: In GetLiveFiles, if flush is triggered and returns Status::IsColumnFamilyDropped(), then set status to Status::OK(). This is OK because dropped column families will be skipped during the rest of this function, and valid column families will have their live files returned to caller. Test plan (dev server): make check ./db_stress -ops_per_thread=1000 -get_live_files_one_in=100 -clear_column_family_one_in=100 ./db_stress -disable_wal=1 -reopen=0 -ops_per_thread=1000 -get_live_files_one_in=100 -clear_column_family_one_in=100 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6805 Reviewed By: ltamasi Differential Revision: D21390044 Pulled By: riversand963 fbshipit-source-id: de67846b95a4f1b88aa0a30c3d70c43cc68625b9	2020-05-04 17:45:49 -07:00
Levi Tamasi	a00ddf1574	Expose the set of live blob files from Version/VersionSet (#6785 ) Summary: The patch adds logic that returns the set of live blob files from `Version::AddLiveFiles` and `VersionSet::AddLiveFiles` (in addition to live table files), and also cleans up the code a bit, for example, by exposing only the numbers of table files as opposed to the earlier `FileDescriptor`s that no clients used. Moreover, the patch extends the `GetLiveFiles` API so that it also exposes blob files in the current version. Similarly to https://github.com/facebook/rocksdb/pull/6755, this is a building block for identifying and purging obsolete blob files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6785 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D21336210 Pulled By: ltamasi fbshipit-source-id: fc1aede8a49eacd03caafbc5f6f9ce43b6270821	2020-05-04 15:08:13 -07:00
anand76	ab13d43e1d	Pass a timeout to FileSystem for random reads (#6751 ) Summary: Calculate ```IOOptions::timeout``` using ```ReadOptions::deadline``` and pass it to ```FileSystem::Read/FileSystem::MultiRead```. This allows us to impose a tighter bound on the time taken by Get/MultiGet on FileSystem/Envs that support IO timeouts. Even on those that don't support, check in ```RandomAccessFileReader::Read``` and ```MultiRead``` and return ```Status::TimedOut()``` if the deadline is exceeded. For now, TableReader creation, which might do file opens and reads, are not covered. It will be implemented in another PR. Tests: Update existing unit tests to verify the correct timeout value is being passed Pull Request resolved: https://github.com/facebook/rocksdb/pull/6751 Reviewed By: riversand963 Differential Revision: D21285631 Pulled By: anand1976 fbshipit-source-id: d89af843e5a91ece866e87aa29438b52a65a8567	2020-04-30 14:50:39 -07:00
Levi Tamasi	fe238e5438	Keep track of obsolete blob files in VersionSet (#6755 ) Summary: The patch adds logic to keep track of obsolete blob files. A blob file becomes obsolete when the last `shared_ptr` that points to the corresponding `SharedBlobFileMetaData` object goes away, which, in turn, happens when the last `Version` that contains the blob file is destroyed. No longer needed blob files are added to the obsolete list in `VersionSet` using a custom deleter to avoid unnecessary coupling between `SharedBlobFileMetaData` and `VersionSet`. Obsolete blob files are returned by `VersionSet::GetObsoleteFiles` and stored in `JobContext`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6755 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D21233155 Pulled By: ltamasi fbshipit-source-id: 47757e06fdc0127f27ed57f51abd27893d9a7b7a	2020-04-30 11:25:51 -07:00
Derrick Pallas	5272305437	Fix FilterBench when RTTI=0 (#6732 ) Summary: The dynamic_cast in the filter benchmark causes release mode to fail due to no-rtti. Replace with static_cast_with_check. Signed-off-by: Derrick Pallas <derrick@pallas.us> Addition by peterd: Remove unnecessary 2nd template arg on all static_cast_with_check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6732 Reviewed By: ltamasi Differential Revision: D21304260 Pulled By: pdillinger fbshipit-source-id: 6e8eb437c4ca5a16dbbfa4053d67c4ad55f1608c	2020-04-29 13:09:23 -07:00
Yanqin Jin	d4398e08fc	Fix timestamp support for MultiGet (#6748 ) Summary: 1. Avoid nullptr dereference when passing timestamp to KeyContext creation. 2. Construct LookupKey correctly with timestamp when creating MultiGetContext. 3. Compare without timestamp when sorting KeyContexts. Fixes https://github.com/facebook/rocksdb/issues/6745 Test plan (dev server): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6748 Reviewed By: pdillinger Differential Revision: D21258691 Pulled By: riversand963 fbshipit-source-id: 44e65b759c18b9986947783edf03be4f890bb004	2020-04-27 22:49:56 -07:00
Yanqin Jin	e04f3bce4f	Update CURRENT file after best-efforts recovery (#6746 ) Summary: After a successful recovery, the CURRENT file should be updated to point to the valid MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6746 Test Plan: make check Reviewed By: anand1976 Differential Revision: D21189876 Pulled By: riversand963 fbshipit-source-id: 7537b49988c5c425ebe9505a5cc260de351ad79b	2020-04-23 16:21:09 -07:00
anand76	c1ccd6b6af	Implement deadline support for MultiGet (#6710 ) Summary: Initial implementation of ReadOptions.deadline for MultiGet. If the request takes longer than the deadline, the keys not yet found will be returned with Status::TimedOut(). This implementation enforces the deadline in DBImpl, which is fairly high level. Its best effort and may not check the deadline after every key lookup, but may do so after a batch of keys. In subsequent stages, we will extend this to passing a timeout down to the FileSystem. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6710 Test Plan: Add new unit tests Reviewed By: riversand963 Differential Revision: D21149158 Pulled By: anand1976 fbshipit-source-id: 9f44eecffeb40873f5034ed59a66d21f9f88879e	2020-04-21 14:51:51 -07:00
Akanksha Mahajan	03a1d95db0	Set max_background_flushes dynamically (#6701 ) Summary: 1. Add changes so that max_background_flushes can be set dynamically. 2. Add a testcase DBOptionsTest.SetBackgroundFlushThreads which set the max_background_flushes dynamically using SetDBOptions. TestPlan: 1. make -j64 check 2. Using new testcase DBOptionsTest.SetBackgroundFlushThreads Pull Request resolved: https://github.com/facebook/rocksdb/pull/6701 Reviewed By: ajkr Differential Revision: D21028010 Pulled By: akankshamahajan15 fbshipit-source-id: 5f949e4a8fd3c32537b637947b7ee09a69cfc7c1	2020-04-20 16:19:02 -07:00
Mike Kolupaev	e45673dece	Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621 ) Summary: Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype. Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling. It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas. Note that the deferred value loading only happens for internal iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621 Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats. Reviewed By: siying Differential Revision: D20786930 Pulled By: al13n321 fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee	2020-04-15 17:40:44 -07:00
Steven Fackler	f53cdab3d7	Hex encode keys in compaction flush logs (#6616 ) Summary: The raw key bytes are currently dumped directly into the log messages, which is not ideal if the keys aren't ASCII strings. Null bytes in particular can cut off bits of the message early. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6616 Reviewed By: ajkr Differential Revision: D20879218 Pulled By: anand1976 fbshipit-source-id: 825a20715fe6d8012c0163c6e7b8159f7926a1a7	2020-04-06 17:41:45 -07:00
sdong	80979f81c7	Make options.bottommost_compression, compression_opts and bottommost_compression_opts dynamically changeable. (#6615 ) Summary: These three options should be made dynamically changeable. Simply add them to MutableCFOptions and made the change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6615 Test Plan: Add a unit test to make sure that SetOptions() can change the options. Reviewed By: riversand963 Differential Revision: D20755951 fbshipit-source-id: 8165f4fd7a7a665cc7fb049698935022a5d2e7ff	2020-03-31 12:11:42 -07:00
Cheng Chang	ee50b8d499	Be able to decrease background thread's CPU priority when creating database backup (#6602 ) Summary: When creating a database backup, the background threads will not only consume IO resources by copying files, but also consuming CPU such as by computing checksums. During peak times, the CPU consumption by the background threads might affect online queries. This PR makes it possible to decrease CPU priority of these threads when creating a new backup. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6602 Test Plan: make check Reviewed By: siying, zhichao-cao Differential Revision: D20683216 Pulled By: cheng-chang fbshipit-source-id: 9978b9ed9488e8ce135e90ca083e5b4b7221fd84	2020-03-28 19:07:25 -07:00
Zhichao Cao	4246888101	Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487 ) Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0	2020-03-27 16:04:43 -07:00
sdong	6fd0ed4993	CompactRange() to use bottom pool when goes to bottommost level (#6593 ) Summary: In automatic compaction, if a compaction is bottommost, it goes to bottom thread pool. We should do the same for manual compaction too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6593 Test Plan: Add a unit test. See all existing tests pass. Reviewed By: ajkr Differential Revision: D20637408 fbshipit-source-id: cb03031e8f895085f7acf6d2d65e69e84c9ddef3	2020-03-24 20:24:32 -07:00

1 2 3 4

167 Commits