Summary:
Following https://github.com/facebook/rocksdb/issues/7655 and https://github.com/facebook/rocksdb/issues/7657, this PR adds `full_history_ts_low_` to `ColumnFamilyData`.
`ColumnFamilyData::full_history_ts_low_` will be used to create `FlushJob` and `CompactionJob`.
`ColumnFamilyData::full_history_ts_low` is persisted to the MANIFEST file. An application can only
increase its value. Consider the following case:
>
> The database has a key at ts=950. `full_history_ts_low` is first set to 1000, and then a GC is triggered
> and cleans up all data older than 1000. If the application sets `full_history_ts_low` to 900 afterwards,
> and tries to read at ts=960, the key at 950 is not seen. From the perspective of the read, the result
> is hard to reason. For simplicity, we just do now allow decreasing full_history_ts_low for now.
>
During recovery, the value of `full_history_ts_low` is restored for each column family if applicable. Note that
version edits in the MANIFEST file for the same column family may have `full_history_ts_low` unsorted due
to the potential interleaving of `LogAndApply` calls. Only the max will be used to restore the state of the
column family.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7740
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D25296217
Pulled By: riversand963
fbshipit-source-id: 24acda1df8262cd7cfdc6ce7b0ec56438abe242a
Summary:
In current code base, in FlushMemtable, when `(Flush_reason == FlushReason::kErrorRecoveryRetryFlush && (!cfd->mem()->IsEmpty() || !cached_recoverable_state_empty_.load()))`, we assert that cfd->imm()->NumNotFlushed() > 0. However, there are some corner cases that can fail this assert: 1) if there are multiple CFs, some CF has immutable memtable, some CFs don't. In ResumeImpl, all CFs will call FlushMemtable, which will hit the assert. 2) Regular flush is scheduled and running, the resume thread is waiting. New KVs are inserted and SchedulePendingFlush is called. Regular flush will continue call MaybeScheduleFlushAndCompaction until all the immutable memtables are flushed. When regular flush ends and auto resume thread starts to schedule new flushes, cfd->imm()->NumNotFlushed() can be 0.
Remove the assert and added the comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7744
Test Plan: make check and pass the stress test
Reviewed By: riversand963
Differential Revision: D25340573
Pulled By: zhichao-cao
fbshipit-source-id: eac357bdace660247c197f01a9ff6857e3c97672
Summary:
When 2 phase commit is enabled, if there are prepared data in a WAL, the WAL should be kept, the minimum log number for such a WAL is written to MANIFEST during flush. In atomic flush, such information is not written to MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7570
Test Plan: Added a new unit test `DBAtomicFlushTest.ManualFlushUnder2PC`, this test fails in atomic flush without this PR, after this PR, it succeeds.
Reviewed By: riversand963
Differential Revision: D24394222
Pulled By: cheng-chang
fbshipit-source-id: 60ce74b21b704804943be40c8de01b41269cf116
Summary:
In the current code base, all the manifest writes with IO error will be set with reason: BackgroundErrorReason::kManifestWrite, which will be mapped to the kHardError if the IO Error is retryable. However, if the system does not use the WAL, all the retryable IO error should be mapped to kSoftError. Create this PR to handle is special case by adding kManifestWriteNoWAL to BackgroundErrorReason.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7693
Test Plan: make check, add new testing cases to error_handler_fs_test
Reviewed By: anand1976
Differential Revision: D25066204
Pulled By: zhichao-cao
fbshipit-source-id: d59553896c2eac3fb37c05238544d2b265379462
Summary:
Add timestamp to the `CompactRange()` and `GetApproximateSizes` range keys if needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7684
Test Plan: make check
Reviewed By: riversand963
Differential Revision: D25015421
Pulled By: jay-zhuang
fbshipit-source-id: 51ca0756087eb053a3b11801e5c7ce1c6e2d38a9
Summary:
https://github.com/facebook/rocksdb/issues/7340 reports and reproduces an assertion failure caused by a combination of the following:
- atomic flush is disabled.
- a column family can appear multiple times in the flush queue at the same time. This behavior was introduced in release 5.17.
Consequently, it is possible that two flushes race with each other. One bg flush thread flushes all memtables. The other thread calls `FlushMemTableToOutputFile()` afterwards, and hits the assertion error below.
```
assert(cfd->imm()->NumNotFlushed() != 0);
assert(cfd->imm()->IsFlushPending());
```
Fix this by reverting the behavior. In non-atomic-flush case, a column family can appear in the flush queue at most once at the same time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7362
Test Plan:
make check
Also run stress test successfully for 10 times.
```
make crash_test
```
Reviewed By: ajkr
Differential Revision: D25172996
Pulled By: riversand963
fbshipit-source-id: f1559b6366cc609e961e3fc83fae548f1fad08ce
Summary:
When a WAL is synced, an edit is written to MANIFEST.
After flushing memtables, the obsoleted WALs are piggybacked to MANIFEST while writing the new L0 files to MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7601
Test Plan:
`track_and_verify_wals_in_manifest` is enabled by default for all tests extending `DBBasicTest`, and in db_stress_test.
Unit test `wal_edit_test`, `version_edit_test`, and `version_set_test` are also updated.
Watch all tests to pass.
Reviewed By: ltamasi
Differential Revision: D24553957
Pulled By: cheng-chang
fbshipit-source-id: 66a569ff1bdced38e22900bd240b73113906e040
Summary:
Fixes Issue https://github.com/facebook/rocksdb/issues/7497
When allow_data_in_errors db_options is set, log error key details in `ParseInternalKey()`
Have fixed most of the calls. Have few TODOs still pending - because have to make more deeper changes to pass in the allow_data_in_errors flag. Will do those in a separate PR later.
Tests:
- make check
- some of the existing tests that exercise the "internal key too small" condition are: dbformat_test, cuckoo_table_builder_test
- some of the existing tests that exercise the corrupted key path are: corruption_test, merge_helper_test, compaction_iterator_test
Example of new status returns:
- Key too small - `Corrupted Key: Internal Key too small. Size=5`
- Corrupt key with allow_data_in_errors option set to false: `Corrupted Key: '<redacted>' seq:3, type:3`
- Corrupt key with allow_data_in_errors option set to true: `Corrupted Key: '61' seq:3, type:3`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7515
Reviewed By: ajkr
Differential Revision: D24240264
Pulled By: ramvadiv
fbshipit-source-id: bc48f5d4475ac19d7713e16df37505b31aac42e7
Summary:
Similarly to how https://github.com/facebook/rocksdb/issues/7345
integrated blob file writing into the flush process,
the patch adds support for writing blob files to the compaction logic.
Namely, if `enable_blob_files` is set, large values encountered during
compaction are extracted to blob files and replaced with blob indexes.
The resulting blob files are then logged to the MANIFEST as part of the
compaction job's `VersionEdit` and added to the `Version` alongside any
table files written by the compaction. Any errors during blob file building fail
the compaction job.
There will be a separate follow-up patch to perform blob garbage collection
during compactions.
In addition, the patch continues to chip away at the mess around computing
various compaction related statistics by eliminating some code duplication
and by making the `num_output_files` and `bytes_written` stats more consistent
for flushes, compactions, and recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7573
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24404696
Pulled By: ltamasi
fbshipit-source-id: 21216af3a172ad3ce8f85d11cd30923784ae426c
Summary:
Remove function calling in assert statement as assert is a no
op in opt build and that function might not be called. This causes hang
in closing RocksDB when refit level is set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7581
Test Plan: make check -j64
Reviewed By: riversand963
Differential Revision: D24466420
Pulled By: akankshamahajan15
fbshipit-source-id: 97db4ec5a95ae693c3290e176a3c12a9b1ad2f6d
Summary:
If `BottommostLevelCompaction.kForce*` is set, compaction should avoid
trivial move and always compact the sst to the target size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7368
Reviewed By: ajkr
Differential Revision: D23629525
Pulled By: jay-zhuang
fbshipit-source-id: 79f23c79ecb31587e0593b28cce43131107bbcd0
Summary:
Add db_basic_test status check list. Some of the warnings are suppressed. It is possible that some of them are due to real bugs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7452
Test Plan: See CI tests pass.
Reviewed By: zhichao-cao
Differential Revision: D23979764
fbshipit-source-id: 6151570c2a9b931b0fbb3fe939a94b2bd1583cbe
Summary:
Fix few test cases and add them in ASSERT_STATUS_CHECKED build.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7427
Test Plan:
1. ASSERT_STATUS_CHECKED=1 make -j48 check,
2. travis build for ASSERT_STATUS_CHECKED,
3. Without ASSERT_STATUS_CHECKED: make check -j64, CircleCI build and travis build
Reviewed By: pdillinger
Differential Revision: D23909983
Pulled By: akankshamahajan15
fbshipit-source-id: 42d7e4aea972acb9fcddb7ca73fcb82f93272434
Summary:
In the current implementation, any retryable IO error happens during Flush is mapped to a hard error. In this case, DB is stopped and write is stalled unless the background error is cleaned. In this PR, if WAL is DISABLED, the retryable IO error during FLush is mapped to a soft error. Such that, the memtable can continue receive the writes. At the same time, if auto resume is triggered, SwtichMemtable will not be called during Flush when resuming the DB to avoid to many small memtables. Testing cases are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7310
Test Plan: adding new unit test, pass make check.
Reviewed By: anand1976
Differential Revision: D23710892
Pulled By: zhichao-cao
fbshipit-source-id: bc4ca50d11c6b23b60d2c0cb171d86d542b038e9
Summary:
The patch adds support for writing blob files during flush by integrating
`BlobFileBuilder` with the flush logic, most importantly, `BuildTable` and
`CompactionIterator`. If `enable_blob_files` is set, large values are extracted
to blob files and replaced with references. The resulting blob files are then
logged to the MANIFEST as part of the flush job's `VersionEdit` and
added to the `Version`, similarly to table files. Errors related to writing
blob files fail the flush, and any blob files written by such jobs are immediately
deleted (again, similarly to how SST files are handled). In addition, the patch
extends the logging and statistics around flushes to account for the presence
of blob files (e.g. `InternalStats::CompactionStats::bytes_written`, which is
used for calculating write amplification, now considers the blob files as well).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7345
Test Plan: Tested using `make check` and `db_bench`.
Reviewed By: riversand963
Differential Revision: D23506369
Pulled By: ltamasi
fbshipit-source-id: 646885f22dfbe063f650d38a1fedc132f499a159
Summary:
Replace FSWritableFile pointer with FSWritableFilePtr
object in WritableFileWriter.
This new object wraps FSWritableFile pointer.
Objective: If tracing is enabled, FSWritableFile Ptr returns
FSWritableFileTracingWrapper pointer that includes all necessary
information in IORecord and calls underlying FileSystem and invokes
IOTracer to dump that record in a binary file. If tracing is disabled
then, underlying FileSystem pointer is returned directly.
FSWritableFilePtr wrapper class is added to bypass the
FSWritableFileWrapper when
tracing is disabled.
Test Plan: make check -j64
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7193
Reviewed By: anand1976
Differential Revision: D23355915
Pulled By: akankshamahajan15
fbshipit-source-id: e62a27a13c1fd77e36a6dbafc7006d969bed25cf
Summary:
This test uses database functionality and required more extensive work to get it to pass than the other tests. The DB functionality required for this test now passes the check.
When it was unclear what the proper behavior was for unchecked status codes, a TODO was added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7283
Reviewed By: akankshamahajan15
Differential Revision: D23251497
Pulled By: ajkr
fbshipit-source-id: 52b79629bdafa0a58de8ead1d1d66f141b331523
Summary:
There is potential data race related CompactRange() with level refitting. After the compaction step and refitting step, some automatic compaction could put data to the destination level and cause the DB to be corrupted. Fix the bug by checking the target level to be empty.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7261
Test Plan: Add a unit test, which would fail with "Corruption: L1 have overlapping ranges '666F6F' seq:6, type:1 vs. '626172' seq:2, type:1", and now it succeeds.
Reviewed By: ajkr
Differential Revision: D23142269
fbshipit-source-id: 28bc14d5ac934c192260b23a4ce3f10a95e3ee91
Summary:
Looks like somebody simply missed initializing a member variable. The column family ID, cf_id, is not set during OnCompactionBegin. But it is set properly in the next function for OnCompactionCompleted. Need this cf_id for tracking progress of a Stardog optimize since there may be multiple compactions required for a given column family.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6938
Reviewed By: siying
Differential Revision: D23153235
Pulled By: ajkr
fbshipit-source-id: 932938de3a4ebbc7ac89702f655583862587d251
Summary:
Manual compaction with `CompactRangeOptions::change_levels` set could
refit to a level targeted by another manual compaction. If
force_consistency_checks were disabled, it could be possible for
overlapping files to be written at that target level.
This PR prevents the possibility by calling `DisableManualCompaction()`
prior to `ReFitLevel()`. It also improves the manual compaction disabling
mechanism to wait for pending manual compactions to complete before
returning, and support disabling from multiple threads.
Fixes https://github.com/facebook/rocksdb/issues/6432.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7250
Test Plan:
crash test command that repro'd the bug reliably:
```
$ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --simple -target_file_size_base=524288 -write_buffer_size=1048576 -clear_column_family_one_in=0 -reopen=0 -max_key=10000000 -column_families=1 -max_background_compactions=8 -compact_range_one_in=100000 -compression_type=none -compaction_style=1 -num_levels=5 -universal_min_merge_width=4 -universal_max_merge_width=8 -level0_file_num_compaction_trigger=12 -rate_limiter_bytes_per_sec=1048576000 -universal_max_size_amplification_percent=100 --duration=3600 --interval=60 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --enable_compaction_filter=0
```
Reviewed By: ltamasi
Differential Revision: D23090800
Pulled By: ajkr
fbshipit-source-id: afcbcd51b42ce76789fdb907d8b9ada790709c13
Summary:
As part of the IOTracing project, this PR
1. Caches "FileSystemPtr" object(wrapper class that returns file system pointer based on tracing enabled) instead of "FileSystem" pointer.
2. FileSystemPtr object is created using FileSystem pointer and IOTracer
pointer.
3. IOTracer shared_ptr is created in DBImpl and it is passed to different classes through constructor.
4. When tracing is enabled through DB::StartIOTrace, FileSystemPtr
returns FileSystemTracingWrapper pointer for tracing purpose and when
it is disabled underlying FileSystem pointer is returned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7180
Test Plan:
make check -j64
COMPILE_WITH_TSAN=1 make check -j64
Reviewed By: anand1976
Differential Revision: D22987117
Pulled By: akankshamahajan15
fbshipit-source-id: 6073617e4c2d5bc363914f3a1f55ae3b0a58fbf1
Summary:
In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765
Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check
Reviewed By: anand1976
Differential Revision: D21916789
Pulled By: zhichao-cao
fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b
Summary:
After https://github.com/facebook/rocksdb/issues/6949 , VersionSet::io_status_ can be concurrently accessed by multiple
threads without lock, causing tsan test to fail. For example, a bg flush thread
resets io_status_ before calling LogAndApply(), while another thread already in
the process of LogAndApply() reads io_status_. This is a bug.
We do not have to reset io_status_ each time we call LogAndApply(). io_status_
is part of the state of VersionSet, and it indicates the outcome of preceding
MANIFEST/CURRENT files IO operations. Its value should be updated only when:
1. MANIFEST/CURRENT files IO fail for the first time.
2. MANIFEST/CURRENT files IO succeed as part of recovering from a prior
failure without process restart, e.g. calling Resume().
Test Plan (devserver):
COMPILE_WITH_TSAN=1 make check
COMPILE_WITH_TSAN=1 make db_test2
./db_test2 --gtest_filter=DBTest2.CompactionStall
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7034
Reviewed By: zhichao-cao
Differential Revision: D22247137
Pulled By: riversand963
fbshipit-source-id: 77b83e05390f3ee3cd2d96d3fdd6fe4f225e3216
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
Reviewed By: anand1976
Differential Revision: D22026020
Pulled By: riversand963
fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
Summary:
`db_id` and `db_session_id` are now part of the table properties for all formats and stored in SST files. This adds about 99 bytes to each new SST file.
The `TablePropertiesNames` for these two identifiers are `rocksdb.creating.db.identity` and `rocksdb.creating.session.identity`.
In addition, SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as `DB::GetDbSessionId`.
A table property test is added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6983
Test Plan: make check and some manual tests.
Reviewed By: zhichao-cao
Differential Revision: D22048826
Pulled By: gg814
fbshipit-source-id: afdf8c11424a6f509b5c0b06dafad584a80103c9
Summary:
As title. The prior change to the line is a typo. Fixing it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6928
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D21873587
Pulled By: riversand963
fbshipit-source-id: f4837fc8792d7106bc230b7b499dfbb7a2847430
Summary:
Current impl. of db_stress will abort verification and report failure if
GetLiveFiles() causes a dropped column family to be flushed. This is not
desired.
To fix, this PR makes the following change:
In GetLiveFiles, if flush is triggered and returns
Status::IsColumnFamilyDropped(), then set status to Status::OK().
This is OK because dropped column families will be skipped during the rest of
this function, and valid column families will have their live files returned to
caller.
Test plan (dev server):
make check
./db_stress -ops_per_thread=1000 -get_live_files_one_in=100 -clear_column_family_one_in=100
./db_stress -disable_wal=1 -reopen=0 -ops_per_thread=1000 -get_live_files_one_in=100 -clear_column_family_one_in=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6805
Reviewed By: ltamasi
Differential Revision: D21390044
Pulled By: riversand963
fbshipit-source-id: de67846b95a4f1b88aa0a30c3d70c43cc68625b9
Summary:
The dynamic_cast in the filter benchmark causes release mode to fail due to
no-rtti. Replace with static_cast_with_check.
Signed-off-by: Derrick Pallas <derrick@pallas.us>
Addition by peterd: Remove unnecessary 2nd template arg on all static_cast_with_check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6732
Reviewed By: ltamasi
Differential Revision: D21304260
Pulled By: pdillinger
fbshipit-source-id: 6e8eb437c4ca5a16dbbfa4053d67c4ad55f1608c
Summary:
1. Add changes so that max_background_flushes can be set dynamically.
2. Add a testcase DBOptionsTest.SetBackgroundFlushThreads which set the
max_background_flushes dynamically using SetDBOptions.
TestPlan: 1. make -j64 check
2. Using new testcase DBOptionsTest.SetBackgroundFlushThreads
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6701
Reviewed By: ajkr
Differential Revision: D21028010
Pulled By: akankshamahajan15
fbshipit-source-id: 5f949e4a8fd3c32537b637947b7ee09a69cfc7c1
Summary:
The raw key bytes are currently dumped directly into the log messages,
which is not ideal if the keys aren't ASCII strings. Null bytes in
particular can cut off bits of the message early.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6616
Reviewed By: ajkr
Differential Revision: D20879218
Pulled By: anand1976
fbshipit-source-id: 825a20715fe6d8012c0163c6e7b8159f7926a1a7
Summary:
When creating a database backup, the background threads will not only consume IO resources by copying files, but also consuming CPU such as by computing checksums. During peak times, the CPU consumption by the background threads might affect online queries.
This PR makes it possible to decrease CPU priority of these threads when creating a new backup.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6602
Test Plan: make check
Reviewed By: siying, zhichao-cao
Differential Revision: D20683216
Pulled By: cheng-chang
fbshipit-source-id: 9978b9ed9488e8ce135e90ca083e5b4b7221fd84
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
Summary:
In automatic compaction, if a compaction is bottommost, it goes to bottom thread pool. We should do the same for manual compaction too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6593
Test Plan: Add a unit test. See all existing tests pass.
Reviewed By: ajkr
Differential Revision: D20637408
fbshipit-source-id: cb03031e8f895085f7acf6d2d65e69e84c9ddef3
Summary:
In CompactRange, if there is no key in memtable falling in the specified range, then flush is skipped.
This PR extends this skipping logic to SST file levels: it starts compaction from the highest level (starting from L0) that has files with key falling in the specified range, instead of always starts from L0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6482
Test Plan:
A new test ManualCompactionTest::SkipLevel is added.
Also updated a test related to statistics of index block cache hit in db_test2, the index cache hit is increased by 1 in this PR because when checking overlap for the key range in L0, OverlapWithLevelIterator will do a seek in the table cache iterator, which will read from the cached index.
Also updated db_compaction_test and db_test to use correct range for full compaction.
Differential Revision: D20251149
Pulled By: cheng-chang
fbshipit-source-id: f822157cf4796972bd5035d9d7178d8dfb7af08b
Summary:
this silences following warning from clang-11
```
rocksdb/db/db_impl/db_impl_compaction_flush.cc:1040:21: warning: loop variable 'newf' of type 'const std::pair<int, rocksdb::FileMetaData>' creates a copy from type 'const
std::pair<int\
, rocksdb::FileMetaData>' [-Wrange-loop-analysis]
for (const auto newf : c->edit()->GetNewFiles()) {
^
rocksdb/db/db_impl/db_impl_compaction_flush.cc:1040:10: note: use reference type 'const std::pair<int, rocksdb::FileMetaData> &' to prevent copying
for (const auto newf : c->edit()->GetNewFiles()) {
^~~~~~~~~~~~~~~~~
&
```
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6477
Differential Revision: D20211850
Pulled By: ltamasi
fbshipit-source-id: 3e89e13a12bba79f1b934d46b7c4c0576cdafb01
Summary:
In the current code base, we can use Directory from Env to manage directory (e.g, Fsync()). The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. So we further replace the Directory class in DB with FSDirectory such that we can have more IO information from IOStatus returned by FSDirectory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6468
Test Plan: pass make asan_check
Differential Revision: D20195261
Pulled By: zhichao-cao
fbshipit-source-id: 93962cb9436852bfcfb76e086d9e7babd461cbe1
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary:
In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata.
Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string). A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216
Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check.
Differential Revision: D19171461
Pulled By: zhichao-cao
fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87
Summary:
Before this fix, atomic flush codepath may hit an assertion failure on a specific failure case.
If all flush jobs within an atomic flush succeed (they do not write to MANIFEST), but batch writing version edits to MANIFEST fails, then `cfd->imm()->RollbackMemTableFlush()` will be called twice, and the second invocation hits assertion failure `assert(m->flush_in_progress_)` since the first invocation resets the variable `flush_in_progress_` to false already.
Test plan (dev server):
```
./db_flush_test --gtest_filter=DBAtomicFlushTest/DBAtomicFlushTest.RollbackAfterFailToInstallResults
make check
```
Both must succeed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6385
Differential Revision: D19782943
Pulled By: riversand963
fbshipit-source-id: 84e1592625e729d1b70fdc8479959387a74cb121
Summary:
I found that CleanupSuperVersion() may block Get() for 30ms+ (per MemTable is 256MB).
Then I found "delete sv" in ~SuperVersion() takes the time.
The backtrace looks like this
DBImpl::GetImpl() -> DBImpl::ReturnAndCleanupSuperVersion() ->
DBImpl::CleanupSuperVersion() : delete sv; -> ~SuperVersion()
I think it's better to delete in a background thread, please review it。
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6146
Differential Revision: D18972066
fbshipit-source-id: 0f7b0b70b9bb1e27ad6fc1c8a408fbbf237ae08c
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Summary:
It's easy to cause coredump when closing ColumnFamilyHandle with unreleased iterators, especially iterators release is controlled by java GC when using JNI.
This patch fixed concurrent CF iteration and drop, we let iterators(actually SuperVersion) hold a ColumnFamilyData reference to prevent the CF from being released too early.
fixed https://github.com/facebook/rocksdb/issues/5982
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6147
Differential Revision: D18926378
fbshipit-source-id: 1dff6d068c603d012b81446812368bfee95a5e15
Summary:
**Summary:**
This PR fixes two unordered_write related issues:
- ingestion job may skip the necessary memtable flush https://github.com/facebook/rocksdb/issues/6026
- compact range may cause memtable is flushed before pending unordered write finished
1. `CompactRange` triggers memtable flush but doesn't wait for pending-writes
2. there are some pending writes but memtable is already flushed
3. the memtable related WAL is removed( note that the pending-writes were recorded in that WAL).
4. pending-writes write to newer created memtable
5. there is a restart
6. missing the previous pending-writes because WAL is removed but they aren't included in SST.
**How to solve:**
- Wait pending memtable writes before ingestion job check memtable key range
- Wait pending memtable writes before flush memtable.
**Note that: `CompactRange` calls `RangesOverlapWithMemtables` too without waiting for pending waits, but I'm not sure whether it affects the correctness.**
**Test Plan:**
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6113
Differential Revision: D18895674
Pulled By: maysamyabandeh
fbshipit-source-id: da22b4476fc7e06c176020e7cc171eb78189ecaf
Summary:
RocksDB should decrement the counter `unscheduled_flushes_` as soon as the bg
thread is scheduled. Before this fix, the counter is decremented only when the
bg thread starts and picks an element from the flush queue. This may cause more
than necessary bg threads to be scheduled. Not a correctness issue, but may
affect flush thread count.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6104
Test Plan:
```
make check
```
Differential Revision: D18735584
Pulled By: riversand963
fbshipit-source-id: d36272d4a08a494aeeab6200a3cff7a3d1a2dc10
Summary:
options.periodic_compaction_seconds isn't supported when options.max_open_files != -1. It's because that the information of file creation time is stored in table properties and are not guaranteed to be loaded unless options.max_open_files = -1. Relax this constraint by storing the information in manifest.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6090
Test Plan: Pass all existing tests; Modify an existing test to force the manifest value to take 0 to simulate backward compatibility case; manually open the DB generated with the change by release 4.2.
Differential Revision: D18702268
fbshipit-source-id: 13e0bd94f546498a04f3dc5fc0d9dff5125ec9eb