Summary:
As previously coded, a Configurable extension would need access to code not in the public API. This change moves RegisterOptions into the Configurable class and therefore available to public extensions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8223
Reviewed By: anand1976
Differential Revision: D27960188
Pulled By: mrambacher
fbshipit-source-id: ac88b19397183df633902def5b5701b9b65fbf40
Summary:
The block_based_table_builder buffers some blocks in memory to construct a good compression dictionary. Before this commit, the keys from each block were buffered separately for convenience. However, the buffered block data implicitly contains all keys. This commit eliminates the redundant key buffers and reduces memory usage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8219
Reviewed By: ajkr
Differential Revision: D27945851
Pulled By: saketh-are
fbshipit-source-id: caf3cac1217201e080a1e24b542bedf20973afee
Summary:
Add new C APIs to create the JemallocNodumpAllocator and set it on a Cache object.
`make test` passes with and without `DISABLE_JEMALLOC=1`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8178
Reviewed By: jay-zhuang
Differential Revision: D27944631
Pulled By: ajkr
fbshipit-source-id: 2531729aa285a8985c58f22f093c4d53029c4a7b
Summary:
This PR is a first step at attempting to clean up some of the Mutable/Immutable Options code. With this change, a DBOption and a ColumnFamilyOption can be reconstructed from their Mutable and Immutable equivalents, respectively.
readrandom tests do not show any performance degradation versus master (though both are slightly slower than the current 6.19 release).
There are still fields in the ImmutableCFOptions that are not CF options but DB options. Eventually, I would like to move those into an ImmutableOptions (= ImmutableDBOptions+ImmutableCFOptions). But that will be part of a future PR to minimize changes and disruptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8176
Reviewed By: pdillinger
Differential Revision: D27954339
Pulled By: mrambacher
fbshipit-source-id: ec6b805ba9afe6e094bffdbd76246c2d99aa9fad
Summary:
Add compaction API for secondary instance, which compact the files to a secondary DB path without installing to the LSM tree.
The API will be used to remote compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8171
Test Plan: `make check`
Reviewed By: ajkr
Differential Revision: D27694545
Pulled By: jay-zhuang
fbshipit-source-id: 8ff3ec1bffdb2e1becee994918850c8902caf731
Summary:
In current RocksDB, in recover the information form WAL, we do the consistency check for each column family when one WAL file is corrupted and PointInTimeRecovery is set. However, it will report a false positive alert on "SST file is ahead of WALs" when one of the CF current log number is greater than the corrupted WAL number (CF contains the data beyond the corrupted WAl) due to a new column family creation during flush. In this case, a new WAL is created (it is empty) during a flush. Also, due to some reason (e.g., storage issue or crash happens before SyncCloseLog is called), the old WAL is corrupted. The new CF has no data, therefore, it does not have the consistency issue.
Fix: when checking cfd->GetLogNumber() > corrupted_wal_number also check cfd->GetLiveSstFilesSize() > 0. So the CFs with no SST file data will skip the check here.
Note potential ignored inconsistency caused due to fix: empty CF can also be caused by write+delete. In this case, after flush, there is no SST files being generated. However, this CF still have the log in the WAL. When the WAL is corrupted, the DB might be inconsistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8207
Test Plan: added unit test, make crash_test
Reviewed By: riversand963
Differential Revision: D27898839
Pulled By: zhichao-cao
fbshipit-source-id: 931fc2d8b92dd00b4169bf84b94e712fd688a83e
Summary:
For some compilers/environments (e.g. Clang, riscv64), we need to link against -latomic. Check if this is a requirement and add the library to the third-party libs if it is.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8183
Reviewed By: pdillinger
Differential Revision: D27773564
Pulled By: mrambacher
fbshipit-source-id: 68e15d823144f83fb02221c7bf5b1e43323419bf
Summary:
RocksDB allows user-specified custom comparators which may not be known to `ldb`,
a built-in tool for checking/mutating the database. Therefore, column family comparator
names mismatch encountered during manifest dump should not prevent the dumping from
proceeding.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8216
Test Plan:
```
make check
```
Also manually do the following
```
KEEP_DB=1 ./db_with_timestamp_basic_test
./ldb --db=<db> manifest_dump --verbose
```
The ldb should succeed and print something like:
```
...
--------------- Column family "default" (ID 0) --------------
log number: 6
comparator: <TestComparator>, but the comparator object is not available.
...
```
Reviewed By: ltamasi
Differential Revision: D27927581
Pulled By: riversand963
fbshipit-source-id: f610b2c842187d17f575362070209ee6b74ec6d4
Summary:
Add comment to DisableManualCompaction() which was missing.
Also explictly return from DBImpl::CompactRange() to avoid memtable flush when manual compaction is disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8186
Test Plan: Run existing unit tests.
Reviewed By: jay-zhuang
Differential Revision: D27744517
fbshipit-source-id: 449548a48905903b888dc9612bd17480f6596a71
Summary:
When WriteBufferManager is shared across DBs and column families
to maintain memory usage under a limit, OOMs have been observed when flush cannot
finish but writes continuously insert to memtables.
In order to avoid OOMs, when memory usage goes beyond buffer_limit_ and DBs tries to write,
this change will stall incoming writers until flush is completed and memory_usage
drops.
Design: Stall condition: When total memory usage exceeds WriteBufferManager::buffer_size_
(memory_usage() >= buffer_size_) WriterBufferManager::ShouldStall() returns true.
DBImpl first block incoming/future writers by calling write_thread_.BeginWriteStall()
(which adds dummy stall object to the writer's queue).
Then DB is blocked on a state State::Blocked (current write doesn't go
through). WBStallInterface object maintained by every DB instance is added to the queue of
WriteBufferManager.
If multiple DBs tries to write during this stall, they will also be
blocked when check WriteBufferManager::ShouldStall() returns true.
End Stall condition: When flush is finished and memory usage goes down, stall will end only if memory
waiting to be flushed is less than buffer_size/2. This lower limit will give time for flush
to complete and avoid continous stalling if memory usage remains close to buffer_size.
WriterBufferManager::EndWriteStall() is called,
which removes all instances from its queue and signal them to continue.
Their state is changed to State::Running and they are unblocked. DBImpl
then signal all incoming writers of that DB to continue by calling
write_thread_.EndWriteStall() (which removes dummy stall object from the
queue).
DB instance creates WBMStallInterface which is an interface to block and
signal DBs during stall.
When DB needs to be blocked or signalled by WriteBufferManager,
state_for_wbm_ state is changed accordingly (RUNNING or BLOCKED).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7898
Test Plan: Added a new test db/db_write_buffer_manager_test.cc
Reviewed By: anand1976
Differential Revision: D26093227
Pulled By: akankshamahajan15
fbshipit-source-id: 2bbd982a3fb7033f6de6153aa92a221249861aae
Summary:
This partially reverts commit 10196d7edc.
The problem with this change is because of important filter use cases:
FIFO compaction and SST writer. FIFO "compaction" always uses level 0 so
would only use Ribbon filters if specifically including level 0 for the
Ribbon filter policy. SST writer sets level_at_creation=-1 to indicate
unknown level, and this would be treated the same as level 0 unless
fixed.
We are keeping the part about committing to permanent schema, which is
only changes to API comments and HISTORY.md.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8212
Test Plan: CI
Reviewed By: jay-zhuang
Differential Revision: D27896468
Pulled By: pdillinger
fbshipit-source-id: 50a775f7cba5d64fb729d9b982e355864020596e
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8208
Make include of "file_system.h" use the same include path as everywhere
else.
Reviewed By: riversand963, akankshamahajan15
Differential Revision: D27881606
fbshipit-source-id: fc1e076229fde21041a813c655ce017b5070c8b3
Summary:
Fixes https://github.com/facebook/rocksdb/issues/6245.
Adapted from https://github.com/facebook/rocksdb/issues/8201 and https://github.com/facebook/rocksdb/issues/8205.
Previously we were writing the ingested file's smallest/largest internal keys
with sequence number zero, or `kMaxSequenceNumber` in case of range
tombstone. The former (sequence number zero) is incorrect and can lead
to files being incorrectly ordered. The fix in this PR is to overwrite
boundary keys that have sequence number zero with the ingested file's assigned
sequence number.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8209
Test Plan: repro unit test
Reviewed By: riversand963
Differential Revision: D27885678
Pulled By: ajkr
fbshipit-source-id: 4a9f2c6efdfff81c3a9923e915ea88b250ee7b6a
Summary:
Unittest reports no space from time to time, which can be reproduced on a small memory machine with SHM. It's caused by large WAL files generated during the test, which is preallocated, but didn't truncate during close(). Adding the missing APIs to set preallocation.
It added arm test as nightly build, as the test runs more than 1 hour.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8204
Test Plan: test on small memory arm machine
Reviewed By: mrambacher
Differential Revision: D27873145
Pulled By: jay-zhuang
fbshipit-source-id: f797c429d6bc13cbcc673bc03fcc72adda55f506
Summary:
In a distributed environment, a file `rename()` operation can succeed on server (remote)
side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which
can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a
new MANIFEST. We currently always delete the new MANIFEST if an error occurs.
This is problematic in distributed world. If the server-side successfully updates the CURRENT
file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail.
As a fix, we can track the execution result of IO operations on the new MANIFEST.
- If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original
MANIFEST. Therefore, it is safe to remove the new MANIFEST.
- If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up
code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
new MANIFEST.) Therefore, we keep the new MANIFEST.
- Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT.
- If process reopens the db immediately after the failure, then the CURRENT file can point
to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
succeed and ignore the other.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D27804648
Pulled By: riversand963
fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
Summary:
Historically, the DB properties `rocksdb.cur-size-active-mem-table`,
`rocksdb.cur-size-all-mem-tables`, and `rocksdb.size-all-mem-tables` called
the method `MemTable::ApproximateMemoryUsage` for mutable memtables,
which is not safe without synchronization. This resulted in data races with
memtable inserts. The patch changes the code handling these properties
to use `MemTable::ApproximateMemoryUsageFast` instead, which returns a
cached value backed by an atomic variable. Two test cases had to be updated
for this change. `MemoryTest.MemTableAndTableReadersTotal` was fixed by
increasing the value size used so each value ends up in its own memtable,
which was the original intention (note: the test has been broken in the sense
that the test code didn't consider that memtable sizes below 64 KB get
increased to 64 KB by `SanitizeOptions`, and has been passing only by
accident). `DBTest.MemoryUsageWithMaxWriteBufferSizeToMaintain` relies on
completely up-to-date values and thus was changed to use `ApproximateMemoryUsage`
directly instead of going through the DB properties. Note: this should be safe in this case
since there's only a single thread involved.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8206
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D27866811
Pulled By: ltamasi
fbshipit-source-id: 7bd754d0565e0a65f1f7f0e78ffc093beef79394
Summary:
If `options.best_efforts_recovery == true`, RocksDB currently tolerates missing table files and recovers to the latest version without missing table files (not considering WAL). It is necessary to handle blob files as well to make the feature more complete.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8180
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D27840556
Pulled By: riversand963
fbshipit-source-id: 041685d0dc2e7779ac4f0374c07a8a327704aa5e
Summary:
Test was flaky because for kUseDbSessionId naming, blob files use
naming scheme kLegacyCrc32cAndFileSize. So expected number of files
because of collision can vary. So disabling blobdb for this test case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8197
Reviewed By: pdillinger
Differential Revision: D27836997
Pulled By: akankshamahajan15
fbshipit-source-id: 5eb21a5f4acae3d6b730a9e1b207264fbc18cb80
Summary:
Since the Ribbon filter schema seems good (compatible back to
6.15.0), this change commits to long term support of the SST schema,
even though we expect the API for enabling Ribbon to change (still
called NewExperimentalRibbonFilterPolicy).
This also adds support for "hybrid" configuration in which some levels
use Bloom (higher levels, lower numbered) for speed and the rest use
Ribbon (lower levels, higher numbered) for memory space efficiency.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8198
Test Plan: unit test added, crash test support
Reviewed By: jay-zhuang
Differential Revision: D27831232
Pulled By: pdillinger
fbshipit-source-id: 90e528677689474d293ed6710b42ba89fbd5b5ab
Summary:
The code for strcmp that was present does work when compiled for Windows unicode file paths.
Needs backporting to:
* 6.17.fb
* 6.18.fb
* 6.19.fb
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8190
Reviewed By: akankshamahajan15
Differential Revision: D27765588
Pulled By: jay-zhuang
fbshipit-source-id: 89f8a5ac61fd7edc758340dfd335b0a5f96dae6e
Summary:
- Fixes the makefile to do the right thing when invoking multiple targets (e.g. make shared_lib install-shared).
- Fixes the building of db_stress in shared lib mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8195
Reviewed By: pdillinger
Differential Revision: D27803452
Pulled By: mrambacher
fbshipit-source-id: 7c285d267770a359eb47f25855affdf58687e0e4
Summary:
Added the Blob option settings from the AdvancedColmnFamilyOptions to the C API.
There are no tests for getting/setting options in the C API currently, hence no specific test plans. Should there be a some?
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8148
Reviewed By: ltamasi
Differential Revision: D27568495
Pulled By: mrambacher
fbshipit-source-id: 3a52b784467ea2c4bc58be5f75c5d41f0a5c55d6
Summary:
Updated the test to wait until all trash files are deleted by
SSTFileManager in the background. Since deletion runs in background so
number of files deleted might not always be as expected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8196
Reviewed By: jay-zhuang
Differential Revision: D27812273
Pulled By: akankshamahajan15
fbshipit-source-id: d3ace1db34f91254b52fa455e09844d02801f58e
Summary:
Extend the DB::GetLiveFilesChecksumInfo API to blob files.
This API is also used by the file_checksum_dump ldb command to dump checksum
of SST files which now also dumps blob files checksum.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8179
Test Plan: Add new unit test
Reviewed By: zhichao-cao
Differential Revision: D27714965
Pulled By: akankshamahajan15
fbshipit-source-id: d8b7343ea845a64c83800336d88cced7152a8c92
Summary:
As the name of `DBImpl::WriteLevel0TableForRecovery` suggests, the resulting table file
should be placed on L0. However, the argument `level` passed to `BuildTable()` is -1.
We need to correct this since the level information will be useful to determine file placement.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8187
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D27748570
Pulled By: riversand963
fbshipit-source-id: e1cd23128a8de31f14b1edc2ea92754c154e4f10
Summary:
Resolves https://github.com/facebook/rocksdb/issues/8014
- Add an assertion on `DB::Open` to ensure `db_options.max_open_files` is unlimited if FIFO Compaction is being used.
- This is to align with what the docs mention and to prevent premature data deletion.
- Update tests to work with this assertion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8172
Test Plan:
```bash
$ make check -j$(nproc)
Generated TARGETS Summary:
- 6 libs
- 0 binarys
- 180 tests
```
Reviewed By: ajkr
Differential Revision: D27768792
Pulled By: thejchap
fbshipit-source-id: cf6350535e3a3577fec72bcba75b3c094dc7a6f3