Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9105
The user contract of SingleDelete is that: a SingleDelete can only be issued to
a key that exists and has NOT been updated. For example, application can insert
one key `key`, and uses a SingleDelete to delete it in the future. The `key`
cannot be updated or removed using Delete.
In reality, especially when write-prepared transaction is being used, things
can get tricky. For example, a prepared transaction already writes `key` to the
memtable after a successful Prepare(). Afterwards, should the transaction
rollback, it will insert a Delete into the memtable to cancel out the prior
Put. Consider the following sequence of operations.
```
// operation sequence 1
Begin txn
Put(key)
Prepare()
Flush()
Rollback txn
Flush()
```
There will be two SSTs resulting from above. One of the contains a PUT, while
the second one contains a Delete. It is also known that releasing a snapshot
can lead to an L0 containing only a SD for a particular key. Consider the
following operations following the above block.
```
// operation sequence 2
db->Put(key)
db->SingleDelete(key)
Flush()
```
The operation sequence 2 can result in an L0 with only the SD.
Should there be a snapshot for conflict checking created before operation
sequence 1, then an attempt to compact the db may hit the assertion failure
below, because ikey_.type is Delete (from a rollback).
```
else if (clear_and_output_next_key_) {
assert(ikey_.type == kTypeValue || ikey_.type == kTypeBlobIndex);
}
```
To fix the assertion failure, we can skip the SingleDelete if we detect an
earlier Delete in the same snapshot interval.
Reviewed By: ltamasi
Differential Revision: D32056848
fbshipit-source-id: 23620a91e28562d91c45cf7e95f414b54b729748
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9060
RocksDB bottommost level compaction may zero out an internal key's sequence if
the key's sequence is in the earliest_snapshot.
In write-prepared transaction, checking the visibility of a certain sequence in
a specific released snapshot may return a "snapshot released" result.
Therefore, it is possible, after a certain sequence of events, a PUT has its
sequence zeroed out, but a subsequent SingleDelete of the same key will still
be output with its original sequence. This violates the ascending order of
keys and leads to incorrect result.
The solution is to use an extra variable `last_key_seq_zeroed_` to track the
information about visibility in earliest snapshot. With this variable, we can
know for sure that a SingleDelete is in the earliest snapshot even if the said
snapshot is released during compaction before processing the SD.
Reviewed By: ltamasi
Differential Revision: D31813016
fbshipit-source-id: d8cff59d6f34e0bdf282614034aaea99be9174e1
Summary:
Directory fsync might be expensive on btrfs and it may not be needed.
Here are 4 directory fsync cases:
1. creating a new file: dir-fsync is not needed on btrfs, as long as the
new file itself is synced.
2. renaming a file: dir-fsync is not needed if the renamed file is
synced. So an API `FsyncAfterFileRename(filename, ...)` is provided
to sync the file on btrfs. By default, it just calls dir-fsync.
3. deleting files: dir-fsync is forced by set
`IOOptions.force_dir_fsync = true`
4. renaming multiple files (like backup and checkpoint): dir-fsync is
forced, the same as above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903
Test Plan: run tests on btrfs and non btrfs
Reviewed By: ajkr
Differential Revision: D30885059
Pulled By: jay-zhuang
fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9061
In write-prepared txn, checking a sequence's visibility in a released (old)
snapshot may return "Snapshot released". Suppose we have two snapshots:
```
earliest_snap < earliest_write_conflict_snap
```
If we release `earliest_write_conflict_snap` but keep `earliest_snap` during
bottommost level compaction, then it is possible that certain sequence of
events can lead to a PUT being seq-zeroed followed by a SingleDelete of the
same key. This violates the ascending order of keys, and will cause data
inconsistency.
Reviewed By: ltamasi
Differential Revision: D31813017
fbshipit-source-id: dc68ba2541d1228489b93cf3edda5f37ed06f285
Summary:
This feature was not part of any common or CI build, so no
surprise it broke. Now we can at least ensure compilation. I don't know
how to run the test successfully (missing config file) so it is bypassed
for now.
Fixes https://github.com/facebook/rocksdb/issues/9078
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9088
Test Plan: CI
Reviewed By: mrambacher
Differential Revision: D32009467
Pulled By: pdillinger
fbshipit-source-id: 3e0d1e5fde7f0ece703d48a81479e1cc7392c25c
Summary:
This PR adds support for building on s390x including updating travis CI. It uses the previous work in https://github.com/facebook/rocksdb/pull/6168 and adds some more changes to get all current tests (make check and jni tests) to pass. The tests were run with snappy, lz4, bzip2 and zstd all compiled in.
There are a few pieces still needed to get the travis build working that I don't think I can do. adamretter is this something you could help with?
1. A prebuilt https://rocksdb-deps.s3-us-west-2.amazonaws.com/cmake/cmake-3.14.5-Linux-s390x.deb package
2. A https://hub.docker.com/r/evolvedbinary/rocksjava s390x image
Not sure if there is more required for travis. Happy to help in any way I can.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8962
Reviewed By: mrambacher
Differential Revision: D31802198
Pulled By: pdillinger
fbshipit-source-id: 683511466fa6b505f85ba5a9964a268c6151f0c2
Summary:
New classes FileStorageInfo and LiveFileStorageInfo and
'experimental' function DB::GetLiveFilesStorageInfo, which is intended
to largely replace several fragmented DB functions needed to create
checkpoints and backups.
This function is now used to create checkpoints and backups, because
it fixes many (probably not all) of the prior complexities of checkpoint
not having atomic access to DB metadata. This also ensures strong
functional test coverage of the new API. Specifically, much of the old
CheckpointImpl::CreateCustomCheckpoint has been migrated to and
updated in DBImpl::GetLiveFilesStorageInfo, with the former now
calling the latter.
Also, the class FileStorageInfo in metadata.h compatibly replaces
BackupFileInfo and serves as a new base class for SstFileMetaData.
Some old fields of SstFileMetaData are still provided (for now) but
deprecated.
Although FileStorageInfo::directory is accurate when using db_paths
and/or cf_paths, these have never been supported by Checkpoint
nor BackupEngine and still are not. This change does now detect
these cases and return NotSupported when appropriate. (More work
needed for support.)
Somehow this change broke ProgressCallbackDuringBackup, but
the progress_callback logic was dubious to begin with because it
would call the callback based on copy buffer size, not size actually
copied. Logic and test updated to track size actually copied
per-thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8968
Test Plan:
tests updated.
DB::GetLiveFilesStorageInfo mostly tested by use in CheckpointImpl.
DBTest.SnapshotFiles updated to also test GetLiveFilesStorageInfo,
including reading the data after DB close.
Added CheckpointTest.CheckpointWithDbPath (NotSupported).
Reviewed By: siying
Differential Revision: D31242045
Pulled By: pdillinger
fbshipit-source-id: b183d1ce9799e220daaefd6b3b5365d98de676c0
Summary:
`FaultInjectionTest{Env,FS}::ReopenWritableFile()` functions were accidentally deleting WALs from previous `db_stress` runs causing verification to fail. They were operating under the assumption that `ReopenWritableFile()` would delete any existing file. It was a reasonable assumption considering the `{Env,FileSystem}::ReopenWritableFile()` documentation stated that would happen. The only problem was neither the implementations we offer nor the "real" clients in RocksDB code followed that contract. So, this PR updates the contract as well as fixing the fault injection client usage.
The fault injection change exposed that `ExternalSSTFileBasicTest.SyncFailure` was relying on a fault injection `Env` dropping unsynced data written by a regular `Env`. I changed that test to make its `SstFileWriter` use fault injection `Env`, and also implemented `LinkFile()` in fault injection so the unsynced data is tracked under the new name.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8995
Test Plan:
- Verified it fixes the following failure:
```
$ ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --ops_per_thread=1000 --prefixpercent=0 --readpercent=60 --reopen=0 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
$ ./db_stress --avoid_flush_during_recovery=1 --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=10 --key_len_percent_dist=1,30,69 --max_bytes_for_level_base=4194304 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_write_fault_one_in=16 --ops_per_thread=1000 --prefix_size=-1 --prefixpercent=0 --readpercent=50 --sync=1 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
Verification failed for column family 0 key 000000000000001300000000000000857878787878 (1143): Value not found: NotFound:
Crash-recovery verification failed :(
...
```
- `make check -j48`
Reviewed By: ltamasi
Differential Revision: D31495388
Pulled By: ajkr
fbshipit-source-id: 7886ccb6a07cb8b78ad7b6c1c341ccf40bb68385
Summary:
Background: Cache warming up will cause potential read performance degradation due to reading blocks from storage to the block cache. Since in production, the workload and access pattern to a certain DB is stable, it is a potential solution to dump out the blocks belonging to a certain DB to persist storage (e.g., to a file) and bulk-load the blocks to Secondary cache before the DB is relaunched. For example, when migrating a DB form host A to host B, it will take a short period of time, the access pattern to blocks in the block cache will not change much. It is efficient to dump out the blocks of certain DB, migrate to the destination host and insert them to the Secondary cache before we relaunch the DB.
Design: we introduce the interface of CacheDumpWriter and CacheDumpRead for user to store the blocks dumped out from block cache. RocksDB will encode all the information and send the string to the writer. User can implement their own writer it they want. CacheDumper and CacheLoad are introduced to save the blocks and load the blocks respectively.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8912
Test Plan: add new tests to lru_cache_test and pass make check.
Reviewed By: pdillinger
Differential Revision: D31452871
Pulled By: zhichao-cao
fbshipit-source-id: 11ab4f5d03e383f476947116361d54188d36ec48
Summary:
This header file was including everything and the kitchen sink when it did not need to. This resulted in many places including this header when they needed other pieces instead.
Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.
Hopefully, this sort of code hygiene cleanup will speed up the builds...
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930
Reviewed By: pdillinger
Differential Revision: D31142788
Pulled By: mrambacher
fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
Summary:
Right now FaultInjectionTestFS::InjectThreadSpecificReadError() might try to corrupt return bytes, but these bytes might be from mmapped files, which would cause segfault. Instead FaultInjectionTestFS::InjectThreadSpecificReadError() should never corrupt data unless it is in caller's buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8952
Test Plan: See db_stress still runs and make sure in a test run failurs are still injected in non-mmap cases.
Reviewed By: ajkr, ltamasi
Differential Revision: D31147318
fbshipit-source-id: 9484a64ff2aaa36685557203f449286e694e65f9
Summary:
There is a corner case when using WriteUnprepared transactions when
`WriteUnpreparedTxn::Get` returns `Status::TryAgain` instead of
propagating the result of `GetFromBatchAndDB`. The patch adds
`PermitUncheckedError` to make the `ASSERT_STATUS_CHECKED` build pass in
this case as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8947
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D31125422
Pulled By: ltamasi
fbshipit-source-id: 42de51dcfa9384e032244c2b4d3f40e9a4111194
Summary:
Several improvements to MultiRead:
1. Fix a bug in stress test which causes false positive when both MultiRead() return and individual read request have failure injected.
2. Add two more types of fault that should be handled: empty read results and checksum mismatch
3. Add a message indicating which type of fault is injected
4. Increase the failure rate
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8937
Reviewed By: anand1976
Differential Revision: D31085930
fbshipit-source-id: 3a04994a3cadebf9a64d25e1fe12b14b7a272fba
Summary:
Updates a few remaining functions that should have been updated
from Status -> IOStatus, and adds to HISTORY for the overall change
including https://github.com/facebook/rocksdb/issues/8820.
This change is for inclusion in version 6.25.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8940
Test Plan: CI
Reviewed By: zhichao-cao
Differential Revision: D31085029
Pulled By: pdillinger
fbshipit-source-id: 91557c6a39ef1d90357d4f4dcd79af0645d87c7b
Summary:
Right now, the failure injection test for MultiGet() is not sufficient. Improve it with TestFSRandomAccessFile::MultiRead() injecting failures.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8925
Test Plan: Run crash test locally for a while.
Reviewed By: anand1976
Differential Revision: D31000529
fbshipit-source-id: 439c7e02cf7440ac5af82deb609e202abdca3e1f
Summary:
In order to populate the IOStatus up to the higher level, replace some of the Status to IOStatus.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8820
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D30967215
Pulled By: zhichao-cao
fbshipit-source-id: ccf9d5cfbd9d3de047c464aaa85f9fa43b474903
Summary:
This allows the wrapper classes to own the wrapped object and eliminates confusion as to ownership. Previously, many classes implemented their own ownership solutions. Fixes https://github.com/facebook/rocksdb/issues/8606
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8618
Reviewed By: pdillinger
Differential Revision: D30136064
Pulled By: mrambacher
fbshipit-source-id: d0bf471df8818dbb1770a86335fe98f761cca193
Summary:
It's always annoying to find a header does not include its own
dependencies and only works when included after other includes. This
change adds `make check-headers` which validates that each header can
be included at the top of a file. Some headers are excluded e.g. because
of platform or external dependencies.
rocksdb_namespace.h had to be re-worked slightly to enable checking for
failure to include it. (ROCKSDB_NAMESPACE is a valid namespace name.)
Fixes mostly involve adding and cleaning up #includes, but for
FileTraceWriter, a constructor was out-of-lined to make a forward
declaration sufficient.
This check is not currently run with `make check` but is added to
CircleCI build-linux-unity since that one is already relatively fast.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8893
Test Plan: existing tests and resolving issues detected by new check
Reviewed By: mrambacher
Differential Revision: D30823300
Pulled By: pdillinger
fbshipit-source-id: 9fff223944994c83c105e2e6496d24845dc8e572
Summary:
ManagedObjects are shared pointer objects where RocksDB wants to share a single object between multiple configurations. For example, the Cache may be shared between multiple column families/tables or the Statistics may be shared between multiple databases.
ManagedObjects are stored in the ObjectRegistry by Type (e.g. Cache) and ID. For a given type/ID name, a single object is stored.
APIs were added to get/set/create these objects.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8658
Reviewed By: pdillinger
Differential Revision: D30806273
Pulled By: mrambacher
fbshipit-source-id: 832ac4423b210c4c4b4a456b35897334775d3160
Summary:
Context:
While all the non-trivial write operations in BackupEngine go through the RateLimiter, reads currently do not. In general, this is not a huge issue because (especially since some I/O efficiency fixes) reads in BackupEngine are mostly limited by corresponding writes, for both backup and restore. But in principle we should charge the RateLimiter for reads as well.
- Charged read operations in `BackupEngineImpl::CopyOrCreateFile`, `BackupEngineImpl::ReadFileAndComputeChecksum`, `BackupEngineImpl::BackupMeta::LoadFromFile` and `BackupEngineImpl::GetFileDbIdentities`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8722
Test Plan:
- Passed existing tests
- Passed added unit tests
Reviewed By: pdillinger
Differential Revision: D30610464
Pulled By: hx235
fbshipit-source-id: 9b08c9387159a5385c8d390d6666377a0d0117e5
Summary:
A "LATEST_BACKUP" file was left in the backup directory by
"BackupEngineTest.NoDeleteWithReadOnly" test, affecting future test
runs. In particular, it caused "BackupEngineTest.IOStats" to fail since
it relies on backup directory containing only data written by its
`BackupEngine`.
The fix is to promote "LATEST_BACKUP" to an explicitly managed file so
it is deleted in `BackupEngineTest` constructor if it exists.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8887
Test Plan:
below command used to fail. Now it passes:
```
$ TEST_TMPDIR=/dev/shm ./backupable_db_test --gtest_filter='BackupEngineTest.NoDeleteWithReadOnly:BackupEngineTest.IOStats'
```
Reviewed By: pdillinger
Differential Revision: D30812336
Pulled By: ajkr
fbshipit-source-id: 32dfbe1368ebdab872e610764bfea5daf9a2af09
Summary:
Gets `Statistics` from the options associated with the `DB` undergoing backup, and populates new ticker stats with the thread-local `IOContext` read/write counters for the threads doing backup work.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8819
Reviewed By: pdillinger
Differential Revision: D30779238
Pulled By: ajkr
fbshipit-source-id: 75ccafc355f90906df5cf80367f7245b985772d8
Summary:
* Consolidate use of std::regex for testing to testharness.cc, to
minimize Facebook linters constantly flagging uses in non-production
code.
* Improve syntax and error messages for asserting some string matches a
regex in tests.
* Add a public Regex wrapper class to encapsulate existing usage in
ObjectRegistry.
* Remove unnecessary include <regex>
* Put warnings that use of Regex in production code could cause bad
performance or stack overflow.
Intended follow-up work:
* Replace std::regex with another underlying implementation like RE2
* Improve ObjectRegistry interface in terms of possibly confusing literal
string matching vs. regex and in terms of reporting invalid regex.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8740
Test Plan:
tests updated, basic unit test for public Regex, and some manual
testing of temporary changes to see example error messages:
utilities/backupable/backupable_db_test.cc:917: Failure
000010_1162373755_138626.blob (child.name)
does not match regex
[0-9]+_[0-9]+_[0-9]+[.]blobHAHAHA (pattern)
db/db_basic_test.cc:74: Failure
R3SHSBA8C4U0CIMV2ZB0 (sid3)
does not match regex [0-9A-Z]{20}HAHAHA
Reviewed By: mrambacher
Differential Revision: D30706246
Pulled By: pdillinger
fbshipit-source-id: ba845e8f563ccad39bdb58f44f04e9da8f78c3fd
Summary:
Old typedef syntax is confusing
Most but not all changes with
perl -pi -e 's/typedef (.*) ([a-zA-Z0-9_]+);/using $2 = $1;/g' list_of_files
make format
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8751
Test Plan: existing
Reviewed By: zhichao-cao
Differential Revision: D30745277
Pulled By: pdillinger
fbshipit-source-id: 6f65f0631c3563382d43347896020413cc2366d9
Summary:
If RateLimiter burst bytes changes during concurrent Restore
operations
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8732
Test Plan: updated unit test fails with TSAN before change, passes after
Reviewed By: ajkr
Differential Revision: D30683879
Pulled By: pdillinger
fbshipit-source-id: d0ddb3587ade91ee2a4d926b475acf7781b03086
Summary:
Env::GenerateUniqueId() works fine on Windows and on POSIX
where /proc/sys/kernel/random/uuid exists. Our other implementation is
flawed and easily produces collision in a new multi-threaded test.
As we rely more heavily on DB session ID uniqueness, this becomes a
serious issue.
This change combines several individually suitable entropy sources
for reliable generation of random unique IDs, with goal of uniqueness
and portability, not cryptographic strength nor maximum speed.
Specifically:
* Moves code for getting UUIDs from the OS to port::GenerateRfcUuid
rather than in Env implementation details. Callers are now told whether
the operation fails or succeeds.
* Adds an internal API GenerateRawUniqueId for generating high-quality
128-bit unique identifiers, by combining entropy from three "tracks":
* Lots of info from default Env like time, process id, and hostname.
* std::random_device
* port::GenerateRfcUuid (when working)
* Built-in implementations of Env::GenerateUniqueId() will now always
produce an RFC 4122 UUID string, either from platform-specific API or
by converting the output of GenerateRawUniqueId.
DB session IDs now use GenerateRawUniqueId while DB IDs (not as
critical) try to use port::GenerateRfcUuid but fall back on
GenerateRawUniqueId with conversion to an RFC 4122 UUID.
GenerateRawUniqueId is declared and defined under env/ rather than util/
or even port/ because of the Env dependency.
Likely follow-up: enhance GenerateRawUniqueId to be faster after the
first call and to guarantee uniqueness within the lifetime of a single
process (imparting the same property onto DB session IDs).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8708
Test Plan:
A new mini-stress test in env_test checks the various public
and internal APIs for uniqueness, including each track of
GenerateRawUniqueId individually. We can't hope to verify anywhere close
to 128 bits of entropy, but it can at least detect flaws as bad as the
old code. Serial execution of the new tests takes about 350 ms on
my machine.
Reviewed By: zhichao-cao, mrambacher
Differential Revision: D30563780
Pulled By: pdillinger
fbshipit-source-id: de4c9ff4b2f581cf784fcedb5f39f16e5185c364
Summary:
`Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.)
`Replayer::Replay()` reports the results via a callback function.
New interface:
`TraceRecordResult` in "rocksdb/trace_record_result.h".
`DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657
Reviewed By: ajkr
Differential Revision: D30290216
Pulled By: autopear
fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6
Summary:
In debug mode, we are seeing assertion failure as follows
```
db/compaction/compaction_iterator.cc:980: void rocksdb::CompactionIterator::PrepareOutput(): \
Assertion `ikey_.type != kTypeDeletion && ikey_.type != kTypeSingleDeletion' failed.
```
It is caused by releasing earliest snapshot during compaction between the execution of
`NextFromInput()` and `PrepareOutput()`.
In one case, as demonstrated in unit test `WritePreparedTransaction.ReleaseEarliestSnapshotDuringCompaction_WithSD2`,
incorrect result may be returned by a following range scan if we disable assertion, as in opt compilation
level: the SingleDelete marker's sequence number is zeroed out, but the preceding PUT is also
outputted to the SST file after compaction. Due to the logic of DBIter, the PUT will not be
skipped and will be returned by iterator in range scan. https://github.com/facebook/rocksdb/issues/8661 illustrates what happened.
Fix by taking a more conservative approach: make compaction zero out sequence number only
if key is in the earliest snapshot when the compaction starts.
Another assertion failure is
```
Assertion `current_user_key_snapshot_ == last_snapshot' failed.
```
It's caused by releasing the snapshot between the PUT and SingleDelete during compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8608
Test Plan: make check
Reviewed By: jay-zhuang
Differential Revision: D30145645
Pulled By: riversand963
fbshipit-source-id: 699f58e66faf70732ad53810ccef43935d3bbe81
Summary:
- Remove extra `;` in trace_record.h
- Remove some unnecessary `assert` in trace_record_handler.cc
- Initialize `env_` after` exec_handler_` in `ReplayerImpl` to let db be asserted in creating the handler before getting `db->GetEnv()`.
- Update history to include the new `TraceReader::Reset()`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8652
Reviewed By: ajkr
Differential Revision: D30276872
Pulled By: autopear
fbshipit-source-id: 476ee162e0f241490c6209307448343a5b326b37
Summary:
New public interfaces:
`TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h".
`Replayer`, available in `rocksdb/utilities/replayer.h`.
User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file.
Unit tests:
- `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes.
- `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611
Reviewed By: ajkr
Differential Revision: D30266329
Pulled By: autopear
fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8
Summary:
Guarantees that if a restore is interrupted, DB::Open will fail. This works by
restoring CURRENT first to CURRENT.tmp then as a final step renaming to CURRENT.
Also makes restore respect BackupEngineOptions::sync (default true). When set,
the restore is guaranteed persisted by the time it returns OK. Also makes the above
atomicity guarantee work in case the interruption is power loss or OS crash (not just
process interruption or crash).
Fixes https://github.com/facebook/rocksdb/issues/8500
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8568
Test Plan:
added to backup mini-stress unit test. Passes with
gtest_repeat=100 (whereas fails 7 times without the CURRENT.tmp)
Reviewed By: akankshamahajan15
Differential Revision: D29812605
Pulled By: pdillinger
fbshipit-source-id: 24e9a993b305b1835ca95558fa7a7152e54cda8e
Summary:
- Changed MergeOperator, CompactionFilter, and CompactionFilterFactory into Customizable classes.
- Added Options/Configurable/Object Registration for TTL and Cassandra variants
- Changed the StringAppend MergeOperators to accept a string delimiter rather than a simple char. Made the delimiter into a configurable option
- Added tests for new functionality
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8481
Reviewed By: zhichao-cao
Differential Revision: D30136050
Pulled By: mrambacher
fbshipit-source-id: 271d1772835935b6773abaf018ee71e42f9491af
Summary:
```FaultInjectionTestFS``` injects various types of read errors in ```FileSystem``` APIs. One type of error is corruption errors, where data is intentionally corrupted or truncated. There is corresponding validation in db_stress to verify that an injected error results in a user visible Get/MultiGet error. However, for corruption errors, its hard to know when a corruption is supposed to be detected by the user request, due to prefetching and, in case of direct IO, padding. This results in false positives. So remove that functionality.
Block checksum validation for Get/MultiGet is confined to ```BlockFetcher```, so we don't lose a lot by disabling this since its a small surface area to test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8616
Reviewed By: zhichao-cao
Differential Revision: D30074422
Pulled By: anand1976
fbshipit-source-id: 6a61fac18f95514c15364b75013799ddf83294df
Summary:
An arbitrary string can be used as a delimiter in StringAppend merge operator
flavor. In particular, it allows using an empty string, combining binary values for
the same key byte-to-byte one next to another.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8536
Reviewed By: mrambacher
Differential Revision: D29962120
Pulled By: zhichao-cao
fbshipit-source-id: 4ef5d846a47835cf428a11200409e30e2dbffc4f
Summary:
Prior to this change, the "wal_dir" DBOption would always be set (defaults to dbname) when the DBOptions were sanitized. Because of this setitng in the options file, it was not possible to rename/relocate a database directory after it had been created and use the existing options file.
After this change, the "wal_dir" option is only set under specific circumstances. Methods were added to the ImmutableDBOptions class to see if it is set and if it is set to something other than the dbname. Additionally, a method was added to retrieve the effective value of the WAL dir (either the option or the dbname/path).
Tests were added to the core and ldb to test that a database could be created and renamed without issue. Additional tests for various permutations of wal_dir were also added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8582
Reviewed By: pdillinger, autopear
Differential Revision: D29881122
Pulled By: mrambacher
fbshipit-source-id: 67d3d033dc8813d59917b0a3fba2550c0efd6dfb
Summary:
This PR tries to remove some unnecessary checks as well as unreachable code blocks to
improve readability. An obvious non-public API method naming typo is also corrected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8565
Test Plan: make check
Reviewed By: lth
Differential Revision: D29963984
Pulled By: riversand963
fbshipit-source-id: cc96e8f09890e5cfe9b20eadb63bdca5484c150a
Summary:
- Added Type/CreateFromString
- Added ability to load EventListeners to DBOptions
- Since EventListeners did not previously have a Name(), defaulted to "". If there is no name, the listener cannot be loaded from the ObjectRegistry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473
Reviewed By: zhichao-cao
Differential Revision: D29901488
Pulled By: mrambacher
fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254
Summary:
Originally the 2 options `db_log_dir` and `wal_dir` will be reused in a snapshot db since the options files are just copied. By default, if `wal_dir` was not set when a db was created, it is set to the db's dir. Therefore, the snapshot db will use the same WAL dir. If both the original db and the snapshot db write to or delete from the WAL dir, one may modify or delete files which belong to the other. The same applies to `db_log_dir` as well, but as info log files are not copied or linked, it is simpler for this option.
2 arguments are added to `Checkpoint::CreateCheckpoint()`, allowing to override these 2 options.
`wal_dir`: If the function argument `wal_dir` is empty, or set to the original db location, or the checkpoint location, the snapshot's `wal_dir` option will be updated to the checkpoint location. Otherwise, the absolute path specified in the argument will be used. During checkpointing, live WAL files will be copied or linked the new location, instead of the current WAL dir specified in the original db.
`db_log_dir`: Same as `wal_dir`, but no files will be copied or linked.
A new unit test was added: `CheckpointTest.CheckpointWithOptionsDirsTest`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8572
Test Plan:
New unit test
```
checkpoint_test --gtest_filter="CheckpointTest.CheckpointWithOptionsDirsTest"
```
Output
```
Note: Google Test filter = CheckpointTest.CheckpointWithOptionsDirsTest
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from CheckpointTest
[ RUN ] CheckpointTest.CheckpointWithOptionsDirsTest
[ OK ] CheckpointTest.CheckpointWithOptionsDirsTest (11712 ms)
[----------] 1 test from CheckpointTest (11712 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (11713 ms total)
[ PASSED ] 1 test.
```
This test will fail without this patch. Just modify the code to remove the 2 arguments introduced in this patch in `CreateCheckpoint()`.
Reviewed By: zhichao-cao
Differential Revision: D29832761
Pulled By: autopear
fbshipit-source-id: e6a639b4d674380df82998c0839e79cab695fe29
Summary:
The PerThreadDBPath has already specified a slash. It does not need to be specified when initializing the test path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8555
Reviewed By: ajkr
Differential Revision: D29758399
Pulled By: jay-zhuang
fbshipit-source-id: 6d2b878523e3e8580536e2829cb25489844d9011
Summary:
ObjectLibrary is shared between multiple DB instances, the
Register() could have race condition.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8574
Test Plan: pass the failed test
Reviewed By: ajkr
Differential Revision: D29855096
Pulled By: jay-zhuang
fbshipit-source-id: 541eed0bd495d2c963d858d81e7eabf1ba16153c
Summary:
If we want to check whether a Status s is NoSpace() or not, we should check the subcode instread of using s==Status::NoSpace(). Fix some of the incorrect check in the ErrorHandler.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8504
Test Plan: make check
Reviewed By: anand1976
Differential Revision: D29601764
Pulled By: zhichao-cao
fbshipit-source-id: cdab56a827891c23746bba9cbb53f169fe35f086
Summary:
When DB Stress enables write failure in reopen, WAL files are also created with a wrapper writalbe file which buffers write until fsync. However, crash test currently expects all writes to WAL is persistent. This is at odd with the unsynced bytes dropped. To work it around temporarily, we disable WAL write failure for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8548
Test Plan: Run db_stress. Manual printf to make sure only WAL files are skipped.
Reviewed By: jay-zhuang
Differential Revision: D29745095
fbshipit-source-id: 1879dd2c01abad7879ca243ee94570ec47c347f3
Summary:
I previously didn't notice the DB mutex was being held during
block cache entry stat scans, probably because I primarily checked for
read performance regressions, because they require the block cache and
are traditionally latency-sensitive.
This change does some refactoring to avoid holding DB mutex and to
avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats").
Some tests have to be updated because now the stats collector is
populated in the Cache aggressively on DB startup rather than lazily.
(I hope to clean up some of this added complexity in the future.)
This change also ensures proper treatment of need_out_of_mutex for
non-int DB properties.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538
Test Plan:
Added unit test logic that uses sync points to fail if the DB mutex
is held during a scan, covering the various ways that a scan might be
triggered.
Performance test - the known impact to holding the DB mutex is on
TransactionDB, and the easiest way to see the impact is to hack the
scan code to almost always miss and take an artificially long time
scanning. Here I've injected an unconditional 5s sleep at the call to
ApplyToAllEntries.
Before (hacked):
$ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0)
rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856
$ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0)
rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323
Notice the 5s P100 write time.
After (hacked):
$ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0)
rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407
$ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0)
rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918
P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code:
$ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0)
rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767
$ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0)
rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832
No substantial difference.
Reviewed By: siying
Differential Revision: D29738847
Pulled By: pdillinger
fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
Summary:
… small overwritten files.
If a file is overwritten with renamed and the parent path is not synced, FaultInjectionTestFS::DeleteFilesCreatedAfterLastDirSync() will delete the file. However, RocksDB relies on file renaming to be atomic no matter whether the parent directory is synced or not, and the current behavior breaks the assumption and caused some false positive: https://github.com/facebook/rocksdb/pull/8489
Since the atomic renaming is used in CURRENT files, to fix the problem, in FaultInjectionTestFS::DeleteFilesCreatedAfterLastDirSync(), we recover the state of overwritten file if the file is small.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8501
Test Plan: Run stress test for a while and see it doesn't break.
Reviewed By: anand1976
Differential Revision: D29594384
fbshipit-source-id: 589b5c2f0a9d2aca53752d7bdb0231efa5b3ae92
Summary:
Various tests had disabled valgrind due to it slowing down and timing
out (as is the case right now) the CI runs. Where a test was disabled with no comment,
I assumed slowness was the cause. For these tests that were slow under
valgrind, as well as the ones identified in https://github.com/facebook/rocksdb/issues/8352, this PR moves them
behind the compiler flag `-DROCKSDB_FULL_VALGRIND_RUN`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8475
Test Plan: running `make full_valgrind_test`, `make valgrind_test`, `make check`; will verify they appear working correctly
Reviewed By: jay-zhuang
Differential Revision: D29504843
Pulled By: ajkr
fbshipit-source-id: 2aac90749cfbd30d5ce11cb29a07a1b9314eeea7
Summary:
```TestFSWritableFile``` buffers data in ```Append``` in order to simulate unsynced data loss on crash. This is only required for buffered IO and should be disabled for direct IO. Otherwise, it causes crash tests to assert on the buffer address alignment - ```db_stress: env/io_posix.cc:1194: virtual rocksdb::IOStatus rocksdb::PosixWritableFile::Append(const rocksdb::Slice&, const rocksdb::IOOptions&, rocksdb::IODebugContext*): Assertion `IsSectorAligned(data.data(), GetRequiredBufferAlignment())' failed.```.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8490
Reviewed By: zhichao-cao
Differential Revision: D29565080
Pulled By: anand1976
fbshipit-source-id: 682831fd66ed3b9597caa74fc453e22dfaf9b973
Summary:
Inject read failures in DB reopen, just as what we do for metadata writes and writes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8476
Test Plan: Some manual tests and make sure failures are triggered.
Reviewed By: anand1976
Differential Revision: D29507283
fbshipit-source-id: d04da0163973447041038bd87701686a417c4e0c
Summary:
Previously Stress can inject metadata write failures when reopening a DB. We extend it to file append too, in the same way.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8474
Test Plan: manually run crash test with various setting and make sure the failures are triggered as expected.
Reviewed By: zhichao-cao
Differential Revision: D29503116
fbshipit-source-id: e73a446e80ccbd09301a579280e56ff949381fab
Summary:
In PR https://github.com/facebook/rocksdb/issues/7523 , checksum handoff is introduced in RocksDB for WAL, Manifest, and SST files. When user enable checksum handoff for a certain type of file, before the data is written to the lower layer storage system, we calculate the checksum (crc32c) of each piece of data and pass the checksum down with the data, such that data verification can be down by the lower layer storage system if it has the capability. However, it cannot cover the whole lifetime of the data in the memory and also it potentially introduces extra checksum calculation overhead.
In this PR, we introduce a new interface in WritableFileWriter::Append, which allows the caller be able to pass the data and the checksum (crc32c) together. In this way, WritableFileWriter can directly use the pass-in checksum (crc32c) to generate the checksum of data being passed down to the storage system. It saves the calculation overhead and achieves higher protection coverage. When a new checksum is added with the data, we use Crc32cCombine https://github.com/facebook/rocksdb/issues/8305 to combine the existing checksum and the new checksum. To avoid the segmenting of data by rate-limiter before it is stored, rate-limiter is called enough times to accumulate enough credits for a certain write. This design only support Manifest and WAL which use log_writer in the current stage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8412
Test Plan: make check, add new testing cases.
Reviewed By: anand1976
Differential Revision: D29151545
Pulled By: zhichao-cao
fbshipit-source-id: 75e2278c5126cfd58393c67b1efd18dcc7a30772
Summary:
This test case has been failing occasionally due to automatic
compactions kicking in, resulting in GC generating additional
blob files that the test did not expect. Disabling automatic
compactions to get rid of this flakiness.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8444
Test Plan: `gtest-parallel --repeat=1000 ./blob_db_test --gtest_filter="BlobDBTest.SnapshotAndGarbageCollection"`
Reviewed By: jay-zhuang
Differential Revision: D29316987
Pulled By: ltamasi
fbshipit-source-id: 9815d189ed7d63890622768675a01990e3680221
Summary:
This reverts commit 25be1ed66a.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8438
Test Plan: Run the impacted mysql test 40 times
Reviewed By: ajkr
Differential Revision: D29286247
Pulled By: jay-zhuang
fbshipit-source-id: d3bd056971a19a8b012d5d0295fa045c012b3c04
Summary:
This commit is for enabling `DBWithTTL` to use `DeteleRange` which it cannot before.
As (int32_t)Timestamp is suffixed to values in `DBWithTTL`, there is no reason that it
cannot use the common used api. I added `DeleteRangeCF` in `DBWithTTLImpl::Write`
so that we can use `DeteleRange` normally. When we run code like
`dbWithTtl->DeleteRange(start, end)`, it executes`WriteBatchInternal::DeleteRange`
internally. Intended to fix https://github.com/facebook/rocksdb/issues/7218
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8384
Test Plan: added corresponded testing logic to existing unit test
Reviewed By: jay-zhuang
Differential Revision: D29176734
fbshipit-source-id: 6874ed979fc08e1d138149d03653e43a75f0e0e6
Summary:
This reverts commit 9167ece586.
It was found to reliably trip a compaction picking conflict assertion in a MyRocks unit test. We don't understand why yet so reverting in the meantime.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8410
Test Plan: `make check -j48`
Reviewed By: jay-zhuang
Differential Revision: D29150300
Pulled By: ajkr
fbshipit-source-id: 2de8664f355d6da015e84e5fec2e3f90f49741c8
Summary:
Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360
Test Plan: add the test to lru_cache_test
Reviewed By: pdillinger
Differential Revision: D29006215
Pulled By: zhichao-cao
fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814
Summary:
This is a duplicate of https://github.com/facebook/rocksdb/issues/4948 by mzhaom to fix tests after rebase.
This change is a follow-up to https://github.com/facebook/rocksdb/issues/4927, which made this possible by allowing tombstone dropping/seqnum zeroing optimizations on the last key in the compaction. Now the `largest_seqno != 0` condition suffices to prevent snapshot release triggered compaction from entering an infinite loop.
The issues caused by the extraneous condition `level_and_file.second->num_deletions > 1` are:
- files could have `largest_seqno > 0` forever making it impossible to tell they cannot contain any covering keys
- it doesn't trigger compaction when there are many overwritten keys. Some MyRocks use case actually doesn't use Delete but instead calls Put with empty value to "delete" keys, so we'd like to be able to trigger compaction in this case too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8357
Test Plan: - make check
Reviewed By: jay-zhuang
Differential Revision: D28855340
Pulled By: ajkr
fbshipit-source-id: a261b51eecafec492499e6d01e8e43112f801798
Summary:
This change gathers and publishes statistics about the
kinds of items in block cache. This is especially important for
profiling relative usage of cache by index vs. filter vs. data blocks.
It works by iterating over the cache during periodic stats dump
(InternalStats, stats_dump_period_sec) or on demand when
DB::Get(Map)Property(kBlockCacheEntryStats), except that for
efficiency and sharing among column families, saved data from
the last scan is used when the data is not considered too old.
The new information can be seen in info LOG, for example:
Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0
Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%)
And also through DB::GetProperty and GetMapProperty (here using
ldb just for demonstration):
$ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats
rocksdb.block-cache-entry-stats.bytes.data-block: 0
rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0
rocksdb.block-cache-entry-stats.bytes.filter-block: 0
rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0
rocksdb.block-cache-entry-stats.bytes.index-block: 178992
rocksdb.block-cache-entry-stats.bytes.misc: 0
rocksdb.block-cache-entry-stats.bytes.other-block: 0
rocksdb.block-cache-entry-stats.bytes.write-buffer: 0
rocksdb.block-cache-entry-stats.capacity: 8388608
rocksdb.block-cache-entry-stats.count.data-block: 0
rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0
rocksdb.block-cache-entry-stats.count.filter-block: 0
rocksdb.block-cache-entry-stats.count.filter-meta-block: 0
rocksdb.block-cache-entry-stats.count.index-block: 215
rocksdb.block-cache-entry-stats.count.misc: 1
rocksdb.block-cache-entry-stats.count.other-block: 0
rocksdb.block-cache-entry-stats.count.write-buffer: 0
rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290
rocksdb.block-cache-entry-stats.percent.data-block: 0.000000
rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000
rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000
rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000
rocksdb.block-cache-entry-stats.percent.index-block: 2.133751
rocksdb.block-cache-entry-stats.percent.misc: 0.000000
rocksdb.block-cache-entry-stats.percent.other-block: 0.000000
rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000
rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052
rocksdb.block-cache-entry-stats.secs_since_last_collection: 0
Solution detail - We need some way to flag what kind of blocks each
entry belongs to, preferably without changing the Cache API.
One of the complications is that Cache is a general interface that could
have other users that don't adhere to whichever convention we decide
on for keys and values. Or we would pay for an extra field in the Handle
that would only be used for this purpose.
This change uses a back-door approach, the deleter, to indicate the
"role" of a Cache entry (in addition to the value type, implicitly).
This has the added benefit of ensuring proper code origin whenever we
recognize a particular role for a cache entry; if the entry came from
some other part of the code, it will use an unrecognized deleter, which
we simply attribute to the "Misc" role.
An internal API makes for simple instantiation and automatic
registration of Cache deleters for a given value type and "role".
Another internal API, CacheEntryStatsCollector, solves the problem of
caching the results of a scan and sharing them, to ensure scans are
neither excessive nor redundant so as not to harm Cache performance.
Because code is added to BlocklikeTraits, it is pulled out of
block_based_table_reader.cc into its own file.
This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option
(could still be added), and with actual stat gathering.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297
Test Plan: manual testing with db_bench, and a couple of basic unit tests
Reviewed By: ltamasi
Differential Revision: D28488721
Pulled By: pdillinger
fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
Summary:
The MultiGetFromBatchAndDB would fail if the PinnableSlice value being returned was pinned. This could happen if the value was retrieved from the DB (not memtable) or potentially if the values were reused (and a previous iteration returned a slice that was pinned).
This change resets the pinnable value to clear it prior to attempting to use it, thereby eliminating the problem with the value already being pinned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8299
Reviewed By: jay-zhuang
Differential Revision: D28455426
Pulled By: mrambacher
fbshipit-source-id: a34d7d983ec9b6bb4c8a2b4892f72858d43e6972
Summary:
Defined the abstract interface for a secondary cache in include/rocksdb/secondary_cache.h, and updated LRUCacheOptions to take a std::shared_ptr<SecondaryCache>. An item is initially inserted into the LRU (primary) cache. When it ages out and evicted from memory, its inserted into the secondary cache. On a LRU cache miss and successful lookup in the secondary cache, the item is promoted to the LRU cache. Only support synchronous lookup currently. The secondary cache would be used to implement a persistent (flash cache) or compressed cache.
Tests:
Results from cache_bench and db_bench don't show any regression due to these changes.
cache_bench results before and after this change -
Command
```./cache_bench -ops_per_thread=10000000 -threads=1```
Before
```Complete in 40.688 s; QPS = 245774```
```Complete in 40.486 s; QPS = 246996```
```Complete in 42.019 s; QPS = 237989```
After
```Complete in 40.672 s; QPS = 245869```
```Complete in 44.622 s; QPS = 224107```
```Complete in 42.445 s; QPS = 235599```
db_bench results before this change, and with this change + https://github.com/facebook/rocksdb/issues/8213 and https://github.com/facebook/rocksdb/issues/8191 -
Commands
```./db_bench --benchmarks="fillseq,compact" -num=30000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/home/anand76/nvm_cache/db -partition_index_and_filters=true```
```./db_bench -db=/home/anand76/nvm_cache/db -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=6 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -threads=16 -duration=300```
Before
```
DB path: [/home/anand76/nvm_cache/db]
readrandom : 80.702 micros/op 198104 ops/sec; 54.4 MB/s (3708999 of 3708999 found)
```
```
DB path: [/home/anand76/nvm_cache/db]
readrandom : 87.124 micros/op 183625 ops/sec; 50.4 MB/s (3439999 of 3439999 found)
```
After
```
DB path: [/home/anand76/nvm_cache/db]
readrandom : 77.653 micros/op 206025 ops/sec; 56.6 MB/s (3866999 of 3866999 found)
```
```
DB path: [/home/anand76/nvm_cache/db]
readrandom : 84.962 micros/op 188299 ops/sec; 51.7 MB/s (3535999 of 3535999 found)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8271
Reviewed By: zhichao-cao
Differential Revision: D28357511
Pulled By: anand1976
fbshipit-source-id: d1cfa236f00e649a18c53328be10a8062a4b6da2
Summary:
We saw the `Commit()` fail with "Operation expired" so apparently the
expiration time is too short. Increased the magnitude of the times in
this test to make flakiness less likely.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8258
Reviewed By: jay-zhuang
Differential Revision: D28177033
Pulled By: ajkr
fbshipit-source-id: 0357acee6cc14c104b6ccd39231a683a606ab130
Summary:
Adds a new Cache::ApplyToAllEntries API that we expect to use
(in follow-up PRs) for efficiently gathering block cache statistics.
Notable features vs. old ApplyToAllCacheEntries:
* Includes key and deleter (in addition to value and charge). We could
have passed in a Handle but then more virtual function calls would be
needed to get the "fields" of each entry. We expect to use the 'deleter'
to identify the origin of entries, perhaps even more.
* Heavily tuned to minimize latency impact on operating cache. It
does this by iterating over small sections of each cache shard while
cycling through the shards.
* Supports tuning roughly how many entries to operate on for each
lock acquire and release, to control the impact on the latency of other
operations without excessive lock acquire & release. The right balance
can depend on the cost of the callback. Good default seems to be
around 256.
* There should be no need to disable thread safety. (I would expect
uncontended locks to be sufficiently fast.)
I have enhanced cache_bench to validate this approach:
* Reports a histogram of ns per operation, so we can look at the
ditribution of times, not just throughput (average).
* Can add a thread for simulated "gather stats" which calls
ApplyToAllEntries at a specified interval. We also generate a histogram
of time to run ApplyToAllEntries.
To make the iteration over some entries of each shard work as cleanly as
possible, even with resize between next set of entries, I have
re-arranged which hash bits are used for sharding and which for indexing
within a shard.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225
Test Plan:
A couple of unit tests are added, but primary validation is manual, as
the primary risk is to performance.
The primary validation is using cache_bench to ensure that neither
the minor hashing changes nor the simulated stats gathering
significantly impact QPS or latency distribution. Note that adding op
latency histogram seriously impacts the benchmark QPS, so for a
fair baseline, we need the cache_bench changes (except remove simulated
stat gathering to make it compile). In short, we don't see any
reproducible difference in ops/sec or op latency unless we are gathering
stats nearly continuously. Test uses 10GB block cache with
8KB values to be somewhat realistic in the number of items to iterate
over.
Baseline typical output:
```
Complete in 92.017 s; Rough parallel ops/sec = 869401
Thread ops/sec = 54662
Operation latency (ns):
Count: 80000000 Average: 11223.9494 StdDev: 29.61
Min: 0 Median: 7759.3973 Max: 9620500
Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
------------------------------------------------------
[ 0, 1 ] 68 0.000% 0.000%
( 2900, 4400 ] 89 0.000% 0.000%
( 4400, 6600 ] 33630240 42.038% 42.038% ########
( 6600, 9900 ] 18129842 22.662% 64.700% #####
( 9900, 14000 ] 7877533 9.847% 74.547% ##
( 14000, 22000 ] 15193238 18.992% 93.539% ####
( 22000, 33000 ] 3037061 3.796% 97.335% #
( 33000, 50000 ] 1626316 2.033% 99.368%
( 50000, 75000 ] 421532 0.527% 99.895%
( 75000, 110000 ] 56910 0.071% 99.966%
( 110000, 170000 ] 16134 0.020% 99.986%
( 170000, 250000 ] 5166 0.006% 99.993%
( 250000, 380000 ] 3017 0.004% 99.996%
( 380000, 570000 ] 1337 0.002% 99.998%
( 570000, 860000 ] 805 0.001% 99.999%
( 860000, 1200000 ] 319 0.000% 100.000%
( 1200000, 1900000 ] 231 0.000% 100.000%
( 1900000, 2900000 ] 100 0.000% 100.000%
( 2900000, 4300000 ] 39 0.000% 100.000%
( 4300000, 6500000 ] 16 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
```
New, gather_stats=false. Median thread ops/sec of 5 runs:
```
Complete in 92.030 s; Rough parallel ops/sec = 869285
Thread ops/sec = 54458
Operation latency (ns):
Count: 80000000 Average: 11298.1027 StdDev: 42.18
Min: 0 Median: 7722.0822 Max: 6398720
Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
------------------------------------------------------
[ 0, 1 ] 109 0.000% 0.000%
( 2900, 4400 ] 793 0.001% 0.001%
( 4400, 6600 ] 34054563 42.568% 42.569% #########
( 6600, 9900 ] 17482646 21.853% 64.423% ####
( 9900, 14000 ] 7908180 9.885% 74.308% ##
( 14000, 22000 ] 15032072 18.790% 93.098% ####
( 22000, 33000 ] 3237834 4.047% 97.145% #
( 33000, 50000 ] 1736882 2.171% 99.316%
( 50000, 75000 ] 446851 0.559% 99.875%
( 75000, 110000 ] 68251 0.085% 99.960%
( 110000, 170000 ] 18592 0.023% 99.983%
( 170000, 250000 ] 7200 0.009% 99.992%
( 250000, 380000 ] 3334 0.004% 99.997%
( 380000, 570000 ] 1393 0.002% 99.998%
( 570000, 860000 ] 700 0.001% 99.999%
( 860000, 1200000 ] 293 0.000% 100.000%
( 1200000, 1900000 ] 196 0.000% 100.000%
( 1900000, 2900000 ] 69 0.000% 100.000%
( 2900000, 4300000 ] 32 0.000% 100.000%
( 4300000, 6500000 ] 10 0.000% 100.000%
```
New, gather_stats=true, 1 second delay between scans. Scans take about
1 second here so it's spending about 50% time scanning. Still the effect on
ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:
```
Complete in 91.890 s; Rough parallel ops/sec = 870608
Thread ops/sec = 54551
Operation latency (ns):
Count: 80000000 Average: 11311.2629 StdDev: 45.28
Min: 0 Median: 7686.5458 Max: 10018340
Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
------------------------------------------------------
[ 0, 1 ] 71 0.000% 0.000%
( 2900, 4400 ] 291 0.000% 0.000%
( 4400, 6600 ] 34492060 43.115% 43.116% #########
( 6600, 9900 ] 16727328 20.909% 64.025% ####
( 9900, 14000 ] 7845828 9.807% 73.832% ##
( 14000, 22000 ] 15510654 19.388% 93.220% ####
( 22000, 33000 ] 3216533 4.021% 97.241% #
( 33000, 50000 ] 1680859 2.101% 99.342%
( 50000, 75000 ] 439059 0.549% 99.891%
( 75000, 110000 ] 60540 0.076% 99.967%
( 110000, 170000 ] 14649 0.018% 99.985%
( 170000, 250000 ] 5242 0.007% 99.991%
( 250000, 380000 ] 3260 0.004% 99.995%
( 380000, 570000 ] 1599 0.002% 99.997%
( 570000, 860000 ] 1043 0.001% 99.999%
( 860000, 1200000 ] 471 0.001% 99.999%
( 1200000, 1900000 ] 275 0.000% 100.000%
( 1900000, 2900000 ] 143 0.000% 100.000%
( 2900000, 4300000 ] 60 0.000% 100.000%
( 4300000, 6500000 ] 27 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
( 9800000, 14000000 ] 1 0.000% 100.000%
Gather stats latency (us):
Count: 46 Average: 980387.5870 StdDev: 60911.18
Min: 879155 Median: 1033777.7778 Max: 1261431
Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
------------------------------------------------------
( 860000, 1200000 ] 45 97.826% 97.826% ####################
( 1200000, 1900000 ] 1 2.174% 100.000%
Most recent cache entry stats:
Number of entries: 1295133
Total charge: 9.88 GB
Average key size: 23.4982
Average charge: 8.00 KB
Unique deleters: 3
```
Reviewed By: mrambacher
Differential Revision: D28295742
Pulled By: pdillinger
fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
Summary:
This change enables a couple of things:
- Different ConfigOptions can have different registry/factory associated with it, thereby allowing things like a "Test" ConfigOptions versus a "Production"
- The ObjectRegistry is created fewer times and can be re-used
The ConfigOptions can also be initialized/constructed from a DBOptions, in which case it will grab some of its settings (Env, Logger) from the DBOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8166
Reviewed By: zhichao-cao
Differential Revision: D27657952
Pulled By: mrambacher
fbshipit-source-id: ae1d6200bb7ab127405cdeefaba43c7fe694dfdd
Summary:
The WBWI has two differing modes of operation dependent on the value
of the constructor parameter `overwrite_key`.
Currently, regardless of the parameter, neither mode performs as
expected when using Merge. This PR remedies this by correctly invoking
the appropriate Merge Operator before returning results from the WBWI.
Examples of issues that exist which are solved by this PR:
## Example 1 with `overwrite_key=false`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
Get('k1')
```
Incorrectly yields `v2`, that is to say that the Merge behaves like a Put.
## Example 2 with o`verwrite_key=true`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
Get('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.
## Example 3 with `overwrite_key=false`
Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
```
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `v1,v2`
## Example 4 with `overwrite_key=true`
Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
```
Merge('k1', 'v1')
GetFromBatchAndDB('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.
## Example 5 with `overwrite_key=false`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `v1,v2`
## Example 6 with `overwrite_key=true`
Currently, from an empty database, `('k1' -> 'v1')`, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8135
Reviewed By: pdillinger
Differential Revision: D27657938
Pulled By: mrambacher
fbshipit-source-id: 0fbda6bbc66bedeba96a84786d90141d776297df
Summary:
An early design of BackupEngine used stackable DB, so I guess a
DB had to opt-in to being backupable. Unfortunately the naming of that
obsolete design still infects our public API and implementation.
This change fixes the public API, with a deprecated
backward-compatibility header. `BackupableDBOptions` is renamed to
`BackupEngineOptions` (copy-replace in the public header) and
backup_engine.h replaces backupable_db.h (present for backward
compatibility). The only other change in backupable_db.h ->
backup_engine.h is cleaning up headers.
Later changes will fix the internal implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8274
Test Plan:
The internal implementation of BackupEngine uses the name
BackupEngineOptions, while the unit tests use the old name
BackupableDBOptions. This gives me confidence that both still work.
Reviewed By: mrambacher
Differential Revision: D28259471
Pulled By: pdillinger
fbshipit-source-id: a25dbe327b9772143488e7bb0ec7139ee42d0613
Summary:
The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions. This change cleans that up by introducing an ImmutableOptions struct. Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form).
Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR. All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes.
Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262
Reviewed By: pdillinger
Differential Revision: D28226540
Pulled By: mrambacher
fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf
Summary:
Greatly reduced the not-quite-copy-paste giant parameter lists
of rocksdb::NewTableBuilder, rocksdb::BuildTable,
BlockBasedTableBuilder::Rep ctor, and BlockBasedTableBuilder ctor.
Moved weird separate parameter `uint32_t column_family_id` of
TableFactory::NewTableBuilder into TableBuilderOptions.
Re-ordered parameters to TableBuilderOptions ctor, so that `uint64_t
target_file_size` is not randomly placed between uint64_t timestamps
(was easy to mix up).
Replaced a couple of fields of BlockBasedTableBuilder::Rep with a
FilterBuildingContext. The motivation for this change is making it
easier to pass along more data into new fields in FilterBuildingContext
(follow-up PR).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8240
Test Plan: ASAN make check
Reviewed By: mrambacher
Differential Revision: D28075891
Pulled By: pdillinger
fbshipit-source-id: fddb3dbb8260a0e8bdcbb51b877ebabf9a690d4f
Summary:
DB Stress to add --open_metadata_write_fault_one_in which would randomly fail in some file metadata modification operations during DB Open, including file creation, close, renaming and directory sync. Some operations can fail before and after the operations take place.
If DB open fails, db_stress would retry without the failure ingestion, and DB is expected to open successfully.
This option is enabled in crash test in half of the time.
Some follow up changes would allow write failures in open time, and ingesting those failures in non-DB open cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8235
Test Plan: Run stress tests for a while and see failures got triggered. This can reproduce the bug fixed by https://github.com/facebook/rocksdb/pull/8192 and a similar one that fails when fsyncing parent directory.
Reviewed By: anand1976
Differential Revision: D28010944
fbshipit-source-id: 36a96da4dc3633e5f7680cef3ea0a900fcdb5558
Summary:
This PR is a first step at attempting to clean up some of the Mutable/Immutable Options code. With this change, a DBOption and a ColumnFamilyOption can be reconstructed from their Mutable and Immutable equivalents, respectively.
readrandom tests do not show any performance degradation versus master (though both are slightly slower than the current 6.19 release).
There are still fields in the ImmutableCFOptions that are not CF options but DB options. Eventually, I would like to move those into an ImmutableOptions (= ImmutableDBOptions+ImmutableCFOptions). But that will be part of a future PR to minimize changes and disruptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8176
Reviewed By: pdillinger
Differential Revision: D27954339
Pulled By: mrambacher
fbshipit-source-id: ec6b805ba9afe6e094bffdbd76246c2d99aa9fad
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8208
Make include of "file_system.h" use the same include path as everywhere
else.
Reviewed By: riversand963, akankshamahajan15
Differential Revision: D27881606
fbshipit-source-id: fc1e076229fde21041a813c655ce017b5070c8b3
Summary:
In a distributed environment, a file `rename()` operation can succeed on server (remote)
side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which
can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a
new MANIFEST. We currently always delete the new MANIFEST if an error occurs.
This is problematic in distributed world. If the server-side successfully updates the CURRENT
file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail.
As a fix, we can track the execution result of IO operations on the new MANIFEST.
- If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original
MANIFEST. Therefore, it is safe to remove the new MANIFEST.
- If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up
code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
new MANIFEST.) Therefore, we keep the new MANIFEST.
- Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT.
- If process reopens the db immediately after the failure, then the CURRENT file can point
to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
succeed and ignore the other.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D27804648
Pulled By: riversand963
fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
Summary:
Historically, the DB properties `rocksdb.cur-size-active-mem-table`,
`rocksdb.cur-size-all-mem-tables`, and `rocksdb.size-all-mem-tables` called
the method `MemTable::ApproximateMemoryUsage` for mutable memtables,
which is not safe without synchronization. This resulted in data races with
memtable inserts. The patch changes the code handling these properties
to use `MemTable::ApproximateMemoryUsageFast` instead, which returns a
cached value backed by an atomic variable. Two test cases had to be updated
for this change. `MemoryTest.MemTableAndTableReadersTotal` was fixed by
increasing the value size used so each value ends up in its own memtable,
which was the original intention (note: the test has been broken in the sense
that the test code didn't consider that memtable sizes below 64 KB get
increased to 64 KB by `SanitizeOptions`, and has been passing only by
accident). `DBTest.MemoryUsageWithMaxWriteBufferSizeToMaintain` relies on
completely up-to-date values and thus was changed to use `ApproximateMemoryUsage`
directly instead of going through the DB properties. Note: this should be safe in this case
since there's only a single thread involved.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8206
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D27866811
Pulled By: ltamasi
fbshipit-source-id: 7bd754d0565e0a65f1f7f0e78ffc093beef79394
Summary:
Test was flaky because for kUseDbSessionId naming, blob files use
naming scheme kLegacyCrc32cAndFileSize. So expected number of files
because of collision can vary. So disabling blobdb for this test case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8197
Reviewed By: pdillinger
Differential Revision: D27836997
Pulled By: akankshamahajan15
fbshipit-source-id: 5eb21a5f4acae3d6b730a9e1b207264fbc18cb80
Summary:
Resolves https://github.com/facebook/rocksdb/issues/8014
- Add an assertion on `DB::Open` to ensure `db_options.max_open_files` is unlimited if FIFO Compaction is being used.
- This is to align with what the docs mention and to prevent premature data deletion.
- Update tests to work with this assertion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8172
Test Plan:
```bash
$ make check -j$(nproc)
Generated TARGETS Summary:
- 6 libs
- 0 binarys
- 180 tests
```
Reviewed By: ajkr
Differential Revision: D27768792
Pulled By: thejchap
fbshipit-source-id: cf6350535e3a3577fec72bcba75b3c094dc7a6f3
Summary:
* CreateNewBackup(WithMetadata) returning the BackupID of new backup
through optional new output param. This is especially useful with the
new mutithreading support, so that you can transactionally determine the
ID of a backup you create.
* GetBackupInfo / GetLatestBackupInfo for individual backups, so that
you don't have to comb through a vector of backups if you don't want to.
Updated HISTORY.md (including re: BlobDB support as new feature)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8170
Test Plan:
Added test logic to existing tests, to minimize increase in
cost of running tests
Reviewed By: zhichao-cao
Differential Revision: D27680410
Pulled By: pdillinger
fbshipit-source-id: 1fc45b73d81aae293ccd4a43d9583d7fd915d3eb
Summary:
Add support for blob files for backup/restore like table files.
Since DB session ID is currently not supported for blob files (there is no place to store it in
the header), so for blob files uses the
kLegacyCrc32cAndFileSize naming scheme even if
share_files_with_checksum_naming is set to kUseDbSessionId.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8129
Test Plan: Add new test units
Reviewed By: ltamasi
Differential Revision: D27408510
Pulled By: akankshamahajan15
fbshipit-source-id: b27434d189a639ef3e6ad165c61a143a2daaf06e
Summary:
Fixing another crash test failure in the case of
write_dbid_to_manifest=true and reading a backup as read-only DB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8164
Test Plan:
enhanced unit test for backup as read-only DB, ran
blackbox_crash_test more with elevated backup_one_in
Reviewed By: zhichao-cao
Differential Revision: D27622237
Pulled By: pdillinger
fbshipit-source-id: 680d0f99ddb465a601737f2e3f2c80efd47384fb
Summary:
Forgot to re-test crash test after adding read-only filesystem
enforcement to https://github.com/facebook/rocksdb/issues/8142. The problem is ReadOnlyFileSystem would reject
CreateDirIfMissing whenever DBOptions::create_if_missing=true. The fix
that is better for users is to allow CreateDirIfMissing in
ReadOnlyFileSystem if the directory exists, so that they don't cause a
failure on using create_if_missing with opening backups as read-only
DBs. Added this option test to the unit test (in addition to being in the
crash test).
Also fixed a couple of lints.
And some better messaging from 'make format' so that when you run it
with uncommitted changes, it's clear that it's only checking the
uncommitted changes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8161
Test Plan: local blackbox_crash_test with amplified backup_one_in
Reviewed By: ajkr
Differential Revision: D27614409
Pulled By: pdillinger
fbshipit-source-id: 63ccb626c7e34c200d61c6bca2a8f60da9015179
Summary:
A current limitation of backups is that you don't know the
exact database state of when the backup was taken. With this new
feature, you can at least inspect the backup's DB state without
restoring it by opening it as a read-only DB.
Rather than add something like OpenAsReadOnlyDB to the BackupEngine API,
which would inhibit opening stackable DB implementations read-only
(if/when their APIs support it), we instead provide a DB name and Env
that can be used to open as a read-only DB.
Possible follow-up work:
* Add a version of GetBackupInfo for a single backup.
* Let CreateNewBackup return the BackupID of the newly-created backup.
Implementation details:
Refactored ChrootFileSystem to split off new base class RemapFileSystem,
which allows more general remapping of files. We use this base class to
implement BackupEngineImpl::RemapSharedFileSystem.
To minimize API impact, I decided to just add these fields `name_for_open`
and `env_for_open` to those set by GetBackupInfo when
include_file_details=true. Creating the RemapSharedFileSystem adds a bit
to the memory consumption, perhaps unnecessarily in some cases, but this
has been mitigated by (a) only initialize the RemapSharedFileSystem
lazily when GetBackupInfo with include_file_details=true is called, and
(b) using the existing `shared_ptr<FileInfo>` objects to hold most of the
mapping data.
To enhance API safety, RemapSharedFileSystem is wrapped by new
ReadOnlyFileSystem which rejects any attempts to write. This uncovered a
couple of places in which DB::OpenForReadOnly would write to the
filesystem, so I fixed these. Added a release note because this affects
logging.
Additional minor refactoring in backupable_db.cc to support the new
functionality.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8142
Test Plan:
new test (run with ASAN and UBSAN), added to stress test and
ran it for a while with amplified backup_one_in
Reviewed By: ajkr
Differential Revision: D27535408
Pulled By: pdillinger
fbshipit-source-id: 04666d310aa0261ef6b2385c43ca793ce1dfd148
Summary:
Before corrupting a file in the DB and expecting corruption to
be detected, open DB read-only to ensure file is not made obsolete by
compaction. Also, to avoid obsolete files not yet deleted, only select
live files to corrupt.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8151
Test Plan: watch CI
Reviewed By: akankshamahajan15
Differential Revision: D27568849
Pulled By: pdillinger
fbshipit-source-id: 39a69a2eafde0482b20a197949d24abe21952f27
Summary:
BackupEngine previously had unclear but strict concurrency
requirements that the API user must follow for safe use. Now we make
that clear, by separating operations into "Read," "Append," and "Write"
operations, and specifying which combinations are safe across threads on
the same BackupEngine object (previously none; now all, using a
read-write lock), and which are safe across different BackupEngine
instances open on the same backup_dir.
The changes to backupable_db.h should be backward compatible. It is
mostly about eliminating copies of what should be the same function and
(unsurprisingly) useful documentation comments were often placed on
only one of the two copies. With the re-organization, we are also
grouping different categories of operations. In the future we might add
BackupEngineReadAppendOnly, but that didn't seem necessary.
To mark API Read operations 'const', I had to mark some implementation
functions 'const' and some fields mutable.
Functional changes:
* Added RWMutex locking around public API functions to implement thread
safety on a single object. To avoid future bugs, this is another
internal class layered on top (removing many "override" in
BackupEngineImpl). It would be possible to allow more concurrency
between operations, rather than mutual exclusion, but IMHO not worth the
work.
* Fixed a race between Open() (Initialize()) and CreateNewBackup() for
different objects on the same backup_dir, where Initialize() could
delete the temporary meta file created during CreateNewBackup().
(This was found by the new test.)
Also cleaned up a couple of "status checked" TODOs, and improved a
checksum mismatch error message to include involved files.
Potential follow-up work:
* CreateNewBackup has an API wart because it doesn't tell you the
BackupID it just created, which makes it of limited use in a multithreaded
setting.
* We could also consider a Refresh() function to catch up to
changes made from another BackupEngine object to the same dir.
* Use a lock file to prevent multiple writer BackupEngines, but this
won't work on remote filesystems not supporting lock files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8115
Test Plan:
new mini-stress test in backup unit tests, run with gcc,
clang, ASC, TSAN, and UBSAN, 100 iterations each.
Reviewed By: ajkr
Differential Revision: D27347589
Pulled By: pdillinger
fbshipit-source-id: 28d82ed2ac672e44085a739ddb19d297dad14b15
Summary:
The implementation of TransactionDB::WrapDB() and
TransactionDB::WrapStackableDB() are almost identical, except for the
type of the first argument `db`. This PR adds a new template function in
anonymous namespace, and calls it in the above two functions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8079
Test Plan: make check
Reviewed By: lth
Differential Revision: D27184575
Pulled By: riversand963
fbshipit-source-id: f2855a6db3a7e897d0d611f7050ca4b696c56a7a
Summary:
This does not add any new public APIs or published
functionality, but adds the ability to read and use (and in tests,
write) backups with a new meta file schema, based on the old schema
but not forward-compatible (before this change). The new schema enables
some capabilities not in the old:
* Explicit versioning, so that users get clean error messages the next
time we want to break forward compatibility.
* Ignoring unrecognized fields (with warning), so that new non-critical
features can be added without breaking forward compatibility.
* Rejecting future "non-ignorable" fields, so that new features critical
to some use-cases could potentially be added outside of linear schema
versions, with broken forward compatibility.
* Fields at the end of the meta file, such as for checksum of the meta
file's contents (up to that point)
* New optional 'size' field for each file, which is checked when present
* Optionally omitting 'crc32' field, so that we aren't required to have
a crc32c checksum for files to take a backup. (E.g. to support backup
via hard links and to better support file custom checksums.)
Because we do not have a JSON parser and to share code, the new schema
is simply derived from the old schema.
BackupEngine code is updated to allow missing checksums in some places,
and to make that easier, `has_checksum` and `verify_checksum_after_work`
are eliminated. Empty `checksum_hex` indicates checksum is unknown. I'm
not too afraid of regressing on data integrity, because
(a) we have pretty good test coverage of corruption detection in backups, and
(b) we are increasingly relying on the DB itself for data integrity rather than
it being an exclusive feature of backups.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8069
Test Plan:
new unit tests, added to crash test (some local run with
boosted backup probability)
Reviewed By: ajkr
Differential Revision: D27139824
Pulled By: pdillinger
fbshipit-source-id: 9e0e4decfb42bb84783d64d2d246456d97e8e8c5
Summary:
Add the new Append and PositionedAppend API to env WritableFile. User is able to benefit from the write checksum handoff API when using the legacy Env classes. FileSystem already implemented the checksum handoff API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8071
Test Plan: make check, added new unit test.
Reviewed By: anand1976
Differential Revision: D27177043
Pulled By: zhichao-cao
fbshipit-source-id: 430c8331fc81099fa6d00f4fff703b68b9e8080e
Summary:
These classes were wraps of Env that provided only extensions to the FileSystem functionality. Changed the classes to be FileSystems and the wraps to be of the CompositeEnvWrapper.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7968
Reviewed By: anand1976
Differential Revision: D26900253
Pulled By: mrambacher
fbshipit-source-id: 94001d8024a3c54a1c11adadca2bac66c3af2a77
Summary:
For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes.
For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.
There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved.
Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:
6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found)
6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found)
PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found)
(Note that the times for 6.18 are prior to revert of the SystemClock).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033
Reviewed By: pdillinger
Differential Revision: D27014563
Pulled By: mrambacher
fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
Summary:
This API can be used for things like determining how much space
can be freed up by deleting a particular backup, etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8042
Test Plan:
validation of the API added to many existing backup unit
tests
Reviewed By: mrambacher
Differential Revision: D26936577
Pulled By: pdillinger
fbshipit-source-id: f0bbd90f0917b9781a6837652fb4616d9247816a