Summary:
In some of the test, db_basic_test may cause time out due to its long running time. Separate the timestamp related test from db_basic_test to avoid the potential issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6516
Test Plan: pass make asan_check
Differential Revision: D20423922
Pulled By: zhichao-cao
fbshipit-source-id: d6306f89a8de55b07bf57233e4554c09ef1fe23a
Summary:
In DBLogicalBlockSizeCacheTest, do not test OpenForReadOnly in LITE mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6523
Test Plan: watch test for LITE mode
Differential Revision: D20420321
Pulled By: cheng-chang
fbshipit-source-id: e45bf6f2800206d6f8ce9af7308e76a08de80643
Summary:
In Linux, when reopening DB with many SST files, profiling shows that 100% system cpu time spent for a couple of seconds for `GetLogicalBufferSize`. This slows down MyRocks' recovery time when site is down.
This PR introduces two new APIs:
1. `Env::RegisterDbPaths` and `Env::UnregisterDbPaths` lets `DB` tell the env when it starts or stops using its database directories . The `PosixFileSystem` takes this opportunity to set up a cache from database directories to the corresponding logical block sizes.
2. `LogicalBlockSizeCache` is defined only for OS_LINUX to cache the logical block sizes.
Other modifications:
1. rename `logical buffer size` to `logical block size` to be consistent with Linux terms.
2. declare `GetLogicalBlockSize` in `PosixHelper` to expose it to `PosixFileSystem`.
3. change the functions `IOError` and `IOStatus` in `env/io_posix.h` to have external linkage since they are used in other translation units too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6457
Test Plan:
1. A new unit test is added for `LogicalBlockSizeCache` in `env/io_posix_test.cc`.
2. A new integration test is added for `DB` operations related to the cache in `db/db_logical_block_size_cache_test.cc`.
`make check`
Differential Revision: D20131243
Pulled By: cheng-chang
fbshipit-source-id: 3077c50f8065c0bffb544d8f49fb10bba9408d04
Summary:
When users fail to open a DB with file lock failure, it is sometimes hard for users to debug. We now include the time the lock is acquired and the thread ID that acquired the lock, to help users debug problems like this. Default Env's thread ID is used.
Since type of lockedFiles is changed, rename it to follow naming convention too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6507
Test Plan: Add a unit test and improve an existing test to validate the case.
Differential Revision: D20378333
fbshipit-source-id: 312fe0e9733fd1d1e9969c321b90ce523cf4708a
Summary:
It's never too soon to refactor something. The patch splits the recently
introduced (`VersionEdit` related) `BlobFileState` into two classes
`BlobFileAddition` and `BlobFileGarbage`. The idea is that once blob files
are closed, they are immutable, and the only thing that changes is the
amount of garbage in them. In the new design, `BlobFileAddition` contains
the immutable attributes (currently, the count and total size of all blobs, checksum
method, and checksum value), while `BlobFileGarbage` contains the mutable
GC-related information elements (count and total size of garbage blobs). This is a
better fit for the GC logic and is more consistent with how SST files are handled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6502
Test Plan: `make check`
Differential Revision: D20348352
Pulled By: ltamasi
fbshipit-source-id: ff93f0121e80ab15e0e0a6525ba0d6af16a0e008
Summary:
Allow user to specify options.max_open_files != -1 with FIFO compaction.
If max_open_files != -1, not all table files are kept open.
In the past, FIFO style compaction requires all table files to be open in order
to read file creation time from table properties. Later, we added file creation
time to MANIFEST, making it possible to read file creation time without opening
file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6503
Test Plan: make check
Differential Revision: D20353758
Pulled By: riversand963
fbshipit-source-id: ba5c61a648419e47e9ef6d74e0e280e3ee24f296
Summary:
Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests.
Create an iterator with timestamp.
```
...
read_opts.timestamp = &ts;
auto* iter = db->NewIterator(read_opts);
// target is key without timestamp.
for (iter->Seek(target); iter->Valid(); iter->Next()) {}
for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {}
delete iter;
read_opts.timestamp = &ts1;
// lower_bound and upper_bound are without timestamp.
read_opts.iterate_lower_bound = &lower_bound;
read_opts.iterate_upper_bound = &upper_bound;
auto* iter1 = db->NewIterator(read_opts);
// Do Seek or SeekToFirst()
delete iter1;
```
Test plan (dev server)
```
$make check
```
Simple benchmarking (dev server)
1. The overhead introduced by this PR even when timestamp is disabled.
key size: 16 bytes
value size: 100 bytes
Entries: 1000000
Data reside in main memory, and try to stress iterator.
Repeated three times on master and this PR.
- Seek without next
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3
```
master: 159047.0 ops/sec
this PR: 158922.3 ops/sec (2% drop in throughput)
- Seek and next 10 times
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10
```
master: 109539.3 ops/sec
this PR: 107519.7 ops/sec (2% drop in throughput)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255
Differential Revision: D19438227
Pulled By: riversand963
fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746
Summary:
`TruncateLastLogAfterRecoverWithoutFlush` case depends on fallocate support
of underlying file system.
On a file system which lacks of this feature, like zfs, it will fail to allocate predefined file size as this test case intends to do;
So a check block is added to detect fallocate support and skip test if not.
The related work is done by JunHe77. Thanks!
Signed-off-by: Yuqi Gu <yuqi.gu@arm.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6437
Differential Revision: D20145032
Pulled By: pdillinger
fbshipit-source-id: c8b691dc508e95acfa2a004ddbc07e2faa76680d
Summary:
In CompactRange, if there is no key in memtable falling in the specified range, then flush is skipped.
This PR extends this skipping logic to SST file levels: it starts compaction from the highest level (starting from L0) that has files with key falling in the specified range, instead of always starts from L0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6482
Test Plan:
A new test ManualCompactionTest::SkipLevel is added.
Also updated a test related to statistics of index block cache hit in db_test2, the index cache hit is increased by 1 in this PR because when checking overlap for the key range in L0, OverlapWithLevelIterator will do a seek in the table cache iterator, which will read from the cached index.
Also updated db_compaction_test and db_test to use correct range for full compaction.
Differential Revision: D20251149
Pulled By: cheng-chang
fbshipit-source-id: f822157cf4796972bd5035d9d7178d8dfb7af08b
Summary:
In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller.
A set of ErrorHandlerFSTests are introduced for testing
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6414
Test Plan: pass make asan_check, pass error_handler_fs_test.
Differential Revision: D20252421
Pulled By: zhichao-cao
fbshipit-source-id: e922038f8ce7e6d1da329fd0bba7283c4b779a21
Summary:
this silences following warning from clang-11
```
rocksdb/db/db_impl/db_impl_compaction_flush.cc:1040:21: warning: loop variable 'newf' of type 'const std::pair<int, rocksdb::FileMetaData>' creates a copy from type 'const
std::pair<int\
, rocksdb::FileMetaData>' [-Wrange-loop-analysis]
for (const auto newf : c->edit()->GetNewFiles()) {
^
rocksdb/db/db_impl/db_impl_compaction_flush.cc:1040:10: note: use reference type 'const std::pair<int, rocksdb::FileMetaData> &' to prevent copying
for (const auto newf : c->edit()->GetNewFiles()) {
^~~~~~~~~~~~~~~~~
&
```
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6477
Differential Revision: D20211850
Pulled By: ltamasi
fbshipit-source-id: 3e89e13a12bba79f1b934d46b7c4c0576cdafb01
Summary:
When DBImpl::GetCreationTimeOfOldestFile() calls Version::GetCreationTimeOfOldestFile(), the version is not directly or indirectly referenced, so an event like compaction can race with the operation and cause DBImpl::GetCreationTimeOfOldestFile() to access delocated data. This was caught by an ASAN run:
==268==ERROR: AddressSanitizer: heap-use-after-free on address 0x612000b7d198 at pc 0x000018332913 bp 0x7f391510d310 sp 0x7f391510d308
READ of size 8 at 0x612000b7d198 thread T845 (store_load-33)
SCARINESS: 51 (8-byte-read-heap-use-after-free)
#0 0x18332912 in rocksdb::Version::GetCreationTimeOfOldestFile(unsigned long*) rocksdb/src/db/version_set.cc:1488
https://github.com/facebook/rocksdb/issues/1 0x1803ddaa in rocksdb::DBImpl::GetCreationTimeOfOldestFile(unsigned long*) rocksdb/src/db/db_impl/db_impl.cc:4499
https://github.com/facebook/rocksdb/issues/2 0xe24ca09 in rocksdb::StackableDB::GetCreationTimeOfOldestFile(unsigned long*) rocksdb/utilities/stackable_db.h:392
......
0x612000b7d198 is located 216 bytes inside of 296-byte region [0x612000b7d0c0,0x612000b7d1e8)
freed by thread T28 here:
......
https://github.com/facebook/rocksdb/issues/5 0x1832c73f in std::vector<rocksdb::FileMetaData*, std::allocator<rocksdb::FileMetaData*> >::~vector() third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/stl_vector.h:435
https://github.com/facebook/rocksdb/issues/6 0x1832c73f in rocksdb::VersionStorageInfo::~VersionStorageInfo() rocksdb/src/db/version_set.cc:734
https://github.com/facebook/rocksdb/issues/7 0x1832cf42 in rocksdb::Version::~Version() rocksdb/src/db/version_set.cc:758
https://github.com/facebook/rocksdb/issues/8 0x9d1bb5 in rocksdb::Version::Unref() rocksdb/src/db/version_set.cc:2869
https://github.com/facebook/rocksdb/issues/9 0x183e7631 in rocksdb::Compaction::~Compaction() rocksdb/src/db/compaction/compaction.cc:275
https://github.com/facebook/rocksdb/issues/10 0x9e6de6 in std::default_delete<rocksdb::Compaction>::operator()(rocksdb::Compaction*) const third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/unique_ptr.h:78
https://github.com/facebook/rocksdb/issues/11 0x9e6de6 in std::unique_ptr<rocksdb::Compaction, std::default_delete<rocksdb::Compaction> >::reset(rocksdb::Compaction*) third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/unique_ptr.h:376
https://github.com/facebook/rocksdb/issues/12 0x9e6de6 in rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2826
https://github.com/facebook/rocksdb/issues/13 0x9ac3b8 in rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2320
https://github.com/facebook/rocksdb/issues/14 0x9abff7 in rocksdb::DBImpl::BGWorkCompaction(void*) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2096
......
Fix the issue by reference the super version and use the referenced version from it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6473
Test Plan: Run ASAN for all existing tests.
Differential Revision: D20196416
fbshipit-source-id: 5f4a7918110fc7b8dd7841932d376bc9d1e59d6f
Summary:
In the current code base, we can use Directory from Env to manage directory (e.g, Fsync()). The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. So we further replace the Directory class in DB with FSDirectory such that we can have more IO information from IOStatus returned by FSDirectory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6468
Test Plan: pass make asan_check
Differential Revision: D20195261
Pulled By: zhichao-cao
fbshipit-source-id: 93962cb9436852bfcfb76e086d9e7babd461cbe1
Summary:
Added new Get() methods that return timestamp. Dummy implementation is given so that classes derived from DB don't need to be touched to provide their implementation. MultiGet is not included.
ReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
base line (commit 72ee067b9):
101.712 micros/op 314602 ops/sec; 36.0 MB/s (5658999 of 5658999 found)
This PR:
100.288 micros/op 319071 ops/sec; 36.5 MB/s (5674999 of 5674999 found)
./db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=readrandom --use_existing_db=1 --num=25000000 --threads=32
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6409
Differential Revision: D20200086
Pulled By: riversand963
fbshipit-source-id: 490edd74d924f62bd8ae9c29c2a6bbbb8410ca50
Summary:
Original author: jeffrey-xiao
If we are writing a global seqno for an ingested file, the range
tombstone metablock gets accessed and put into the cache during
ingestion preparation. At the time, the global seqno of the ingested
file has not yet been determined, so the cached block will not have a
global seqno. When the file is ingested and we read its range tombstone
metablock, it will be returned from the cache with no global seqno. In
that case, we use the actual seqnos stored in the range tombstones,
which are all zero, so the tombstones cover nothing.
This commit removes global_seqno_ variable from Block. When iterating
over a block, the global seqno for the block is determined by the
iterator instead of storing this mutable attribute in Block.
Additionally, this commit adds a regression test to check that keys are
deleted when ingesting a file with a global seqno and range deletion
tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6429
Differential Revision: D19961563
Pulled By: ajkr
fbshipit-source-id: 5cf777397fa3e452401f0bf0364b0750492487b7
Summary:
BlobDB currently does not keep track of blob files: no records are written to
the manifest when a blob file is added or removed, and upon opening a database,
the list of blob files is populated simply based on the contents of the blob directory.
This means that lost blob files cannot be detected at the moment. We plan to solve
this issue by making blob files a part of `Version`; as a first step, this patch makes
it possible to store information about blob files in `VersionEdit`. Currently, this information
includes blob file number, total number and size of all blobs, and total number and size
of garbage blobs. However, the format is extensible: new fields can be added in
both a forward compatible and a forward incompatible manner if needed (similarly
to `kNewFile4`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6416
Test Plan: `make check`
Differential Revision: D19894234
Pulled By: ltamasi
fbshipit-source-id: f9753e1f2aedf6dadb70c09b345207cb9c58c329
Summary:
Cleanup some code without any real change in functionality.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6440
Differential Revision: D20015891
Pulled By: riversand963
fbshipit-source-id: 33e18754b0f002006a6d4805e9aaf84c0c8ad25a
Summary:
Adding a C API function to set `row_cache` on `rocksdb_options_t` as this functionality is missing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6442
Differential Revision: D20036813
Pulled By: riversand963
fbshipit-source-id: c1fa95ea343345fbc1e57961d0d048e0e79be373
Summary:
Currently, a new MANIFEST file is assigned a new file number when 1) no
MANIFEST is open, or 2) current MANIFEST file size exceeds a threshold. This is
not sufficient. There are cases when the caller explicitly specifies that a new
MANIFEST be created. For example, if user sets options.write_dbid_to_manifest = true,
and there are WAL files, then RocksDB will run into an issue during recovery.
`DBImpl::Recover()` will call `LogAndApply()` to write dbid. At this point, the db being
recovered creates a new MANIFEST, say, MANIFEST-000003. Since there are WALs,
`DBImpl::RecoverLogFiles` will be called. Towards the end of this function, we call
`LogAndApply(new_descriptor_log=true)`, which explicitly creates a new MANIFEST.
However, the manifest_file_number is wrong before this fix. Consequently, RocksDB
opens an existing, non-empty file for append, effectively truncating the file to zero.
If a crash occurs, then there will be data loss.
Test Plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6426
Test Plan: make check
Differential Revision: D19951866
Pulled By: riversand963
fbshipit-source-id: 4b1b9fc28d4fe2ac12764b388ef9e61f05e766da
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary:
We were removing the file from `log_recycle_files_` before renaming it
with `ReuseWritableFile()`. Since `ReuseWritableFile()` occurs outside
the DB mutex, it was possible for a concurrent full purge to sneak in
and delete the file before it could be renamed. Consequently, `SwitchMemtable()`
would fail and the DB would enter read-only mode.
The fix is to hold the old file number in `log_recycle_files_` until
after the file has been renamed. Full purge uses that list to decide
which files to keep, so it can no longer delete a file pending recycling.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5900
Test Plan: new unit test
Differential Revision: D19771719
Pulled By: ajkr
fbshipit-source-id: 094346349ca3fb499712e62de03905acc30b5ce8
Summary:
Previously, when recovering version set, LoadTableHandlers failures are ignored.
If paranoid_checks is true, this failure should not be ignored, otherwise, the opened db might be in an inconsistent state.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6368
Test Plan: make check
Differential Revision: D19713459
Pulled By: cheng-chang
fbshipit-source-id: 68cb94f4f2cc43f8b024b14755193cd45cfcad55
Summary:
During recovery, multiple (un)prepared batches could exist in the same WAL record due to group commit. This breaks an assertion in `MemTableInserter::MarkBeginPrepare`.
To fix, reset unprepared_batch_ to false after `MarkEndPrepare`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6419
Differential Revision: D19896148
Pulled By: lth
fbshipit-source-id: b1a32ef88f775a0881264a18bd1a4a5b8c85eee3
Summary:
A recent fix related to 2pc https://github.com/facebook/rocksdb/pull/6313/ writes something to WAL, but does not flush or sync. This causes assertion failure "impl->TEST_WALBufferIsEmpty()" if manual_wal_flush = true. We should fsync the entry to make sure a second power reset can recover.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6417
Test Plan: Add manual_wal_flush=true case in TransactionTest.DoubleCrashInRecovery and fix a bug in the test so that the bug can be reproduced. It passes with the fix.
Differential Revision: D19894537
fbshipit-source-id: f1e84e49e2269f583c6019743118292cd8b6598e
Summary:
It's observed on Windows DestroyDB failed to remove the log file because the logger is still alive in sst file manager and holding a handle to the log file. This fix makes sure the logger is released before attempt to clear the database directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6308
Differential Revision: D19818829
Pulled By: riversand963
fbshipit-source-id: 54c3e6859aadaaba4a49b3e851b73dc35ec7dc6a
Summary:
Seems like this caused the following test failure on AppVeyor:
DBTest2.CrashInRecoveryMultipleCF
c:\projects\rocksdb\db\db_test_util.cc(107): error: DestroyDB(dbname_, options)
IO error: Failed to delete: C:\projects\rocksdb\db_tests\\testrocksdb-3112//db_test2_10791409581227174103/000013.sst: Access is denied.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6410
Test Plan: Wait to see whether the AppVeyor test passes.
Differential Revision: D19879872
Pulled By: cheng-chang
fbshipit-source-id: 59a9c55ca88566e9210c0b715ecc45a4fd9afe26
Summary:
Unrevert the previous fix to propagate error status, and an additional fix to not treat a memtable lookup MergeInProgress status as an error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6403
Test Plan:
Unit tests
Tried running stress tests but couldn't repro the stress failure
Differential Revision: D19846721
Pulled By: anand1976
fbshipit-source-id: 7db10cccbdc863d9b559497f0a46b608d2488ca4
Summary:
This reverts commit d70011bccc. The commit is causing some stress test failure due to unexpected Status::MergeInProgress() return for some keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6401
Differential Revision: D19826623
Pulled By: anand1976
fbshipit-source-id: edd634cede9cb7bdd2cb8f46e662ea709b16d2f1
Summary:
Add a utility class `Defer` to defer the execution of a function until the Defer object goes out of scope.
Used in VersionSet:: ProcessManifestWrites as an example.
The inline comments for class `Defer` have more details.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6382
Test Plan: `make defer_test version_set_test && ./defer_test && ./version_set_test`
Differential Revision: D19797538
Pulled By: cheng-chang
fbshipit-source-id: b1a9b7306e4fd4f48ec2ab55783caa561a315f0f
Summary:
https://github.com/facebook/rocksdb/pull/6383 surfaced an issue with
`VersionSet`/`ReactiveVersionSet` and `AtomicGroupReadBuffer::AddEdit`
(which was added in https://github.com/facebook/rocksdb/pull/5411):
`AddEdit` moves the `VersionEdit` passed to it into `replay_buffer_`,
however, the client `VersionSet` classes keep using it afterwards. This
*seemed to* work before the refactoring but it really did not: since
`VersionEdit` used to have a user-declared destructor, no move
constructor/move assignment operator was generated, and the `move` in
`AddEdit` was really a copy. The patch makes the copy explicit. Note: it
should be possible to rework this logic so that we can get away
with the move but for now, this should fix the issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6400
Test Plan:
`make check`
`make analyze`
Differential Revision: D19824466
Pulled By: ltamasi
fbshipit-source-id: f38033967daf2a39c78dcd6e12978bafe37632b4
Summary:
In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata.
Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string). A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216
Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check.
Differential Revision: D19171461
Pulled By: zhichao-cao
fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87
Summary:
When `no_slowdown` is enabled, it returns `Status::Incomplete("Write stall")` if a stall would occur. This patch adds descriptive text for when `no_slowdown` and `low_pri` are enabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6396
Differential Revision: D19808978
Pulled By: cheng-chang
fbshipit-source-id: a53b0d25ed414c821a086531e0222027f925e627
Summary:
Currently, any IO errors and checksum mismatches while reading data
blocks, are being ignored by the batched MultiGet. Its only looking at
the GetContext state. Fix that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6387
Test Plan: Add unit tests
Differential Revision: D19799819
Pulled By: anand1976
fbshipit-source-id: 46133dccbb04e64067b9fe6cda73e282203db969
Summary:
Fixed an error when compiled with -Og:
db/write_thread.cc:183:14: error: 'state' may be used uninitialized in this function [-Werror=maybe-uninitialized]
Signed-off-by: Robert Yang <liezhi.yang@windriver.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6275
Differential Revision: D19381755
fbshipit-source-id: a90bf3cd4a7248d9d71219e918fc6253deb97e3c
Summary:
This is a bunch of small improvements to `VersionEdit`. Namely, the patch
* Makes the names and order of variables, methods, and code chunks related
to the various information elements more consistent, and adds missing
getters for the sake of completeness.
* Initializes previously uninitialized stack variables.
* Marks all getters const to improve const correctness.
* Adds in-class initializers and removes the default ctor that would
create an object with uninitialized built-in fields and call `Clear`
afterwards.
* Adds a new type alias for new files and changes the existing `typedef`
for deleted files into a type alias as well.
* Makes the helper method `DecodeNewFile4From` private.
* Switches from long-winded iterator syntax to range based loops in a
couple of places.
* Fixes a couple of assignments where an integer 0 was assigned to
boolean members.
* Fixes a getter which used to return a `const std::string` instead of
the intended `const std::string&`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6383
Test Plan: make check
Differential Revision: D19780537
Pulled By: ltamasi
fbshipit-source-id: b0b4f09fee0ec0e7c7b7a6d76bfe5346e91824d0
Summary:
Since the logic for handling IDENTITY file is now inside `NewDB`, the comment above `NewDB` is no longer relevant.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6379
Test Plan: not needed
Differential Revision: D19795440
Pulled By: cheng-chang
fbshipit-source-id: 0b1cca87ac6d92474701c46aa4c8d4d708bfa19b
Summary:
Several statuses were not checked during DB::Open.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6380
Test Plan: make check
Differential Revision: D19780237
Pulled By: cheng-chang
fbshipit-source-id: c8d189d20344bd1607890dd1449345bda2ef96b9
Summary:
Before this fix, atomic flush codepath may hit an assertion failure on a specific failure case.
If all flush jobs within an atomic flush succeed (they do not write to MANIFEST), but batch writing version edits to MANIFEST fails, then `cfd->imm()->RollbackMemTableFlush()` will be called twice, and the second invocation hits assertion failure `assert(m->flush_in_progress_)` since the first invocation resets the variable `flush_in_progress_` to false already.
Test plan (dev server):
```
./db_flush_test --gtest_filter=DBAtomicFlushTest/DBAtomicFlushTest.RollbackAfterFailToInstallResults
make check
```
Both must succeed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6385
Differential Revision: D19782943
Pulled By: riversand963
fbshipit-source-id: 84e1592625e729d1b70fdc8479959387a74cb121
Summary:
In `DBSSTTest.SSTsWithLdbSuffixHandling`, some sst files are renamed to ldb files, the original intention of the test is to test that the ldb files can be loaded along with the sst files.
The original test checks this by `ASSERT_NE("NOT_FOUND", Get(Key(k)))`, but the problem is `Get(Key(k))` returns IO error due to path not found instead of NOT_FOUND, so the success of ASSERT_NE does not mean the key can be retrieved.
This PR updates the test to make sure Get(Key(k)) returns the original value.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6370
Test Plan: make db_sst_test && ./db_sst_test
Differential Revision: D19726278
Pulled By: cheng-chang
fbshipit-source-id: 993127f56457b315e669af4eeb92d6f956b7a4b7
Summary:
Before this PR it calls GetFileSize() once for each sst file in the DB. This can take a long time if there are be tens of thousands of sst files (e.g. in thousands of column families), and even longer if Env is talking to some remote service rather than local filesystem. This PR makes DB::Open() use sst file sizes that are already known from manifest (typically almost all files in the DB) and only call GetFileSize() for non-sst or obsolete files. Note that GetFileSize() is also called and checked against manifest in CheckConsistency(), so the calls in SstFileManagerImpl were completely redundant.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6363
Test Plan: deployed to a test cluster, looked at a dump of Env calls (from a custom instrumented Env) - no more thousands of GetFileSize()s.
Differential Revision: D19702509
Pulled By: al13n321
fbshipit-source-id: 99f8110620cb2e9d0c092dfcdbb11f3af4ff8b73
Summary:
Right now RocksDB gets manifest file size before recovering from it. The information is available in LogReader. Use it instead to prevent one file system call.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6369
Test Plan: Run all existing tests
Differential Revision: D19714872
fbshipit-source-id: 0144be324d403c99e3da875ea2feccc8f64e883d
Summary:
When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open.
If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be *really* slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes.
We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353
Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble.
Differential Revision: D19656425
Pulled By: al13n321
fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82
Summary:
Fix an intermittent failure in
DBErrorHandlingTest.CompactionManifestWriteError due to a race between
background error recovery and the main test thread calling
TEST_WaitForCompact().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6367
Test Plan: Run the test using gtest_parallel
Differential Revision: D19713802
Pulled By: anand1976
fbshipit-source-id: 29e35dc26e0984fe8334c083e059f4fa1f335d68
Summary:
Right now when reading IDENTITY file, we use a very similar logic as ReadFileToString() while it does an extra file size check, which may be expensive in some file systems. There is no reason to duplicate the logic. Use ReadFileToString() instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6365
Test Plan: RUn all existing tests.
Differential Revision: D19709399
fbshipit-source-id: 3bac31f3b2471f98a0d2694278b41e9cd34040fe
Summary:
A relatively recent regression causes for every CF, create and open directory is called for the DB directory, unless CF has a private directory. This doesn't scale well with large number of column families.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6358
Test Plan: Run all existing tests and see it pass. strace with db_bench --num_column_families and observe it doesn't open directory for number of column families.
Differential Revision: D19675141
fbshipit-source-id: da01d9216f1dae3f03d4064fbd88ce71245bd9be
Summary:
MultiDBCompactionError fails when it verifies the number of files on level 0 and level 1 without waiting for compaction to finish.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6266
Differential Revision: D19701639
Pulled By: riversand963
fbshipit-source-id: e96d511bcde705075f073e0b550cebcd2ecfccdc
Summary:
DBTest2.ChangePrefixExtractor fails in LITE build because LITE build doesn't support adaptive build. Fix it by removing the stats check but only check correctness.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6356
Test Plan: Run the test with both of LITE and non-LITE build.
Differential Revision: D19669537
fbshipit-source-id: 6d7dd6c8a79f18e80ca1636864b9c71922030d8e
Summary:
Add a unit test for prefix extractor change, including a check that fails due to a bug.
Also comment out the partitioned filter case which will fail the test too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6323
Test Plan: Run the test and it passes (and fails if the SeekForPrev() part is uncommented)
Differential Revision: D19509744
fbshipit-source-id: 678202ca97b5503e9de73b54b90de9e5ba822b72
Summary:
Non-zero recycle_log_file_num is incompatible with kPointInTimeRecovery and kAbsoluteConsistency recovery modes. Currently SanitizeOptions changes the recovery mode to kTolerateCorruptedTailRecords, while to resolve this option conflict it makes more sense to compromise recycle_log_file_num, which is a performance feature, instead of wal_recovery_mode, which is a safety feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6351
Differential Revision: D19648931
Pulled By: maysamyabandeh
fbshipit-source-id: dd0bf78349edc007518a00c4d63931fd69294ad7
Summary:
Unit test names, together with other components, are used to create log files
during some internal testing. Overly long names cause infra failure due to file
names being too long.
Look for internal tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6352
Differential Revision: D19649307
Pulled By: riversand963
fbshipit-source-id: 6f29de096e33c0eaa87d9c8702f810eda50059e7
Summary:
Fix for issue https://github.com/facebook/rocksdb/issues/6316
When an append/sync of the manifest file fails due to an IO error such
as NoSpace, we don't always put the DB in read-only mode. This is true
for flush and compactions, as well as foreground operatons such as column family
add/drop, CompactFiles etc. Subsequent changes to the DB will be
recorded in the same manifest file, which would have a corrupted record
in the middle due to the previous failure. On next DB::Open(), it will
fail to process the full manifest and data will be lost.
To fix this, we reset VersionSet::descriptor_log_ on append/sync
failure, which will force a new manifest file to be written on the next
append.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6331
Test Plan: Add new unit tests in error_handler_test.cc
Differential Revision: D19632951
Pulled By: anand1976
fbshipit-source-id: 68d527cb6e59a94cbbbf9f5a17a7f464381d51e3
Summary:
DBTest2.AutoPrefixMode1 doesn't pass because auto prefix mode is not supported there.
Fix it by disabling the test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6346
Test Plan: Run DBTest2.AutoPrefixMode1 in lite mode
Differential Revision: D19627486
fbshipit-source-id: fbde75260aeecb7e6fc406e09c19a71a95aa5f08
Summary:
db_bloom_filter_test break with clang LITE build with following message:
db/db_bloom_filter_test.cc:23:29: error: unused variable 'kPlainTable' [-Werror,-Wunused-const-variable]
static constexpr PseudoMode kPlainTable = -1;
^
Fix it by moving the declaration out of LITE build
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6340
Test Plan:
USE_CLANG=1 LITE=1 make db_bloom_filter_test
and without LITE=1
Differential Revision: D19609834
fbshipit-source-id: 0e88f5c6759238a94f9880d84c785ac18e7cdd7e
Summary:
In WritePrepared there could be gap in sequence numbers. This breaks the trick we use in kPointInTimeRecovery which assume the first seq in the log right after the corrupted log is one larger than the last seq we read from the logs. To let this trick keep working, we add a dummy entry with the expected sequence to the first log right after recovery.
Also in WriteCommitted, if the log right after the corrupted log is empty, since it has no sequence number to let the sequential trick work, it is assumed as unexpected behavior. This is however expected to happen if we close the db after recovering from a corruption and before writing anything new to it. To remedy that, we apply the same technique by writing a dummy entry to the log that is created after the corrupted log.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6313
Differential Revision: D19458291
Pulled By: maysamyabandeh
fbshipit-source-id: 09bc49e574690085df45b034ca863ff315937e2d
Summary:
Add a new option ReadOptions.auto_prefix_mode. When set to true, iterator should return the same result as total order seek, but may choose to do prefix seek internally, based on iterator upper bounds. Also fix two previous bugs when handling prefix extrator changes: (1) reverse iterator should not rely on upper bound to determine prefix. Fix it with skipping prefix check. (2) block-based filter is not handled properly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6314
Test Plan: (1) add a unit test; (2) add the check to stress test and run see whether it can pass at least one run.
Differential Revision: D19458717
fbshipit-source-id: 51c1bcc5cdd826c2469af201979a39600e779bce
Summary:
./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions passed 96 / 100 times.
```
With the fix: all runs (tried 100, 1000, 10000) succeed.
```
$ TEST_TMPDIR=/dev/shm ~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.LevelTtlCascadingCompactions --repeat=1000
[1000/1000] DBCompactionTest.LevelTtlCascadingCompactions (1895 ms)
```
Test Plan:
Build:
```
COMPILE_WITH_TSAN=1 make db_compaction_test -j100
```
Without the fix: a few runs out of 100 fail:
```
$ TEST_TMPDIR=/dev/shm KEEP_DB=1 ~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.LevelTtlCascadingCompactions --repeat=100
...
...
Note: Google Test filter = DBCompactionTest.LevelTtlCascadingCompactions
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBCompactionTest
[ RUN ] DBCompactionTest.LevelTtlCascadingCompactions
db/db_compaction_test.cc:3687: Failure
Expected equality of these values:
oldest_time
Which is: 1580155869
level_to_files[6][0].oldest_ancester_time
Which is: 1580155870
DB is still at /dev/shm//db_compaction_test_6337001442947696266
[ FAILED ] DBCompactionTest.LevelTtlCascadingCompactions (1432 ms)
[----------] 1 test from DBCompactionTest (1432 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (1433 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] DBCompactionTest.LevelTtlCascadingCompactions
1 FAILED TEST
[80/100] DBCompactionTest.LevelTtlCascadingCompactions returned/aborted with exit code 1 (1489 ms)
[100/100] DBCompactionTest.LevelTtlCascadingCompactions (1522 ms)
FAILED TESTS (4/100):
1419 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/90)
1434 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/84)
1457 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/82)
1489 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/74)
Differential Revision: D19587040
Pulled By: sagar0
fbshipit-source-id: 11191ae9940837643bff47ebe18b299b4be3d950
Summary:
It chooses the oldest memtable, not the largest one. This is an
important difference for users whose CFs receive non-uniform write
rates.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6335
Differential Revision: D19588865
Pulled By: maysamyabandeh
fbshipit-source-id: 62ad4325b0182f5f27858584cd73fd5978fb2cec
Summary:
https://github.com/facebook/rocksdb/pull/6028 introduces a bug for hash index in SST files. If a table reader is created when total order seek is used, prefix_extractor might be passed into table reader as null. While later when prefix seek is used, the same table reader used, hash index is checked but prefix extractor is null and the program would crash.
Fix the issue by fixing http://github.com/facebook/rocksdb/pull/6028 in the way that prefix_extractor is preserved but ReadOptions.total_order_seek is checked
Also, a null pointer check is added so that a bug like this won't cause segfault in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6328
Test Plan: Add a unit test that would fail without the fix. Stress test that reproduces the crash would pass.
Differential Revision: D19586751
fbshipit-source-id: 8de77690167ddf5a77a01e167cf89430b1bfba42
Summary:
The earlier code used two conflicting definitions for the number of
input records going into a compaction, one based on the
`rocksdb.num.entries` table property and one based on
`CompactionIterationStats`. The first one is correct and in line
with how output records are counted, while the second one incorrectly
ignores input records in various cases when the `CompactionIterator`
advances or reseeks the input iterator (this can happen, amongst other
cases, when dealing with `SingleDelete`s, regular `Delete`s, `Merge`s,
and compaction filters). This can result in the code undercounting the
input records and computing an incorrect value for "records dropped"
during the compaction. The patch fixes this by switching over to the
correct (table property based) input record count for "records dropped".
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6325
Test Plan: Tested using `make check` and `db_bench`.
Differential Revision: D19525491
Pulled By: ltamasi
fbshipit-source-id: 4340b0b2f41546db8e356db70ca02199e48fa636
Summary:
When there is a write stall, the active write group leader calls ```BeginWriteStall()``` to walk the queue of writers and remove any with the ```no_slowdown``` option set. There was a bug in the code which updated the back pointer but not the forward pointer (```link_newer```), corrupting the list and causing some threads to wait forever. This PR fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6322
Test Plan: Add a unit test in db_write_test
Differential Revision: D19538313
Pulled By: anand1976
fbshipit-source-id: 6fbed819e594913f435886606f5d36f74f235c3a
Summary:
This is a simple edit to have two #include file paths be consistent within range_del_aggregator.{h,cc} with everywhere else.
The impact of this inconsistency is that it actual breaks a Bazel based build on the Windows platform. The same pragma once failure occurs with both Windows Visual C++ 2019 and clang for Windows 9.0. Bazel's "sandboxing" of the builds causes both compilers to not properly recognize "rocksdb/types.h" and "include/rocksdb/types.h" to be the same file (also comparator.h). My guess is that the backslash versus forward slash mixing within path names is the underlying issue.
But, everything builds fine once the include paths in these two source files are consistent with the rest of the repository.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6321
Differential Revision: D19506585
Pulled By: ltamasi
fbshipit-source-id: 294c346607edc433ab99eaabc9c880ee7426817a
Summary:
Currently, this test case tries to infer whether
`VersionStorageInfo::UpdateAccumulatedStats` was called during open by
checking the number of files opened against an arbitrary threshold (10).
This makes the test brittle and results in sporadic failures. The patch
changes the test case to use sync points to directly test whether
`UpdateAccumulatedStats` was called.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6306
Test Plan: `make check`
Differential Revision: D19439544
Pulled By: ltamasi
fbshipit-source-id: ceb7adf578222636a0f51740872d0278cd1a914f
Summary:
When we do concurrently writes, and different write operations will have WAL enable or disable.
But the data from write operation with WAL disabled will still be logged into log files, which will lead to extra disk write/sync since we do not want any guarantee for these part of data.
Detail can be found in https://github.com/facebook/rocksdb/issues/6280. This PR avoid mixing the two types in a write group. The advantage is simpler reasoning about the write group content
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6290
Differential Revision: D19448598
Pulled By: maysamyabandeh
fbshipit-source-id: 3d990a0f79a78ea1bfc90773f6ebafc1884c20de
Summary:
This PR adds a `rocksdb_options_set_atomic_flush` function to the C API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6307
Differential Revision: D19451313
Pulled By: ltamasi
fbshipit-source-id: 750495642ef55b1ea7e13477f85c38cd6574849c
Summary:
Recent bug fix related to hash index introduced a new bug: hash index can return NotFound but it is not handled by BlockBasedTable::Get(). The end result is that Get() stops being executed too early. Fix it by ignoring NotFound code in Get().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6305
Test Plan: A problematic DB used to return NotFound incorrectly, and now able to return correct result. Will try to construct a unit test too.0
Differential Revision: D19438925
fbshipit-source-id: e751afa8c13728d56511cfeb1bc811ecb99f3217
Summary:
https://github.com/facebook/rocksdb/pull/2205 introduced a new
configuration option called `max_background_jobs`, superseding the
earlier options `max_background_flushes` and
`max_background_compactions`. However, unlike
`max_background_compactions`, setting `max_background_jobs` dynamically
through the `SetDBOptions` interface does not adjust the size of the
thread pools (see https://github.com/facebook/rocksdb/issues/6298). The
patch fixes this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6300
Test Plan: Extended unit test.
Differential Revision: D19430899
Pulled By: ltamasi
fbshipit-source-id: 704006605b3c13c3d1b997ccc0831ee369721074
Summary:
Recent fix to Prefix Hash https://github.com/facebook/rocksdb/pull/6292 caused a bug that the newly created NotFound status in hash index is never reset. This causes reseek or implict reseek to return wrong results sometimes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6302
Test Plan:
Add a unit test that would fail. Not fix.
crash test with hash test would fail in several seconds. With the fix, it will run about several minutes before failing with another failure.
Differential Revision: D19424572
fbshipit-source-id: c5276f36a95fd0e2837e30190476d2fe21ed8566
Summary:
When prefix is enabled the expected behavior when the prefix of the target does not exist is for Seek is to seek to any key larger than target and SeekToPrev to any key less than the target.
Currently. the prefix index (kHashSearch) returns OK status but sets Invalid() to indicate two cases: a prefix of the searched key does not exist, ii) the key is beyond the range of the keys in SST file. The SeekForPrev implementation in BlockBasedTable thus does not have enough information to know when it should set the index key to first (to return a key smaller than target). The patch fixes that by returning NotFound status for cases that the prefix does not exist. SeekForPrev in BlockBasedTable accordingly SeekToFirst instead of SeekToLast on the index iterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6297
Test Plan: SeekForPrev of non-exsiting prefix is added to block_test.cc, and a test case is added in db_test2, which fails without the fix.
Differential Revision: D19404695
fbshipit-source-id: cafbbf95f8f60ff9ede9ccc99d25bfa1cf6fcdc3
Summary:
A recent commit adds a unit test that uses a function not available in LITE build. Fix it by avoiding the call
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6295
Test Plan: Run the test in LITE build and see it passes.
Differential Revision: D19395678
fbshipit-source-id: 37b42835bae02511630d80f7cafb1179401bc033
Summary:
The fractional cascading index is not correctly generated when two files at the same level contains the same smallest or largest user key.
The result would be that it would hit an assertion in debug mode and lower level files might be skipped.
This might cause wrong results when the same user keys are of merge operands and Get() is called using the exact user key. In that case, the lower files would need to further checked.
The fix is to fix the fractional cascading index.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6285
Test Plan: Add a unit test which would cause the assertion which would be fixed.
Differential Revision: D19358426
fbshipit-source-id: 39b2b1558075fd95e99491d462a67f9f2298c48e
Summary:
This makes it easier to call the functions from Rust as otherwise they require mutable types.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6283
Differential Revision: D19349991
Pulled By: wqfish
fbshipit-source-id: e8da7a75efe8cd97757baef8ca844a054f2519b4
Summary:
Look at all compaction input files to compute the oldest ancestor time.
In https://github.com/facebook/rocksdb/issues/5992 we changed how creation_time (aka oldest-ancestor-time) table property of compaction output files is computed from max(creation-time-of-all-compaction-inputs) to min(creation-time-of-all-inputs). This exposed a bug where, during compaction, the creation_time:s of only the L0 compaction inputs were being looked at, and all other input levels were being ignored. This PR fixes the issue.
Some TTL compactions when using Level-Style compactions might not have run due to this bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6279
Test Plan: Enhanced the unit tests to validate that the correct time is propagated to the compaction outputs.
Differential Revision: D19337812
Pulled By: sagar0
fbshipit-source-id: edf8a72f11e405e93032ff5f45590816debe0bb4
Summary:
unordered_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions.
The patch fixes that and also reverts the changes performed by https://github.com/facebook/rocksdb/pull/6254, in which max_successive_merges was mistakenly declared incompatible with unordered_write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6284
Differential Revision: D19356115
Pulled By: maysamyabandeh
fbshipit-source-id: f06dadec777622bd75f267361c022735cf8cecb6
Summary:
Fix compilation under LITE by putting `#ifndef ROCKSDB_LITE` around a code block.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6277
Differential Revision: D19334157
Pulled By: riversand963
fbshipit-source-id: 947111ed68aa550f5ea424b216c1442a8af9e32b
Summary:
Currently, the recently-added test DBTest2.SwitchMemtableRaceWithNewManifest
fails in LITE mode since SetOptions() returns "Not supported". I do not want to
put `#ifndef ROCKSDB_LITE` because it reduces test coverage. Instead, just
trigger compaction on a different column family. The bg compaction thread
calling LogAndApply() may race with thread calling SwitchMemtable().
Test Plan (dev server):
make check
OPT=-DROCKSDB_LITE make check
or run DBTest2.SwitchMemtableRaceWithNewManifest 100 times.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6267
Differential Revision: D19301309
Pulled By: riversand963
fbshipit-source-id: 88cedcca2f985968ed3bb234d324ffa2aa04ca50
Summary:
Fix an error message when CURRENT is not found.
Test plan (dev server)
```
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6264
Differential Revision: D19300699
Pulled By: riversand963
fbshipit-source-id: 303fa206386a125960ecca1dbdeff07422690caf
Summary:
Add oldest snapshot sequence property, so we can use `db.GetProperty("rocksdb.oldest-snapshot-sequence")` to get the sequence number of the oldest snapshot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6228
Differential Revision: D19264145
Pulled By: maysamyabandeh
fbshipit-source-id: 67fbe5304d89cbc475bd404e30d1299f7b11c010
Summary:
It seems that the C-API doesn't expose the range delete functionality at the moment, so add the API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6259
Differential Revision: D19290320
Pulled By: pdillinger
fbshipit-source-id: 3f403a4c3446d2042d55f1ece7cdc9c040f40c27
Summary:
When measure_io_stats_ is enabled, the volume of logging is beyond the default limit of 512 size. The patch allows the EventLoggerStream to change the limit, and also sets it to 1024 for FlushJob.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6258
Differential Revision: D19279269
Pulled By: maysamyabandeh
fbshipit-source-id: 3fb5d468dad488f289ac99d713378177eb7504d6
Summary:
allow_concurrent_memtable_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions. The patch fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6254
Differential Revision: D19265819
Pulled By: maysamyabandeh
fbshipit-source-id: 47f2e2dc26fe0972c7152f4da15dadb9703f1179
Summary:
level passed into ColumnFamilyData::CalculateSSTWriteHint() can be smaller than base_level in current version, which would cause overflow.
We see ubsan complains:
db/compaction/compaction_job.cc:1511:39: runtime error: load of value 4294967295, which is not a valid value for type 'Env::WriteLifeTimeHint'
and I hope this commit fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6212
Test Plan: Run existing tests and see them to pass.
Differential Revision: D19168442
fbshipit-source-id: bf8fd86f85478ecfa7556db46dc3242de8c83dc9
Summary:
The bad code was:
```
mutex.Lock(); // `mutex` protects `container`
for (auto& x : container) {
mutex.Unlock();
// do stuff to x
mutex.Lock();
}
```
It's incorrect because both `x` and the iterator may become invalid if another thread modifies the container while this thread is not holding the mutex.
Broken by https://github.com/facebook/rocksdb/pull/5796 - it replaced a `while (!container.empty())` loop with a `for (auto x : container)`.
(RocksDB code does a lot of such unlocking+re-locking of mutexes, and this type of bugs comes up a lot :/ )
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6193
Test Plan: Ran some logdevice integration tests that were crashing without this fix.
Differential Revision: D19116874
Pulled By: al13n321
fbshipit-source-id: 9672bc4227c1b68f46f7436db2b96811adb8c703
Summary:
Cuts about 30-60 seconds to from each Travis Linux build, and about 15 minutes from each macOS build
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6181
Differential Revision: D19098357
Pulled By: pdillinger
fbshipit-source-id: 863dd1ab09076ad9b03c2b7914908359628315ae
Summary:
I found that CleanupSuperVersion() may block Get() for 30ms+ (per MemTable is 256MB).
Then I found "delete sv" in ~SuperVersion() takes the time.
The backtrace looks like this
DBImpl::GetImpl() -> DBImpl::ReturnAndCleanupSuperVersion() ->
DBImpl::CleanupSuperVersion() : delete sv; -> ~SuperVersion()
I think it's better to delete in a background thread, please review it。
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6146
Differential Revision: D18972066
fbshipit-source-id: 0f7b0b70b9bb1e27ad6fc1c8a408fbbf237ae08c
Summary:
With https://github.com/facebook/rocksdb/pull/6121, errors returned by `PrepareBlobValue`
result in `CompactionIterator::status_` being set to `Corruption` or `IOError`
as appropriate, however, `valid_` is not set to `false`. The error is eventually propagated in
`CompactionJob::ProcessKeyValueCompaction` but only after the main loop completes.
Setting `valid_` to `false` upon errors enables us to terminate the loop early and fail the
compaction sooner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6170
Test Plan:
Ran `make check` and used `db_bench` in BlobDB mode.
fbshipit-source-id: a2ca88a3ca71115e2605bd34a4c795d8a28bef27
Summary:
As title. Previous assumption was that the underlying lib can always return
a shared_ptr<Env>. This is too strong. Therefore, we use Env::LoadEnv to relax
it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6196
Test Plan: make check
Differential Revision: D19133199
Pulled By: riversand963
fbshipit-source-id: c83a0c02a42610d077054f2de1acfc45126b3a75
Summary:
In the current MultiGet, if the KV-pairs do not belong to the data blocks in the block cache, multiple blocks are read from a SST. It will trigger one block read for each block request and read them in parallel. In some cases, if some data blocks are adjacent in the SST, the reads for these blocks can be combined to a single large read, which can reduce the system calls and reduce the read latency if possible.
Considering to fill the block cache, if multiple data blocks are in the same memory buffer, we need to copy them to the heap separately. Therefore, only in the case that 1) data block compression is enabled, and 2) compressed block cache is null, we can do combined read. Otherwise, extra memory copy is needed, which may cause extra overhead. In the current case, data blocks will be uncompressed to a new memory space.
Also, in the case that 1) data block compression is enabled, and 2) compressed block cache is null, it is possible the data block is actually not compressed. In the current logic, these data blocks will not be added to the uncompressed_cache. So if memory buffer is shared and the data block is not compressed, the data block are copied to the head and fill the cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6089
Test Plan: Added test case to ParallelIO.MultiGet. Pass make asan_check
Differential Revision: D18734668
Pulled By: zhichao-cao
fbshipit-source-id: 67c5615ed373e51e42635fd74b36f8f3a66d5da4
Summary:
https://github.com/facebook/rocksdb/pull/6177 introduced a data race
involving `MemTableList::InstallNewVersion` and `MemTableList::NumFlushed`.
The patch fixes this by caching whether the current version has any
memtable history (i.e. flushed memtables that are kept around for
transaction conflict checking) in an `std::atomic<bool>` member called
`current_has_history_`, similarly to how `current_memory_usage_excluding_last_`
is handled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6187
Test Plan:
```
make clean
COMPILE_WITH_TSAN=1 make db_test -j24
./db_test
```
Differential Revision: D19084059
Pulled By: ltamasi
fbshipit-source-id: 327a5af9700fb7102baea2cc8903c085f69543b9
Summary:
We have observed an increase in CPU load caused by frequent calls to
`ColumnFamilyData::InstallSuperVersion` from `DBImpl::TrimMemtableHistory`
when using `max_write_buffer_size_to_maintain` to limit the amount of
memtable history maintained for transaction conflict checking. Part of the issue
is that trimming can potentially be scheduled even if there is no memtable
history. The patch adds a check that fixes this.
See also https://github.com/facebook/rocksdb/pull/6169.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6177
Test Plan:
Compared `perf` output for
```
./db_bench -benchmarks=randomtransaction -optimistic_transaction_db=1 -statistics -stats_interval_seconds=1 -duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000 --transaction_set_snapshot=1 --threads=32
```
before and after the change. There is a significant reduction for the call chain
`rocksdb::DBImpl::TrimMemtableHistory` -> `rocksdb::ColumnFamilyData::InstallSuperVersion` ->
`rocksdb::ThreadLocalPtr::StaticMeta::Scrape` even without https://github.com/facebook/rocksdb/pull/6169.
Differential Revision: D19057445
Pulled By: ltamasi
fbshipit-source-id: dff81882d7b280e17eda7d9b072a2d4882c50f79
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Summary:
And clean up related code, especially in stress test.
(More clean up of db_stress_test_base.cc coming after this.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6154
Test Plan: make check, make blackbox_crash_test for a bit
Differential Revision: D18938180
Pulled By: pdillinger
fbshipit-source-id: 524d27621b8dbb25f6dff40f1081e7c00630357e