Summary:
This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262
Reviewed By: ajkr
Differential Revision: D20651306
fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb
Summary:
These three options should be made dynamically changeable. Simply add them to MutableCFOptions and made the change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6615
Test Plan: Add a unit test to make sure that SetOptions() can change the options.
Reviewed By: riversand963
Differential Revision: D20755951
fbshipit-source-id: 8165f4fd7a7a665cc7fb049698935022a5d2e7ff
Summary:
For FIFO compaction, we use flush time instead of oldest key time as the
creation time. This is to prevent FIFO compaction dropping files whose oldest
key time is older than TTL but which has newer keys than TTL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6612
Test Plan: make check
Reviewed By: siying
Differential Revision: D20748217
Pulled By: riversand963
fbshipit-source-id: 3f7b00a847020760537cdddd12f6fe039e5bc663
Summary:
In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600
Test Plan: tested with make asan_check
Reviewed By: riversand963
Differential Revision: D20717670
Pulled By: zhichao-cao
fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6
Summary:
When creating a database backup, the background threads will not only consume IO resources by copying files, but also consuming CPU such as by computing checksums. During peak times, the CPU consumption by the background threads might affect online queries.
This PR makes it possible to decrease CPU priority of these threads when creating a new backup.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6602
Test Plan: make check
Reviewed By: siying, zhichao-cao
Differential Revision: D20683216
Pulled By: cheng-chang
fbshipit-source-id: 9978b9ed9488e8ce135e90ca083e5b4b7221fd84
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
Summary:
The patch adds a couple of classes to represent metadata about
blob files: `SharedBlobFileMetaData` contains the information elements
that are immutable (once the blob file is closed), e.g. blob file number,
total number and size of blob files, checksum method/value, while
`BlobFileMetaData` contains attributes that can vary across versions like
the amount of garbage in the file. There is a single `SharedBlobFileMetaData`
for each blob file, which is jointly owned by the `BlobFileMetaData` objects
that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s
and can also be shared if the (immutable _and_ mutable) state of the blob file
is the same in two versions.
In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends
`VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those
containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata
to a new `VersionStorageInfo`. Consistency checks are also extended to ensure
that table files point to blob files that are part of the `Version`, and that all blob files
that are part of any given `Version` have at least some _non_-garbage data in them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D20656803
Pulled By: ltamasi
fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
Summary:
As the first step of reintroducing eviction statistics for the block
cache, the patch switches from using simple function pointers as deleters
to function objects implementing an interface. This will enable using
deleters that have state, like a smart pointer to the statistics object
that is to be updated when an entry is removed from the cache. For now,
the patch adds a deleter template class `SimpleDeleter`, which simply
casts the `value` pointer to its original type and calls `delete` or
`delete[]` on it as appropriate. Note: to prevent object lifecycle
issues, deleters must outlive the cache entries referring to them;
`SimpleDeleter` ensures this by using the ("leaky") Meyers singleton
pattern.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6545
Test Plan: `make asan_check`
Reviewed By: siying
Differential Revision: D20475823
Pulled By: ltamasi
fbshipit-source-id: fe354c33dd96d9bafc094605462352305449a22a
Summary:
In automatic compaction, if a compaction is bottommost, it goes to bottom thread pool. We should do the same for manual compaction too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6593
Test Plan: Add a unit test. See all existing tests pass.
Reviewed By: ajkr
Differential Revision: D20637408
fbshipit-source-id: cb03031e8f895085f7acf6d2d65e69e84c9ddef3
Summary:
Add timestamp support for MultiGet().
timestamp from readoptions is honored, and timestamps can be returned along with values.
MultiReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
base line (commit 17bef7d3a):
multireadrandom : 104.173 micros/op 307167 ops/sec; (5462999 of 5462999 found)
This PR:
multireadrandom : 104.199 micros/op 307095 ops/sec; (5307999 of 5307999 found)
.\db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=multireadrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6483
Reviewed By: anand1976
Differential Revision: D20498373
Pulled By: riversand963
fbshipit-source-id: 8505f22bc40fd791bc7dd05e48d7e67c91edb627
Summary:
When applying a new version in non DB open case, optimize_filters_for_hits is used for max_threads, which is clearly a bug. It is not clear what the indented value in the first place, but it value 1 makes sense here, which would create no extra threads. This bug is not expected to cause user visible problems, assuming C++ implicitly cast bool to 0 or 1.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6576
Test Plan: Run all exsiting test.
Reviewed By: ajkr
Differential Revision: D20602467
fbshipit-source-id: 40b2cd8619aba09ae9242b36c415464db3c9b737
Summary:
The current Env/FileSystem API separation has a couple of issues -
1. It requires the user to specify 2 options - ```Options::env``` and ```Options::file_system``` - which means they have to make code changes to benefit from the new APIs. Furthermore, there is a risk of accessing the same APIs in two different ways, through Env in the old way and through FileSystem in the new way. The two may not always match, for example, if env is ```PosixEnv``` and FileSystem is a custom implementation. Any stray RocksDB calls to env will use the ```PosixEnv``` implementation rather than the file_system implementation.
2. There needs to be a simple way for the FileSystem developer to instantiate an Env for backward compatibility purposes.
This PR solves the above issues and simplifies the migration in the following ways -
1. Embed a shared_ptr to the ```FileSystem``` in the ```Env```, and remove ```Options::file_system``` as a configurable option. This way, no code changes will be required in application code to benefit from the new API. The default Env constructor uses a ```LegacyFileSystemWrapper``` as the embedded ```FileSystem```.
1a. - This also makes it more robust by ensuring that even if RocksDB
has some stray calls to Env APIs rather than FileSystem, they will go
through the same object and thus there is no risk of getting out of
sync.
2. Provide a ```NewCompositeEnv()``` API that can be used to construct a
PosixEnv with a custom FileSystem implementation. This eliminates an
indirection to call Env APIs, and relieves the FileSystem developer of
the burden of having to implement wrappers for the Env APIs.
3. Add a couple of missing FileSystem APIs - ```SanitizeEnvOptions()``` and
```NewLogger()```
Tests:
1. New unit tests
2. make check and make asan_check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6552
Reviewed By: riversand963
Differential Revision: D20592038
Pulled By: anand1976
fbshipit-source-id: c3801ad4153f96d21d5a3ae26c92ba454d1bf1f7
Summary:
The MultiGet test in db_basic_test fails in CircleCI vs2019. The reason is that even Snappy compression is enabled, the first compression type is still kNoCompression. This PR checks the list and ensure that only when compression is enable and the compression type is valid, compression will be enabled. Such that, it will not fail the combined read test in MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6578
Test Plan: make check, db_basic_test.
Reviewed By: anand1976
Differential Revision: D20607529
Pulled By: zhichao-cao
fbshipit-source-id: dcead264d5c2da105912c18caad34b8510bb04b0
Summary:
Fix LITE build by excluding some unit tests that use features not supported in LITE.
```
db/db_basic_test.cc:1778:8: error: ‘void rocksdb::{anonymous}::TableFileListener::OnTableFileCreated(const rocksdb::TableFileCreationInfo&)’ marked ‘override’, but does not override
void OnTableFileCreated(const TableFileCreationInfo& info) override {
^~~~~~~~~~~~~~~~~~
make: *** [db/db_basic_test.o] Error 1
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6575
Reviewed By: ltamasi
Differential Revision: D20598598
Pulled By: riversand963
fbshipit-source-id: 367f7cb2500360ad57030b138a94c0f731a04339
Summary:
There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status.
This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version.
`DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will *not* be performed.
To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush.
Test plan (on devserver):
```
$make check
$COMPILE_WITH_ASAN=1 make all && make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334
Reviewed By: anand1976
Differential Revision: D19778960
Pulled By: riversand963
fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8
Summary:
When `use_direct_reads` and `use_direct_writes` are `false`, `logical_sector_size_` inside various `*File` implementations are not actually used, so `GetLogicalBlockSize` does not necessarily need to be called for `logical_sector_size_`, just set a default page size.
This is a follow up PR for https://github.com/facebook/rocksdb/pull/6457.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6522
Test Plan: make check
Reviewed By: siying
Differential Revision: D20408885
Pulled By: cheng-chang
fbshipit-source-id: f2d3808f41265237e7fa2c0be9f084f8fa97fe3d
Summary:
Each time RocksDB switches to a new MANIFEST file from old one, it calls WriteCurrentStateToManifest() which writes a 'snapshot' of the current in-memory state of versions to the beginning of the new manifest as a bunch of version edits. We can distinguish these version edits from other version edits written during normal operations with a custom, safe-to-ignore tag.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6530
Test Plan: added test to version_edit_test, pass make asan_check
Reviewed By: riversand963
Differential Revision: D20524516
Pulled By: zhichao-cao
fbshipit-source-id: f1de102f5499bfa88dae3caa2f32c7f42cf904db
Summary:
The whole point of the pimpl idiom is to hide implementation details.
Internal helper methods like `CheckConsistency`, `CheckConsistencyForDeletes`,
and `MaybeAddFile` do not belong in the public interface of the class.
In addition, the patch switches to `unique_ptr` for the implementation
object instead of using a raw `delete`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6556
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D20523568
Pulled By: ltamasi
fbshipit-source-id: 5bbb0ccebd0c47a33b815398c7f9cfe13bd775ac
Summary:
Reduce runtime by reducing test scale to avoid test time-outs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6546
Test Plan:
time ./db_with_timestamp_basic_test
and watch internal tests.
Reviewed By: zhichao-cao
Differential Revision: D20479292
Pulled By: riversand963
fbshipit-source-id: c9e4a155be7699dd4de60fa531de86d442a3ba0a
Summary:
If DB is not deleted, in concurrent test, the tests might fail because of the previously existing DB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6532
Test Plan:
make clean && make -j24 LITE=1 db_logical_block_size_cache_test && ./db_logical_block_size_cache_test
make clean && make -j24 db_logical_block_size_cache_test && ./db_logical_block_size_cache_test
Differential Revision: D20454734
Pulled By: cheng-chang
fbshipit-source-id: 8abede2ec1d79c1a4fe1bc95fbda489f8f7ee052
Summary:
In some of the test, db_basic_test may cause time out due to its long running time. Separate the timestamp related test from db_basic_test to avoid the potential issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6516
Test Plan: pass make asan_check
Differential Revision: D20423922
Pulled By: zhichao-cao
fbshipit-source-id: d6306f89a8de55b07bf57233e4554c09ef1fe23a
Summary:
In DBLogicalBlockSizeCacheTest, do not test OpenForReadOnly in LITE mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6523
Test Plan: watch test for LITE mode
Differential Revision: D20420321
Pulled By: cheng-chang
fbshipit-source-id: e45bf6f2800206d6f8ce9af7308e76a08de80643
Summary:
In Linux, when reopening DB with many SST files, profiling shows that 100% system cpu time spent for a couple of seconds for `GetLogicalBufferSize`. This slows down MyRocks' recovery time when site is down.
This PR introduces two new APIs:
1. `Env::RegisterDbPaths` and `Env::UnregisterDbPaths` lets `DB` tell the env when it starts or stops using its database directories . The `PosixFileSystem` takes this opportunity to set up a cache from database directories to the corresponding logical block sizes.
2. `LogicalBlockSizeCache` is defined only for OS_LINUX to cache the logical block sizes.
Other modifications:
1. rename `logical buffer size` to `logical block size` to be consistent with Linux terms.
2. declare `GetLogicalBlockSize` in `PosixHelper` to expose it to `PosixFileSystem`.
3. change the functions `IOError` and `IOStatus` in `env/io_posix.h` to have external linkage since they are used in other translation units too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6457
Test Plan:
1. A new unit test is added for `LogicalBlockSizeCache` in `env/io_posix_test.cc`.
2. A new integration test is added for `DB` operations related to the cache in `db/db_logical_block_size_cache_test.cc`.
`make check`
Differential Revision: D20131243
Pulled By: cheng-chang
fbshipit-source-id: 3077c50f8065c0bffb544d8f49fb10bba9408d04
Summary:
When users fail to open a DB with file lock failure, it is sometimes hard for users to debug. We now include the time the lock is acquired and the thread ID that acquired the lock, to help users debug problems like this. Default Env's thread ID is used.
Since type of lockedFiles is changed, rename it to follow naming convention too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6507
Test Plan: Add a unit test and improve an existing test to validate the case.
Differential Revision: D20378333
fbshipit-source-id: 312fe0e9733fd1d1e9969c321b90ce523cf4708a
Summary:
It's never too soon to refactor something. The patch splits the recently
introduced (`VersionEdit` related) `BlobFileState` into two classes
`BlobFileAddition` and `BlobFileGarbage`. The idea is that once blob files
are closed, they are immutable, and the only thing that changes is the
amount of garbage in them. In the new design, `BlobFileAddition` contains
the immutable attributes (currently, the count and total size of all blobs, checksum
method, and checksum value), while `BlobFileGarbage` contains the mutable
GC-related information elements (count and total size of garbage blobs). This is a
better fit for the GC logic and is more consistent with how SST files are handled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6502
Test Plan: `make check`
Differential Revision: D20348352
Pulled By: ltamasi
fbshipit-source-id: ff93f0121e80ab15e0e0a6525ba0d6af16a0e008
Summary:
Allow user to specify options.max_open_files != -1 with FIFO compaction.
If max_open_files != -1, not all table files are kept open.
In the past, FIFO style compaction requires all table files to be open in order
to read file creation time from table properties. Later, we added file creation
time to MANIFEST, making it possible to read file creation time without opening
file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6503
Test Plan: make check
Differential Revision: D20353758
Pulled By: riversand963
fbshipit-source-id: ba5c61a648419e47e9ef6d74e0e280e3ee24f296
Summary:
Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests.
Create an iterator with timestamp.
```
...
read_opts.timestamp = &ts;
auto* iter = db->NewIterator(read_opts);
// target is key without timestamp.
for (iter->Seek(target); iter->Valid(); iter->Next()) {}
for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {}
delete iter;
read_opts.timestamp = &ts1;
// lower_bound and upper_bound are without timestamp.
read_opts.iterate_lower_bound = &lower_bound;
read_opts.iterate_upper_bound = &upper_bound;
auto* iter1 = db->NewIterator(read_opts);
// Do Seek or SeekToFirst()
delete iter1;
```
Test plan (dev server)
```
$make check
```
Simple benchmarking (dev server)
1. The overhead introduced by this PR even when timestamp is disabled.
key size: 16 bytes
value size: 100 bytes
Entries: 1000000
Data reside in main memory, and try to stress iterator.
Repeated three times on master and this PR.
- Seek without next
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3
```
master: 159047.0 ops/sec
this PR: 158922.3 ops/sec (2% drop in throughput)
- Seek and next 10 times
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10
```
master: 109539.3 ops/sec
this PR: 107519.7 ops/sec (2% drop in throughput)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255
Differential Revision: D19438227
Pulled By: riversand963
fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746
Summary:
`TruncateLastLogAfterRecoverWithoutFlush` case depends on fallocate support
of underlying file system.
On a file system which lacks of this feature, like zfs, it will fail to allocate predefined file size as this test case intends to do;
So a check block is added to detect fallocate support and skip test if not.
The related work is done by JunHe77. Thanks!
Signed-off-by: Yuqi Gu <yuqi.gu@arm.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6437
Differential Revision: D20145032
Pulled By: pdillinger
fbshipit-source-id: c8b691dc508e95acfa2a004ddbc07e2faa76680d
Summary:
In CompactRange, if there is no key in memtable falling in the specified range, then flush is skipped.
This PR extends this skipping logic to SST file levels: it starts compaction from the highest level (starting from L0) that has files with key falling in the specified range, instead of always starts from L0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6482
Test Plan:
A new test ManualCompactionTest::SkipLevel is added.
Also updated a test related to statistics of index block cache hit in db_test2, the index cache hit is increased by 1 in this PR because when checking overlap for the key range in L0, OverlapWithLevelIterator will do a seek in the table cache iterator, which will read from the cached index.
Also updated db_compaction_test and db_test to use correct range for full compaction.
Differential Revision: D20251149
Pulled By: cheng-chang
fbshipit-source-id: f822157cf4796972bd5035d9d7178d8dfb7af08b
Summary:
In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller.
A set of ErrorHandlerFSTests are introduced for testing
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6414
Test Plan: pass make asan_check, pass error_handler_fs_test.
Differential Revision: D20252421
Pulled By: zhichao-cao
fbshipit-source-id: e922038f8ce7e6d1da329fd0bba7283c4b779a21
Summary:
this silences following warning from clang-11
```
rocksdb/db/db_impl/db_impl_compaction_flush.cc:1040:21: warning: loop variable 'newf' of type 'const std::pair<int, rocksdb::FileMetaData>' creates a copy from type 'const
std::pair<int\
, rocksdb::FileMetaData>' [-Wrange-loop-analysis]
for (const auto newf : c->edit()->GetNewFiles()) {
^
rocksdb/db/db_impl/db_impl_compaction_flush.cc:1040:10: note: use reference type 'const std::pair<int, rocksdb::FileMetaData> &' to prevent copying
for (const auto newf : c->edit()->GetNewFiles()) {
^~~~~~~~~~~~~~~~~
&
```
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6477
Differential Revision: D20211850
Pulled By: ltamasi
fbshipit-source-id: 3e89e13a12bba79f1b934d46b7c4c0576cdafb01
Summary:
When DBImpl::GetCreationTimeOfOldestFile() calls Version::GetCreationTimeOfOldestFile(), the version is not directly or indirectly referenced, so an event like compaction can race with the operation and cause DBImpl::GetCreationTimeOfOldestFile() to access delocated data. This was caught by an ASAN run:
==268==ERROR: AddressSanitizer: heap-use-after-free on address 0x612000b7d198 at pc 0x000018332913 bp 0x7f391510d310 sp 0x7f391510d308
READ of size 8 at 0x612000b7d198 thread T845 (store_load-33)
SCARINESS: 51 (8-byte-read-heap-use-after-free)
#0 0x18332912 in rocksdb::Version::GetCreationTimeOfOldestFile(unsigned long*) rocksdb/src/db/version_set.cc:1488
https://github.com/facebook/rocksdb/issues/1 0x1803ddaa in rocksdb::DBImpl::GetCreationTimeOfOldestFile(unsigned long*) rocksdb/src/db/db_impl/db_impl.cc:4499
https://github.com/facebook/rocksdb/issues/2 0xe24ca09 in rocksdb::StackableDB::GetCreationTimeOfOldestFile(unsigned long*) rocksdb/utilities/stackable_db.h:392
......
0x612000b7d198 is located 216 bytes inside of 296-byte region [0x612000b7d0c0,0x612000b7d1e8)
freed by thread T28 here:
......
https://github.com/facebook/rocksdb/issues/5 0x1832c73f in std::vector<rocksdb::FileMetaData*, std::allocator<rocksdb::FileMetaData*> >::~vector() third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/stl_vector.h:435
https://github.com/facebook/rocksdb/issues/6 0x1832c73f in rocksdb::VersionStorageInfo::~VersionStorageInfo() rocksdb/src/db/version_set.cc:734
https://github.com/facebook/rocksdb/issues/7 0x1832cf42 in rocksdb::Version::~Version() rocksdb/src/db/version_set.cc:758
https://github.com/facebook/rocksdb/issues/8 0x9d1bb5 in rocksdb::Version::Unref() rocksdb/src/db/version_set.cc:2869
https://github.com/facebook/rocksdb/issues/9 0x183e7631 in rocksdb::Compaction::~Compaction() rocksdb/src/db/compaction/compaction.cc:275
https://github.com/facebook/rocksdb/issues/10 0x9e6de6 in std::default_delete<rocksdb::Compaction>::operator()(rocksdb::Compaction*) const third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/unique_ptr.h:78
https://github.com/facebook/rocksdb/issues/11 0x9e6de6 in std::unique_ptr<rocksdb::Compaction, std::default_delete<rocksdb::Compaction> >::reset(rocksdb::Compaction*) third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/unique_ptr.h:376
https://github.com/facebook/rocksdb/issues/12 0x9e6de6 in rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2826
https://github.com/facebook/rocksdb/issues/13 0x9ac3b8 in rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2320
https://github.com/facebook/rocksdb/issues/14 0x9abff7 in rocksdb::DBImpl::BGWorkCompaction(void*) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2096
......
Fix the issue by reference the super version and use the referenced version from it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6473
Test Plan: Run ASAN for all existing tests.
Differential Revision: D20196416
fbshipit-source-id: 5f4a7918110fc7b8dd7841932d376bc9d1e59d6f
Summary:
In the current code base, we can use Directory from Env to manage directory (e.g, Fsync()). The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. So we further replace the Directory class in DB with FSDirectory such that we can have more IO information from IOStatus returned by FSDirectory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6468
Test Plan: pass make asan_check
Differential Revision: D20195261
Pulled By: zhichao-cao
fbshipit-source-id: 93962cb9436852bfcfb76e086d9e7babd461cbe1
Summary:
Added new Get() methods that return timestamp. Dummy implementation is given so that classes derived from DB don't need to be touched to provide their implementation. MultiGet is not included.
ReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
base line (commit 72ee067b9):
101.712 micros/op 314602 ops/sec; 36.0 MB/s (5658999 of 5658999 found)
This PR:
100.288 micros/op 319071 ops/sec; 36.5 MB/s (5674999 of 5674999 found)
./db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=readrandom --use_existing_db=1 --num=25000000 --threads=32
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6409
Differential Revision: D20200086
Pulled By: riversand963
fbshipit-source-id: 490edd74d924f62bd8ae9c29c2a6bbbb8410ca50
Summary:
Original author: jeffrey-xiao
If we are writing a global seqno for an ingested file, the range
tombstone metablock gets accessed and put into the cache during
ingestion preparation. At the time, the global seqno of the ingested
file has not yet been determined, so the cached block will not have a
global seqno. When the file is ingested and we read its range tombstone
metablock, it will be returned from the cache with no global seqno. In
that case, we use the actual seqnos stored in the range tombstones,
which are all zero, so the tombstones cover nothing.
This commit removes global_seqno_ variable from Block. When iterating
over a block, the global seqno for the block is determined by the
iterator instead of storing this mutable attribute in Block.
Additionally, this commit adds a regression test to check that keys are
deleted when ingesting a file with a global seqno and range deletion
tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6429
Differential Revision: D19961563
Pulled By: ajkr
fbshipit-source-id: 5cf777397fa3e452401f0bf0364b0750492487b7
Summary:
BlobDB currently does not keep track of blob files: no records are written to
the manifest when a blob file is added or removed, and upon opening a database,
the list of blob files is populated simply based on the contents of the blob directory.
This means that lost blob files cannot be detected at the moment. We plan to solve
this issue by making blob files a part of `Version`; as a first step, this patch makes
it possible to store information about blob files in `VersionEdit`. Currently, this information
includes blob file number, total number and size of all blobs, and total number and size
of garbage blobs. However, the format is extensible: new fields can be added in
both a forward compatible and a forward incompatible manner if needed (similarly
to `kNewFile4`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6416
Test Plan: `make check`
Differential Revision: D19894234
Pulled By: ltamasi
fbshipit-source-id: f9753e1f2aedf6dadb70c09b345207cb9c58c329
Summary:
Cleanup some code without any real change in functionality.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6440
Differential Revision: D20015891
Pulled By: riversand963
fbshipit-source-id: 33e18754b0f002006a6d4805e9aaf84c0c8ad25a
Summary:
Adding a C API function to set `row_cache` on `rocksdb_options_t` as this functionality is missing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6442
Differential Revision: D20036813
Pulled By: riversand963
fbshipit-source-id: c1fa95ea343345fbc1e57961d0d048e0e79be373
Summary:
Currently, a new MANIFEST file is assigned a new file number when 1) no
MANIFEST is open, or 2) current MANIFEST file size exceeds a threshold. This is
not sufficient. There are cases when the caller explicitly specifies that a new
MANIFEST be created. For example, if user sets options.write_dbid_to_manifest = true,
and there are WAL files, then RocksDB will run into an issue during recovery.
`DBImpl::Recover()` will call `LogAndApply()` to write dbid. At this point, the db being
recovered creates a new MANIFEST, say, MANIFEST-000003. Since there are WALs,
`DBImpl::RecoverLogFiles` will be called. Towards the end of this function, we call
`LogAndApply(new_descriptor_log=true)`, which explicitly creates a new MANIFEST.
However, the manifest_file_number is wrong before this fix. Consequently, RocksDB
opens an existing, non-empty file for append, effectively truncating the file to zero.
If a crash occurs, then there will be data loss.
Test Plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6426
Test Plan: make check
Differential Revision: D19951866
Pulled By: riversand963
fbshipit-source-id: 4b1b9fc28d4fe2ac12764b388ef9e61f05e766da
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary:
We were removing the file from `log_recycle_files_` before renaming it
with `ReuseWritableFile()`. Since `ReuseWritableFile()` occurs outside
the DB mutex, it was possible for a concurrent full purge to sneak in
and delete the file before it could be renamed. Consequently, `SwitchMemtable()`
would fail and the DB would enter read-only mode.
The fix is to hold the old file number in `log_recycle_files_` until
after the file has been renamed. Full purge uses that list to decide
which files to keep, so it can no longer delete a file pending recycling.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5900
Test Plan: new unit test
Differential Revision: D19771719
Pulled By: ajkr
fbshipit-source-id: 094346349ca3fb499712e62de03905acc30b5ce8
Summary:
Previously, when recovering version set, LoadTableHandlers failures are ignored.
If paranoid_checks is true, this failure should not be ignored, otherwise, the opened db might be in an inconsistent state.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6368
Test Plan: make check
Differential Revision: D19713459
Pulled By: cheng-chang
fbshipit-source-id: 68cb94f4f2cc43f8b024b14755193cd45cfcad55
Summary:
During recovery, multiple (un)prepared batches could exist in the same WAL record due to group commit. This breaks an assertion in `MemTableInserter::MarkBeginPrepare`.
To fix, reset unprepared_batch_ to false after `MarkEndPrepare`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6419
Differential Revision: D19896148
Pulled By: lth
fbshipit-source-id: b1a32ef88f775a0881264a18bd1a4a5b8c85eee3
Summary:
A recent fix related to 2pc https://github.com/facebook/rocksdb/pull/6313/ writes something to WAL, but does not flush or sync. This causes assertion failure "impl->TEST_WALBufferIsEmpty()" if manual_wal_flush = true. We should fsync the entry to make sure a second power reset can recover.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6417
Test Plan: Add manual_wal_flush=true case in TransactionTest.DoubleCrashInRecovery and fix a bug in the test so that the bug can be reproduced. It passes with the fix.
Differential Revision: D19894537
fbshipit-source-id: f1e84e49e2269f583c6019743118292cd8b6598e
Summary:
It's observed on Windows DestroyDB failed to remove the log file because the logger is still alive in sst file manager and holding a handle to the log file. This fix makes sure the logger is released before attempt to clear the database directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6308
Differential Revision: D19818829
Pulled By: riversand963
fbshipit-source-id: 54c3e6859aadaaba4a49b3e851b73dc35ec7dc6a
Summary:
Seems like this caused the following test failure on AppVeyor:
DBTest2.CrashInRecoveryMultipleCF
c:\projects\rocksdb\db\db_test_util.cc(107): error: DestroyDB(dbname_, options)
IO error: Failed to delete: C:\projects\rocksdb\db_tests\\testrocksdb-3112//db_test2_10791409581227174103/000013.sst: Access is denied.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6410
Test Plan: Wait to see whether the AppVeyor test passes.
Differential Revision: D19879872
Pulled By: cheng-chang
fbshipit-source-id: 59a9c55ca88566e9210c0b715ecc45a4fd9afe26
Summary:
Unrevert the previous fix to propagate error status, and an additional fix to not treat a memtable lookup MergeInProgress status as an error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6403
Test Plan:
Unit tests
Tried running stress tests but couldn't repro the stress failure
Differential Revision: D19846721
Pulled By: anand1976
fbshipit-source-id: 7db10cccbdc863d9b559497f0a46b608d2488ca4
Summary:
This reverts commit d70011bccc. The commit is causing some stress test failure due to unexpected Status::MergeInProgress() return for some keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6401
Differential Revision: D19826623
Pulled By: anand1976
fbshipit-source-id: edd634cede9cb7bdd2cb8f46e662ea709b16d2f1
Summary:
Add a utility class `Defer` to defer the execution of a function until the Defer object goes out of scope.
Used in VersionSet:: ProcessManifestWrites as an example.
The inline comments for class `Defer` have more details.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6382
Test Plan: `make defer_test version_set_test && ./defer_test && ./version_set_test`
Differential Revision: D19797538
Pulled By: cheng-chang
fbshipit-source-id: b1a9b7306e4fd4f48ec2ab55783caa561a315f0f
Summary:
https://github.com/facebook/rocksdb/pull/6383 surfaced an issue with
`VersionSet`/`ReactiveVersionSet` and `AtomicGroupReadBuffer::AddEdit`
(which was added in https://github.com/facebook/rocksdb/pull/5411):
`AddEdit` moves the `VersionEdit` passed to it into `replay_buffer_`,
however, the client `VersionSet` classes keep using it afterwards. This
*seemed to* work before the refactoring but it really did not: since
`VersionEdit` used to have a user-declared destructor, no move
constructor/move assignment operator was generated, and the `move` in
`AddEdit` was really a copy. The patch makes the copy explicit. Note: it
should be possible to rework this logic so that we can get away
with the move but for now, this should fix the issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6400
Test Plan:
`make check`
`make analyze`
Differential Revision: D19824466
Pulled By: ltamasi
fbshipit-source-id: f38033967daf2a39c78dcd6e12978bafe37632b4
Summary:
In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata.
Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string). A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216
Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check.
Differential Revision: D19171461
Pulled By: zhichao-cao
fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87
Summary:
When `no_slowdown` is enabled, it returns `Status::Incomplete("Write stall")` if a stall would occur. This patch adds descriptive text for when `no_slowdown` and `low_pri` are enabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6396
Differential Revision: D19808978
Pulled By: cheng-chang
fbshipit-source-id: a53b0d25ed414c821a086531e0222027f925e627
Summary:
Currently, any IO errors and checksum mismatches while reading data
blocks, are being ignored by the batched MultiGet. Its only looking at
the GetContext state. Fix that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6387
Test Plan: Add unit tests
Differential Revision: D19799819
Pulled By: anand1976
fbshipit-source-id: 46133dccbb04e64067b9fe6cda73e282203db969
Summary:
Fixed an error when compiled with -Og:
db/write_thread.cc:183:14: error: 'state' may be used uninitialized in this function [-Werror=maybe-uninitialized]
Signed-off-by: Robert Yang <liezhi.yang@windriver.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6275
Differential Revision: D19381755
fbshipit-source-id: a90bf3cd4a7248d9d71219e918fc6253deb97e3c
Summary:
This is a bunch of small improvements to `VersionEdit`. Namely, the patch
* Makes the names and order of variables, methods, and code chunks related
to the various information elements more consistent, and adds missing
getters for the sake of completeness.
* Initializes previously uninitialized stack variables.
* Marks all getters const to improve const correctness.
* Adds in-class initializers and removes the default ctor that would
create an object with uninitialized built-in fields and call `Clear`
afterwards.
* Adds a new type alias for new files and changes the existing `typedef`
for deleted files into a type alias as well.
* Makes the helper method `DecodeNewFile4From` private.
* Switches from long-winded iterator syntax to range based loops in a
couple of places.
* Fixes a couple of assignments where an integer 0 was assigned to
boolean members.
* Fixes a getter which used to return a `const std::string` instead of
the intended `const std::string&`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6383
Test Plan: make check
Differential Revision: D19780537
Pulled By: ltamasi
fbshipit-source-id: b0b4f09fee0ec0e7c7b7a6d76bfe5346e91824d0
Summary:
Since the logic for handling IDENTITY file is now inside `NewDB`, the comment above `NewDB` is no longer relevant.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6379
Test Plan: not needed
Differential Revision: D19795440
Pulled By: cheng-chang
fbshipit-source-id: 0b1cca87ac6d92474701c46aa4c8d4d708bfa19b
Summary:
Several statuses were not checked during DB::Open.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6380
Test Plan: make check
Differential Revision: D19780237
Pulled By: cheng-chang
fbshipit-source-id: c8d189d20344bd1607890dd1449345bda2ef96b9
Summary:
Before this fix, atomic flush codepath may hit an assertion failure on a specific failure case.
If all flush jobs within an atomic flush succeed (they do not write to MANIFEST), but batch writing version edits to MANIFEST fails, then `cfd->imm()->RollbackMemTableFlush()` will be called twice, and the second invocation hits assertion failure `assert(m->flush_in_progress_)` since the first invocation resets the variable `flush_in_progress_` to false already.
Test plan (dev server):
```
./db_flush_test --gtest_filter=DBAtomicFlushTest/DBAtomicFlushTest.RollbackAfterFailToInstallResults
make check
```
Both must succeed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6385
Differential Revision: D19782943
Pulled By: riversand963
fbshipit-source-id: 84e1592625e729d1b70fdc8479959387a74cb121
Summary:
In `DBSSTTest.SSTsWithLdbSuffixHandling`, some sst files are renamed to ldb files, the original intention of the test is to test that the ldb files can be loaded along with the sst files.
The original test checks this by `ASSERT_NE("NOT_FOUND", Get(Key(k)))`, but the problem is `Get(Key(k))` returns IO error due to path not found instead of NOT_FOUND, so the success of ASSERT_NE does not mean the key can be retrieved.
This PR updates the test to make sure Get(Key(k)) returns the original value.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6370
Test Plan: make db_sst_test && ./db_sst_test
Differential Revision: D19726278
Pulled By: cheng-chang
fbshipit-source-id: 993127f56457b315e669af4eeb92d6f956b7a4b7
Summary:
Before this PR it calls GetFileSize() once for each sst file in the DB. This can take a long time if there are be tens of thousands of sst files (e.g. in thousands of column families), and even longer if Env is talking to some remote service rather than local filesystem. This PR makes DB::Open() use sst file sizes that are already known from manifest (typically almost all files in the DB) and only call GetFileSize() for non-sst or obsolete files. Note that GetFileSize() is also called and checked against manifest in CheckConsistency(), so the calls in SstFileManagerImpl were completely redundant.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6363
Test Plan: deployed to a test cluster, looked at a dump of Env calls (from a custom instrumented Env) - no more thousands of GetFileSize()s.
Differential Revision: D19702509
Pulled By: al13n321
fbshipit-source-id: 99f8110620cb2e9d0c092dfcdbb11f3af4ff8b73
Summary:
Right now RocksDB gets manifest file size before recovering from it. The information is available in LogReader. Use it instead to prevent one file system call.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6369
Test Plan: Run all existing tests
Differential Revision: D19714872
fbshipit-source-id: 0144be324d403c99e3da875ea2feccc8f64e883d
Summary:
When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open.
If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be *really* slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes.
We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353
Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble.
Differential Revision: D19656425
Pulled By: al13n321
fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82
Summary:
Fix an intermittent failure in
DBErrorHandlingTest.CompactionManifestWriteError due to a race between
background error recovery and the main test thread calling
TEST_WaitForCompact().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6367
Test Plan: Run the test using gtest_parallel
Differential Revision: D19713802
Pulled By: anand1976
fbshipit-source-id: 29e35dc26e0984fe8334c083e059f4fa1f335d68
Summary:
Right now when reading IDENTITY file, we use a very similar logic as ReadFileToString() while it does an extra file size check, which may be expensive in some file systems. There is no reason to duplicate the logic. Use ReadFileToString() instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6365
Test Plan: RUn all existing tests.
Differential Revision: D19709399
fbshipit-source-id: 3bac31f3b2471f98a0d2694278b41e9cd34040fe
Summary:
A relatively recent regression causes for every CF, create and open directory is called for the DB directory, unless CF has a private directory. This doesn't scale well with large number of column families.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6358
Test Plan: Run all existing tests and see it pass. strace with db_bench --num_column_families and observe it doesn't open directory for number of column families.
Differential Revision: D19675141
fbshipit-source-id: da01d9216f1dae3f03d4064fbd88ce71245bd9be
Summary:
MultiDBCompactionError fails when it verifies the number of files on level 0 and level 1 without waiting for compaction to finish.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6266
Differential Revision: D19701639
Pulled By: riversand963
fbshipit-source-id: e96d511bcde705075f073e0b550cebcd2ecfccdc
Summary:
DBTest2.ChangePrefixExtractor fails in LITE build because LITE build doesn't support adaptive build. Fix it by removing the stats check but only check correctness.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6356
Test Plan: Run the test with both of LITE and non-LITE build.
Differential Revision: D19669537
fbshipit-source-id: 6d7dd6c8a79f18e80ca1636864b9c71922030d8e
Summary:
Add a unit test for prefix extractor change, including a check that fails due to a bug.
Also comment out the partitioned filter case which will fail the test too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6323
Test Plan: Run the test and it passes (and fails if the SeekForPrev() part is uncommented)
Differential Revision: D19509744
fbshipit-source-id: 678202ca97b5503e9de73b54b90de9e5ba822b72
Summary:
Non-zero recycle_log_file_num is incompatible with kPointInTimeRecovery and kAbsoluteConsistency recovery modes. Currently SanitizeOptions changes the recovery mode to kTolerateCorruptedTailRecords, while to resolve this option conflict it makes more sense to compromise recycle_log_file_num, which is a performance feature, instead of wal_recovery_mode, which is a safety feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6351
Differential Revision: D19648931
Pulled By: maysamyabandeh
fbshipit-source-id: dd0bf78349edc007518a00c4d63931fd69294ad7
Summary:
Unit test names, together with other components, are used to create log files
during some internal testing. Overly long names cause infra failure due to file
names being too long.
Look for internal tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6352
Differential Revision: D19649307
Pulled By: riversand963
fbshipit-source-id: 6f29de096e33c0eaa87d9c8702f810eda50059e7
Summary:
Fix for issue https://github.com/facebook/rocksdb/issues/6316
When an append/sync of the manifest file fails due to an IO error such
as NoSpace, we don't always put the DB in read-only mode. This is true
for flush and compactions, as well as foreground operatons such as column family
add/drop, CompactFiles etc. Subsequent changes to the DB will be
recorded in the same manifest file, which would have a corrupted record
in the middle due to the previous failure. On next DB::Open(), it will
fail to process the full manifest and data will be lost.
To fix this, we reset VersionSet::descriptor_log_ on append/sync
failure, which will force a new manifest file to be written on the next
append.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6331
Test Plan: Add new unit tests in error_handler_test.cc
Differential Revision: D19632951
Pulled By: anand1976
fbshipit-source-id: 68d527cb6e59a94cbbbf9f5a17a7f464381d51e3
Summary:
DBTest2.AutoPrefixMode1 doesn't pass because auto prefix mode is not supported there.
Fix it by disabling the test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6346
Test Plan: Run DBTest2.AutoPrefixMode1 in lite mode
Differential Revision: D19627486
fbshipit-source-id: fbde75260aeecb7e6fc406e09c19a71a95aa5f08
Summary:
db_bloom_filter_test break with clang LITE build with following message:
db/db_bloom_filter_test.cc:23:29: error: unused variable 'kPlainTable' [-Werror,-Wunused-const-variable]
static constexpr PseudoMode kPlainTable = -1;
^
Fix it by moving the declaration out of LITE build
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6340
Test Plan:
USE_CLANG=1 LITE=1 make db_bloom_filter_test
and without LITE=1
Differential Revision: D19609834
fbshipit-source-id: 0e88f5c6759238a94f9880d84c785ac18e7cdd7e
Summary:
In WritePrepared there could be gap in sequence numbers. This breaks the trick we use in kPointInTimeRecovery which assume the first seq in the log right after the corrupted log is one larger than the last seq we read from the logs. To let this trick keep working, we add a dummy entry with the expected sequence to the first log right after recovery.
Also in WriteCommitted, if the log right after the corrupted log is empty, since it has no sequence number to let the sequential trick work, it is assumed as unexpected behavior. This is however expected to happen if we close the db after recovering from a corruption and before writing anything new to it. To remedy that, we apply the same technique by writing a dummy entry to the log that is created after the corrupted log.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6313
Differential Revision: D19458291
Pulled By: maysamyabandeh
fbshipit-source-id: 09bc49e574690085df45b034ca863ff315937e2d
Summary:
Add a new option ReadOptions.auto_prefix_mode. When set to true, iterator should return the same result as total order seek, but may choose to do prefix seek internally, based on iterator upper bounds. Also fix two previous bugs when handling prefix extrator changes: (1) reverse iterator should not rely on upper bound to determine prefix. Fix it with skipping prefix check. (2) block-based filter is not handled properly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6314
Test Plan: (1) add a unit test; (2) add the check to stress test and run see whether it can pass at least one run.
Differential Revision: D19458717
fbshipit-source-id: 51c1bcc5cdd826c2469af201979a39600e779bce
Summary:
./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions passed 96 / 100 times.
```
With the fix: all runs (tried 100, 1000, 10000) succeed.
```
$ TEST_TMPDIR=/dev/shm ~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.LevelTtlCascadingCompactions --repeat=1000
[1000/1000] DBCompactionTest.LevelTtlCascadingCompactions (1895 ms)
```
Test Plan:
Build:
```
COMPILE_WITH_TSAN=1 make db_compaction_test -j100
```
Without the fix: a few runs out of 100 fail:
```
$ TEST_TMPDIR=/dev/shm KEEP_DB=1 ~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.LevelTtlCascadingCompactions --repeat=100
...
...
Note: Google Test filter = DBCompactionTest.LevelTtlCascadingCompactions
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBCompactionTest
[ RUN ] DBCompactionTest.LevelTtlCascadingCompactions
db/db_compaction_test.cc:3687: Failure
Expected equality of these values:
oldest_time
Which is: 1580155869
level_to_files[6][0].oldest_ancester_time
Which is: 1580155870
DB is still at /dev/shm//db_compaction_test_6337001442947696266
[ FAILED ] DBCompactionTest.LevelTtlCascadingCompactions (1432 ms)
[----------] 1 test from DBCompactionTest (1432 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (1433 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] DBCompactionTest.LevelTtlCascadingCompactions
1 FAILED TEST
[80/100] DBCompactionTest.LevelTtlCascadingCompactions returned/aborted with exit code 1 (1489 ms)
[100/100] DBCompactionTest.LevelTtlCascadingCompactions (1522 ms)
FAILED TESTS (4/100):
1419 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/90)
1434 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/84)
1457 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/82)
1489 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/74)
Differential Revision: D19587040
Pulled By: sagar0
fbshipit-source-id: 11191ae9940837643bff47ebe18b299b4be3d950
Summary:
It chooses the oldest memtable, not the largest one. This is an
important difference for users whose CFs receive non-uniform write
rates.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6335
Differential Revision: D19588865
Pulled By: maysamyabandeh
fbshipit-source-id: 62ad4325b0182f5f27858584cd73fd5978fb2cec
Summary:
https://github.com/facebook/rocksdb/pull/6028 introduces a bug for hash index in SST files. If a table reader is created when total order seek is used, prefix_extractor might be passed into table reader as null. While later when prefix seek is used, the same table reader used, hash index is checked but prefix extractor is null and the program would crash.
Fix the issue by fixing http://github.com/facebook/rocksdb/pull/6028 in the way that prefix_extractor is preserved but ReadOptions.total_order_seek is checked
Also, a null pointer check is added so that a bug like this won't cause segfault in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6328
Test Plan: Add a unit test that would fail without the fix. Stress test that reproduces the crash would pass.
Differential Revision: D19586751
fbshipit-source-id: 8de77690167ddf5a77a01e167cf89430b1bfba42
Summary:
The earlier code used two conflicting definitions for the number of
input records going into a compaction, one based on the
`rocksdb.num.entries` table property and one based on
`CompactionIterationStats`. The first one is correct and in line
with how output records are counted, while the second one incorrectly
ignores input records in various cases when the `CompactionIterator`
advances or reseeks the input iterator (this can happen, amongst other
cases, when dealing with `SingleDelete`s, regular `Delete`s, `Merge`s,
and compaction filters). This can result in the code undercounting the
input records and computing an incorrect value for "records dropped"
during the compaction. The patch fixes this by switching over to the
correct (table property based) input record count for "records dropped".
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6325
Test Plan: Tested using `make check` and `db_bench`.
Differential Revision: D19525491
Pulled By: ltamasi
fbshipit-source-id: 4340b0b2f41546db8e356db70ca02199e48fa636
Summary:
When there is a write stall, the active write group leader calls ```BeginWriteStall()``` to walk the queue of writers and remove any with the ```no_slowdown``` option set. There was a bug in the code which updated the back pointer but not the forward pointer (```link_newer```), corrupting the list and causing some threads to wait forever. This PR fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6322
Test Plan: Add a unit test in db_write_test
Differential Revision: D19538313
Pulled By: anand1976
fbshipit-source-id: 6fbed819e594913f435886606f5d36f74f235c3a
Summary:
This is a simple edit to have two #include file paths be consistent within range_del_aggregator.{h,cc} with everywhere else.
The impact of this inconsistency is that it actual breaks a Bazel based build on the Windows platform. The same pragma once failure occurs with both Windows Visual C++ 2019 and clang for Windows 9.0. Bazel's "sandboxing" of the builds causes both compilers to not properly recognize "rocksdb/types.h" and "include/rocksdb/types.h" to be the same file (also comparator.h). My guess is that the backslash versus forward slash mixing within path names is the underlying issue.
But, everything builds fine once the include paths in these two source files are consistent with the rest of the repository.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6321
Differential Revision: D19506585
Pulled By: ltamasi
fbshipit-source-id: 294c346607edc433ab99eaabc9c880ee7426817a
Summary:
Currently, this test case tries to infer whether
`VersionStorageInfo::UpdateAccumulatedStats` was called during open by
checking the number of files opened against an arbitrary threshold (10).
This makes the test brittle and results in sporadic failures. The patch
changes the test case to use sync points to directly test whether
`UpdateAccumulatedStats` was called.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6306
Test Plan: `make check`
Differential Revision: D19439544
Pulled By: ltamasi
fbshipit-source-id: ceb7adf578222636a0f51740872d0278cd1a914f
Summary:
When we do concurrently writes, and different write operations will have WAL enable or disable.
But the data from write operation with WAL disabled will still be logged into log files, which will lead to extra disk write/sync since we do not want any guarantee for these part of data.
Detail can be found in https://github.com/facebook/rocksdb/issues/6280. This PR avoid mixing the two types in a write group. The advantage is simpler reasoning about the write group content
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6290
Differential Revision: D19448598
Pulled By: maysamyabandeh
fbshipit-source-id: 3d990a0f79a78ea1bfc90773f6ebafc1884c20de
Summary:
This PR adds a `rocksdb_options_set_atomic_flush` function to the C API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6307
Differential Revision: D19451313
Pulled By: ltamasi
fbshipit-source-id: 750495642ef55b1ea7e13477f85c38cd6574849c
Summary:
Recent bug fix related to hash index introduced a new bug: hash index can return NotFound but it is not handled by BlockBasedTable::Get(). The end result is that Get() stops being executed too early. Fix it by ignoring NotFound code in Get().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6305
Test Plan: A problematic DB used to return NotFound incorrectly, and now able to return correct result. Will try to construct a unit test too.0
Differential Revision: D19438925
fbshipit-source-id: e751afa8c13728d56511cfeb1bc811ecb99f3217
Summary:
https://github.com/facebook/rocksdb/pull/2205 introduced a new
configuration option called `max_background_jobs`, superseding the
earlier options `max_background_flushes` and
`max_background_compactions`. However, unlike
`max_background_compactions`, setting `max_background_jobs` dynamically
through the `SetDBOptions` interface does not adjust the size of the
thread pools (see https://github.com/facebook/rocksdb/issues/6298). The
patch fixes this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6300
Test Plan: Extended unit test.
Differential Revision: D19430899
Pulled By: ltamasi
fbshipit-source-id: 704006605b3c13c3d1b997ccc0831ee369721074
Summary:
Recent fix to Prefix Hash https://github.com/facebook/rocksdb/pull/6292 caused a bug that the newly created NotFound status in hash index is never reset. This causes reseek or implict reseek to return wrong results sometimes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6302
Test Plan:
Add a unit test that would fail. Not fix.
crash test with hash test would fail in several seconds. With the fix, it will run about several minutes before failing with another failure.
Differential Revision: D19424572
fbshipit-source-id: c5276f36a95fd0e2837e30190476d2fe21ed8566
Summary:
When prefix is enabled the expected behavior when the prefix of the target does not exist is for Seek is to seek to any key larger than target and SeekToPrev to any key less than the target.
Currently. the prefix index (kHashSearch) returns OK status but sets Invalid() to indicate two cases: a prefix of the searched key does not exist, ii) the key is beyond the range of the keys in SST file. The SeekForPrev implementation in BlockBasedTable thus does not have enough information to know when it should set the index key to first (to return a key smaller than target). The patch fixes that by returning NotFound status for cases that the prefix does not exist. SeekForPrev in BlockBasedTable accordingly SeekToFirst instead of SeekToLast on the index iterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6297
Test Plan: SeekForPrev of non-exsiting prefix is added to block_test.cc, and a test case is added in db_test2, which fails without the fix.
Differential Revision: D19404695
fbshipit-source-id: cafbbf95f8f60ff9ede9ccc99d25bfa1cf6fcdc3
Summary:
A recent commit adds a unit test that uses a function not available in LITE build. Fix it by avoiding the call
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6295
Test Plan: Run the test in LITE build and see it passes.
Differential Revision: D19395678
fbshipit-source-id: 37b42835bae02511630d80f7cafb1179401bc033
Summary:
The fractional cascading index is not correctly generated when two files at the same level contains the same smallest or largest user key.
The result would be that it would hit an assertion in debug mode and lower level files might be skipped.
This might cause wrong results when the same user keys are of merge operands and Get() is called using the exact user key. In that case, the lower files would need to further checked.
The fix is to fix the fractional cascading index.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6285
Test Plan: Add a unit test which would cause the assertion which would be fixed.
Differential Revision: D19358426
fbshipit-source-id: 39b2b1558075fd95e99491d462a67f9f2298c48e