Summary:
Since the logic for handling IDENTITY file is now inside `NewDB`, the comment above `NewDB` is no longer relevant.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6379
Test Plan: not needed
Differential Revision: D19795440
Pulled By: cheng-chang
fbshipit-source-id: 0b1cca87ac6d92474701c46aa4c8d4d708bfa19b
Summary:
Several statuses were not checked during DB::Open.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6380
Test Plan: make check
Differential Revision: D19780237
Pulled By: cheng-chang
fbshipit-source-id: c8d189d20344bd1607890dd1449345bda2ef96b9
Summary:
Before this fix, atomic flush codepath may hit an assertion failure on a specific failure case.
If all flush jobs within an atomic flush succeed (they do not write to MANIFEST), but batch writing version edits to MANIFEST fails, then `cfd->imm()->RollbackMemTableFlush()` will be called twice, and the second invocation hits assertion failure `assert(m->flush_in_progress_)` since the first invocation resets the variable `flush_in_progress_` to false already.
Test plan (dev server):
```
./db_flush_test --gtest_filter=DBAtomicFlushTest/DBAtomicFlushTest.RollbackAfterFailToInstallResults
make check
```
Both must succeed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6385
Differential Revision: D19782943
Pulled By: riversand963
fbshipit-source-id: 84e1592625e729d1b70fdc8479959387a74cb121
Summary:
Before this PR it calls GetFileSize() once for each sst file in the DB. This can take a long time if there are be tens of thousands of sst files (e.g. in thousands of column families), and even longer if Env is talking to some remote service rather than local filesystem. This PR makes DB::Open() use sst file sizes that are already known from manifest (typically almost all files in the DB) and only call GetFileSize() for non-sst or obsolete files. Note that GetFileSize() is also called and checked against manifest in CheckConsistency(), so the calls in SstFileManagerImpl were completely redundant.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6363
Test Plan: deployed to a test cluster, looked at a dump of Env calls (from a custom instrumented Env) - no more thousands of GetFileSize()s.
Differential Revision: D19702509
Pulled By: al13n321
fbshipit-source-id: 99f8110620cb2e9d0c092dfcdbb11f3af4ff8b73
Summary:
When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open.
If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be *really* slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes.
We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353
Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble.
Differential Revision: D19656425
Pulled By: al13n321
fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82
Summary:
Right now when reading IDENTITY file, we use a very similar logic as ReadFileToString() while it does an extra file size check, which may be expensive in some file systems. There is no reason to duplicate the logic. Use ReadFileToString() instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6365
Test Plan: RUn all existing tests.
Differential Revision: D19709399
fbshipit-source-id: 3bac31f3b2471f98a0d2694278b41e9cd34040fe
Summary:
A relatively recent regression causes for every CF, create and open directory is called for the DB directory, unless CF has a private directory. This doesn't scale well with large number of column families.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6358
Test Plan: Run all existing tests and see it pass. strace with db_bench --num_column_families and observe it doesn't open directory for number of column families.
Differential Revision: D19675141
fbshipit-source-id: da01d9216f1dae3f03d4064fbd88ce71245bd9be
Summary:
Non-zero recycle_log_file_num is incompatible with kPointInTimeRecovery and kAbsoluteConsistency recovery modes. Currently SanitizeOptions changes the recovery mode to kTolerateCorruptedTailRecords, while to resolve this option conflict it makes more sense to compromise recycle_log_file_num, which is a performance feature, instead of wal_recovery_mode, which is a safety feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6351
Differential Revision: D19648931
Pulled By: maysamyabandeh
fbshipit-source-id: dd0bf78349edc007518a00c4d63931fd69294ad7
Summary:
In WritePrepared there could be gap in sequence numbers. This breaks the trick we use in kPointInTimeRecovery which assume the first seq in the log right after the corrupted log is one larger than the last seq we read from the logs. To let this trick keep working, we add a dummy entry with the expected sequence to the first log right after recovery.
Also in WriteCommitted, if the log right after the corrupted log is empty, since it has no sequence number to let the sequential trick work, it is assumed as unexpected behavior. This is however expected to happen if we close the db after recovering from a corruption and before writing anything new to it. To remedy that, we apply the same technique by writing a dummy entry to the log that is created after the corrupted log.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6313
Differential Revision: D19458291
Pulled By: maysamyabandeh
fbshipit-source-id: 09bc49e574690085df45b034ca863ff315937e2d
Summary:
It chooses the oldest memtable, not the largest one. This is an
important difference for users whose CFs receive non-uniform write
rates.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6335
Differential Revision: D19588865
Pulled By: maysamyabandeh
fbshipit-source-id: 62ad4325b0182f5f27858584cd73fd5978fb2cec
Summary:
https://github.com/facebook/rocksdb/pull/2205 introduced a new
configuration option called `max_background_jobs`, superseding the
earlier options `max_background_flushes` and
`max_background_compactions`. However, unlike
`max_background_compactions`, setting `max_background_jobs` dynamically
through the `SetDBOptions` interface does not adjust the size of the
thread pools (see https://github.com/facebook/rocksdb/issues/6298). The
patch fixes this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6300
Test Plan: Extended unit test.
Differential Revision: D19430899
Pulled By: ltamasi
fbshipit-source-id: 704006605b3c13c3d1b997ccc0831ee369721074
Summary:
Fix an error message when CURRENT is not found.
Test plan (dev server)
```
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6264
Differential Revision: D19300699
Pulled By: riversand963
fbshipit-source-id: 303fa206386a125960ecca1dbdeff07422690caf
Summary:
The bad code was:
```
mutex.Lock(); // `mutex` protects `container`
for (auto& x : container) {
mutex.Unlock();
// do stuff to x
mutex.Lock();
}
```
It's incorrect because both `x` and the iterator may become invalid if another thread modifies the container while this thread is not holding the mutex.
Broken by https://github.com/facebook/rocksdb/pull/5796 - it replaced a `while (!container.empty())` loop with a `for (auto x : container)`.
(RocksDB code does a lot of such unlocking+re-locking of mutexes, and this type of bugs comes up a lot :/ )
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6193
Test Plan: Ran some logdevice integration tests that were crashing without this fix.
Differential Revision: D19116874
Pulled By: al13n321
fbshipit-source-id: 9672bc4227c1b68f46f7436db2b96811adb8c703
Summary:
I found that CleanupSuperVersion() may block Get() for 30ms+ (per MemTable is 256MB).
Then I found "delete sv" in ~SuperVersion() takes the time.
The backtrace looks like this
DBImpl::GetImpl() -> DBImpl::ReturnAndCleanupSuperVersion() ->
DBImpl::CleanupSuperVersion() : delete sv; -> ~SuperVersion()
I think it's better to delete in a background thread, please review it。
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6146
Differential Revision: D18972066
fbshipit-source-id: 0f7b0b70b9bb1e27ad6fc1c8a408fbbf237ae08c
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Summary:
We have observed an increase in CPU load caused by frequent calls to
`ColumnFamilyData::InstallSuperVersion` from `DBImpl::TrimMemtableHistory`
when using `max_write_buffer_size_to_maintain` to limit the amount of
memtable history maintained for transaction conflict checking. As it turns out,
this is caused by the code creating and installing a new `SuperVersion` even if
no memtables were actually trimmed. The patch adds a check to avoid this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6169
Test Plan:
Compared `perf` output for
```
./db_bench -benchmarks=randomtransaction -optimistic_transaction_db=1 -statistics -stats_interval_seconds=1 -duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000 --transaction_set_snapshot=1 --threads=32
```
before and after the change. With the fix, the call chain `rocksdb::DBImpl::TrimMemtableHistory` ->
`rocksdb::ColumnFamilyData::InstallSuperVersion` -> `rocksdb::ThreadLocalPtr::StaticMeta::Scrape`
no longer registers in the `perf` report.
Differential Revision: D19031509
Pulled By: ltamasi
fbshipit-source-id: 02686fce594e5b50eba0710e4b28a9b808c8aa20
Summary:
It's easy to cause coredump when closing ColumnFamilyHandle with unreleased iterators, especially iterators release is controlled by java GC when using JNI.
This patch fixed concurrent CF iteration and drop, we let iterators(actually SuperVersion) hold a ColumnFamilyData reference to prevent the CF from being released too early.
fixed https://github.com/facebook/rocksdb/issues/5982
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6147
Differential Revision: D18926378
fbshipit-source-id: 1dff6d068c603d012b81446812368bfee95a5e15
Summary:
**Summary:**
This PR fixes two unordered_write related issues:
- ingestion job may skip the necessary memtable flush https://github.com/facebook/rocksdb/issues/6026
- compact range may cause memtable is flushed before pending unordered write finished
1. `CompactRange` triggers memtable flush but doesn't wait for pending-writes
2. there are some pending writes but memtable is already flushed
3. the memtable related WAL is removed( note that the pending-writes were recorded in that WAL).
4. pending-writes write to newer created memtable
5. there is a restart
6. missing the previous pending-writes because WAL is removed but they aren't included in SST.
**How to solve:**
- Wait pending memtable writes before ingestion job check memtable key range
- Wait pending memtable writes before flush memtable.
**Note that: `CompactRange` calls `RangesOverlapWithMemtables` too without waiting for pending waits, but I'm not sure whether it affects the correctness.**
**Test Plan:**
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6113
Differential Revision: D18895674
Pulled By: maysamyabandeh
fbshipit-source-id: da22b4476fc7e06c176020e7cc171eb78189ecaf
Summary:
`low_pri_write_rate_limiter_` is not being used. Removing. `WriteController` has an internal low_pri rate limiter which is the real rate limiter for low-pri writes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6068
Test Plan: make
Differential Revision: D18664120
fbshipit-source-id: dfe3e4de033cf3522b67781b383aad7d0936034c
Summary:
db_stress_tool.cc now is a giant file. In order to main it easier to improve and maintain, break it down to multiple source files.
Most classes are turned into their own files. Separate .h and .cc files are created for gflag definiations. Another .h and .cc files are created for some common functions. Some test execution logic that is only loosely related to class StressTest is moved to db_stress_driver.h and db_stress_driver.cc. All the files are located under db_stress_tool/. The directory name is created as such because if we end it with either stress or test, .gitignore will ignore any file under it and makes it prone to issues in developements.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6134
Test Plan: Build under GCC7 with and without LITE on using GNU Make. Build with GCC 4.8. Build with cmake with -DWITH_TOOL=1
Differential Revision: D18876064
fbshipit-source-id: b25d0a7451840f31ac0f5ebb0068785f783fdf7d
Summary:
After secondary instance replays the logs from primary, certain files become
obsolete. The secondary should find these files, evict their table readers from
table cache and close them. If this is not done, the secondary will hold on to
these files and prevent their space from being freed.
Test plan (devserver):
```
$./db_secondary_test --gtest_filter=DBSecondaryTest.SecondaryCloseFiles
$make check
$./db_stress -ops_per_thread=100000 -enable_secondary=true -threads=32 -secondary_catch_up_one_in=10000 -clear_column_family_one_in=1000 -reopen=100
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6114
Differential Revision: D18769998
Pulled By: riversand963
fbshipit-source-id: 5d1f151567247196164e1b79d8402fa2045b9120
Summary:
RocksDB should decrement the counter `unscheduled_flushes_` as soon as the bg
thread is scheduled. Before this fix, the counter is decremented only when the
bg thread starts and picks an element from the flush queue. This may cause more
than necessary bg threads to be scheduled. Not a correctness issue, but may
affect flush thread count.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6104
Test Plan:
```
make check
```
Differential Revision: D18735584
Pulled By: riversand963
fbshipit-source-id: d36272d4a08a494aeeab6200a3cff7a3d1a2dc10
Summary:
options.periodic_compaction_seconds isn't supported when options.max_open_files != -1. It's because that the information of file creation time is stored in table properties and are not guaranteed to be loaded unless options.max_open_files = -1. Relax this constraint by storing the information in manifest.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6090
Test Plan: Pass all existing tests; Modify an existing test to force the manifest value to take 0 to simulate backward compatibility case; manually open the DB generated with the change by release 4.2.
Differential Revision: D18702268
fbshipit-source-id: 13e0bd94f546498a04f3dc5fc0d9dff5125ec9eb
Summary:
Previously, options.ttl cannot be set with options.max_open_files = -1, because it makes use of creation_time field in table properties, which is not available unless max_open_files = -1. With this commit, the information will be stored in manifest and when it is available, will be used instead.
Note that, this change will break forward compatibility for release 5.1 and older.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6060
Test Plan: Extend existing test case to options.max_open_files != -1, and simulate backward compatility in one test case by forcing the value to be 0.
Differential Revision: D18631623
fbshipit-source-id: 30c232a8672de5432ce9608bb2488ecc19138830
Summary:
Use db mutex to protect the execution of Version::GetColumnFamilyMetaData()
called in DBImpl::GetColumnFamilyMetaData().
Without mutex, GetColumnFamilyMetaData() races with MarkFilesBeingCompacted()
for access to FileMetaData::being_compacted.
Other than mutex, there are several more alternatives.
- Make FileMetaData::being_compacted an atomic variable. This will make
FileMetaData non-copy-able.
- Separate being_compacted from FileMetaData. This requires re-organizing data
structures that are already used in many places.
Test Plan (dev server):
```
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6056
Differential Revision: D18620488
Pulled By: riversand963
fbshipit-source-id: 87f89660b5d5e2ab4ef7962b7b2a7d00e346aa3b
Summary:
The new DB::MultiGet() doesn't validate input for num_keys > 1 and GCC-9 complains about it. Fix it by directly return when num_keys == 0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6054
Test Plan: Build with GCC-9 and see it passes.
Differential Revision: D18608958
fbshipit-source-id: 1c279aff3c7fe6e9d5a6d085ed02550ecea4fdb2
Summary:
When two_write_queue enable, IngestExternalFile performs EnterUnbatched on both write queues. SwitchMemtable also EnterUnbatched on 2nd write queue when this option is enabled. When the call stack includes IngestExternalFile -> FlushMemTable -> SwitchMemtable, this results into a deadlock.
The implemented solution is to pass on the existing writes_stopped argument in FlushMemTable to skip EnterUnbatched in SwitchMemtable.
Fixes https://github.com/facebook/rocksdb/issues/5974
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5976
Differential Revision: D18535943
Pulled By: maysamyabandeh
fbshipit-source-id: a4f9d4964c10d4a7ca06b1e0102ca2ec395512bc
Summary:
Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice.
As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816
Test Plan:
make check
make asan_check
asan_crash_test
Differential Revision: D18408676
Pulled By: anand1976
fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d
Summary:
FlushJobInfo and CompactionJobInfo are aggregates; we should use the
aggregate initialization syntax to ensure members (specifically those of
built-in types) are value-initialized.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5997
Test Plan: make check
Differential Revision: D18273398
Pulled By: ltamasi
fbshipit-source-id: 35b1a63ad9ca01605d288329858af72fffd7f392
Summary:
In pipeline writing mode, memtable switching needs to wait for memtable writing to finish to make sure that when memtables are made immutable, inserts are not going to them. This is currently done in DBImpl::SwitchMemtable(). This is done after flush_scheduler_.TakeNextColumnFamily() is called to fetch the list of column families to switch. The function flush_scheduler_.TakeNextColumnFamily() itself, however, is not thread-safe when being called together with flush_scheduler_.ScheduleFlush().
This change provides a fix, which moves the waiting logic before flush_scheduler_.TakeNextColumnFamily(). WaitForPendingWrites() is a natural place where the logic can happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5716
Test Plan: Run all tests with ASAN and TSAN.
Differential Revision: D18217658
fbshipit-source-id: b9c5e765c9989645bf10afda7c5c726c3f82f6c3
Summary:
Adding a new API to db.h that allows users to get file_creation_time of the oldest file in the DB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5948
Test Plan: Added unit test.
Differential Revision: D18056151
Pulled By: vjnadimpalli
fbshipit-source-id: 448ec9d34cb6772e1e5a62db399ace00dcbfbb5d
Summary:
A bug occasionally shows up in crash test, and https://github.com/facebook/rocksdb/issues/5851 reproduces it.
The bug can surface in the following way.
1. Database has multiple column families.
2. Between one DB restart, the last log file is corrupted in the middle (not the tail)
3. During restart, DB crashes between flushing between two column families.
Then DB will fail to be opened again with error "SST file is ahead of WALs".
Solution is to update the log number associated with each column family altogether after flushing all column families' memtables. The version edits should be written to a new MANIFEST. Only after writing to all these version edits succeed does RocksDB (atomically) points the CURRENT file to the new MANIFEST.
Test plan (on devserver):
```
$make all && make check
```
Specifically
```
$make db_test2
$./db_test2 --gtest_filter=DBTest2.CrashInRecoveryMultipleCF
```
Also checked for compatibility as follows.
Use this branch, run DBTest2.CrashInRecoveryMultipleCF and preserve the db directory.
Then checkout 5.4, build ldb, and dump the MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5856
Differential Revision: D17620818
Pulled By: riversand963
fbshipit-source-id: b52ce5969c9a8052cacec2bd805fcfb373589039
Summary:
This patch adds a number of new information elements to the FlushJobInfo and
CompactionJobInfo structures that are passed to EventListeners via the
OnFlush{Begin, Completed} and OnCompaction{Begin, Completed} callbacks.
Namely, for flushes, the file numbers of the new SST and the oldest blob file it
references are propagated. For compactions, the new pieces of information are
the file number, level, and the oldest blob file referenced by each compaction
input and output file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5962
Test Plan:
Extended the EventListener unit tests with logic that checks that these information
elements are correctly propagated from the corresponding FileMetaData.
Differential Revision: D18095568
Pulled By: ltamasi
fbshipit-source-id: 6874359a6aadb53366b5fe87adcb2f9bd27a0a56
Summary:
When there are concurrent flush job on the same CF, `OnFlushCompleted` can be called before the flush result being install to LSM. Fixing the issue by passing `FlushJobInfo` through `MemTable`, and the thread who commit the flush result can fetch the `FlushJobInfo` and fire `OnFlushCompleted` on behave of the thread actually writing the SST.
Fix https://github.com/facebook/rocksdb/issues/5892
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5908
Test Plan: Add new test. The test will fail without the fix.
Differential Revision: D17916144
Pulled By: riversand963
fbshipit-source-id: e18df67d9533b5baee52ae3605026cdeb05cbe10
Summary:
This is groundwork for adding garbage collection support to BlobDB. The
patch adds logic that keeps track of the oldest blob file referred to by
each SST file. The oldest blob file is identified during flush/
compaction (similarly to how the range of keys covered by the SST is
identified), and persisted in the manifest as a custom field of the new
file edit record. Blob indexes with TTL are ignored for the purposes of
identifying the oldest blob file (since such blob files are cleaned up by the
TTL logic in BlobDB).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5903
Test Plan:
Added new unit tests; also ran db_bench in BlobDB mode, inspected the
manifest using ldb, and confirmed (by scanning the SST files using
sst_dump) that the value of the oldest blob file number field matches
the contents of the file for each SST.
Differential Revision: D17859997
Pulled By: ltamasi
fbshipit-source-id: 21662c137c6259a6af70446faaf3a9912c550e90
Summary:
RocksDB has a MultiGet() API that implements batched key lookup for higher performance (https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L468). Currently, batching is implemented in BlockBasedTableReader::MultiGet() for SST file lookups. One of the ways it improves performance is by pipelining bloom filter lookups (by prefetching required cachelines for all the keys in the batch, and then doing the probe) and thus hiding the cache miss latency. The same concept can be extended to the memtable as well. This PR involves implementing a pipelined bloom filter lookup in DynamicBloom, and implementing MemTable::MultiGet() that can leverage it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5818
Test Plan:
Existing tests
Performance Test:
Ran the below command which fills up the memtable and makes sure there are no flushes and then call multiget. Ran it on master and on the new change and see atleast 1% performance improvement across all the test runs I did. Sometimes the improvement was upto 5%.
TEST_TMPDIR=/data/users/$USER/benchmarks/feature/ numactl -C 10 ./db_bench -benchmarks="fillseq,multireadrandom" -num=600000 -compression_type="none" -level_compaction_dynamic_level_bytes -write_buffer_size=200000000 -target_file_size_base=200000000 -max_bytes_for_level_base=16777216 -reads=90000 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4 -statistics -memtable_whole_key_filtering=true -memtable_bloom_size_ratio=10
Differential Revision: D17578869
Pulled By: vjnadimpalli
fbshipit-source-id: 23dc651d9bf49db11d22375bf435708875a1f192
Summary:
Atomic flush is incompatible with pipelined write. At least now.
If pipelined write is enabled, a thread performing write can exit the write
thread and start inserting into memtables. Consequently a thread performing
flush will enter write thread and race with memtable insertion by the former.
This will cause undefined result in terms of data persistence.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5860
Test Plan:
```
$make all && make check
```
Differential Revision: D17638944
Pulled By: riversand963
fbshipit-source-id: abc578dc49a5dbe41bc5adcecf448f8e042a6d49
Summary:
This is a bug occaionally shows up in crash test, and this unit test is to reproduce it. The bug is following:
1. Database has multiple CFs.
2. Between one DB restart, the last log file is corrupted in the middle (not the tail)
3. During restart, DB crashes between flushes between two CFs.
The DB will fail to be opened again with error "SST file is ahead of WALs"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5851
Test Plan: Run the test itself.
Differential Revision: D17614721
fbshipit-source-id: 1b0abce49b203a76a039e38e76bc940429975f20
Summary:
Further apply formatter to more recent commits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5830
Test Plan: Run all existing tests.
Differential Revision: D17488031
fbshipit-source-id: 137458fd94d56dd271b8b40c522b03036943a2ab
Summary:
Some recent commits might not have passed through the formatter. I formatted recent 45 commits. The script hangs for more commits so I stopped there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5827
Test Plan: Run all existing tests.
Differential Revision: D17483727
fbshipit-source-id: af23113ee63015d8a43d89a3bc2c1056189afe8f
Summary:
Originally the loop of closing WAL in PurgeObsoleteFiles resides inside a loop
iterating over the candidate files. It should be moved out.
Test plan (devserver)
```
$COMPILE_WITH_ASAN=1 make -j32 all
$make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5804
Differential Revision: D17374350
Pulled By: riversand963
fbshipit-source-id: 2bee7343fc0481d9a385a87c7676491522285c96
Summary:
purge_queue_ maybe contains thousands sst files, for example manual compact a range. If full scan is triggered at the same time and the total sst files number is large, RocksDB will be blocked at https://github.com/facebook/rocksdb/blob/master/db/db_impl_files.cc#L150 for several seconds. In our environment we have 140,000 sst files and the manual compaction delete about 1000 sst files, it blocked about 2 minutes.
Commandeering https://github.com/facebook/rocksdb/issues/5290.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5796
Differential Revision: D17357775
Pulled By: riversand963
fbshipit-source-id: 20eacca917355b8de975ccc7b1c9a3e7bd5b201a
Summary:
Manual compaction may bring in very high load because sometime the amount of data involved in a compaction could be large, which may affect online service. So it would be good if the running compaction making the server busy can be stopped immediately. In this implementation, stopping manual compaction condition is only checked in slow process. We let deletion compaction and trivial move go through.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3971
Test Plan: add tests at more spots.
Differential Revision: D17369043
fbshipit-source-id: 575a624fb992ce0bb07d9443eb209e547740043c
Summary:
For our default block cache, each additional entry has extra memory overhead. It include LRUHandle (72 bytes currently) and the cache key (two varint64, file id and offset). The usage is not negligible. For example for block_size=4k, the overhead accounts for an extra 2% memory usage for the cache. The patch charging the cache for the extra usage, reducing untracked memory usage outside block cache. The feature is enabled by default and can be disabled by passing kDontChargeCacheMetadata to the cache constructor.
This PR builds up on https://github.com/facebook/rocksdb/issues/4258
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5797
Test Plan:
- Existing tests are updated to either disable the feature when the test has too much dependency on the old way of accounting the usage or increasing the cache capacity to account for the additional charge of metadata.
- The Usage tests in cache_test.cc are augmented to test the cache usage under kFullChargeCacheMetadata.
Differential Revision: D17396833
Pulled By: maysamyabandeh
fbshipit-source-id: 7684ccb9f8a40ca595e4f5efcdb03623afea0c6f
Summary:
file_reader_writer.h and .cc contain several files and helper function, and it's hard to navigate. Separate it to multiple files and put them under file/
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5803
Test Plan: Build whole project using make and cmake.
Differential Revision: D17374550
fbshipit-source-id: 10efca907721e7a78ed25bbf74dc5410dea05987
Summary:
Currently IngestExternalFile() fails when its input files' ranges overlap. This condition doesn't need to hold for files that are to be ingested in L0, though.
This commit allows overlapping files and forces their target level to L0.
Additionally, ingest job's completion is logged to EventLogger, analogous to flush and compaction jobs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5539
Differential Revision: D17370660
Pulled By: riversand963
fbshipit-source-id: 749a3899b17d1be267a5afd5b0a99d96b38ab2f3
Summary:
Move definition and implementation for ArenaWrappedDBIter into its own .h/.cc files. Also, change inlining of functions to better comply with the Google C++ style guide.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5801
Test Plan: make check
Differential Revision: D17371012
Pulled By: anand1976
fbshipit-source-id: c1361abc2851575111e357a63d88be3b3d6cb341
Summary:
Use delete to disable automatic generated methods instead of private, and put the constructor together for more clear.This modification cause the unused field warning, so add unused attribute to disable this warning.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5009
Differential Revision: D17288733
fbshipit-source-id: 8a767ce096f185f1db01bd28fc88fef1cdd921f3
Summary:
When building with clang 9, warning is reported for InternalDBStatsType type names shadowed the one for statistics. Rename them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5779
Test Plan: Build with clang 9 and see it passes.
Differential Revision: D17239378
fbshipit-source-id: af28fb42066c738cd1b841f9fe21ab4671dafd18
Summary:
Adding a light weight API to get last live WAL file name and size. Meant to be used as a helper for backup/restore tooling in a larger ecosystem such as MySQL with a MyRocks storage engine.
Specifically within MySQL's backup/restore mechanism, this call can be made with a write lock on the mysql db to get a transactionally consistent snapshot of the current WAL file position along with other non-rocksdb log/data files.
Without this, the alternative would be to take the aforementioned lock, scan the WAL dir for all files, find the last file and note its exact size as the rocksdb 'checkpoint'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5765
Differential Revision: D17172717
Pulled By: affandar
fbshipit-source-id: f2fabafd4c0e6fc45f126670c8c88a9f84cb8a37
Summary:
Each DB has a globally unique ID. A DB can be physically copied around, or backed-up and restored, and the users should be identify the same DB. This unique ID right now is stored as plain text in file IDENTITY under the DB directory. This approach introduces at least two problems: (1) the file is not checksumed; (2) the source of truth of a DB is the manifest file, which can be copied separately from IDENTITY file, causing the DB ID to be wrong.
The goal of this PR is solve this problem by moving the DB ID to manifest. To begin with we will write to both identity file and manifest. Write to Manifest is controlled via the flag write_dbid_to_manifest in Options and default is false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5725
Test Plan: Added unit tests.
Differential Revision: D16963840
Pulled By: vjnadimpalli
fbshipit-source-id: 8a86a4c8c82c716003c40fd6b9d2d758030d92e9
Summary:
Before this PR, when the number of column families involved in a file ingestion exceeds 2, a bug in the looping logic prevents correct file number being assigned to each ingestion job.
Also skip deleting non-existing hard links during cleanup-after-failure.
Test plan (devserver)
```
$COMPILE_WITH_ASAN=1 make all
$./external_sst_file_test --gtest_filter=ExternalSSTFileTest/ExternalSSTFileTest.IngestFilesIntoMultipleColumnFamilies_*/*
$makke check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5760
Differential Revision: D17142982
Pulled By: riversand963
fbshipit-source-id: 06c1847a4e7a402647bcf28d124e70f2a0f9daf6
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022
Differential Revision: D14394062
Pulled By: miasantreble
fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
Summary:
The comments of snap_refresh_nanos advertise that the snapshot refresh feature will be disabled when the option is set to 0. This contract is however not honored in the code: https://github.com/facebook/rocksdb/pull/5278
The patch fixes that and also adds an assert to ensure that the feature is not used when the option is zero.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5724
Differential Revision: D16918185
Pulled By: maysamyabandeh
fbshipit-source-id: fec167287df7d85093e087fc39c0eb243e3bbd7e
Summary:
Right now VerifyChecksum() doesn't do read-ahead. In some use cases, users won't be able to achieve good performance. With this change, by default, RocksDB will do a default readahead, and users will be able to overwrite the readahead size by passing in a ReadOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5713
Test Plan: Add a new unit test.
Differential Revision: D16860874
fbshipit-source-id: 0cff0fe79ac855d3d068e6ccd770770854a68413
Summary:
Add a command in ldb so that users can print out tombstones in SST files.
In order to test the code, change the interface of LDBCommandRunner::RunCommand() so that it doesn't return from the program, but return the status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5615
Test Plan: Add a new unit test
Differential Revision: D16550326
fbshipit-source-id: 88ddfe6984bdcbb3a528abdd115089df09eba52e
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
const ReadOptions& options, ColumnFamilyHandle* column_family,
const Slice& key, PinnableSlice* merge_operands,
GetMergeOperandsOptions* get_merge_operands_options,
int* number_of_operands)
Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);
Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604
Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist
Differential Revision: D16657366
Pulled By: vjnadimpalli
fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
Summary:
Currently in `DBImpl::PurgeObsoleteFiles`, the list of candidate files is create through a combination of calling LogFileName using `log_delete_files` and `full_scan_candidate_files`.
In full_scan_candidate_files, the filenames look like this
{file_name = "074715.log", file_path = "/txlogs/3306"},
but LogFileName produces filenames like this that prepends a slash:
{file_name = "/074715.log", file_path = "/txlogs/3306"},
This confuses the dedup step here: bb4178066d/db/db_impl/db_impl_files.cc (L339-L345)
Because duplicates still exist, DeleteFile is called on the same file twice, and hits an error on the second try. Error message: Failed to mark /txlogs/3302/764418.log as trash.
The root cause is the use of `kDumbDbName` when generating file names, it creates file names like /074715.log. This PR removes the use of `kDumbDbName` and create paths without leading '/' when dbname can be ignored.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5603
Test Plan: make check
Differential Revision: D16413203
Pulled By: miasantreble
fbshipit-source-id: 6ba8288382c55f7d5e3892d722fc94b57d2e4491
Summary:
In some cases, we don't have to get really accurate number. Something like 10% off is fine, we can create a new option for that use case. In this case, we can calculate size for full files first, and avoid estimation inside SST files if full files got us a huge number. For example, if we already covered 100GB of data, we should be able to skip partial dives into 10 SST files of 30MB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5609
Differential Revision: D16433481
Pulled By: elipoz
fbshipit-source-id: 5830b31e1c656d0fd3a00d7fd2678ddc8f6e601b
Summary:
The `TransactionTest.MultiGetBatchedTest` were failing with unprepared batches because we were not using the correct callbacks. Override MultiGet to pass down the correct ReadCallback. A similar problem is also fixed in WritePrepared.
This PR also fixes an issue similar to (https://github.com/facebook/rocksdb/pull/5147), but for MultiGet instead of Get.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5634
Differential Revision: D16552674
Pulled By: lth
fbshipit-source-id: 736eaf8e919c6b13d5f5655b1c0d36b57ad04804
Summary:
The new DB::GetApproximateSizes with SizeApproximationOptions argument, which allows to add more options/knobs to the DB::GetApproximateSizes call (beyond only the include_flags)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5626
Differential Revision: D16496913
Pulled By: elipoz
fbshipit-source-id: ee8c6c182330a285fa056ecfc3905a592b451720
Summary:
In previous https://github.com/facebook/rocksdb/issues/5079, we added user-specified timestamp to `DB::Get()` and `DB::Put()`. Limitation is that these two functions may cause extra memory allocation and key copy. The reason is that `WriteBatch` does not allocate extra memory for timestamps because it is not aware of timestamp size, and we did not provide an API to assign/update timestamp of each key within a `WriteBatch`.
We address these issues in this PR by doing the following.
1. Add a `timestamp_size_` to `WriteBatch` so that `WriteBatch` can take timestamps into account when calling `WriteBatch::Put`, `WriteBatch::Delete`, etc.
2. Add APIs `WriteBatch::AssignTimestamp` and `WriteBatch::AssignTimestamps` so that application can assign/update timestamps for each key in a `WriteBatch`.
3. Avoid key copy in `GetImpl` by adding new constructor to `LookupKey`.
Test plan (on devserver):
```
$make clean && COMPILE_WITH_ASAN=1 make -j32 all
$./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
$make check
```
If the API extension looks good, I will add more unit tests.
Some simple benchmark using db_bench.
```
$rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
$rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000 -disable_wal=true
```
Master is at a78503bd6c.
```
| | readrandom | fillrandom |
| master | 15.53 MB/s | 25.97 MB/s |
| PR5502 | 16.70 MB/s | 25.80 MB/s |
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5502
Differential Revision: D16340894
Pulled By: riversand963
fbshipit-source-id: 51132cf792be07d1efc3ac33f5768c4ee2608bb8
Summary:
There are a number of fixes in this PR (with most bugs found via the added stress tests):
1. Re-enable reseek optimization. This was initially disabled to avoid infinite loops in https://github.com/facebook/rocksdb/pull/3955 but this can be resolved by remembering not to reseek after a reseek has already been done. This problem only affects forward iteration in `DBIter::FindNextUserEntryInternal`, as we already disable reseeking in `DBIter::FindValueForCurrentKeyUsingSeek`.
2. Verify that ReadOption.snapshot can be safely used for iterator creation. Some snapshots would not give correct results because snaphsot validation would not be enforced, breaking some assumptions in Prev() iteration.
3. In the non-snapshot Get() case, reads done at `LastPublishedSequence` may not be enough, because unprepared sequence numbers are not published. Use `std::max(published_seq, max_visible_seq)` to do lookups instead.
4. Add stress test to test reading own writes.
5. Minor bug in the allow_concurrent_memtable_write case where we forgot to pass in batch_per_txn_.
6. Minor performance optimization in `CalcMaxUnpreparedSequenceNumber` by assigning by reference instead of value.
7. Add some more comments everywhere.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5573
Differential Revision: D16276089
Pulled By: lth
fbshipit-source-id: 18029c944eb427a90a87dee76ac1b23f37ec1ccb
Summary:
Added log_readahead_size option to control prefetching for Log::Reader.
This is mostly useful for reading a remotely located log, as it can save the number of round-trips when reading it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5592
Differential Revision: D16362989
Pulled By: elipoz
fbshipit-source-id: c5d4d5245a44008cd59879640efff70c091ad3e8
Summary:
Refresh of the earlier change here - https://github.com/facebook/rocksdb/issues/5135
This is a review request for code change needed for - https://github.com/facebook/rocksdb/issues/3469
"Add support for taking snapshot of a column family and creating column family from a given CF snapshot"
We have an implementation for this that we have been testing internally. We have two new APIs that together provide this functionality.
(1) ExportColumnFamily() - This API is modelled after CreateCheckpoint() as below.
// Exports all live SST files of a specified Column Family onto export_dir,
// returning SST files information in metadata.
// - SST files will be created as hard links when the directory specified
// is in the same partition as the db directory, copied otherwise.
// - export_dir should not already exist and will be created by this API.
// - Always triggers a flush.
virtual Status ExportColumnFamily(ColumnFamilyHandle* handle,
const std::string& export_dir,
ExportImportFilesMetaData** metadata);
Internally, the API will DisableFileDeletions(), GetColumnFamilyMetaData(), Parse through
metadata, creating links/copies of all the sst files, EnableFileDeletions() and complete the call by
returning the list of file metadata.
(2) CreateColumnFamilyWithImport() - This API is modeled after IngestExternalFile(), but invoked only during a CF creation as below.
// CreateColumnFamilyWithImport() will create a new column family with
// column_family_name and import external SST files specified in metadata into
// this column family.
// (1) External SST files can be created using SstFileWriter.
// (2) External SST files can be exported from a particular column family in
// an existing DB.
// Option in import_options specifies whether the external files are copied or
// moved (default is copy). When option specifies copy, managing files at
// external_file_path is caller's responsibility. When option specifies a
// move, the call ensures that the specified files at external_file_path are
// deleted on successful return and files are not modified on any error
// return.
// On error return, column family handle returned will be nullptr.
// ColumnFamily will be present on successful return and will not be present
// on error return. ColumnFamily may be present on any crash during this call.
virtual Status CreateColumnFamilyWithImport(
const ColumnFamilyOptions& options, const std::string& column_family_name,
const ImportColumnFamilyOptions& import_options,
const ExportImportFilesMetaData& metadata,
ColumnFamilyHandle** handle);
Internally, this API creates a new CF, parses all the sst files and adds it to the specified column family, at the same level and with same sequence number as in the metadata. Also performs safety checks with respect to overlaps between the sst files being imported.
If incoming sequence number is higher than current local sequence number, local sequence
number is updated to reflect this.
Note, as the sst files is are being moved across Column Families, Column Family name in sst file
will no longer match the actual column family on destination DB. The API does not modify Column
Family name or id in the sst files being imported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5495
Differential Revision: D16018881
fbshipit-source-id: 9ae2251025d5916d35a9fc4ea4d6707f6be16ff9
Summary:
1. Cleanup WAL trash files on open
2. Don't apply deletion rate limit if WAL dir is different from db dir
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5520
Test Plan: Add new unit tests and make check
Differential Revision: D16096750
Pulled By: anand1976
fbshipit-source-id: 6f07858ad864b754b711db416f0389c45ede599b
Summary:
This is to prevent bg flush thread from unrefing and deleting the cfd that has been dropped by a concurrent thread.
Before RocksDB calls `DBImpl::WaitForFlushMemTables`, we should increase the refcount of each `ColumnFamilyData` so that its ref count will not drop to 0 even if the column family is dropped by another thread. Otherwise the bg flush thread can deref the cfd and deletes it, causing a segfault in `WaitForFlushMemtables` upon accessing `cfd`.
Test plan (on devserver):
```
$make clean && COMPILE_WITH_ASAN=1 make -j32
$make check
```
All unit tests must pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5513
Differential Revision: D16062898
Pulled By: riversand963
fbshipit-source-id: 37dc511f1dc99f036d0201bbd7f0a8f5677c763d
Summary:
WAL records RocksDB writes to all column families. When user flushes a a column family, the old WAL will not accept new writes but cannot be deleted yet because it may still contain live data for other column families. (See https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log#life-cycle-of-a-wal for detailed explanation)
Because of this, if there is a column family that receive very infrequent writes and no manual flush is called for it, it could prevent a lot of WALs from being deleted. PR https://github.com/facebook/rocksdb/pull/5046 introduced persistent stats column family which is a good example of such column families. Depending on the config, it may have long intervals between writes, and user is unaware of it which makes it difficult to call manual flush for it.
This PR addresses the problem for persistent stats column family by forcing a flush for persistent stats column family when 1) another column family is flushed 2) persistent stats column family's log number is the smallest among all column families, this way persistent stats column family will keep advancing its log number when necessary, allowing RocksDB to delete old WAL files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5509
Differential Revision: D16045896
Pulled By: miasantreble
fbshipit-source-id: 286837b633e988417f0096ff38384742d3b40ef4
Summary:
It it not safe to assume application had sync the SST file before ingest it into DB. Also the directory to put the ingested file needs to be fsync, otherwise the file can be lost. For integrity of RocksDB we need to sync the ingested file and directory before apply the change to manifest.
Also syncing after writing global sequence when write_global_seqno=true was removed in https://github.com/facebook/rocksdb/issues/4172. Adding it back.
Fixes https://github.com/facebook/rocksdb/issues/5287.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5435
Test Plan:
Test ingest file with ldb command and observe fsync/fdatasync in strace output. Tried both move_files=true and move_files=false.
https://gist.github.com/yiwu-arbug/650a4023f57979056d83485fa863bef9
More test suggestions are welcome.
Differential Revision: D15941675
Pulled By: riversand963
fbshipit-source-id: 389533f3923065a96df2cdde23ff4724a1810d78
Summary:
This PR adds more callers for table readers. These information are only used for block cache analysis so that we can know which caller accesses a block.
1. It renames the BlockCacheLookupCaller to TableReaderCaller as passing the caller from upstream requires changes to table_reader.h and TableReaderCaller is a more appropriate name.
2. It adds more table reader callers in table/table_reader_caller.h, e.g., kCompactionRefill, kExternalSSTIngestion, and kBuildTable.
This PR is long as it requires modification of interfaces in table_reader.h, e.g., NewIterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5454
Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32.
Differential Revision: D15819451
Pulled By: HaoyuHuang
fbshipit-source-id: b6caa704c8fb96ddd15b9a934b7e7ea87f88092d
Summary:
While the secondary is replaying after the primary, the primary may switch to a new MANIFEST. The secondary is already able to detect and follow the primary to the new MANIFEST. However, the current implementation has a bug, described as follows.
The new MANIFEST's first records have been generated by VersionSet::WriteSnapshot to describe the current state of the column families and the db as of the MANIFEST creation. Since the secondary instance has already finished recovering upon start, there is no need for the secondary to process these records. Actually, if the secondary were to replay these records, the secondary may end up adding the same SST files **again** to each column family, causing consistency checks done by VersionBuilder to fail. Therefore, we record the number of records to skip at the beginning of the new MANIFEST and ignore them.
Test plan (on dev server)
```
$make clean && make -j32 all
$./db_secondary_test
```
All existing unit tests must pass as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5472
Differential Revision: D15866771
Pulled By: riversand963
fbshipit-source-id: a1eec4837fb2ad13059398efb0f437e74fd53bed
Summary:
`DBImplSecondary` calls `CheckConsistency()` during open. In the past, `DBImplSecondary` did not override this function thus `DBImpl::CheckConsistency()` is called.
The following can happen. The secondary instance is performing consistency check which calls `GetFileSize(file_path)` but the file at `file_path` is deleted by the primary instance. `DBImpl::CheckConsistency` does not account for this and fails the consistency check. This is undesirable. The solution is that, we call `DBImpl::CheckConsistency()` first. If it passes, then we are good. If not, we give it a second chance and handles the case of file(s) being deleted.
Test plan (on dev server):
```
$make clean && make -j20 all
$./db_secondary_test
```
All other existing unit tests must pass as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5469
Differential Revision: D15861845
Pulled By: riversand963
fbshipit-source-id: 507d72392508caed3cd003bb2e2aa43f993dd597
Summary:
This PR continues the work in https://github.com/facebook/rocksdb/pull/4748 and https://github.com/facebook/rocksdb/pull/4535 by adding a new DBOption `persist_stats_to_disk` which instructs RocksDB to persist stats history to RocksDB itself. When statistics is enabled, and both options `stats_persist_period_sec` and `persist_stats_to_disk` are set, RocksDB will periodically write stats to a built-in column family in the following form: key -> (timestamp in microseconds)#(stats name), value -> stats value. The existing API `GetStatsHistory` will detect the current value of `persist_stats_to_disk` and either read from in-memory data structure or from the hidden column family on disk.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5046
Differential Revision: D15863138
Pulled By: miasantreble
fbshipit-source-id: bb82abdb3f2ca581aa42531734ac799f113e931b
Summary:
It seems like CF Options are not properly validated when creating a new column family with `CreateColumnFamily` API; only a selected few checks are done. Calling `ColumnFamilyData::ValidateOptions`, which is the single source for all CFOptions validations, will help fix this. (`ColumnFamilyData::ValidateOptions` is already called at the time of `DB::Open`).
**Test Plan:**
Added a new test: `DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions`
```
TEST_TMPDIR=/dev/shm ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions
```
Also ran gtest-parallel to make sure the new test is not flaky.
```
TEST_TMPDIR=/dev/shm ~/gtest-parallel/gtest-parallel ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions --repeat=10000
[10000/10000] DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions (15 ms)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5453
Differential Revision: D15816851
Pulled By: sagar0
fbshipit-source-id: 9e702b9850f5c4a7e0ef8d39e1e6f9b81e7fe1e5
Summary:
This PR integrates the block cache tracer class into db_impl.cc.
db_impl.cc contains a member variable of AtomicBlockCacheTraceWriter class and passes its reference to the block_based_table_reader.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5433
Differential Revision: D15728016
Pulled By: HaoyuHuang
fbshipit-source-id: 23d5659e8c82d556833dcc1a5558aac8c1f7db71
Summary:
In secondary mode, it is possible that the secondary lists the primary's WAL
directory, finds a WAL and tries to open it. It is possible that the primary
deletes the WAL after secondary listing dir but before the secondary opening
it. Then the secondary will fail to open the WAL file with a PathNotFound
status. In this case, we can return OK without replaying WAL and optionally
replay more MANIFEST.
Test Plan (on my dev machine):
Without this PR, the following will fail several times out of 100 runs.
```
~/gtest-parallel/gtest-parallel -r 100 -w 16 ./db_secondary_test --gtest_filter=DBSecondaryTest.SwitchToNewManifestDuringOpen
```
With this PR, the above should always succeed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5323
Differential Revision: D15763878
Pulled By: riversand963
fbshipit-source-id: c7164fa7cb8d9001abc258b6a2dc93613e4f38ff
Summary:
If a memtable definitely covers a key, there isn't a need to check older memtables.
We can skip them by checking the earliest sequence number.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4941
Differential Revision: D13932666
fbshipit-source-id: b9d52f234b8ad9dd3bf6547645cd457175a3ca9b
Summary:
To avoid deadlock mutex_ should never be acquired before log_write_mutex_. The patch documents that and also fixes one case in ::FlushWAL that acquires mutex_ through ::WriteStatusCheck when it already holds lock on log_write_mutex_.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5437
Differential Revision: D15749722
Pulled By: maysamyabandeh
fbshipit-source-id: f57b69c44b4b80cc6d7ddf3d3fdf4a9eb5a5a45a
Summary:
Use `CreateLoggerFromOptions` function to reduce code duplication.
Test plan (on my machine)
```
$make clean && make -j32 db_secondary_test
$KEEP_DB=1 ./db_secondary_test
```
Verify all info logs of the secondary instance are properly logged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5427
Differential Revision: D15748922
Pulled By: riversand963
fbshipit-source-id: bad7261df1b8373efc504f141efc7871e375a311
Summary:
BlockCacheLookupContext only contains the caller for now.
We will trace block accesses at five places:
1. BlockBasedTable::GetFilter.
2. BlockBasedTable::GetUncompressedDict.
3. BlockBasedTable::MaybeReadAndLoadToCache. (To trace access on data, index, and range deletion block.)
4. BlockBasedTable::Get. (To trace the referenced key and whether the referenced key exists in a fetched data block.)
5. BlockBasedTable::MultiGet. (To trace the referenced key and whether the referenced key exists in a fetched data block.)
We create the context at:
1. BlockBasedTable::Get. (kUserGet)
2. BlockBasedTable::MultiGet. (kUserMGet)
3. BlockBasedTable::NewIterator. (either kUserIterator, kCompaction, or external SST ingestion calls this function.)
4. BlockBasedTable::Open. (kPrefetch)
5. Index/Filter::CacheDependencies. (kPrefetch)
6. BlockBasedTable::ApproximateOffsetOf. (kCompaction or kUserApproximateSize).
I loaded 1 million key-value pairs into the database and ran the readrandom benchmark with a single thread. I gave the block cache 10 GB to make sure all reads hit the block cache after warmup. The throughput is comparable.
Throughput of this PR: 231334 ops/s.
Throughput of the master branch: 238428 ops/s.
Experiment setup:
RocksDB: version 6.2
Date: Mon Jun 10 10:42:51 2019
CPU: 24 * Intel Core Processor (Skylake)
CPUCache: 16384 KB
Keys: 20 bytes each
Values: 100 bytes each (100 bytes after compression)
Entries: 1000000
Prefix: 20 bytes
Keys per prefix: 0
RawSize: 114.4 MB (estimated)
FileSize: 114.4 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: NoCompression
Compression sampling rate: 0
Memtablerep: skip_list
Perf Level: 1
Load command: ./db_bench --benchmarks="fillseq" --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000
Run command: ./db_bench --benchmarks="readrandom,stats" --use_existing_db --threads=1 --duration=120 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 --duration=120
TODOs:
1. Create a caller for external SST file ingestion and differentiate the callers for iterator.
2. Integrate tracer to trace block cache accesses.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5421
Differential Revision: D15704258
Pulled By: HaoyuHuang
fbshipit-source-id: 4aa8a55f8cb1576ffb367bfa3186a91d8f06d93a
Summary:
In regular RocksDB instance, `MemTable::earliest_seqno_` is "db sequence number at the time of creation". However, we cannot use the db sequence number to set the value of `MemTable::earliest_seqno_` for secondary instance, i.e. `DBImplSecondary` due to the logic of MANIFEST and WAL replay.
When replaying the log files of the primary, the secondary instance first replays MANIFEST and updates the db sequence number if necessary. Next, the secondary replays WAL files, creates new memtables if necessary and inserts key-value pairs into memtables. The following can occur when the db has two or more column families.
Assume the db has column family "default" and "cf1". At a certain in time, both "default" and "cf1" have data in memtables.
1. Primary triggers a flush and flushes "cf1". "default" is **not** flushed.
2. Secondary replays the MANIFEST updates its db sequence number to the latest value learned from the MANIFEST.
3. Secondary starts to replay WAL that contains the writes to "default". It is possible that the write batches' sequence numbers are smaller than the db sequence number. In this case, these write batches will be skipped, and these updates will not be visible to reader until "default" is later flushed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5413
Differential Revision: D15637407
Pulled By: riversand963
fbshipit-source-id: 3de3fe35cfc6f1b9f844f3f926f0df29717b6580
Summary:
The patch reduces the contention over prepared_mutex_ using these techniques:
1) Move ::RemovePrepared() to be called from the commit callback when we have two write queues.
2) Use two separate mutex for PreparedHeap, one prepared_mutex_ needed for ::RemovePrepared, and one ::push_pop_mutex() needed for ::AddPrepared(). Given that we call ::AddPrepared only from the first write queue and ::RemovePrepared mostly from the 2nd, this will result into each the two write queues not competing with each other over a single mutex. ::RemovePrepared might occasionally need to acquire ::push_pop_mutex() if ::erase() ends up with calling ::pop()
3) Acquire ::push_pop_mutex() on the first callback of the write queue and release it on the last.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5420
Differential Revision: D15741985
Pulled By: maysamyabandeh
fbshipit-source-id: 84ce8016007e88bb6e10da5760ba1f0d26347735
Summary:
I'm not able to prove it, but the stress test failure may be caused by the following sequence of events -
1. Crash db_stress while writing the log file. This should result in a corrupted WAL.
2. Run db_stress with recycle_log_file_num=1. Crash during recovery immediately after writing manifest and updating the current file. The old log from the previous run is left behind, but the memtable would have been flushed during recovery and the CF log number will point to the newer log
3. Run db_stress with recycle_log_file_num=0. During recovery, the old log file will be processed and the corruption will be detected. Since the CF has moved ahead, we get the "SST file is ahead of WAL" error
Test -
1. stress_crash
2. make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5412
Differential Revision: D15699120
Pulled By: anand1976
fbshipit-source-id: 9092ce81e7c4a0b4b4e66560c23ea4812a4d9cbe
Summary:
When using `PRIu64` type of printf specifier, current code base does the following:
```
#ifndef __STDC_FORMAT_MACROS
#define __STDC_FORMAT_MACROS
#endif
#include <inttypes.h>
```
However, this can be simplified to
```
#include <cinttypes>
```
as long as flag `-std=c++11` is used.
This should solve issues like https://github.com/facebook/rocksdb/issues/5159
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402
Differential Revision: D15701195
Pulled By: miasantreble
fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03
Summary:
It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now).
Test plan (on devserver):
```
$COMPILE_WITH_ASAN=1 make -j32 all
$./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
$make check
```
All tests must pass.
We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled.
```
$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000
```
Repeat for 6 times for both versions.
Results are as follows:
```
| | readrandom | fillrandom |
| master | 16.77 MB/s | 47.05 MB/s |
| PR5079 | 16.44 MB/s | 47.03 MB/s |
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079
Differential Revision: D15132946
Pulled By: riversand963
fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c
Summary:
With this commit, RocksDB secondary instance respects atomic groups in version edits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5411
Differential Revision: D15617512
Pulled By: HaoyuHuang
fbshipit-source-id: 913f4ede391d772dcaf5649e3cd2099fa292d120
Summary:
The PR #5275 separated the column dropped and shutdown status codes. However, there were a couple of places in compaction where this change ended up treating a ShutdownInProgress() error as a real error and set bg_error. This caused MyRocks unit test to fail due to WAL writes during shutdown returning this error. Fix it by ignoring the shutdown status during compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5400
Differential Revision: D15611680
Pulled By: anand1976
fbshipit-source-id: c602e97840e3ae24eb420d61e0ce95d3e6258632
Summary:
Currently we validate options in DB::Open. However the validation step is missing when options are dynamically updated in ::SetOptions. The patch fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5368
Differential Revision: D15540101
Pulled By: maysamyabandeh
fbshipit-source-id: d27bbffd8f0252d1b50bcf59e0a70a278ed937f4
Summary:
util/ means for lower level libraries. trace_replay is highly integrated to DB and sometimes call DB. Move it out to a separate directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5376
Differential Revision: D15550938
Pulled By: siying
fbshipit-source-id: f46dce5ceffdc05a73f26379c7bb1b79ebe6c207
Summary:
Many logging related source files are under util/. It will be more structured if they are together.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387
Differential Revision: D15579036
Pulled By: siying
fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f