Summary:
* Add missing unit test for schema stability of FileChecksumGenCrc32c
(previously was only comparing to itself)
* A lot of clarifying comments
* Add some assertions for preconditions
* Rename WritableFileWriter::CalculateFileChecksum -> UpdateFileChecksum
* Simplify FileChecksumGenCrc32c with shared functions
* Implement EndianSwapValue to replace unused EndianTransform
And incidentally since I had trouble with 'make check-format' GitHub action disagreeing with local run,
* Output full diagnostic information when 'make check-format' fails in CI
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6861
Test Plan: new unit test passes before & after other changes
Reviewed By: zhichao-cao
Differential Revision: D21667115
Pulled By: pdillinger
fbshipit-source-id: 6a99970f87605aa024fa540c78cd519ff322c3e6
Summary:
When creating a database backup, the background threads will not only consume IO resources by copying files, but also consuming CPU such as by computing checksums. During peak times, the CPU consumption by the background threads might affect online queries.
This PR makes it possible to decrease CPU priority of these threads when creating a new backup.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6602
Test Plan: make check
Reviewed By: siying, zhichao-cao
Differential Revision: D20683216
Pulled By: cheng-chang
fbshipit-source-id: 9978b9ed9488e8ce135e90ca083e5b4b7221fd84
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Summary:
This change ignores the value of BackupableDBOptions::max_valid_backups_to_open when a BackupEngine is not read-only.
Issue: https://github.com/facebook/rocksdb/issues/4997
Note on tests: I had to remove test case WriteOnlyEngine of BackupableDBTest because it was not consistent with the new semantic of BackupableDBOptions::max_valid_backups_to_open. Maybe, we should think about adding a new interface for append-only BackupEngines. On the other hand, I changed LimitBackupsOpened test case to use a read-only BackupEngine, and I added a new specific test case for the change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6072
Reviewed By: pdillinger
Differential Revision: D18687364
Pulled By: sebastianopeluso
fbshipit-source-id: 77bc1f927d623964d59137a93de123bbd719da4e
Summary:
Production:
* Fixes GarbageCollect (and auto-GC triggered by PurgeOldBackups, DeleteBackup, or CreateNewBackup) to clean up backup directory independent of current settings (except max_valid_backups_to_open; see issue https://github.com/facebook/rocksdb/issues/4997) and prior settings used with same backup directory.
* Fixes GarbageCollect (and auto-GC) not to attempt to remove "." and ".." entries from directories.
* Clarifies contract with users in modifying BackupEngine operations. In short, leftovers from any incomplete operation are cleaned up by any subsequent call to that same kind of operation (PurgeOldBackups and DeleteBackup considered the same kind of operation). GarbageCollect is available to clean up after all kinds. (NB: right now PurgeOldBackups and DeleteBackup will clean up after incomplete CreateNewBackup, but we aren't promising to continue that behavior.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6023
Test Plan:
* Refactors open parameters to use an option enum, for readability, etc. (Also fixes an unused parameter bug in the redundant OpenDBAndBackupEngineShareWithChecksum.)
* Fixes an apparent bug in ShareTableFilesWithChecksumsTransition in which old backup data was destroyed in the transition to be tested. That test is now augmented to ensure GarbageCollect (or auto-GC) does not remove shared files when BackupEngine is opened with share_table_files=false.
* Augments DeleteTmpFiles test to ensure that CreateNewBackup does auto-GC when an incompletely created backup is detected.
Differential Revision: D18453559
Pulled By: pdillinger
fbshipit-source-id: 5e54e7b08d711b161bc9c656181012b69a8feac4
Summary:
Only if there is a crash, power failure, or I/O error in
DeleteBackup, shared or private files from the backup might be left
behind that are not cleaned up by PurgeOldBackups or DeleteBackup-- only
by GarbageCollect. This makes the BackupEngine API "leaky by default."
Even if it means a modest performance hit, I think we should make
Delete and Purge do as they say, with ongoing best effort: i.e. future
calls will attempt to finish any incomplete work from earlier calls.
This change does that by having DeleteBackup and PurgeOldBackups do a
GarbageCollect, unless (to minimize performance hit) this BackupEngine
has already done a GarbageCollect and there have been no
deletion-related I/O errors in that GarbageCollect or since then.
Rejected alternative 1: remove meta file last instead of first. This would in theory turn partially deleted backups into corrupted backups, but code changes would be needed to allow the missing files and consider it acceptably corrupt, rather than failing to open the BackupEngine. This might be a reasonable choice, but I mostly rejected it because it doesn't solve the legacy problem of cleaning up existing lingering files.
Rejected alternative 2: use a deletion marker file. If deletion started with creating a file that marks a backup as flagged for deletion, then we could reliably detect partially deleted backups and efficiently finish removing them. In addition to not solving the legacy problem, this could be precarious if there's a disk full situation, and we try to create a new file in order to delete some files. Ugh.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6015
Test Plan: Updated unit tests
Differential Revision: D18401333
Pulled By: pdillinger
fbshipit-source-id: 12944e372ce6809f3f5a4c416c3b321a8927d925
Summary:
Further apply formatter to more recent commits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5830
Test Plan: Run all existing tests.
Differential Revision: D17488031
fbshipit-source-id: 137458fd94d56dd271b8b40c522b03036943a2ab
Summary:
file_reader_writer.h and .cc contain several files and helper function, and it's hard to navigate. Separate it to multiple files and put them under file/
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5803
Test Plan: Build whole project using make and cmake.
Differential Revision: D17374550
fbshipit-source-id: 10efca907721e7a78ed25bbf74dc5410dea05987
Summary:
When using `PRIu64` type of printf specifier, current code base does the following:
```
#ifndef __STDC_FORMAT_MACROS
#define __STDC_FORMAT_MACROS
#endif
#include <inttypes.h>
```
However, this can be simplified to
```
#include <cinttypes>
```
as long as flag `-std=c++11` is used.
This should solve issues like https://github.com/facebook/rocksdb/issues/5159
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402
Differential Revision: D15701195
Pulled By: miasantreble
fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03
Summary:
Many logging related source files are under util/. It will be more structured if they are together.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387
Differential Revision: D15579036
Pulled By: siying
fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377
Differential Revision: D15551366
Pulled By: siying
fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
Summary:
util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375
Differential Revision: D15550935
Pulled By: siying
fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2
Summary:
The code convention we are following, Google C++ Style, discourage
alias in header files, especially public headers:
https://google.github.io/styleguide/cppguide.html#Aliases
Remove some of them. Might removed some from .cc files as well to be consistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5113
Differential Revision: D14633030
Pulled By: siying
fbshipit-source-id: b990edc919d5de60295992284f980195e501d424
Summary:
Use the `DBOptions` that the backup engine already holds to figure out the right `EnvOptions` to use when reading the DB files. This means that, if a user opened a DB instance with `use_direct_reads=true`, then using `BackupEngine` to back up that DB instance will use direct I/O to read files when calculating checksums and copying. Currently the WALs and manifests would still be read using buffered I/O to prevent mixing direct I/O reads with concurrent buffered I/O writes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4640
Differential Revision: D13015268
Pulled By: ajkr
fbshipit-source-id: 77006ad6f3e00ce58374ca4793b785eea0db6269
Summary:
Ran the following commands to recursively change all the files under RocksDB:
```
find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
```
Running `make format` updated some formatting on the files touched.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638
Differential Revision: D12934992
Pulled By: sagar0
fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
Summary:
As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited.
Besides this, try to fix some warnings about loss of data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339
Differential Revision: D9654990
Pulled By: ajkr
fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848
Summary:
We want to sample the file I/O issued by RocksDB and report the function calls. This requires us to include the file paths otherwise it's hard to tell what has been going on.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4039
Differential Revision: D8670178
Pulled By: riversand963
fbshipit-source-id: 97ee806d1c583a2983e28e213ee764dc6ac28f7a
Summary:
HashMayMatch is related to AddKey() instead of CreateFilter().
Also applies some minor Fixes#4191#4200#3910
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4202
Differential Revision: D9180945
Pulled By: maysamyabandeh
fbshipit-source-id: 6f07b81c5bb9bda5c0273475b486ba8a030471e6
Summary:
Fixes#3391.
This change adds a `DeleteRange` method to `SstFileWriter` and adds
support for ingesting SSTs with range deletion tombstones. This is
important for applications that need to atomically ingest SSTs while
clearing out any existing keys in a given key range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3778
Differential Revision: D8821836
Pulled By: anand1976
fbshipit-source-id: ca7786c1947ff129afa703dab011d524c7883844
Summary:
The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel.
Example: ``` ~/gtest-parallel/gtest-parallel ./table_test```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135
Differential Revision: D8846653
Pulled By: maysamyabandeh
fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84
Summary:
Prior to this PR, there was a race condition between `DBImpl::SetOptions` and `BackupEngine::CreateNewBackup`, as illustrated below.
```
Time thread 1 thread 2
| CreateNewBackup -> GetLiveFiles
| SetOptions -> RenameTempFileToOptionsFile
| SetOptions -> RenameTempFileToOptionsFile
| SetOptions -> RenameTempFileToOptionsFile // unlink oldest OPTIONS file
| copy the oldest OPTIONS // IO error!
V
```
Proposed fix is to check the value of `DBImpl::disable_obsolete_files_deletion_` before calling `DeleteObsoleteOptionsFiles`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4108
Differential Revision: D8796360
Pulled By: riversand963
fbshipit-source-id: 02045317f793ea4c7d4400a5bf333b8502fa3e82
Summary:
We potentially need this information for tracing, profiling and diagnosis.
Closes https://github.com/facebook/rocksdb/pull/4026
Differential Revision: D8555214
Pulled By: riversand963
fbshipit-source-id: 4263e06c00b6d5410b46aa46eb4e358ff2161dd2
Summary:
We used to name private directories like "1.tmp" while BackupEngine populated them, and then rename without the ".tmp" suffix (i.e., rename "1.tmp" to "1") after all files were copied. On glusterfs, directory renames like this require operations across many hosts, and partial failures have caused operational problems.
Fortunately we don't need to rename private directories. We already have a meta-file that uses the tempfile-rename pattern to commit a backup atomically after all its files have been successfully copied. So we can copy private files directly to their final location, so now there's no directory rename.
Closes https://github.com/facebook/rocksdb/pull/3749
Differential Revision: D7705610
Pulled By: ajkr
fbshipit-source-id: fd724a28dd2bf993ce323a5f2cb7e7d6980cc346
Summary:
this PR fixes a few failed contbuild:
1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406
2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662
3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag.
Closes https://github.com/facebook/rocksdb/pull/3718
Reviewed By: maysamyabandeh
Differential Revision: D7621192
Pulled By: miasantreble
fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146
Summary:
This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557.
Closes https://github.com/facebook/rocksdb/pull/3662
Differential Revision: D7426121
Pulled By: Dayvedde
fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352
Summary:
When `max_valid_backups_to_open` is set, the `BackupEngine` doesn't know about the files referenced by existing backups. This PR prevents us from deleting valid files when that option is set, in cases where we are unable to accurately determine refcount. There are warnings logged when we may miss deleting unreferenced files, and a recommendation in the header for users to periodically unset this option and run a full `GarbageCollect`.
Closes https://github.com/facebook/rocksdb/pull/3518
Differential Revision: D7008331
Pulled By: ajkr
fbshipit-source-id: 87907f964dc9716e229d08636a895d2fc7b72305
Summary:
The 10MB buffer in BackupEngineImpl::BackupMeta::StoreToFile can be corrupted with a large number of files. Added a check to determine current buffer length and append data to file if buffer becomes full.
Resolves https://github.com/facebook/rocksdb/issues/3228
Closes https://github.com/facebook/rocksdb/pull/3636
Differential Revision: D7354160
Pulled By: ajkr
fbshipit-source-id: eec12d38095a0d17551a4aaee52b99d30a555722
Summary:
This diff handles cases where compaction causes an ENOSPC error.
This does not handle corner cases where another background job is started while compaction is running, and the other background job triggers ENOSPC, although we do allow the user to provision for these background jobs with SstFileManager::SetCompactionBufferSize.
It also does not handle the case where compaction has finished and some other background job independently triggers ENOSPC.
Usage: Functionality is inside SstFileManager. In particular, users should set SstFileManager::SetMaxAllowedSpaceUsage, which is the reference highwatermark for determining whether to cancel compactions.
Closes https://github.com/facebook/rocksdb/pull/3449
Differential Revision: D7016941
Pulled By: amytai
fbshipit-source-id: 8965ab8dd8b00972e771637a41b4e6c645450445
Summary:
Use the rsync tempfile naming convention in our `BackupEngine`. The temp file follows the format, `.<filename>.<suffix>`, which is later renamed to `<filename>`. We fix `tmp` as the `<suffix>` as we don't need to use random bytes for now. The benefit is gluster treats this tempfile naming convention specially and applies hashing only to `<filename>`, so the file won't need to be linked or moved when it's renamed. Our gluster team suggested this will make things operationally easier.
Closes https://github.com/facebook/rocksdb/pull/3463
Differential Revision: D6893333
Pulled By: ajkr
fbshipit-source-id: fd7622978f4b2487fce33cde40dd3124f16bcaa8
Summary:
- check most times after calling snprintf that the buffer didn't fill up. Previously we'd proceed and use `buf_size - len` as the length in subsequent calls, which underflowed as those are unsigned size_t.
- replace some memcpys with snprintf for consistency
Closes https://github.com/facebook/rocksdb/pull/3255
Differential Revision: D6541464
Pulled By: ajkr
fbshipit-source-id: 8610ea6a24f38e0a37c6d17bc65b7c712da6d932
Summary:
1. Class BackupMeta
```
52 : timestamp_(0), size_(0), meta_filename_(meta_filename),
CID 1168103 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
2. uninit_member: Non-static class member sequence_number_ is not initialized in this constructor nor in any functions that it calls.
153 file_infos_(file_infos), env_(env) {}
```
2. class BackupEngineImpl
```
513 }
7. uninit_member: Non-static class member latest_backup_id_ is not initialized in this constructor nor in any functions that it calls.
CID 1322803 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
9. uninit_member: Non-static class member latest_valid_backup_id_ is not initialized in this constructor nor in any functions that it calls.
514}
```
3. struct BackupAfterCopyOrCreateWorkItem
```
368 struct BackupAfterCopyOrCreateWorkItem {
369 std::future<CopyOrCreateResult> result;
1. member_decl: Class member declaration for shared.
370 bool shared;
3. member_decl: Class member declaration for needed_to_copy.
371 bool needed_to_copy;
5. member_decl: Class member declaration for backup_env.
372 Env* backup_env;
373 std::string dst_path_tmp;
374 std::string dst_path;
375 std::string dst_relative;
2. uninit_member: Non-static class member shared is not initialized in this constructor nor in any functions that it calls.
4. uninit_member: Non-static class member needed_to_copy is not initialized in this constructor nor in any functions that it calls.
CID 1396122 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
6. uninit_member: Non-static class member backup_env is not initialized in this constructor nor in any functions that it calls.
376 BackupAfterCopyOrCreateWorkItem() {}
```
4. struct CopyOrCreateWorkItem
```
318 struct CopyOrCreateWorkItem {
319 std::string src_path;
320 std::string dst_path;
321 std::string contents;
1. member_decl: Class member declaration for src_env.
322 Env* src_env;
3. member_decl: Class member declaration for dst_env.
323 Env* dst_env;
5. member_decl: Class member declaration for sync.
324 bool sync;
7. member_decl: Class member declaration for rate_limiter.
325 RateLimiter* rate_limiter;
9. member_decl: Class member declaration for size_limit.
326 uint64_t size_limit;
327 std::promise<CopyOrCreateResult> result;
328 std::function<void()> progress_callback;
329
2. uninit_member: Non-static class member src_env is not initialized in this constructor nor in any functions that it calls.
4. uninit_member: Non-static class member dst_env is not initialized in this constructor nor in any functions that it calls.
6. uninit_member: Non-static class member sync is not initialized in this constructor nor in any functions that it calls.
8. uninit_member: Non-static class member rate_limiter is not initialized in this constructor nor in any functions that it calls.
CID 1396123 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
10. uninit_member: Non-static class member size_limit is not initialized in this constructor nor in any functions that it calls.
330 CopyOrCreateWorkItem() {}
```
5. struct RestoreAfterCopyOrCreateWorkItem
```
struct RestoreAfterCopyOrCreateWorkItem {
410 std::future<CopyOrCreateResult> result;
1. member_decl: Class member declaration for checksum_value.
411 uint32_t checksum_value;
CID 1396153 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
2. uninit_member: Non-static class member checksum_value is not initialized in this constructor nor in any functions that it calls.
412 RestoreAfterCopyOrCreateWorkItem() {}
```
Closes https://github.com/facebook/rocksdb/pull/3131
Differential Revision: D6428556
Pulled By: sagar0
fbshipit-source-id: a86675444543eff028e3cae6942197a143a112c4
Summary:
There are internal users who open BackupEngine for writing new backups only, and they don't care whether old backups can be read or not. The condition `BackupableDBOptions::max_valid_backups_to_open == 0` should be supported (previously in df74b775e6 I made the mistake of choosing 0 as a special value to disable the limit).
Closes https://github.com/facebook/rocksdb/pull/2819
Differential Revision: D5751599
Pulled By: ajkr
fbshipit-source-id: e73ac19eb5d756d6b68601eae8e43407ee4f2752
Summary:
Backup engine is intentionally openable even when some backups are corrupt. Previously the engine could write new backups as long as the most recent backup wasn't corrupt. This PR makes the backup engine able to create new backups even when the most recent one is corrupt.
We now maintain two ID instance variables:
- `latest_backup_id_` is used when creating backup to choose the new ID
- `latest_valid_backup_id_` is used when restoring latest backup since we want most recent valid one
Closes https://github.com/facebook/rocksdb/pull/2804
Differential Revision: D5734148
Pulled By: ajkr
fbshipit-source-id: db440707b31df2c7b084188aa5f6368449e10bcf
Summary:
This reverts the previous commit 1d7048c598, which broke the build.
Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627
Differential Revision: D5476473
Pulled By: sagar0
fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.
Reviewed By: igorsugak
Differential Revision: D5454343
fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
Summary:
Throughput: 46k tps in our sysbench settings (filling the details later)
The idea is to have the simplest change that gives us a reasonable boost
in 2PC throughput.
Major design changes:
1. The WAL file internal buffer is not flushed after each write. Instead
it is flushed before critical operations (WAL copy via fs) or when
FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
via mutex_.
2. Use two sequence numbers: last seq, and last seq for write. Last seq
is the last visible sequence number for reads. Last seq for write is the
next sequence number that should be used to write to WAL/memtable. This
allows to have a memtable write be in parallel to WAL writes.
3. BatchGroup is not used for writes. This means that we can have
parallel writers which changes a major assumption in the code base. To
accommodate for that i) allow only 1 WriteImpl that intends to write to
memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
come via group commit phase which is serial anyway, ii) make all the
parts in the code base that assumed to be the only writer (via
EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
protected via a stat_mutex_.
Note: the first commit has the approach figured out but is not clean.
Submitting the PR anyway to get the early feedback on the approach. If
we are ok with the approach I will go ahead with this updates:
0) Rebase with Yi's pipelining changes
1) Currently batching is disabled by default to make sure that it will be
consistent with all unit tests. Will make this optional via a config.
2) A couple of unit tests are disabled. They need to be updated with the
serial commit of 2PC taken into account.
3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
releasing mutex_ beforehand (the same way EnterUnbatched does). This
needs to be cleaned up.
Closes https://github.com/facebook/rocksdb/pull/2345
Differential Revision: D5210732
Pulled By: maysamyabandeh
fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4
Summary:
Allow users to rate limit background work based on read bytes, written bytes, or sum of read and written bytes. Support these by changing the RateLimiter API, so no additional options were needed.
Closes https://github.com/facebook/rocksdb/pull/2433
Differential Revision: D5216946
Pulled By: ajkr
fbshipit-source-id: aec57a8357dbb4bfde2003261094d786d94f724e
Summary:
Disable direct reads for log and manifest. Direct reads should not affect sequential_file
Also add kDirectIO for option_config_ in db_test_util
Closes https://github.com/facebook/rocksdb/pull/2337
Differential Revision: D5100261
Pulled By: lightmark
fbshipit-source-id: 0ebfd13b93fa1b8f9acae514ac44f8125a05868b
Summary:
These code paths forked when checkpoint was introduced by copy/pasting the core backup logic. Over time they diverged and bug fixes were sometimes applied to one but not the other (like fix to include all relevant WALs for 2PC), or it required extra effort to fix both (like fix to forge CURRENT file). This diff reunites the code paths by extracting the core logic into a function, CreateCustomCheckpoint(), that is customizable via callbacks to implement both checkpoint and backup.
Related changes:
- flush_before_backup is now forcibly enabled when 2PC is enabled
- Extracted CheckpointImpl class definition into a header file. This is so the function, CreateCustomCheckpoint(), can be called by internal rocksdb code but not exposed to users.
- Implemented more functions in DummyDB/DummyLogFile (in backupable_db_test.cc) that are used by CreateCustomCheckpoint().
Closes https://github.com/facebook/rocksdb/pull/1932
Differential Revision: D4622986
Pulled By: ajkr
fbshipit-source-id: 157723884236ee3999a682673b64f7457a7a0d87
Summary:
This was requested by a customer who wants to proactively monitor whether any valid backups are available. The existing performance was poor because Open() serially reads every small meta-file (one per backup), which was slow on HDFS.
Now we only read the minimum number of meta-files to find `max_valid_backups_to_open` valid backups. The customer mentioned above can just set it to one.
Closes https://github.com/facebook/rocksdb/pull/2151
Differential Revision: D4882564
Pulled By: ajkr
fbshipit-source-id: cb0edf9e8ac693e4d5f24902e725a011ed8c0c2f
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090
Differential Revision: D4833681
Pulled By: siying
fbshipit-source-id: 2fd8bef
Summary:
It is confusing to have auto_roll_logger to stay under db/, which has nothing to do with database. Move filename together as it is a dependency.
Closes https://github.com/facebook/rocksdb/pull/2080
Differential Revision: D4821141
Pulled By: siying
fbshipit-source-id: ca7d768
Summary:
previously we only cleaned up .tmp files under "shared/" and "private/" directories in case the previous backup failed. we need to do the same for "shared_checksum/"; otherwise, the subsequent backup will fail if it tries to backup at least one of the same files.
Closes https://github.com/facebook/rocksdb/pull/2062
Differential Revision: D4805599
Pulled By: ajkr
fbshipit-source-id: eaa6088
Summary:
This is the metric I plan to use for adaptive rate limiting. The statistics are updated only if the rate limiter is drained by flush or compaction. I believe (but am not certain) that this is the normal case.
The Statistics object is passed in RateLimiter::Request() to avoid requiring changes to client code, which would've been necessary if we passed it in the RateLimiter constructor.
Closes https://github.com/facebook/rocksdb/pull/1946
Differential Revision: D4646489
Pulled By: ajkr
fbshipit-source-id: d8e0161
Summary:
InsertPathnameToSizeBytes() is called on shared/ and shared_checksum/ directories, which only exist for certain configurations. If we try to list a non-existent directory's contents, some Envs will dump an error message. Let's avoid this by checking whether the directory exists before listing its contents.
Closes https://github.com/facebook/rocksdb/pull/1895
Differential Revision: D4596301
Pulled By: ajkr
fbshipit-source-id: c809679
Summary:
As the last step in backup creation, the .tmp directory is renamed omitting the .tmp suffix. In case the process terminates before this, the .tmp directory will be left behind. Even if this happens, we want future backups to succeed, so I added some checks/cleanup for this case.
Closes https://github.com/facebook/rocksdb/pull/1896
Differential Revision: D4597323
Pulled By: ajkr
fbshipit-source-id: 48900d8
Summary:
introduce new methods into a public threadpool interface,
- allow submission of std::functions as they allow greater flexibility.
- add Joining methods to the implementation to join scheduled and submitted jobs with
an option to cancel jobs that did not start executing.
- Remove ugly `#ifdefs` between pthread and std implementation, make it uniform.
- introduce pimpl for a drop in replacement of the implementation
- Introduce rocksdb::port::Thread typedef which is a replacement for std::thread. On Posix Thread defaults as before std::thread.
- Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation.
- should be no functionality changes.
Closes https://github.com/facebook/rocksdb/pull/1823
Differential Revision: D4492902
Pulled By: siying
fbshipit-source-id: c74cb11
Summary:
#1733 started using SizeFileBytes(), so our dummy log file implementation should stop asserting that this function isn't called.
Closes https://github.com/facebook/rocksdb/pull/1740
Differential Revision: D4376055
Pulled By: ajkr
fbshipit-source-id: 2854d89
Summary:
Since the backup work as snapshot, we should only copy
the bytes of the wal while we get the alive files.
Closes https://github.com/facebook/rocksdb/pull/1733
Differential Revision: D4373457
Pulled By: ajkr
fbshipit-source-id: 389318f
Summary:
Fixes compile error:
In file included from ./util/statistics.h:17:0,
from ./util/stop_watch.h:8,
from ./util/perf_step_timer.h:9,
from ./util/iostats_context_imp.h:8,
from ./util/posix_logger.h:27,
from ./port/util_logger.h:18,
from ./db/auto_roll_logger.h:15,
from db/auto_roll_logger.cc:6:
./util/thread_local.h:65:16: error: 'function' in namespace 'std' does not name a template type
typedef std::function<void(void*, void*)> FoldFunc;
Closes https://github.com/facebook/rocksdb/pull/1656
Differential Revision: D4318702
Pulled By: yiwu-arbug
fbshipit-source-id: 8c5d17a
Summary:
We used to treat any failure to read a backup's meta-file as if the backup were corrupted; however, we should distinguish corruption errors from errors in the backup Env. This fixes an issue where callers would get inconsistent results from GetBackupInfo() if they called it on an engine that encountered Env error during initialization. Now we fail Initialize() in this case so callers cannot invoke GetBackupInfo() on such engines.
Closes https://github.com/facebook/rocksdb/pull/1654
Differential Revision: D4318573
Pulled By: ajkr
fbshipit-source-id: f7a7c54
Summary:
Some users are assuming NotFound means the backup does not
exist at the provided path, which is a reasonable assumption. We need to
stop returning NotFound for system errors.
Depends on #1644
Closes https://github.com/facebook/rocksdb/pull/1645
Differential Revision: D4312233
Pulled By: ajkr
fbshipit-source-id: 5343c10
Summary: LockFile is unnecessary in unit test
Test Plan: env_basic_test.cc
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60285
Summary: Backup options file to private directory
Test Plan:
backupable_db_test.cc, BackupOptions
Modify DB options by calling OpenDB for 3 times. Check the latest options file is in the right place. Also check no redundent files are backuped.
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, dhruba, andrewkr
Differential Revision: https://reviews.facebook.net/D59373
Summary:
Rocksdb backup and restore rate limiting is currently done per backup/restore.
So, it is difficult to control rate across multiple backup/restores. With this
change, a throttler can be provided. If a throttler is provided, it is used.
Otherwise, a new throttler is created based on the actual rate limits specified
in the options.
Test Plan: Added unit tests
Reviewers: ldemailly, andrewkr, sdong
Reviewed By: andrewkr
Subscribers: igor, yiwu, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56265
Summary:
Google C++ Style writes: In particular, prefer to write lambda captures explicitly when capturing this or if the lambda will escape the current scope.
Here it is the case for both.
Test Plan: Run all test suites.
Reviewers: andrewkr, dhruba
Reviewed By: andrewkr, dhruba
Subscribers: yhchiang, IslamAbdelRahman, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58515
Summary: Disable backupable_db_test.cc on Windows since EnvChroot is not supported
Test Plan: check ROCKSDB_LITE
Reviewers: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58035
Summary:
- Used ChrootEnv so the database and backup Envs are isolated in the filesystem.
- Removed DifferentEnvs test since now every test uses different Envs
Depends on D57543
Test Plan:
- ran backupable_db_test
- verified backupable_db_test now catches the bug when D57159 is backed out (this bug previously passed through the test cases, which motivated this change)
Reviewers: sdong, lightmark, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57615
Summary: Fix BackupableDBTest.NoDoubleCopy and BackupableDBTest.DifferentEnvs by mocking the db files in db_env instead of backup_env_
Test Plan: make check -j64
Reviewers: sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57273
Summary:
When db_env_ != backup_env_, InsertPathnameToSizeBytes() would
use the wrong Env during backup creation. This happened because this function
used backup_env_ instead of db_env_ to get WAL/data file sizes.
This diff adds an argument to InsertPathnameToSizeBytes() indicating which Env
to use.
Test Plan: ran @anirbanb's BackupTestTool
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57159
Summary:
This interface is redundant and has been deprecated for a while.
It's also unused internally. Let's delete it.
I moved the comments to the corresponding functions in BackupEngine/
BackupEngineReadOnly. This caused the diff tool to not work cleanly.
Test Plan:
unit tests
$ ./backupable_db_test
Reviewers: yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D56331
Summary:
Several of backupable_db_test fails if running standalone, because of directory missing. Fix it by:
(1) garbage collector skips shared directory if it doesn't exit
(2) BackupableDBTest.Issue921Test to create the parent directory of the backup directory fist.
Test Plan: Run the tests individually and make sure they pass
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56829
Summary:
- Need to use unsigned long long for 64-bit literals on windows
- Need size_t for backup meta-file length since clang doesn't let us assign size_t to int
Test Plan: backupable_db_test and options_test
Reviewers: IslamAbdelRahman, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D56391
Summary:
Rocksdb backup engine maintains metadata about backups in separate files. But,
there was no way to add extra application specific data to it. Adding support
for that.
In some use cases, applications decide to restore a backup based on some
metadata. This will help those cases to cheaply decide whether to restore or
not.
Test Plan:
Added a unit test. Existing ones are passing
Sample meta file for BinaryMetadata test-
```
1459454043
0
metadata 6162630A64656600676869
2
private/1/MANIFEST-000001 crc32 1184723444
private/1/CURRENT crc32 3505765120
```
Reviewers: sdong, ldemailly, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, ldemailly
Differential Revision: https://reviews.facebook.net/D56007
Summary:
- Keep track of obsolete manifests in VersionSet
- Updated FindObsoleteFiles() to put obsolete manifests in the JobContext for later use by PurgeObsoleteFiles()
- Added test case that verifies a stale manifest is deleted by a non-full purge
Test Plan:
$ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation
Reviewers: IslamAbdelRahman, yoshinorim, sdong
Reviewed By: sdong
Subscribers: andrewkr, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D55269
Summary:
Now that we get sizes efficiently, we no longer need the workaround to
embed file size in filename.
Test Plan:
$ ./backupable_db_test
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55035
Summary:
For VerifyBackup(), backup files can be spread across "shared/",
"shared_checksum/", and "private/" subdirectories, so we have to
bulk get all three.
For CreateNewBackup(), we make two separate bulk calls: one for the
data files and one for WAL files.
There is also a new helper function, ExtendPathnameToSizeBytes(),
that translates the file attributes vector to a map. I decided to leave
GetChildrenFileAttributes()'s (from D53781) return type as vector to
keep it consistent with GetChildren().
Depends on D53781.
Test Plan:
verified relevant unit tests
$ ./backupable_db_test
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D53919
Summary:
Fixed two related race conditions in backup creation.
(1) CreateNewBackup() uses DB::DisableFileDeletions() to prevent table files
from being deleted while it is copying; however, the MANIFEST file could still
rotate during this time. The fix is to stop deleting the old manifest in the
rotation logic. It will be deleted safely later when PurgeObsoleteFiles() runs
(can only happen when file deletions are enabled).
(2) CreateNewBackup() did not account for the CURRENT file being mutable.
This is significant because the files returned by GetLiveFiles() contain a
particular manifest filename, but the manifest to which CURRENT refers can
change at any time. This causes problems when CURRENT changes between the call
to GetLiveFiles() and when it's copied to the backup directory. To workaround this, I
manually forge a CURRENT file referring to the manifest filename returned in
GetLiveFiles().
(2) also applies to the checkpointing code, so let me know if this approach is
good and I'll make the same change there.
Test Plan:
new test for roll manifest during backup creation.
running the test before this change:
$ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation
...
IO error: /tmp/rocksdbtest-9383/backupable_db/MANIFEST-000001: No such file or directory
running the test after this change:
$ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation
...
[ RUN ] BackupableDBTest.ChangeManifestDuringBackupCreation
[ OK ] BackupableDBTest.ChangeManifestDuringBackupCreation (2836 ms)
Reviewers: IslamAbdelRahman, anthony, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D54711
Summary:
As titled. This fixes the tsan error caused by logger_ being used in
backup_engine_'s destructor. It does not fix the transient unit test failure,
which is caused by MANIFEST file changing while backup is happening.
Test Plan:
verified the tsan error no longer happens on either success or
failure.
$ COMPILE_WITH_TSAN=1 make -j32 backupable_db_test
$ while ./backupable_db_test --gtest_filter=BackupableDBTest.CorruptionsTest ; do : ; done
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D54669
Summary: fix memory leak in test code
Test Plan: ran test
Reviewers: rven, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52617
Summary:
See a bug report here: https://github.com/facebook/rocksdb/issues/921
The fix is to not check the shared/ directory if share_table_files is false. We could also check FileExists() before GetChildren(), but that will add extra latency when Env is Hdfs :(
Test Plan: added a unit test
Reviewers: rven, sdong, IslamAbdelRahman, yhchiang, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52593
Summary: Getting file size from all the backup files can take a long time. In some cases, the sizes are available in file names. We allow a mode to get those sizes from file name.
Test Plan:
Make some unit tests in backupable_db_test to run in such a mode.
Make sure RocksDB Lite builds too.
Reviewers: IslamAbdelRahman, rven, yhchiang, kradhakrishnan, anthony, igor
Reviewed By: igor
Subscribers: muthu, asameet, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51243
Summary:
This patch allows rocksdb to persist options into a file on
DB::Open, SetOptions, and Create / Drop ColumnFamily.
Options files are created under the same directory as the rocksdb
instance.
In addition, this patch also adds a fail_if_missing_options_file in DBOptions
that makes any function call return non-ok status when it is not able to
persist options properly.
// If true, then DB::Open / CreateColumnFamily / DropColumnFamily
// / SetOptions will fail if options file is not detected or properly
// persisted.
//
// DEFAULT: false
bool fail_if_missing_options_file;
Options file names are formatted as OPTIONS-<number>, and RocksDB
will always keep the latest two options files.
Test Plan:
Add options_file_test.
options_test
column_family_test
Reviewers: igor, IslamAbdelRahman, sdong, anthony
Reviewed By: anthony
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48285