Commit Graph

69 Commits

Author SHA1 Message Date
Akanksha Mahajan
8465cccde2 Posix API support for Async Read and Poll APIs (#9578)
Summary:
Provide support for Async Read and Poll in Posix file system using IOUring.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9578

Test Plan: In progress

Reviewed By: anand1976

Differential Revision: D34690256

Pulled By: akankshamahajan15

fbshipit-source-id: 291cbd1380a3cb904b726c34c0560d1b2ce44a2e
2022-03-10 18:28:31 -08:00
Peter Dillinger
e7ac7363b4 Add to HISTORY and minor loose ends from #9294, #9254 (#9386)
Summary:
Loose ends relate to mmap on 32-bit systems. (Testing is more
complicated when the feature was completely disabled on 32-bit.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9386

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D33590715

Pulled By: pdillinger

fbshipit-source-id: f2637036a538a552200adee65b6765fce8cae27b
2022-01-21 13:04:19 -08:00
Yanqin Jin
1a8e9f0e07 Use fcntl(F_FULLFSYNC) on OS X (#9356)
Summary:
Closing https://github.com/facebook/rocksdb/issues/5954

fsync/fdatasync on Linux:
```
(fsync/fdatasync) includes writing through or flushing a disk cache if present.
```

However, on OS X and iOS:
```
(fsync) will flush all data from the host to the drive (i.e. the "permanent storage device"),
the drive itself may not physically write the data to the platters for quite some time and it
may be written in an out-of-order sequence.
```

Solution is to use `fcntl(F_FULLFSYNC)` on OS X so that we get the same
persistence guarantee.

According to OSX man page,
```
The F_FULLFSYNC fcntl asks the drive to flush **all** buffered data to permanent storage.
```
This suggests that it will be no faster than `fsync` on Linux, since Linux, according to its man page,
```
writing through or flushing a disk cache if present
```
It means Linux may not flush **all** data from disk cache.

This is similar to bug reports/fixes in:
- golang: https://github.com/golang/go/issues/26650
- leveldb: 296de8d5b8.

Not sure if we should fallback to fsync since we break persistence contract.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9356

Reviewed By: jay-zhuang

Differential Revision: D33417416

Pulled By: riversand963

fbshipit-source-id: 475548ff9c5eaccde325e0f6842694271cbc8cb7
2022-01-18 20:23:11 -08:00
Adam Retter
c879910102 Fix fstatfs call for compilation on 32 bit systems (#9251)
Summary:
On some 32-bit systems, BTRFS_SUPER_MAGIC is unsigned while __fsword_t is signed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9251

Reviewed By: ajkr

Differential Revision: D32961651

Pulled By: pdillinger

fbshipit-source-id: 78e85fc1336f304a21e4d5961e60957c90daed63
2021-12-08 22:01:23 -08:00
Jay Zhuang
29102641dd Skip directory fsync for filesystem btrfs (#8903)
Summary:
Directory fsync might be expensive on btrfs and it may not be needed.
Here are 4 directory fsync cases:
1. creating a new file: dir-fsync is not needed on btrfs, as long as the
   new file itself is synced.
2. renaming a file: dir-fsync is not needed if the renamed file is
   synced. So an API `FsyncAfterFileRename(filename, ...)` is provided
   to sync the file on btrfs. By default, it just calls dir-fsync.
3. deleting files: dir-fsync is forced by set
   `IOOptions.force_dir_fsync = true`
4. renaming multiple files (like backup and checkpoint): dir-fsync is
   forced, the same as above.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903

Test Plan: run tests on btrfs and non btrfs

Reviewed By: ajkr

Differential Revision: D30885059

Pulled By: jay-zhuang

fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a
2021-11-03 12:21:27 -07:00
anand76
7743f033b1 More robust checking of IO uring completion data (#8894)
Summary:
Potential bugs in the IO uring implementation can cause bad data to be returned in the completion queue. Add some checks in the PosixRandomAccessFile::MultiRead completion handling code to catch such errors and fail the entire MultiRead. Also log some diagnostic messages and stack trace.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8894

Reviewed By: siying, pdillinger

Differential Revision: D30826982

Pulled By: anand1976

fbshipit-source-id: af91815ac760e095d6cc0466cf8bd5c10167fd15
2021-09-15 12:44:43 -07:00
sdong
60e5af83c1 Handle return code by io_uring_submit_and_wait() and io_uring_wait_cqe() (#8311)
Summary:
Right now return codes by io_uring_submit_and_wait() and io_uring_wait_cqe() are not handled. It is not the good practice. Although these two functions are not supposed to return non-0 values in normal exeuction, people suspect that they might return non-0 value when an interruption happens, and the code might cause hanging.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8311

Test Plan: Make sure at least normal test cases still pass.

Reviewed By: anand1976

Differential Revision: D28500828

fbshipit-source-id: 8a76cea9cafbd041102e0b6a8eef9d0bfed7c211
2021-05-18 16:09:14 -07:00
sdong
e19908cba6 Refactor kill point (#8241)
Summary:
Refactor kill point to one single class, rather than several extern variables. The intention was to drop unflushed data before killing to simulate some job, and I tried to a pointer to fault ingestion fs to the killing class, but it ended up with harder than I thought. Perhaps we'll need to do this in another way. But I thought the refactoring itself is good so I send it out.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8241

Test Plan: make release and run crash test for a while.

Reviewed By: anand1976

Differential Revision: D28078486

fbshipit-source-id: f9182c1455f52e6851c13f88a21bade63bcec45f
2021-05-05 15:50:29 -07:00
anand76
7d7f14480e Always truncate the latest WAL file on DB Open (#8122)
Summary:
Currently, we only truncate the latest alive WAL files when the DB is opened. If the latest WAL file is empty or was flushed during Open, its not truncated since the file will be deleted later on in the Open path. However, before deletion, a new WAL file is created, and if the process crash loops between the new WAL file creation and deletion of the old WAL file, the preallocated space will keep accumulating and eventually use up all disk space. To prevent this, always truncate the latest WAL file, even if its empty or the data was flushed.

Tests:
Add unit tests to db_wal_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8122

Reviewed By: riversand963

Differential Revision: D27366132

Pulled By: anand1976

fbshipit-source-id: f923cc03ef033ccb32b140d36c6a63a8152f0e8e
2021-03-28 10:00:08 -07:00
Jay Zhuang
45c65d6dcf Use thread-safe strerror_r() to get error message (#8087)
Summary:
`strerror()` is not thread-safe, using `strerror_r()` instead. The API could be different on the different platforms, used the code from 0deef031cb/folly/String.cpp (L457)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8087

Reviewed By: mrambacher

Differential Revision: D27267151

Pulled By: jay-zhuang

fbshipit-source-id: 4b8856d1ec069d5f239b764750682c56e5be9ddb
2021-03-24 23:07:27 -07:00
Yanqin Jin
394210f280 Remove unused includes (#7604)
Summary:
This is a PR generated **semi-automatically** by an internal tool to remove unused includes and `using` statements.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7604

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D24579392

Pulled By: riversand963

fbshipit-source-id: c4bfa6c6b08da1de186690d37eb73d8fff45aecd
2020-10-28 23:22:27 -07:00
mrambacher
d9d190742c Make env*_test work with ASSERT_STATUS_CHECKED (#7176)
Summary:
Make (most of) the env*_test pass when ASSERT_STATUS_CHECKED is enabled.

One test that opens a database is currently disabled in this mode, as there are many errors that need revisited for DB tests and status checks.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7176

Reviewed By: cheng-chang

Differential Revision: D22799278

Pulled By: ajkr

fbshipit-source-id: 16d8a02eaeecd6df1060249b6a5811292801f2ed
2020-07-28 22:59:48 -07:00
Peter Dillinger
88b4210701 Remove racially charged terms "whitelist" and "blacklist" (#7008)
Summary:
We don't need them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7008

Test Plan: "make check" and ensure "make crash_test" starts

Reviewed By: ajkr

Differential Revision: D22143838

Pulled By: pdillinger

fbshipit-source-id: 72c8e16603abc59f4954e304466bc4dc1f58f94e
2020-06-19 15:27:32 -07:00
Akanksha Mahajan
a1523efcdf Status check enforcement for io_posix_test and options_settable_test (#6857)
Summary:
Added status check enforcement for io_posix_test and options_settable_test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6857

Test Plan: ASSERT_STATUS_CHECKED=1 make -j48 check

Reviewed By: ajkr

Differential Revision: D21647904

Pulled By: akankshamahajan15

fbshipit-source-id: b7f2321eb6c141a88cd5e1270ecb7d58f00341af
2020-05-19 19:22:28 -07:00
Cheng Chang
ada700b906 Re-read the whole request in direct IO mode when IO uring returns partial result (#6853)
Summary:
If both direct IO and IO uring are enabled, when IO uring returns partial result, we'll try to read the remaining part of the request, but the starting address/offset of the remaining part might not be aligned to the block size, in direct IO mode, the unaligned offset causes bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6853

Test Plan: run make check with both direct IO and IO uring enabled, this is covered by one of the continuous tests.

Reviewed By: anand1976

Differential Revision: D21603023

Pulled By: cheng-chang

fbshipit-source-id: 942f6a11ff21e1892af6c4464e02bab4c707787c
2020-05-18 17:25:57 -07:00
Cheng Chang
1758f76f2d Fix unused variable of r in release mode (#6750)
Summary:
In release mode, asserts are not compiled, so `r` is not used, causing compiler warnings.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6750

Test Plan: make check under release mode

Reviewed By: anand1976

Differential Revision: D21220365

Pulled By: cheng-chang

fbshipit-source-id: fd4afa9843d54af68c4da8660ec61549803e1167
2020-04-24 15:14:13 -07:00
Cheng Chang
51bdfae010 Check alignment of MultiRead requests in direct IO mode (#6739)
Summary:
Add assertions to check direct IO's alignment requirements in MultiRead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6739

Test Plan: make check

Reviewed By: siying

Differential Revision: D21143825

Pulled By: cheng-chang

fbshipit-source-id: 26f1623b062a1851080771128feac0669a61f5e9
2020-04-23 15:19:31 -07:00
phantomape
cb671ea1ca env: Add clearerr() before repeating an interrupted file read (#6609)
Summary:
This change updates PosixSequentialFile::Read to call clearerr()
before fread()ing again after an EINTR is returned on a previous
fread.

The original fix is from bd8f1ebb91.
Fixing https://github.com/facebook/rocksdb/issues/6509

Signed-off-by: phantomape <cxucheng@outlook.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6609

Reviewed By: zhichao-cao

Differential Revision: D20731482

Pulled By: riversand963

fbshipit-source-id: 7f1f3a1449077d5560f45c465a78d08633740ba0
2020-03-29 21:56:31 -07:00
Cheng Chang
2d9efc9ab2 Cache result of GetLogicalBufferSize in Linux (#6457)
Summary:
In Linux, when reopening DB with many SST files, profiling shows that 100% system cpu time spent for a couple of seconds for `GetLogicalBufferSize`. This slows down MyRocks' recovery time when site is down.

This PR introduces two new APIs:
1. `Env::RegisterDbPaths` and `Env::UnregisterDbPaths` lets `DB` tell the env when it starts or stops using its database directories . The `PosixFileSystem` takes this opportunity to set up a cache from database directories to the corresponding logical block sizes.
2. `LogicalBlockSizeCache` is defined only for OS_LINUX to cache the logical block sizes.

Other modifications:
1. rename `logical buffer size` to `logical block size` to be consistent with Linux terms.
2. declare `GetLogicalBlockSize` in `PosixHelper` to expose it to `PosixFileSystem`.
3. change the functions `IOError` and `IOStatus` in `env/io_posix.h` to have external linkage since they are used in other translation units too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6457

Test Plan:
1. A new unit test is added for `LogicalBlockSizeCache` in `env/io_posix_test.cc`.
2. A new integration test is added for `DB` operations related to the cache in `db/db_logical_block_size_cache_test.cc`.

`make check`

Differential Revision: D20131243

Pulled By: cheng-chang

fbshipit-source-id: 3077c50f8065c0bffb544d8f49fb10bba9408d04
2020-03-11 18:40:05 -07:00
sdong
942eaba091 Handle io_uring partial results (#6441)
Summary:
The logic that handles io_uring partial results was wrong. Fix the logic by putting it into a queue and continue reading.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6441

Test Plan: Make sure this patch fixes the application test case where the bug was discovered; in env_test, add a unit test that simulates partial results and make sure the results are still correct.

Differential Revision: D20018616

fbshipit-source-id: 5398a7e34d74c26d52aa69dfd604e93e95d99c62
2020-02-21 16:57:37 -08:00
sdong
fdf882ded2 Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433

Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.

Differential Revision: D19977691

fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
2020-02-20 12:09:57 -08:00
anand76
afa2420c2b Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 14:48:41 -08:00
sdong
d1ae2c3faf Fix an asan warning caused by the recent io_uring change (#6135)
Summary:
ASAN reports:

internal_repo_rocksdb/repo:db_test - MultiThreaded/MultiThreadedDBTest.MultiThreaded/43: fatal
==2692739==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6130000500ca at pc 0x0000006be780 bp 0x7efef85ccd20 sp 0x7efef85cc4d0
[CONTEXT] === How to use this, how to get the raw stack trace, and more: fburl.com/ASAN ===
[CONTEXT] READ of size 331 at 0x6130000500ca thread T195
[CONTEXT]      #0 db_test_bin+0x6be77f                     __interceptor_strlen.part.35
[CONTEXT]      https://github.com/facebook/rocksdb/issues/1 internal_repo_rocksdb/repo/include/rocksdb/slice.h:55 rocksdb::Slice::Slice(char const*)
[CONTEXT]      https://github.com/facebook/rocksdb/issues/2 internal_repo_rocksdb/repo/env/io_posix.cc:522 rocksdb::PosixRandomAccessFile::MultiRead(rocksdb::ReadRequest*, unsigned long)

I looked at env/io_posix.cc:522 but don't see a reason why the line needs to be there at all, because it is not used before overwritten. So it must be a line that is put there as a bug. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6135

Test Plan: Rerun the same test which passes after the fix. Run all the tests and make sure they all pass.

Differential Revision: D18880251

fbshipit-source-id: 3b84ac6a05b67b529c4202e0ceb4c047460f44f2
2019-12-09 10:25:09 -08:00
sdong
e3a82bb934 PosixRandomAccessFile::MultiRead() to use I/O uring if supported (#5881)
Summary:
Right now, PosixRandomAccessFile::MultiRead() executes read requests in parallel. In this PR, it leverages I/O Uring library to run it in parallel, even when page cache is enabled. This function will fall back if the kernel version doesn't support it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5881

Test Plan: Run the unit test on a kernel version supporting it and make sure all tests pass, and run a unit test on kernel version supporting it and see it pass. Before merging, will also run stress test and see it passes.

Differential Revision: D17742266

fbshipit-source-id: e05699c925ac04fdb42379456a4e23e4ebcb803a
2019-12-07 20:55:52 -08:00
Richard He
cfc20019d1 Fixed FALLOC_FL_KEEP_SIZE undefined (#5614)
Summary:
Fix `error: ‘FALLOC_FL_KEEP_SIZE’` undeclared error in `io_posix.cc` during Vagrant build in CentOS as per issue https://github.com/facebook/rocksdb/issues/5599
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5614

Differential Revision: D17217960

fbshipit-source-id: ef736c51b16833107fd9ccc7917ed1def2a8d02c
2019-09-05 17:37:21 -07:00
sheng qiu
c762efc4a9 fix compile error: ‘FALLOC_FL_KEEP_SIZE’ undeclared (#5708)
Summary:
add "linux/falloc.h" in env/io_posix.cc to fix compile error: ‘FALLOC_FL_KEEP_SIZE’ undeclared

Signed-off-by: sheng qiu <herbert1984106@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5708

Differential Revision: D16832922

fbshipit-source-id: 30e787c4a1b5a9724a8acfd68962ff5ec5f27d3e
2019-08-16 13:58:05 -07:00
Yi Wu
849a8c0ae0 fix sign compare warnings (#5651)
Summary:
Fix -Wsign-compare warnings for gcc9.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5651

Test Plan: Tested with ubuntu19.10+gcc9

Differential Revision: D16567428

fbshipit-source-id: 730b2704d42ba0c4e4ea946a3199bbb34be4c25c
2019-07-30 14:12:54 -07:00
ggaurav28
60d8b19836 Implemented a file logger that uses WritableFileWriter (#5491)
Summary:
Current PosixLogger performs IO operations using posix calls. Thus the
current implementation will not work for non-posix env. Created a new
logger class EnvLogger that uses env specific WritableFileWriter for IO operations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5491

Test Plan: make check

Differential Revision: D15909002

Pulled By: ggaurav28

fbshipit-source-id: 13a8105176e8e42db0c59798d48cb6a0dbccc965
2019-07-09 16:27:22 -07:00
Andrew Kryczka
2c9df9f9e5 Dynamic test whether sync_file_range returns ENOSYS (#5416)
Summary:
`sync_file_range` returns `ENOSYS` on Windows Subsystem for Linux even
when using a supposedly supported filesystem like ext4. To handle this
case we can do a dynamic check that a no-op `sync_file_range`
invocation, which is accomplished by passing zero for the `flags`
argument, succeeds.

Also I rearranged the function and comments to hopefully make it more
easily understandable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5416

Differential Revision: D15807061

fbshipit-source-id: d31d94e1f228b7850ea500e6199f8b5daf8cfbd3
2019-06-13 13:56:10 -07:00
Siying Dong
000b9ec217 Move some logging related files to logging/ (#5387)
Summary:
Many logging related source files are under util/. It will be more structured if they are together.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387

Differential Revision: D15579036

Pulled By: siying

fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f
2019-05-31 17:23:59 -07:00
Siying Dong
8843129ece Move some memory related files from util/ to memory/ (#5382)
Summary:
Move arena, allocator, and memory tools under util to a separate memory/ directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382

Differential Revision: D15564655

Pulled By: siying

fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5
2019-05-30 17:44:09 -07:00
Siying Dong
e9e0101ca4 Move test related files under util/ to test_util/ (#5377)
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377

Differential Revision: D15551366

Pulled By: siying

fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
2019-05-30 11:25:51 -07:00
Raphael Bost
468ca61105 Break large file writes into 1GB chunks (#5213)
Summary:
This is a workaround for the issue described in #5169.
It has been tested on a database with very large values, but not dedicated test has been added to the code base.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5213

Differential Revision: D15243116

Pulled By: siying

fbshipit-source-id: e0c226a6cd71a60924dcd7ce7af74abcb4054484
2019-05-15 14:20:24 -07:00
Andrew Kryczka
8272a6de57 Optionally wait on bytes_per_sync to smooth I/O (#5183)
Summary:
The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways.

My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes.

Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it.

There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically).

The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5183

Differential Revision: D14953553

Pulled By: siying

fbshipit-source-id: 445c3862e019fb7b470f9c7f314fc231b62706e9
2019-04-22 11:51:39 -07:00
Zhongyi Xie
3c5eed5ebe remove incorrect assert in GetUniqueIdFromFile (#5102)
Summary:
User report has shown that sometimes `BlockBasedTable::SetupCacheKeyPrefix` would assert when trying to generate an id from the file. The actual cause seems to be hardware related but we might be better off without the incorrect assertion
See T42178927 for more information
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5102

Differential Revision: D14604677

Pulled By: miasantreble

fbshipit-source-id: fcb09207ebdc4fa66e941afbc0523d84797e7ad7
2019-03-25 23:28:29 -07:00
Rashmi Sharma
a4396f9218 Make it easier for users to load options from option file and set shared block cache. (#5063)
Summary:
[RocksDB] Make it easier for users to load options from option file and set shared block cache.
Right now, it requires several dynamic casting for users to set the shared block cache to their option struct cast from the option file.
If people don't do that, every CF of every DB will generate its own 8MB block cache. It's not a usable setting. So we are dragging every user who loads options from the file into such a mess.
Instead, we should allow them to pass their cache object to LoadLatestOptions() and LoadOptionsFromFile(), so that those loaded option structs will have the shared block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5063

Differential Revision: D14518584

Pulled By: rashmishrm

fbshipit-source-id: c91430ff9425a0e67d76fc67931d755f491ca5aa
2019-03-21 16:25:28 -07:00
Andrew Kryczka
186b3afaa8 Use fallocate even if hole-punching unsupported (#5023)
Summary:
The compiler flag `-DROCKSDB_FALLOCATE_PRESENT` was only set when
`fallocate`, `FALLOC_FL_KEEP_SIZE`, and `FALLOC_FL_PUNCH_HOLE` were all
present. However, the last of the three is not really necessary for the
primary `fallocate` use case; furthermore, it was introduced only in later
Linux kernel versions (2.6.38+).

This PR changes the flag `-DROCKSDB_FALLOCATE_PRESENT` to only require
`fallocate` and `FALLOC_FL_KEEP_SIZE` to be present. There is a separate
check for `FALLOC_FL_PUNCH_HOLE` only in the place where it is used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5023

Differential Revision: D14248487

Pulled By: siying

fbshipit-source-id: a10ed0b902fa755988e957bd2dcec9081ec0502e
2019-03-04 15:43:17 -08:00
Young Tack Jin
4091597c67 fix for nvme device path (#4866)
Summary:
nvme device path doesn't have "block" as like "nvme/nvme0/nvme0n1"
or "nvme/nvme0/nvme0n1/nvme0n1p1". the last directory such as
"nvme0n1p1" should be removed if nvme drive is partitioned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4866

Differential Revision: D13627824

Pulled By: riversand963

fbshipit-source-id: 09ab968f349f3dbb890beea20193f1359b17d317
2019-01-31 19:08:37 -08:00
Sagar Vemuri
645e57c22d Assert for Direct IO at the beginning in PositionedRead (#3891)
Summary:
Moved the direct-IO assertion to the top in `PosixSequentialFile::PositionedRead`, as it doesn't make sense to check for sector alignments before checking for direct IO.
Closes https://github.com/facebook/rocksdb/pull/3891

Differential Revision: D8267972

Pulled By: sagar0

fbshipit-source-id: 0ecf77c0fb5c35747a4ddbc15e278918c0849af7
2018-06-21 14:58:01 -07:00
奏之章
6e08916eb3 Fix Fadvise on closed file when reads use mmap
Summary:
```PosixMmapReadableFile::fd_``` is closed after created, but needs to remain open for the lifetime of `PosixMmapReadableFile` since it is used whenever `InvalidateCache` is called.
Closes https://github.com/facebook/rocksdb/pull/2764

Differential Revision: D8152515

Pulled By: ajkr

fbshipit-source-id: b738a6a55ba4e392f9b0f374ff396a1e61c64f65
2018-05-25 10:57:57 -07:00
Dmitri Smirnov
3db8504cde Catchup with posix features
Summary:
Catch up with Posix features
  NewWritableRWFile must fail when file does not exists
  Implement Env::Truncate()
  Adjust Env options optimization functions
  Implement MemoryMappedBuffer on Windows.
Closes https://github.com/facebook/rocksdb/pull/3857

Differential Revision: D8053610

Pulled By: ajkr

fbshipit-source-id: ccd0d46c29648a9f6f496873bc1c9d6c5547487e
2018-05-24 15:13:04 -07:00
Andrew Kryczka
46152d53bf Second attempt at db_stress crash-recovery verification
Summary:
- Original commit: a4fb1f8c04
- Revert commit (we reverted as a quick fix to get crash tests passing): 6afe22db2e

This PR includes the contents of the original commit plus two bug fixes, which are:

- In whitebox crash test, only set `--expected_values_path` for `db_stress` runs in the first half of the crash test's duration. In the second half, a fresh DB is created for each `db_stress` run, so we cannot maintain expected state across `db_stress` runs.
- Made `Exists()` return true for `UNKNOWN_SENTINEL` values. I previously had an assert in `Exists()` that value was not `UNKNOWN_SENTINEL`. But it is possible for post-crash-recovery expected values to be `UNKNOWN_SENTINEL` (i.e., if the crash happens in the middle of an update), in which case this assertion would be tripped. The effect of returning true in this case is there may be cases where a `SingleDelete` deletes no data. But if we had returned false, the effect would be calling `SingleDelete` on a key with multiple older versions, which is not supported.
Closes https://github.com/facebook/rocksdb/pull/3793

Differential Revision: D7811671

Pulled By: ajkr

fbshipit-source-id: 67e0295bfb1695ff9674837f2e05bb29c50efc30
2018-04-30 12:27:34 -07:00
Andrew Kryczka
6afe22db2e revert db_stress crash-recovery verification
Summary:
crash-recovery verification is failing in the whitebox testing, which may or may not be a valid correctness issue -- need more time to investigate. In the meantime, reverting so we don't mask other failures.
Closes https://github.com/facebook/rocksdb/pull/3786

Differential Revision: D7794516

Pulled By: ajkr

fbshipit-source-id: 28ccdfdb9ec9b3b0fb08c15cbf9d2e282201ff33
2018-04-27 12:57:01 -07:00
Andrew Kryczka
a4fb1f8c04 Add crash-recovery correctness check to db_stress
Summary:
Previously, our `db_stress` tool held the expected state of the DB in-memory, so after crash-recovery, there was no way to verify data correctness. This PR adds an option, `--expected_values_file`, which specifies a file holding the expected values.

In black-box testing, the `db_stress` process can be killed arbitrarily, so updates to the `--expected_values_file` must be atomic. We achieve this by `mmap`ing the file and relying on `std::atomic<uint32_t>` for atomicity. Actually this doesn't provide a total guarantee on what we want as `std::atomic<uint32_t>` could, in theory, be translated into multiple stores surrounded by a mutex. We can verify our assumption by looking at `std::atomic::is_always_lock_free`.

For the `mmap`'d file, we didn't have an existing way to expose its contents as a raw memory buffer. This PR adds it in the `Env::NewMemoryMappedFileBuffer` function, and `MemoryMappedFileBuffer` class.

`db_crashtest.py` is updated to use an expected values file for black-box testing. On the first iteration (when the DB is created), an empty file is provided as `db_stress` will populate it when it runs. On subsequent iterations, that same filename is provided so `db_stress` can check the data is as expected on startup.
Closes https://github.com/facebook/rocksdb/pull/3629

Differential Revision: D7463144

Pulled By: ajkr

fbshipit-source-id: c8f3e82c93e045a90055e2468316be155633bd8b
2018-04-24 15:58:22 -07:00
Zhongyi Xie
954b496b3f fix memory leak in two_level_iterator
Summary:
this PR fixes a few failed contbuild:
1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406
2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662
3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag.
Closes https://github.com/facebook/rocksdb/pull/3718

Reviewed By: maysamyabandeh

Differential Revision: D7621192

Pulled By: miasantreble

fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146
2018-04-15 17:26:26 -07:00
David Lai
3be9b36453 comment unused parameters to turn on -Wunused-parameter flag
Summary:
This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557.
Closes https://github.com/facebook/rocksdb/pull/3662

Differential Revision: D7426121

Pulled By: Dayvedde

fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352
2018-04-12 17:59:16 -07:00
Andrew Kryczka
5d68243e61 Comment out unused variables
Summary:
Submitting on behalf of another employee.
Closes https://github.com/facebook/rocksdb/pull/3557

Differential Revision: D7146025

Pulled By: ajkr

fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e
2018-03-05 13:13:41 -08:00
Igor Sugak
aba3409740 Back out "[codemod] - comment out unused parameters"
Reviewed By: igorsugak

fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37
2018-02-22 12:43:17 -08:00
David Lai
f4a030ce81 - comment out unused parameters
Reviewed By: everiq, igorsugak

Differential Revision: D7046710

fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461
2018-02-22 09:44:23 -08:00
Tamir Duberstein
cd5092e168 Suppress unused warnings
Summary:
- Use `__unused__` everywhere
- Suppress unused warnings in Release mode
    + This currently affects non-MSVC builds (e.g. mingw64).
Closes https://github.com/facebook/rocksdb/pull/3448

Differential Revision: D6885496

Pulled By: miasantreble

fbshipit-source-id: f2f6adacec940cc3851a9eee328fafbf61aad211
2018-02-02 12:27:07 -08:00