Commit Graph

69 Commits

Author SHA1 Message Date
Sagar Vemuri
84c5c9aab1 Fix a bug in compaction reads causing checksum mismatches and asan errors (#5531)
Summary:
Fixed a bug in compaction reads due to which incorrect number of bytes were being read/utilized. The bug was introduced in https://github.com/facebook/rocksdb/issues/5498 , resulting in "Corruption: block checksum mismatch" and "heap-buffer-overflow" asan errors in our tests.

https://github.com/facebook/rocksdb/issues/5498 was introduced recently and is not in any released versions.

ASAN:
```
> ==2280939==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6250005e83da at pc 0x000000d57f62 bp 0x7f954f483770 sp 0x7f954f482f20
> === How to use this, how to get the raw stack trace, and more: fburl.com/ASAN ===
> READ of size 4 at 0x6250005e83da thread T4
> SCARINESS: 27 (4-byte-read-heap-buffer-overflow-far-from-bounds)

>      #0 tests+0xd57f61                           __asan_memcpy
>      https://github.com/facebook/rocksdb/issues/1 rocksdb/src/util/coding.h:124            rocksdb::DecodeFixed32(char const*)
>      https://github.com/facebook/rocksdb/issues/2 rocksdb/src/table/block_fetcher.cc:39    rocksdb::BlockFetcher::CheckBlockChecksum()
>      https://github.com/facebook/rocksdb/issues/3 rocksdb/src/table/block_fetcher.cc:99    rocksdb::BlockFetcher::TryGetFromPrefetchBuffer()
>      https://github.com/facebook/rocksdb/issues/4 rocksdb/src/table/block_fetcher.cc:209   rocksdb::BlockFetcher::ReadBlockContents()
>      https://github.com/facebook/rocksdb/issues/5 rocksdb/src/table/block_based/block_based_table_reader.cc:93 rocksdb::(anonymous namespace)::ReadBlockFromFile(rocksdb::RandomAccessFileReader*, rocksdb::FilePrefetchBuffer*, rocksdb::Footer const&, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, std::unique_ptr<...>*, rocksdb::ImmutableCFOptions const&, bool, bool, rocksdb::UncompressionDict
 const&, rocksdb::PersistentCacheOptions const&, unsigned long, unsigned long, rocksdb::MemoryAllocator*, bool)
>      https://github.com/facebook/rocksdb/issues/6 rocksdb/src/table/block_based/block_based_table_reader.cc:2331 rocksdb::BlockBasedTable::RetrieveBlock(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<...>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, bool) const
>      https://github.com/facebook/rocksdb/issues/7 rocksdb/src/table/block_based/block_based_table_reader.cc:2090 rocksdb::DataBlockIter* rocksdb::BlockBasedTable::NewDataBlockIterator<...>(rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::DataBlockIter*, rocksdb::BlockType, bool, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::Status, rocksdb::FilePrefetchBuffe
r*, bool) const
>      https://github.com/facebook/rocksdb/issues/8 rocksdb/src/table/block_based/block_based_table_reader.cc:2720 rocksdb::BlockBasedTableIterator<...>::InitDataBlock()
>      https://github.com/facebook/rocksdb/issues/9 rocksdb/src/table/block_based/block_based_table_reader.cc:2607 rocksdb::BlockBasedTableIterator<...>::SeekToFirst()
>     https://github.com/facebook/rocksdb/issues/10 rocksdb/src/table/iterator_wrapper.h:83  rocksdb::IteratorWrapperBase<...>::SeekToFirst()
>     https://github.com/facebook/rocksdb/issues/11 rocksdb/src/table/merging_iterator.cc:100 rocksdb::MergingIterator::SeekToFirst()
>     https://github.com/facebook/rocksdb/issues/12 rocksdb/compaction/compaction_job.cc:877 rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)
>     https://github.com/facebook/rocksdb/issues/13 rocksdb/compaction/compaction_job.cc:590 rocksdb::CompactionJob::Run()
>     https://github.com/facebook/rocksdb/issues/14 rocksdb/db_impl/db_impl_compaction_flush.cc:2689 rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)
>     https://github.com/facebook/rocksdb/issues/15 rocksdb/db_impl/db_impl_compaction_flush.cc:2248 rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)
>     https://github.com/facebook/rocksdb/issues/16 rocksdb/db_impl/db_impl_compaction_flush.cc:2024 rocksdb::DBImpl::BGWorkCompaction(void*)
>     https://github.com/facebook/rocksdb/issues/23 rocksdb/src/util/threadpool_imp.cc:266   rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)
>     https://github.com/facebook/rocksdb/issues/24 rocksdb/src/util/threadpool_imp.cc:307   rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5531

Test Plan: Verified that this fixes the fb-internal Logdevice test which caught the issue.

Differential Revision: D16109702

Pulled By: sagar0

fbshipit-source-id: 1fc08549cf7b553e338a133ae11eb9f4d5011914
2019-07-03 19:06:46 -07:00
anand76
7259e28d91 MultiGet parallel IO (#5464)
Summary:
Enhancement to MultiGet batching to read data blocks required for keys in a batch in parallel from disk. It uses Env::MultiRead() API to read multiple blocks and reduce latency.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5464

Test Plan:
1. make check
2. make asan_check
3. make asan_crash

Differential Revision: D15911771

Pulled By: anand1976

fbshipit-source-id: 605036b9af0f90ca0020dc87c3a86b4da6e83394
2019-06-30 20:56:04 -07:00
Vijay Nadimpalli
22028aa9ab Compaction Reads should read no more than compaction_readahead_size bytes, when set! (#5498)
Summary:
As a result of https://github.com/facebook/rocksdb/issues/5431 the compaction_readahead_size given by a user was not used exactly, the reason being the code behind readahead for user-read and compaction-read was unified in the above PR and the behavior for user-read is to read readahead_size+n bytes (see FilePrefetchBuffer::TryReadFromCache method). Before the unification the ReadaheadRandomAccessFileReader used compaction_readahead_size as it is.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5498

Test Plan:
Ran strace command : strace -e pread64 -f -T -t ./db_compaction_test --gtest_filter=DBCompactionTest.PartialManualCompaction

In the test the compaction_readahead_size was configured to 2MB and verified the pread syscall did indeed request 2MB. Before the change it was requesting more than 2MB.

Strace Output:
strace: Process 3798982 attached
Note: Google Test filter = DBCompactionTest.PartialManualCompaction
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBCompactionTest
[ RUN      ] DBCompactionTest.PartialManualCompaction
strace: Process 3798983 attached
strace: Process 3798984 attached
strace: Process 3798985 attached
strace: Process 3798986 attached
strace: Process 3798987 attached
strace: Process 3798992 attached
[pid 3798987] 12:07:05 +++ exited with 0 +++
strace: Process 3798993 attached
[pid 3798993] 12:07:05 +++ exited with 0 +++
strace: Process 3798994 attached
strace: Process 3799008 attached
strace: Process 3799009 attached
[pid 3799008] 12:07:05 +++ exited with 0 +++
strace: Process 3799010 attached
[pid 3799009] 12:07:05 +++ exited with 0 +++
strace: Process 3799011 attached
[pid 3799010] 12:07:05 +++ exited with 0 +++
[pid 3799011] 12:07:05 +++ exited with 0 +++
strace: Process 3799012 attached
[pid 3799012] 12:07:05 +++ exited with 0 +++
strace: Process 3799013 attached
strace: Process 3799014 attached
[pid 3799013] 12:07:05 +++ exited with 0 +++
strace: Process 3799015 attached
[pid 3799014] 12:07:05 +++ exited with 0 +++
[pid 3799015] 12:07:05 +++ exited with 0 +++
strace: Process 3799016 attached
[pid 3799016] 12:07:05 +++ exited with 0 +++
strace: Process 3799017 attached
[pid 3799017] 12:07:05 +++ exited with 0 +++
strace: Process 3799019 attached
[pid 3799019] 12:07:05 +++ exited with 0 +++
strace: Process 3799020 attached
strace: Process 3799021 attached
[pid 3799020] 12:07:05 +++ exited with 0 +++
[pid 3799021] 12:07:05 +++ exited with 0 +++
strace: Process 3799022 attached
[pid 3799022] 12:07:05 +++ exited with 0 +++
strace: Process 3799023 attached
[pid 3799023] 12:07:05 +++ exited with 0 +++
strace: Process 3799047 attached
strace: Process 3799048 attached
[pid 3799047] 12:07:06 +++ exited with 0 +++
[pid 3799048] 12:07:06 +++ exited with 0 +++
[pid 3798994] 12:07:06 +++ exited with 0 +++
strace: Process 3799052 attached
[pid 3799052] 12:07:06 +++ exited with 0 +++
strace: Process 3799054 attached
strace: Process 3799069 attached
strace: Process 3799070 attached
[pid 3799069] 12:07:06 +++ exited with 0 +++
strace: Process 3799071 attached
[pid 3799070] 12:07:06 +++ exited with 0 +++
[pid 3799071] 12:07:06 +++ exited with 0 +++
strace: Process 3799072 attached
strace: Process 3799073 attached
[pid 3799072] 12:07:06 +++ exited with 0 +++
[pid 3799073] 12:07:06 +++ exited with 0 +++
strace: Process 3799074 attached
[pid 3799074] 12:07:06 +++ exited with 0 +++
strace: Process 3799075 attached
[pid 3799075] 12:07:06 +++ exited with 0 +++
strace: Process 3799076 attached
[pid 3799076] 12:07:06 +++ exited with 0 +++
strace: Process 3799077 attached
[pid 3799077] 12:07:06 +++ exited with 0 +++
strace: Process 3799078 attached
[pid 3799078] 12:07:06 +++ exited with 0 +++
strace: Process 3799079 attached
[pid 3799079] 12:07:06 +++ exited with 0 +++
strace: Process 3799080 attached
[pid 3799080] 12:07:06 +++ exited with 0 +++
strace: Process 3799081 attached
[pid 3799081] 12:07:06 +++ exited with 0 +++
strace: Process 3799082 attached
[pid 3799082] 12:07:06 +++ exited with 0 +++
strace: Process 3799083 attached
[pid 3799083] 12:07:06 +++ exited with 0 +++
strace: Process 3799086 attached
strace: Process 3799087 attached
[pid 3798984] 12:07:06 pread64(9, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000121>
[pid 3798984] 12:07:06 pread64(9, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000106>
[pid 3798984] 12:07:06 pread64(9, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000081>
[pid 3798984] 12:07:06 pread64(9, "\0\v\3foo\2\7\0\0\0\0\0\0\0\270 \0\v\4foo\2\3\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000138>
[pid 3798984] 12:07:06 pread64(11, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000097>
[pid 3798984] 12:07:06 pread64(11, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000086>
[pid 3798984] 12:07:06 pread64(11, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000064>
[pid 3798984] 12:07:06 pread64(11, "\0\v\3foo\2\21\0\0\0\0\0\0\0\270 \0\v\4foo\2\r\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000064>
[pid 3798984] 12:07:06 pread64(12, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000080>
[pid 3798984] 12:07:06 pread64(12, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000090>
[pid 3798984] 12:07:06 pread64(12, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000059>
[pid 3798984] 12:07:06 pread64(12, "\0\v\3foo\2\33\0\0\0\0\0\0\0\270 \0\v\4foo\2\27\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000065>
[pid 3798984] 12:07:06 pread64(13, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000070>
[pid 3798984] 12:07:06 pread64(13, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000059>
[pid 3798984] 12:07:06 pread64(13, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000061>
[pid 3798984] 12:07:06 pread64(13, "\0\v\3foo\2%\0\0\0\0\0\0\0\270 \0\v\4foo\2!\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000065>
[pid 3798984] 12:07:06 pread64(14, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000118>
[pid 3798984] 12:07:06 pread64(14, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000093>
[pid 3798984] 12:07:06 pread64(14, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000050>
[pid 3798984] 12:07:06 pread64(14, "\0\v\3foo\2/\0\0\0\0\0\0\0\270 \0\v\4foo\2+\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000082>
[pid 3798984] 12:07:06 pread64(15, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000080>
[pid 3798984] 12:07:06 pread64(15, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000086>
[pid 3798984] 12:07:06 pread64(15, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000091>
[pid 3798984] 12:07:06 pread64(15, "\0\v\3foo\0029\0\0\0\0\0\0\0\270 \0\v\4foo\0025\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000174>
[pid 3798984] 12:07:06 pread64(16, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000080>
[pid 3798984] 12:07:06 pread64(16, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000093>
[pid 3798984] 12:07:06 pread64(16, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000194>
[pid 3798984] 12:07:06 pread64(16, "\0\v\3foo\2C\0\0\0\0\0\0\0\270 \0\v\4foo\2?\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000086>
[pid 3798984] 12:07:06 pread64(17, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000079>
[pid 3798984] 12:07:06 pread64(17, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000047>
[pid 3798984] 12:07:06 pread64(17, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000045>
[pid 3798984] 12:07:06 pread64(17, "\0\v\3foo\2M\0\0\0\0\0\0\0\270 \0\v\4foo\2I\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000107>
[pid 3798983] 12:07:06 pread64(17, "\0\v\200\10foo\2P\0\0\0\0\0\0)U?MSg_)j(roFn($e"..., 2097152, 0) = 11230 <0.000091>
[pid 3798983] 12:07:06 pread64(17, "", 2085922, 11230) = 0 <0.000073>
[pid 3798983] 12:07:06 pread64(16, "\0\v\200\10foo\2F\0\0\0\0\0\0k[h3%.OPH_^:\\S7T&"..., 2097152, 0) = 11230 <0.000083>
[pid 3798983] 12:07:06 pread64(16, "", 2085922, 11230) = 0 <0.000078>
[pid 3798983] 12:07:06 pread64(15, "\0\v\200\10foo\2<\0\0\0\0\0\0+qToi_c{*S+4:N(:"..., 2097152, 0) = 11230 <0.000095>
[pid 3798983] 12:07:06 pread64(15, "", 2085922, 11230) = 0 <0.000067>
[pid 3798983] 12:07:06 pread64(14, "\0\v\200\10foo\0022\0\0\0\0\0\0%hw%OMa\"}9I609Q!B"..., 2097152, 0) = 11230 <0.000111>
[pid 3798983] 12:07:06 pread64(14, "", 2085922, 11230) = 0 <0.000093>
[pid 3798983] 12:07:06 pread64(13, "\0\v\200\10foo\2(\0\0\0\0\0\0p}Y&mu^DcaSGb2&nP"..., 2097152, 0) = 11230 <0.000128>
[pid 3798983] 12:07:06 pread64(13, "", 2085922, 11230) = 0 <0.000076>
[pid 3798983] 12:07:06 pread64(12, "\0\v\200\10foo\2\36\0\0\0\0\0\0YIyW#]oSs^6VHfB<`"..., 2097152, 0) = 11230 <0.000092>
[pid 3798983] 12:07:06 pread64(12, "", 2085922, 11230) = 0 <0.000073>
[pid 3798983] 12:07:06 pread64(11, "\0\v\200\10foo\2\24\0\0\0\0\0\0mfF8Jel/*Zf :-#s("..., 2097152, 0) = 11230 <0.000088>
[pid 3798983] 12:07:06 pread64(11, "", 2085922, 11230) = 0 <0.000067>
[pid 3798983] 12:07:06 pread64(9, "\0\v\200\10foo\2\n\0\0\0\0\0\0\\X'cjiHX)D,RSj1X!"..., 2097152, 0) = 11230 <0.000115>
[pid 3798983] 12:07:06 pread64(9, "", 2085922, 11230) = 0 <0.000073>
[pid 3798983] 12:07:06 pread64(8, "\1\315\5 \36\30\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 754) = 53 <0.000098>
[pid 3798983] 12:07:06 pread64(8, "\0\22\3rocksdb.properties;\215\5\0\0\0\0\1\0\0\0"..., 37, 717) = 37 <0.000064>
[pid 3798983] 12:07:06 pread64(8, "\0$\4rocksdb.block.based.table.ind"..., 658, 59) = 658 <0.000074>
[pid 3798983] 12:07:06 pread64(8, "\0\v\2foo\1\0\0\0\0\0\0\0\0\31\0\0\0\0\1\0\0\0\0\212\216\222P", 29, 30) = 29 <0.000064>
[pid 3799086] 12:07:06 +++ exited with 0 +++
[pid 3799087] 12:07:06 +++ exited with 0 +++
[pid 3799054] 12:07:06 +++ exited with 0 +++
strace: Process 3799104 attached
[pid 3799104] 12:07:06 +++ exited with 0 +++
[       OK ] DBCompactionTest.PartialManualCompaction (757 ms)
[----------] 1 test from DBCompactionTest (758 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (759 ms total)
[  PASSED  ] 1 test.
[pid 3798983] 12:07:06 +++ exited with 0 +++
[pid 3798984] 12:07:06 +++ exited with 0 +++
[pid 3798992] 12:07:06 +++ exited with 0 +++
[pid 3798986] 12:07:06 +++ exited with 0 +++
[pid 3798982] 12:07:06 +++ exited with 0 +++
[pid 3798985] 12:07:06 +++ exited with 0 +++
12:07:06 +++ exited with 0 +++

Differential Revision: D15948422

Pulled By: vjnadimpalli

fbshipit-source-id: 9b189d1e8675d290c7784e4b33e5d3b5761d2ac8
2019-06-21 21:31:49 -07:00
Vijay Nadimpalli
24b118ad98 Combine the read-ahead logic for user reads and compaction reads (#5431)
Summary:
Currently the read-ahead logic for user reads and compaction reads go through different code paths where compaction reads create new table readers and use `ReadaheadRandomAccessFile`. This change is to unify read-ahead logic to use read-ahead in BlockBasedTableReader::InitDataBlock(). As a result of the change  `ReadAheadRandomAccessFile` class and `new_table_reader_for_compaction_inputs` option will no longer be used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5431

Test Plan:
make check

Here is the benchmarking - https://gist.github.com/vjnadimpalli/083cf423f7b6aa12dcdb14c858bc18a5

Differential Revision: D15772533

Pulled By: vjnadimpalli

fbshipit-source-id: b71dca710590471ede6fb37553388654e2e479b9
2019-06-19 14:10:46 -07:00
Siying Dong
8843129ece Move some memory related files from util/ to memory/ (#5382)
Summary:
Move arena, allocator, and memory tools under util to a separate memory/ directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382

Differential Revision: D15564655

Pulled By: siying

fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5
2019-05-30 17:44:09 -07:00
Siying Dong
e9e0101ca4 Move test related files under util/ to test_util/ (#5377)
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377

Differential Revision: D15551366

Pulled By: siying

fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
2019-05-30 11:25:51 -07:00
Michael Liu
ca89ac2ba9 Apply modernize-use-override (2nd iteration)
Summary:
Use C++11’s override and remove virtual where applicable.
Change are automatically generated.

Reviewed By: Orvid

Differential Revision: D14090024

fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a
2019-02-14 14:41:36 -08:00
Alexander Zinoviev
32a6dd9a41 Add a new CPU time counter to compaction report (#4889)
Summary:
Measure CPU time consumed for a compaction and report it in the stats report
Enable NowCPUNanos() to work for MacOS
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889

Differential Revision: D13701276

Pulled By: zinoale

fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055
2019-01-29 17:24:00 -08:00
Yanqin Jin
e79df377c5 Use chrono::time_point instead of time_t (#4868)
Summary:
By convention, time_t almost always stores the integral number of seconds since
00:00 hours, Jan 1, 1970 UTC, according to http://www.cplusplus.com/reference/ctime/time_t/.
We surely want more precision than seconds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4868

Differential Revision: D13633046

Pulled By: riversand963

fbshipit-source-id: 4e01e23a22e8838023c51a91247a286dbf3a5396
2019-01-16 09:51:05 -08:00
Siying Dong
ae25546a7a Direct I/O Close() shouldn't rewrite the last block (#4771)
Summary:
In Direct I/O case, WritableFileWriter::Close() rewrites the last block again, even if there is nothing new. The reason is that, Close() flushes the buffer. For non-direct I/O case, the buffer is empty in this case so it is a no-op. However, in direct I/O case, the partial data in the last block is kept in the buffer because it needs to be rewritten for the next write. This piece of data is flushed again. This commit fixes it by skipping this write out if `pending_sync_` flag shows that there isn't new data sync last sync.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4771

Differential Revision: D13420426

Pulled By: siying

fbshipit-source-id: 9d39ec9a215b1425d4ed40d85e0eba1f5daa75c6
2018-12-11 13:55:02 -08:00
Sagar Vemuri
dc3528077a Update all unique/shared_ptr instances to be qualified with namespace std (#4638)
Summary:
Ran the following commands to recursively change all the files under RocksDB:
```
find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
```
Running `make format` updated some formatting on the files touched.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638

Differential Revision: D12934992

Pulled By: sagar0

fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
2018-11-09 11:19:58 -08:00
Yanqin Jin
729a617b5b Add listener to sample file io (#3933)
Summary:
We would like to collect file-system-level statistics including file name, offset, length, return code, latency, etc., which requires to add callbacks to intercept file IO function calls when RocksDB is running.
To collect file-system-level statistics, users can inherit the class `EventListener`, as in `TestFileOperationListener `. Note that `TestFileOperationListener::ShouldBeNotifiedOnFileIO()` returns true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3933

Differential Revision: D10219571

Pulled By: riversand963

fbshipit-source-id: 7acc577a2d31097766a27adb6f78eaf8b1e8ff15
2018-10-12 18:36:11 -07:00
cngzhnp
64324e329e Support pragma once in all header files and cleanup some warnings (#4339)
Summary:
As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited.

Besides this, try to fix some warnings about loss of data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339

Differential Revision: D9654990

Pulled By: ajkr

fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848
2018-09-05 18:13:31 -07:00
Zhichao Cao
999d955e4f RocksDB Trace Analyzer (#4091)
Summary:
A framework of trace analyzing for RocksDB

After collecting the trace by using the tool of [PR #3837](https://github.com/facebook/rocksdb/pull/3837). User can use the Trace Analyzer to interpret, analyze, and characterize the collected workload.
**Input:**
1. trace file
2. Whole keys space file

**Statistics:**
1. Access count of each operation (Get, Put, Delete, SingleDelete, DeleteRange, Merge) in each column family.
2. Key hotness (access count) of each one
3. Key space separation based on given prefix
4. Key size distribution
5. Value size distribution if appliable
6. Top K accessed keys
7. QPS statistics including the average QPS and peak QPS
8. Top K accessed prefix
9. The query correlation analyzing, output the number of X after Y and the corresponding average time
    intervals

**Output:**
1. key access heat map (either in the accessed key space or whole key space)
2. trace sequence file (interpret the raw trace file to line base text file for future use)
3. Time serial (The key space ID and its access time)
4. Key access count distritbution
5. Key size distribution
6. Value size distribution (in each intervals)
7. whole key space separation by the prefix
8. Accessed key space separation by the prefix
9. QPS of each operation and each column family
10. Top K QPS and their accessed prefix range

**Test:**
1. Added the unit test of analyzing Get, Put, Delete, SingleDelete, DeleteRange, Merge
2. Generated the trace and analyze the trace

**Implemented but not tested (due to the limitation of trace_replay):**
1. Analyzing Iterator, supporting Seek() and SeekForPrev() analyzing
2. Analyzing the number of Key found by Get

**Future Work:**
1.  Support execution time analyzing of each requests
2.  Support cache hit situation and block read situation of Get
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4091

Differential Revision: D9256157

Pulled By: zhichao-cao

fbshipit-source-id: f0ceacb7eedbc43a3eee6e85b76087d7832a8fe6
2018-08-13 11:44:02 -07:00
Siying Dong
8425c8bd4d BlockBasedTableReader: automatically adjust tail prefetch size (#4156)
Summary:
Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others.
Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10%
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4156

Differential Revision: D8916847

Pulled By: siying

fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758
2018-07-20 14:43:37 -07:00
Zhongyi Xie
91d7c03cdc Exclude time waiting for rate limiter from rocksdb.sst.read.micros (#4102)
Summary:
Our "rocksdb.sst.read.micros" stat includes time spent waiting for rate limiter. It probably only affects people rate limiting compaction reads, which is fairly rare.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4102

Differential Revision: D8848506

Pulled By: miasantreble

fbshipit-source-id: 01258ac5ae56e4eee372978cfc9143a6869f8bfc
2018-07-13 18:44:14 -07:00
Sagar Vemuri
1c912196de Remove external tracking of AlignedBuffer's size (#4105)
Summary:
Remove external tracking of AlignedBuffer's size in `ReadaheadRandomAccessFile` and `FilePrefetchBuffer`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4105

Differential Revision: D8805724

Pulled By: sagar0

fbshipit-source-id: d61d8c203c7c500e3f36e912132d7852026ed023
2018-07-11 15:57:49 -07:00
Sagar Vemuri
440621aab8 Fix Copying of data between buffers in FilePrefetchBuffer (#4100)
Summary:
Copy data between buffers inside FilePrefetchBuffer only when chunk length is greater than 0. Otherwise AlignedBuffer was accessing memory out of its range causing crashes.

Removing the tracking of buffer length outside of `AlignedBuffer`, i.e. in `FilePrefetchBuffer` and `ReadaheadRandomAccessFile`, will follow in a separate PR, as it is not the root cause of the crash reported in #4051. (`FilePrefetchBuffer` itself has been this way from its inception, and `ReadaheadRandomAccessFile` was updated to add the buffer length at some point).

Comprehensive tests for `FilePrefetchBuffer` also to follow in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4100

Differential Revision: D8792590

Pulled By: sagar0

fbshipit-source-id: 3578f45761cf6884243e767f749db4016ccc93e1
2018-07-11 12:28:13 -07:00
Sagar Vemuri
7103559f49 Improve direct IO range scan performance with readahead (#3884)
Summary:
This PR extends the improvements in #3282 to also work when using Direct IO.
We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash.

**Description:**
This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.

**Implementation Details:**
- Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead.
- `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled.
- `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer.
- Made sure not to re-read partial chunks of data that were already available in the buffer, from device again.
- Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date.

**Constraints:**
- Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).
- Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously.
- Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them.

**Benchmarks:**
I used the same benchmark as used in #3282.
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```

Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```

```
Before:
seekrandom   :   37939.906 micros/op 26 ops/sec;   29.2 MB/s (1636 of 1999 found)
With this change:
seekrandom   :   8527.720 micros/op 117 ops/sec;  129.7 MB/s (6530 of 7999 found)
```
~4.5X perf improvement. Taken on an average of 3 runs.
Closes https://github.com/facebook/rocksdb/pull/3884

Differential Revision: D8082143

Pulled By: sagar0

fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb
2018-06-21 11:13:08 -07:00
Andrew Kryczka
072ae671a7 Apply use_direct_io_for_flush_and_compaction to writes only
Summary:
Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used.

This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously.
Closes https://github.com/facebook/rocksdb/pull/3829

Differential Revision: D7915443

Pulled By: ajkr

fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279
2018-05-09 19:42:58 -07:00
Andrew Kryczka
4bf169f07e Disable readahead when using mmap for reads
Summary:
`ReadaheadRandomAccessFile` had an unwritten assumption, which was that its wrapped file's `Read()` function always copies into the provided scratch buffer. Actually this was not true when the wrapped file was `PosixMmapReadableFile`, whose `Read()` implementation does no copying and instead returns a `Slice` pointing directly into the  `mmap`'d memory region. This PR:

- prevents `ReadaheadRandomAccessFile` from ever wrapping mmap readable files
- adds an assert for the assumption `ReadaheadRandomAccessFile` makes about the wrapped file's use of scratch buffer
Closes https://github.com/facebook/rocksdb/pull/3813

Differential Revision: D7891513

Pulled By: ajkr

fbshipit-source-id: dc64a55222d6af280c39a1852ee39e9e9d7cde7d
2018-05-08 12:13:18 -07:00
Anand Ananthabhotla
f9f4d40f93 Align SST file data blocks to avoid spanning multiple pages
Summary:
Provide a block_align option in BlockBasedTableOptions to allow
alignment of SST file data blocks. This will avoid higher
IOPS/throughput load due to < 4KB data blocks spanning 2 4KB pages.
When this option is set to true, the block alignment is set to lower of
block size and 4KB.
Closes https://github.com/facebook/rocksdb/pull/3502

Differential Revision: D7400897

Pulled By: anand1976

fbshipit-source-id: 04cc3bd144e88e3431a4f97604e63ad7a0f06d44
2018-03-26 20:26:10 -07:00
Fosco Marotto
d518fe1da6 uint64_t and size_t changes to compile for iOS
Summary:
In attempting to build a static lib for use in iOS, I ran in to lots of type errors between uint64_t and size_t.  This PR contains the changes I made to get `TARGET_OS=IOS make static_lib` to succeed while also getting Xcode to build successfully with the resulting `librocksdb.a` library imported.

This also compiles for me on macOS and tests fine, but I'm really not sure if I made the correct decisions about where to `static_cast` and where to change types.

Also up for discussion: is iOS worth supporting?  Getting the static lib is just part one, we aren't providing any bridging headers or wrappers like the ObjectiveRocks project, it won't be a great experience.
Closes https://github.com/facebook/rocksdb/pull/3503

Differential Revision: D7106457

Pulled By: gfosco

fbshipit-source-id: 82ac2073de7e1f09b91f6b4faea91d18bd311f8e
2018-03-06 12:43:51 -08:00
Siying Dong
b3c5351335 Direct I/O writable file should do fsync in Close()
Summary:
We don't do fsync() after truncate in direct I/O writeable file (in fact we don't do any fsync ever). This can cause metadata not persistent to disk after the file is generated. We call it instead.
Closes https://github.com/facebook/rocksdb/pull/3500

Differential Revision: D6981482

Pulled By: siying

fbshipit-source-id: 7e2b591b7e5dd1b96fc0775515b8b9e6092980ef
2018-02-13 16:27:11 -08:00
Andrew Kryczka
b78ed0460b fix ReadaheadRandomAccessFile/iterator prefetch bug
Summary:
`ReadaheadRandomAccessFile` is used by iterators for file reads in several cases, like in compaction when `compaction_readahead_size > 0` or `use_direct_io_for_flush_and_compaction == true`, or in user iterator when `ReadOptions::readahead_size > 0`. `ReadaheadRandomAccessFile` maintains an internal buffer for readahead data. It assumes that, if the buffer's length is less than `ReadaheadRandomAccessFile::readahead_size_`, which is fixed in the constructor, then EOF has been reached so it doesn't try reading further.

Recently, d938226af4 started calling `RandomAccessFile::Prefetch` with various lengths: 8KB, 16KB, etc. When the `RandomAccessFile` is a `ReadaheadRandomAccessFile`, it triggers the above condition and incorrectly determines EOF. If a block is partially in the readahead buffer and EOF is incorrectly decided, the result is a truncated data block.

The problem is reproducible:

```
TEST_TMPDIR=/data/compaction_bench ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -block_size=18384 -use_direct_io_for_flush_and_compaction=true
...
put error: Corruption: truncated block read from /data/compaction_bench/dbbench/000014.sst offset 20245, expected 10143 bytes, got 8427
```
Closes https://github.com/facebook/rocksdb/pull/3454

Differential Revision: D6869405

Pulled By: ajkr

fbshipit-source-id: 87001c299e7600a37c0dcccbd0368e0954c929cf
2018-02-01 09:42:09 -08:00
Maysam Yabandeh
4927b4e662 Rounddown in FilePrefetchBuffer::Prefetch
Summary:
FilePrefetchBuffer::Prefetch is currently rounds the offset up which does not fit its new use cases in prefetching index/filter blocks, as it would skips over some the offsets that were requested to be prefetched. This patch rounds down instead.

Fixes #3180
Closes https://github.com/facebook/rocksdb/pull/3413

Differential Revision: D6816392

Pulled By: maysamyabandeh

fbshipit-source-id: 3aaeaf59c55d72b61dacfae6d4a8e65eccb3c553
2018-01-26 12:57:25 -08:00
Siying Dong
666a005f9b Support prefetch last 512KB with direct I/O in block based file reader
Summary:
Right now, if direct I/O is enabled, prefetching the last 512KB cannot be applied, except compaction inputs or readahead is enabled for iterators. This can create a lot of I/O for HDD cases. To solve the problem, the 512KB is prefetched in block based table if direct I/O is enabled. The prefetched buffer is passed in totegher with random access file reader, so that we try to read from the buffer before reading from the file. This can be extended in the future to support flexible user iterator readahead too.
Closes https://github.com/facebook/rocksdb/pull/2708

Differential Revision: D5593091

Pulled By: siying

fbshipit-source-id: ee36ff6d8af11c312a2622272b21957a7b5c81e7
2017-08-11 12:16:45 -07:00
Siying Dong
3c327ac2d0 Change RocksDB License
Summary: Closes https://github.com/facebook/rocksdb/pull/2589

Differential Revision: D5431502

Pulled By: siying

fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75
2017-07-15 16:11:23 -07:00
Andrew Kryczka
c217e0b9c7 Call RateLimiter for compaction reads
Summary:
Allow users to rate limit background work based on read bytes, written bytes, or sum of read and written bytes. Support these by changing the RateLimiter API, so no additional options were needed.
Closes https://github.com/facebook/rocksdb/pull/2433

Differential Revision: D5216946

Pulled By: ajkr

fbshipit-source-id: aec57a8357dbb4bfde2003261094d786d94f724e
2017-06-13 14:56:46 -07:00
Siying Dong
0175d58c3c Make direct I/O write use incremental buffer
Summary:
Currently for direct I/O, the large maximum buffer is always allocated. This will be wasteful if users flush the data in much smaller chunks. This diff fix this by changing the behavior of incremental buffer works. When we enlarge buffer, we try to copy the existing data in the buffer to the enlarged buffer, rather than flush the buffer first. This can make sure that no extra I/O is introduced because of buffer enlargement.
Closes https://github.com/facebook/rocksdb/pull/2403

Differential Revision: D5178403

Pulled By: siying

fbshipit-source-id: a8fe1e7304bdb8cab2973340022fe80ff83449fd
2017-06-13 04:41:37 -07:00
Aaron Gao
a36220ccfb fix unity test
Summary:
unity test will fail even if we have the same function names in different anonymous namespaces in different files.
Closes https://github.com/facebook/rocksdb/pull/2321

Differential Revision: D5083783

Pulled By: lightmark

fbshipit-source-id: 1347aaf866900af30d23cdd4f29c1b96f17352af
2017-05-17 18:56:55 -07:00
Aaron Gao
2d42cf5ea9 Roundup read bytes in ReadaheadRandomAccessFile
Summary:
Fix alignment in ReadaheadRandomAccessFile
Closes https://github.com/facebook/rocksdb/pull/2253

Differential Revision: D5012336

Pulled By: lightmark

fbshipit-source-id: 10d2c829520cb787227ef653ef63d5d701725778
2017-05-05 12:14:14 -07:00
Siying Dong
d616ebea23 Add GPLv2 as an alternative license.
Summary: Closes https://github.com/facebook/rocksdb/pull/2226

Differential Revision: D4967547

Pulled By: siying

fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4
2017-04-27 18:06:12 -07:00
Dmitri Smirnov
cdad04b051 Remove double buffering on RandomRead on Windows.
Summary:
Remove double buffering on RandomRead on Windows.
  With more logic appear in file reader/write Read no longer
  obeys forwarding calls to Windows implementation.
  Previously direct_io (unbuffered) was only available on Windows
  but now is supported as generic.
  We remove intermediate buffering on Windows.
  Remove random_access_max_buffer_size option which was windows specific.
  Non-zero values for that opton introduced unnecessary lock contention.
  Remove Env::EnableReadAhead(), Env::ShouldForwardRawRequest() that are
  no longer necessary.
  Add aligned buffer reads for cases when requested reads exceed read ahead size.
Closes https://github.com/facebook/rocksdb/pull/2105

Differential Revision: D4847770

Pulled By: siying

fbshipit-source-id: 8ab48f8e854ab498a4fd398a6934859792a2788f
2017-04-27 12:30:05 -07:00
Aaron Gao
6e8d6f429d readahead backwards from sst end
Summary:
prefetch some data from the end of the file for each compaction to reduce IO.
Closes https://github.com/facebook/rocksdb/pull/2149

Differential Revision: D4880576

Pulled By: lightmark

fbshipit-source-id: aa767cd1afc84c541837fbf1ad6c0d45b34d3932
2017-04-14 18:56:14 -07:00
Aaron Gao
10d7546961 set readahead buffer size from roundup(user_size) + 4k to roundup(use…
Summary:
Users usually set readahead buffer to a multiple of 4k, more than that, usually a multiple of blocks.
So previously we set real buffer size 512 * n + 4k, which may introduce an additional block reading.
Closes https://github.com/facebook/rocksdb/pull/2138

Differential Revision: D4871504

Pulled By: lightmark

fbshipit-source-id: b070faa51d92e976e8e8468c00692699e585e243
2017-04-11 17:13:33 -07:00
Siying Dong
d2dce5611a Move some files under util/ to separate dirs
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090

Differential Revision: D4833681

Pulled By: siying

fbshipit-source-id: 2fd8bef
2017-04-05 19:09:16 -07:00
Aaron Gao
9272e12f19 avoid ftruncate twice in buffered io
Summary:
in buffered io, the filesize_ is the real size.
Closes https://github.com/facebook/rocksdb/pull/1991

Differential Revision: D4711433

Pulled By: lightmark

fbshipit-source-id: ad604b9
2017-03-17 11:39:13 -07:00
Aaron Gao
d525718a93 cleanup direct io flag in WritableFileWriter
Summary:
remove unnecessary field `direct_io_`, use `use_direct_io()` instead.
Closes https://github.com/facebook/rocksdb/pull/1992

Differential Revision: D4712195

Pulled By: lightmark

fbshipit-source-id: 57d34f9
2017-03-14 22:39:09 -07:00
Andrew Kryczka
7c80a6d7d1 Statistic for how often rate limiter is drained
Summary:
This is the metric I plan to use for adaptive rate limiting. The statistics are updated only if the rate limiter is drained by flush or compaction. I believe (but am not certain) that this is the normal case.

The Statistics object is passed in RateLimiter::Request() to avoid requiring changes to client code, which would've been necessary if we passed it in the RateLimiter constructor.
Closes https://github.com/facebook/rocksdb/pull/1946

Differential Revision: D4646489

Pulled By: ajkr

fbshipit-source-id: d8e0161
2017-03-02 17:54:15 -08:00
Aaron Gao
f206af56fc add use_direct_io() to ReadaheadRandomAccessFile
Summary:
Missing this function will cause RandomAccessFileReader not doing alignment in Direct IO mode, which introduce an IOError: invalid argument.
Closes https://github.com/facebook/rocksdb/pull/1900

Differential Revision: D4601261

Pulled By: lightmark

fbshipit-source-id: c3eadf1
2017-02-22 14:54:11 -08:00
Yulia Kartseva
ebc8a79980 alignment is on in ReadaheadRandomAccessFile::Read()
Summary: Closes https://github.com/facebook/rocksdb/pull/1857

Differential Revision: D4534518

Pulled By: wat-ze-hex

fbshipit-source-id: b456946
2017-02-18 12:09:12 -08:00
Aaron Gao
db2b4eb50e avoid direct io in rocksdb_lite
Summary:
fix lite bugs
disable direct io in lite mode
Closes https://github.com/facebook/rocksdb/pull/1870

Differential Revision: D4559866

Pulled By: yiwu-arbug

fbshipit-source-id: 3761c51
2017-02-16 10:39:13 -08:00
Aaron Gao
3e6899d116 change UseDirectIO() to use_direct_io()
Summary:
also change variable name `direct_io_` to `use_direct_io_` in WritableFile to make it consistent with read path.
Closes https://github.com/facebook/rocksdb/pull/1770

Differential Revision: D4416435

Pulled By: lightmark

fbshipit-source-id: 4143c53
2017-01-13 12:09:15 -08:00
Aaron Gao
dc2584eea0 direct reads refactor
Summary:
direct IO reads refactoring
remove unnecessary classes and unified interfaces
tested with db_bench

need more change for options and ON/OFF for different files.
Since disabled is default, it should be fine now
Closes https://github.com/facebook/rocksdb/pull/1636

Differential Revision: D4307189

Pulled By: lightmark

fbshipit-source-id: 6991e22
2017-01-11 16:54:12 -08:00
Aaron Gao
972f96b3fb direct io write support
Summary:
rocksdb direct io support

```
[gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 5.0
Date:       Wed Nov 23 13:17:43 2016
CPU:        40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
CPUCache:   25600 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 1
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       4.393 micros/op 227639 ops/sec;   25.2 MB/s

[gzh@dev11575.prn2 ~/roc
Closes https://github.com/facebook/rocksdb/pull/1564

Differential Revision: D4241093

Pulled By: lightmark

fbshipit-source-id: 98c29e3
2016-12-22 13:09:19 -08:00
Dhruba Borthakur
34723b4c4a Cleanup unused variable pending_fsync_.
Summary: Cleanup unused variable pending_fsync_.

Test Plan: make check

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61581
2016-08-05 10:31:41 -07:00
Baraa Hamodi
21e95811d1 Updated all copyright headers to the new format. 2016-02-09 15:12:00 -08:00
Dmitri Smirnov
5270b33bd3 Make use of portable uint64_t type to make possible file access
in 64-bit.

  Currently, a signed off_t type is being used for the following
  interfaces for both offset and the length in bytes:
  * `Allocate`
  * `RangeSync`

  On Linux `off_t` is automatically either 32 or 64-bit depending on
  the platform. On Windows it is always a 32-bit signed long which
  limits file access and in particular space pre-allocation
  to effectively 2 Gb.

  Proposal is to replace off_t with uint64_t as a portable type
  always access files with 64-bit interfaces.

  May need to modify posix code but lack resources to test it.
2015-11-10 17:03:42 -08:00
SherlockNoMad
a6dd0831d5 Add Option to Skip Flushing in TableBuilder 2015-10-29 22:10:25 -07:00