Commit Graph

1790 Commits

Author SHA1 Message Date
sdong
699a569c52 Remove RandomAccessFileReader.for_compaction_ (#5572)
Summary:
RandomAccessFileReader.for_compaction_ doesn't seem to be used anymore. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5572

Test Plan: USE_CLANG=1 make all check -j

Differential Revision: D16286178

fbshipit-source-id: aa338049761033dfbe5e8b1707bbb0be2df5be7e
2019-07-16 16:32:18 -07:00
Yikun Jiang
f064d74e45 Cleanup the Arm64 CRC32 unused warning (#5565)
Summary:
When 'HAVE_ARM64_CRC' is set, the blew methods:

- bool rocksdb::crc32c::isSSE42()
- bool rocksdb::crc32c::isPCLMULQDQ()

are defined but not used, the unused-function is raised
when do rocksdb build.

This patch try to cleanup these warnings by add ifndef,
if it build under the HAVE_ARM64_CRC, we will not define
`isSSE42` and `isPCLMULQDQ`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5565

Differential Revision: D16233654

fbshipit-source-id: c32a9dda7465dbf65f9ccafef159124db92cdffd
2019-07-15 11:20:26 -07:00
ggaurav28
60d8b19836 Implemented a file logger that uses WritableFileWriter (#5491)
Summary:
Current PosixLogger performs IO operations using posix calls. Thus the
current implementation will not work for non-posix env. Created a new
logger class EnvLogger that uses env specific WritableFileWriter for IO operations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5491

Test Plan: make check

Differential Revision: D15909002

Pulled By: ggaurav28

fbshipit-source-id: 13a8105176e8e42db0c59798d48cb6a0dbccc965
2019-07-09 16:27:22 -07:00
sdong
e4dcf5fd22 db_bench to add a new "benchmark" to print out all stats history (#5532)
Summary:
Sometimes it is helpful to fetch the whole history of stats after benchmark runs. Add such an option
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5532

Test Plan: Run the benchmark manually and observe the output is as expected.

Differential Revision: D16097764

fbshipit-source-id: 10b5b735a22a18be198b8f348be11f11f8806904
2019-07-03 20:03:28 -07:00
Sagar Vemuri
84c5c9aab1 Fix a bug in compaction reads causing checksum mismatches and asan errors (#5531)
Summary:
Fixed a bug in compaction reads due to which incorrect number of bytes were being read/utilized. The bug was introduced in https://github.com/facebook/rocksdb/issues/5498 , resulting in "Corruption: block checksum mismatch" and "heap-buffer-overflow" asan errors in our tests.

https://github.com/facebook/rocksdb/issues/5498 was introduced recently and is not in any released versions.

ASAN:
```
> ==2280939==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6250005e83da at pc 0x000000d57f62 bp 0x7f954f483770 sp 0x7f954f482f20
> === How to use this, how to get the raw stack trace, and more: fburl.com/ASAN ===
> READ of size 4 at 0x6250005e83da thread T4
> SCARINESS: 27 (4-byte-read-heap-buffer-overflow-far-from-bounds)

>      #0 tests+0xd57f61                           __asan_memcpy
>      https://github.com/facebook/rocksdb/issues/1 rocksdb/src/util/coding.h:124            rocksdb::DecodeFixed32(char const*)
>      https://github.com/facebook/rocksdb/issues/2 rocksdb/src/table/block_fetcher.cc:39    rocksdb::BlockFetcher::CheckBlockChecksum()
>      https://github.com/facebook/rocksdb/issues/3 rocksdb/src/table/block_fetcher.cc:99    rocksdb::BlockFetcher::TryGetFromPrefetchBuffer()
>      https://github.com/facebook/rocksdb/issues/4 rocksdb/src/table/block_fetcher.cc:209   rocksdb::BlockFetcher::ReadBlockContents()
>      https://github.com/facebook/rocksdb/issues/5 rocksdb/src/table/block_based/block_based_table_reader.cc:93 rocksdb::(anonymous namespace)::ReadBlockFromFile(rocksdb::RandomAccessFileReader*, rocksdb::FilePrefetchBuffer*, rocksdb::Footer const&, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, std::unique_ptr<...>*, rocksdb::ImmutableCFOptions const&, bool, bool, rocksdb::UncompressionDict
 const&, rocksdb::PersistentCacheOptions const&, unsigned long, unsigned long, rocksdb::MemoryAllocator*, bool)
>      https://github.com/facebook/rocksdb/issues/6 rocksdb/src/table/block_based/block_based_table_reader.cc:2331 rocksdb::BlockBasedTable::RetrieveBlock(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<...>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, bool) const
>      https://github.com/facebook/rocksdb/issues/7 rocksdb/src/table/block_based/block_based_table_reader.cc:2090 rocksdb::DataBlockIter* rocksdb::BlockBasedTable::NewDataBlockIterator<...>(rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::DataBlockIter*, rocksdb::BlockType, bool, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::Status, rocksdb::FilePrefetchBuffe
r*, bool) const
>      https://github.com/facebook/rocksdb/issues/8 rocksdb/src/table/block_based/block_based_table_reader.cc:2720 rocksdb::BlockBasedTableIterator<...>::InitDataBlock()
>      https://github.com/facebook/rocksdb/issues/9 rocksdb/src/table/block_based/block_based_table_reader.cc:2607 rocksdb::BlockBasedTableIterator<...>::SeekToFirst()
>     https://github.com/facebook/rocksdb/issues/10 rocksdb/src/table/iterator_wrapper.h:83  rocksdb::IteratorWrapperBase<...>::SeekToFirst()
>     https://github.com/facebook/rocksdb/issues/11 rocksdb/src/table/merging_iterator.cc:100 rocksdb::MergingIterator::SeekToFirst()
>     https://github.com/facebook/rocksdb/issues/12 rocksdb/compaction/compaction_job.cc:877 rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)
>     https://github.com/facebook/rocksdb/issues/13 rocksdb/compaction/compaction_job.cc:590 rocksdb::CompactionJob::Run()
>     https://github.com/facebook/rocksdb/issues/14 rocksdb/db_impl/db_impl_compaction_flush.cc:2689 rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)
>     https://github.com/facebook/rocksdb/issues/15 rocksdb/db_impl/db_impl_compaction_flush.cc:2248 rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)
>     https://github.com/facebook/rocksdb/issues/16 rocksdb/db_impl/db_impl_compaction_flush.cc:2024 rocksdb::DBImpl::BGWorkCompaction(void*)
>     https://github.com/facebook/rocksdb/issues/23 rocksdb/src/util/threadpool_imp.cc:266   rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)
>     https://github.com/facebook/rocksdb/issues/24 rocksdb/src/util/threadpool_imp.cc:307   rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5531

Test Plan: Verified that this fixes the fb-internal Logdevice test which caught the issue.

Differential Revision: D16109702

Pulled By: sagar0

fbshipit-source-id: 1fc08549cf7b553e338a133ae11eb9f4d5011914
2019-07-03 19:06:46 -07:00
Eli Pozniansky
f872009237 Fix from some C-style casting (#5524)
Summary:
Fix from some C-style casting in bloom.cc and ./tools/db_bench_tool.cc
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5524

Differential Revision: D16075626

Pulled By: elipoz

fbshipit-source-id: 352948885efb64a7ef865942c75c3c727a914207
2019-07-01 13:05:34 -07:00
anand76
7259e28d91 MultiGet parallel IO (#5464)
Summary:
Enhancement to MultiGet batching to read data blocks required for keys in a batch in parallel from disk. It uses Env::MultiRead() API to read multiple blocks and reduce latency.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5464

Test Plan:
1. make check
2. make asan_check
3. make asan_crash

Differential Revision: D15911771

Pulled By: anand1976

fbshipit-source-id: 605036b9af0f90ca0020dc87c3a86b4da6e83394
2019-06-30 20:56:04 -07:00
Mike Kolupaev
b4d7209428 Add an option to put first key of each sst block in the index (#5289)
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.

Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.

So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.

Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.

This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289

Differential Revision: D15256423

Pulled By: al13n321

fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
2019-06-24 20:54:04 -07:00
Sergei Petrunia
e731f44022 C file should not include <cinttypes>, it is a C++ header. (#5499)
Summary:
Include <inttypes.h> instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5499

Differential Revision: D15966937

Pulled By: miasantreble

fbshipit-source-id: 2156c4329b91d26d447de94f1231264d52786350
2019-06-24 16:12:39 -07:00
Vijay Nadimpalli
22028aa9ab Compaction Reads should read no more than compaction_readahead_size bytes, when set! (#5498)
Summary:
As a result of https://github.com/facebook/rocksdb/issues/5431 the compaction_readahead_size given by a user was not used exactly, the reason being the code behind readahead for user-read and compaction-read was unified in the above PR and the behavior for user-read is to read readahead_size+n bytes (see FilePrefetchBuffer::TryReadFromCache method). Before the unification the ReadaheadRandomAccessFileReader used compaction_readahead_size as it is.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5498

Test Plan:
Ran strace command : strace -e pread64 -f -T -t ./db_compaction_test --gtest_filter=DBCompactionTest.PartialManualCompaction

In the test the compaction_readahead_size was configured to 2MB and verified the pread syscall did indeed request 2MB. Before the change it was requesting more than 2MB.

Strace Output:
strace: Process 3798982 attached
Note: Google Test filter = DBCompactionTest.PartialManualCompaction
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBCompactionTest
[ RUN      ] DBCompactionTest.PartialManualCompaction
strace: Process 3798983 attached
strace: Process 3798984 attached
strace: Process 3798985 attached
strace: Process 3798986 attached
strace: Process 3798987 attached
strace: Process 3798992 attached
[pid 3798987] 12:07:05 +++ exited with 0 +++
strace: Process 3798993 attached
[pid 3798993] 12:07:05 +++ exited with 0 +++
strace: Process 3798994 attached
strace: Process 3799008 attached
strace: Process 3799009 attached
[pid 3799008] 12:07:05 +++ exited with 0 +++
strace: Process 3799010 attached
[pid 3799009] 12:07:05 +++ exited with 0 +++
strace: Process 3799011 attached
[pid 3799010] 12:07:05 +++ exited with 0 +++
[pid 3799011] 12:07:05 +++ exited with 0 +++
strace: Process 3799012 attached
[pid 3799012] 12:07:05 +++ exited with 0 +++
strace: Process 3799013 attached
strace: Process 3799014 attached
[pid 3799013] 12:07:05 +++ exited with 0 +++
strace: Process 3799015 attached
[pid 3799014] 12:07:05 +++ exited with 0 +++
[pid 3799015] 12:07:05 +++ exited with 0 +++
strace: Process 3799016 attached
[pid 3799016] 12:07:05 +++ exited with 0 +++
strace: Process 3799017 attached
[pid 3799017] 12:07:05 +++ exited with 0 +++
strace: Process 3799019 attached
[pid 3799019] 12:07:05 +++ exited with 0 +++
strace: Process 3799020 attached
strace: Process 3799021 attached
[pid 3799020] 12:07:05 +++ exited with 0 +++
[pid 3799021] 12:07:05 +++ exited with 0 +++
strace: Process 3799022 attached
[pid 3799022] 12:07:05 +++ exited with 0 +++
strace: Process 3799023 attached
[pid 3799023] 12:07:05 +++ exited with 0 +++
strace: Process 3799047 attached
strace: Process 3799048 attached
[pid 3799047] 12:07:06 +++ exited with 0 +++
[pid 3799048] 12:07:06 +++ exited with 0 +++
[pid 3798994] 12:07:06 +++ exited with 0 +++
strace: Process 3799052 attached
[pid 3799052] 12:07:06 +++ exited with 0 +++
strace: Process 3799054 attached
strace: Process 3799069 attached
strace: Process 3799070 attached
[pid 3799069] 12:07:06 +++ exited with 0 +++
strace: Process 3799071 attached
[pid 3799070] 12:07:06 +++ exited with 0 +++
[pid 3799071] 12:07:06 +++ exited with 0 +++
strace: Process 3799072 attached
strace: Process 3799073 attached
[pid 3799072] 12:07:06 +++ exited with 0 +++
[pid 3799073] 12:07:06 +++ exited with 0 +++
strace: Process 3799074 attached
[pid 3799074] 12:07:06 +++ exited with 0 +++
strace: Process 3799075 attached
[pid 3799075] 12:07:06 +++ exited with 0 +++
strace: Process 3799076 attached
[pid 3799076] 12:07:06 +++ exited with 0 +++
strace: Process 3799077 attached
[pid 3799077] 12:07:06 +++ exited with 0 +++
strace: Process 3799078 attached
[pid 3799078] 12:07:06 +++ exited with 0 +++
strace: Process 3799079 attached
[pid 3799079] 12:07:06 +++ exited with 0 +++
strace: Process 3799080 attached
[pid 3799080] 12:07:06 +++ exited with 0 +++
strace: Process 3799081 attached
[pid 3799081] 12:07:06 +++ exited with 0 +++
strace: Process 3799082 attached
[pid 3799082] 12:07:06 +++ exited with 0 +++
strace: Process 3799083 attached
[pid 3799083] 12:07:06 +++ exited with 0 +++
strace: Process 3799086 attached
strace: Process 3799087 attached
[pid 3798984] 12:07:06 pread64(9, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000121>
[pid 3798984] 12:07:06 pread64(9, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000106>
[pid 3798984] 12:07:06 pread64(9, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000081>
[pid 3798984] 12:07:06 pread64(9, "\0\v\3foo\2\7\0\0\0\0\0\0\0\270 \0\v\4foo\2\3\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000138>
[pid 3798984] 12:07:06 pread64(11, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000097>
[pid 3798984] 12:07:06 pread64(11, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000086>
[pid 3798984] 12:07:06 pread64(11, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000064>
[pid 3798984] 12:07:06 pread64(11, "\0\v\3foo\2\21\0\0\0\0\0\0\0\270 \0\v\4foo\2\r\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000064>
[pid 3798984] 12:07:06 pread64(12, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000080>
[pid 3798984] 12:07:06 pread64(12, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000090>
[pid 3798984] 12:07:06 pread64(12, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000059>
[pid 3798984] 12:07:06 pread64(12, "\0\v\3foo\2\33\0\0\0\0\0\0\0\270 \0\v\4foo\2\27\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000065>
[pid 3798984] 12:07:06 pread64(13, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000070>
[pid 3798984] 12:07:06 pread64(13, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000059>
[pid 3798984] 12:07:06 pread64(13, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000061>
[pid 3798984] 12:07:06 pread64(13, "\0\v\3foo\2%\0\0\0\0\0\0\0\270 \0\v\4foo\2!\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000065>
[pid 3798984] 12:07:06 pread64(14, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000118>
[pid 3798984] 12:07:06 pread64(14, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000093>
[pid 3798984] 12:07:06 pread64(14, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000050>
[pid 3798984] 12:07:06 pread64(14, "\0\v\3foo\2/\0\0\0\0\0\0\0\270 \0\v\4foo\2+\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000082>
[pid 3798984] 12:07:06 pread64(15, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000080>
[pid 3798984] 12:07:06 pread64(15, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000086>
[pid 3798984] 12:07:06 pread64(15, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000091>
[pid 3798984] 12:07:06 pread64(15, "\0\v\3foo\0029\0\0\0\0\0\0\0\270 \0\v\4foo\0025\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000174>
[pid 3798984] 12:07:06 pread64(16, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000080>
[pid 3798984] 12:07:06 pread64(16, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000093>
[pid 3798984] 12:07:06 pread64(16, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000194>
[pid 3798984] 12:07:06 pread64(16, "\0\v\3foo\2C\0\0\0\0\0\0\0\270 \0\v\4foo\2?\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000086>
[pid 3798984] 12:07:06 pread64(17, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000079>
[pid 3798984] 12:07:06 pread64(17, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000047>
[pid 3798984] 12:07:06 pread64(17, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000045>
[pid 3798984] 12:07:06 pread64(17, "\0\v\3foo\2M\0\0\0\0\0\0\0\270 \0\v\4foo\2I\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000107>
[pid 3798983] 12:07:06 pread64(17, "\0\v\200\10foo\2P\0\0\0\0\0\0)U?MSg_)j(roFn($e"..., 2097152, 0) = 11230 <0.000091>
[pid 3798983] 12:07:06 pread64(17, "", 2085922, 11230) = 0 <0.000073>
[pid 3798983] 12:07:06 pread64(16, "\0\v\200\10foo\2F\0\0\0\0\0\0k[h3%.OPH_^:\\S7T&"..., 2097152, 0) = 11230 <0.000083>
[pid 3798983] 12:07:06 pread64(16, "", 2085922, 11230) = 0 <0.000078>
[pid 3798983] 12:07:06 pread64(15, "\0\v\200\10foo\2<\0\0\0\0\0\0+qToi_c{*S+4:N(:"..., 2097152, 0) = 11230 <0.000095>
[pid 3798983] 12:07:06 pread64(15, "", 2085922, 11230) = 0 <0.000067>
[pid 3798983] 12:07:06 pread64(14, "\0\v\200\10foo\0022\0\0\0\0\0\0%hw%OMa\"}9I609Q!B"..., 2097152, 0) = 11230 <0.000111>
[pid 3798983] 12:07:06 pread64(14, "", 2085922, 11230) = 0 <0.000093>
[pid 3798983] 12:07:06 pread64(13, "\0\v\200\10foo\2(\0\0\0\0\0\0p}Y&mu^DcaSGb2&nP"..., 2097152, 0) = 11230 <0.000128>
[pid 3798983] 12:07:06 pread64(13, "", 2085922, 11230) = 0 <0.000076>
[pid 3798983] 12:07:06 pread64(12, "\0\v\200\10foo\2\36\0\0\0\0\0\0YIyW#]oSs^6VHfB<`"..., 2097152, 0) = 11230 <0.000092>
[pid 3798983] 12:07:06 pread64(12, "", 2085922, 11230) = 0 <0.000073>
[pid 3798983] 12:07:06 pread64(11, "\0\v\200\10foo\2\24\0\0\0\0\0\0mfF8Jel/*Zf :-#s("..., 2097152, 0) = 11230 <0.000088>
[pid 3798983] 12:07:06 pread64(11, "", 2085922, 11230) = 0 <0.000067>
[pid 3798983] 12:07:06 pread64(9, "\0\v\200\10foo\2\n\0\0\0\0\0\0\\X'cjiHX)D,RSj1X!"..., 2097152, 0) = 11230 <0.000115>
[pid 3798983] 12:07:06 pread64(9, "", 2085922, 11230) = 0 <0.000073>
[pid 3798983] 12:07:06 pread64(8, "\1\315\5 \36\30\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 754) = 53 <0.000098>
[pid 3798983] 12:07:06 pread64(8, "\0\22\3rocksdb.properties;\215\5\0\0\0\0\1\0\0\0"..., 37, 717) = 37 <0.000064>
[pid 3798983] 12:07:06 pread64(8, "\0$\4rocksdb.block.based.table.ind"..., 658, 59) = 658 <0.000074>
[pid 3798983] 12:07:06 pread64(8, "\0\v\2foo\1\0\0\0\0\0\0\0\0\31\0\0\0\0\1\0\0\0\0\212\216\222P", 29, 30) = 29 <0.000064>
[pid 3799086] 12:07:06 +++ exited with 0 +++
[pid 3799087] 12:07:06 +++ exited with 0 +++
[pid 3799054] 12:07:06 +++ exited with 0 +++
strace: Process 3799104 attached
[pid 3799104] 12:07:06 +++ exited with 0 +++
[       OK ] DBCompactionTest.PartialManualCompaction (757 ms)
[----------] 1 test from DBCompactionTest (758 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (759 ms total)
[  PASSED  ] 1 test.
[pid 3798983] 12:07:06 +++ exited with 0 +++
[pid 3798984] 12:07:06 +++ exited with 0 +++
[pid 3798992] 12:07:06 +++ exited with 0 +++
[pid 3798986] 12:07:06 +++ exited with 0 +++
[pid 3798982] 12:07:06 +++ exited with 0 +++
[pid 3798985] 12:07:06 +++ exited with 0 +++
12:07:06 +++ exited with 0 +++

Differential Revision: D15948422

Pulled By: vjnadimpalli

fbshipit-source-id: 9b189d1e8675d290c7784e4b33e5d3b5761d2ac8
2019-06-21 21:31:49 -07:00
Vijay Nadimpalli
24b118ad98 Combine the read-ahead logic for user reads and compaction reads (#5431)
Summary:
Currently the read-ahead logic for user reads and compaction reads go through different code paths where compaction reads create new table readers and use `ReadaheadRandomAccessFile`. This change is to unify read-ahead logic to use read-ahead in BlockBasedTableReader::InitDataBlock(). As a result of the change  `ReadAheadRandomAccessFile` class and `new_table_reader_for_compaction_inputs` option will no longer be used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5431

Test Plan:
make check

Here is the benchmarking - https://gist.github.com/vjnadimpalli/083cf423f7b6aa12dcdb14c858bc18a5

Differential Revision: D15772533

Pulled By: vjnadimpalli

fbshipit-source-id: b71dca710590471ede6fb37553388654e2e479b9
2019-06-19 14:10:46 -07:00
Zhongyi Xie
d68f9f4580 simplify include directive involving inttypes (#5402)
Summary:
When using `PRIu64` type of printf specifier, current code base does the following:
```
#ifndef __STDC_FORMAT_MACROS
#define __STDC_FORMAT_MACROS
#endif
#include <inttypes.h>
```
However, this can be simplified to
```
#include <cinttypes>
```
as long as flag `-std=c++11` is used.
This should solve issues like https://github.com/facebook/rocksdb/issues/5159
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402

Differential Revision: D15701195

Pulled By: miasantreble

fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03
2019-06-06 13:56:07 -07:00
Yanqin Jin
340ed4fac7 Add support for timestamp in Get/Put (#5079)
Summary:
It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now).

Test plan (on devserver):
```
$COMPILE_WITH_ASAN=1 make -j32 all
$./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
$make check
```
All tests must pass.

We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled.
```
$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000
```
Repeat for 6 times for both versions.

Results are as follows:
```
|        | readrandom | fillrandom |
| master | 16.77 MB/s | 47.05 MB/s |
| PR5079 | 16.44 MB/s | 47.03 MB/s |
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079

Differential Revision: D15132946

Pulled By: riversand963

fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c
2019-06-05 23:10:47 -07:00
Siying Dong
5851cb7fdb Move util/trace_replay.* to trace_replay/ (#5376)
Summary:
util/ means for lower level libraries. trace_replay is highly integrated to DB and sometimes call DB. Move it out to a separate directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5376

Differential Revision: D15550938

Pulled By: siying

fbshipit-source-id: f46dce5ceffdc05a73f26379c7bb1b79ebe6c207
2019-06-03 13:25:26 -07:00
Siying Dong
000b9ec217 Move some logging related files to logging/ (#5387)
Summary:
Many logging related source files are under util/. It will be more structured if they are together.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387

Differential Revision: D15579036

Pulled By: siying

fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f
2019-05-31 17:23:59 -07:00
Yuan Zhou
79edf0a7a8 util: fix log_write_bench (#5335)
Summary:
log_write_bench doesn't compile due to some recent API changes.
This patch fixes the compile by adding the missing params for
OptimizeForLogWrite() and WritableFileWriter().

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5335

Differential Revision: D15588875

Pulled By: miasantreble

fbshipit-source-id: 726ff4dc227733e915c3b796df25bd3ab0b431ac
2019-05-31 17:17:57 -07:00
Vijay Nadimpalli
49c5a12dbe Organizing rocksdb/db directory
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390

Differential Revision: D15579388

Pulled By: vjnadimpalli

fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366
2019-05-31 11:57:01 -07:00
Siying Dong
cb094e13bb Auto roll logger to enforce options.keep_log_file_num immediately after a new file is created (#5370)
Summary:
Right now, with auto roll logger, options.keep_log_file_num enforcement is triggered by events like DB reopen or full obsolete scan happens. In the mean time, the size and number of log files can grow without a limit. We put a stronger enforcement to the option, so that the number of log files can always under control.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5370

Differential Revision: D15570413

Pulled By: siying

fbshipit-source-id: 0916c3c4d42ab8fdd29389ee7fd7e1557b03176e
2019-05-31 10:50:19 -07:00
Siying Dong
8843129ece Move some memory related files from util/ to memory/ (#5382)
Summary:
Move arena, allocator, and memory tools under util to a separate memory/ directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382

Differential Revision: D15564655

Pulled By: siying

fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5
2019-05-30 17:44:09 -07:00
Vijay Nadimpalli
50e470791d Organizing rocksdb/table directory by format
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373

Differential Revision: D15559425

Pulled By: vjnadimpalli

fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55
2019-05-30 14:51:11 -07:00
Siying Dong
e9e0101ca4 Move test related files under util/ to test_util/ (#5377)
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377

Differential Revision: D15551366

Pulled By: siying

fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
2019-05-30 11:25:51 -07:00
Siying Dong
545d206040 Move some file related files outside util/ (#5375)
Summary:
util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375

Differential Revision: D15550935

Pulled By: siying

fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2
2019-05-29 20:47:06 -07:00
Yanqin Jin
b5e4ee2e76 Fix a clang analyze error (#5365)
Summary:
The analyzer thinks max_allowed_ space can be 0. In that case, free_space will
be assigned as free_space. It fails to realize that the function call
GetFreeSpace actually sets the free_space variable properly, which is possibly
due to lack of inter-function call analysis.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5365

Differential Revision: D15521790

Pulled By: riversand963

fbshipit-source-id: 839d0a285a1c8773a28a385f0c3be4bb7fbe32cb
2019-05-28 12:19:41 -07:00
Sagar Vemuri
e264eebcd7 Add comments in file_reader_writer.h (#5355)
Summary:
Add file and class level comments in file_reader_writer.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5355

Differential Revision: D15499020

Pulled By: sagar0

fbshipit-source-id: 925b2326885cdb4357e6a139ac65ee5e2ce1d613
2019-05-24 20:31:45 -07:00
Yanqin Jin
bd9f1d2d0f Fix RocksDB auto-recovery from SpaceLimit err (#5334)
Summary:
If RocksDB is configured with a positive max_allowed_space (via sst file manager),
then the sst file manager should use this value instead of total free disk
space to determine whether to clear the background error of space limit
reached.

In DBSSTTest.DBWithMaxSpaceAllowed, we configure a low space limit that is very
likely lower than the free disk space of the test machine. Therefore, once the
test db encounters a Status::SpaceLimit, error handler will call into sst file
manager to start error recovery which may clear the bg error since disk free
space is larger than reserved_disk_buffer_.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5334

Differential Revision: D15501622

Pulled By: riversand963

fbshipit-source-id: 58035efc450b062d6b28c78c322005ec3705fb47
2019-05-24 18:38:12 -07:00
Sagar Vemuri
b09c018b4d Add comments to trace_replay.h (#5359)
Summary:
Add file, class, and function level comments in trace_replay.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5359

Differential Revision: D15505318

Pulled By: sagar0

fbshipit-source-id: 181e3d4ea805fd9a33f91b89e123bbd0c1ead2ce
2019-05-24 16:59:54 -07:00
Sagar Vemuri
5d359fc337 Document AlignedBuffer (#5345)
Summary:
Add comments to util/aligned_buffer.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5345

Differential Revision: D15496004

Pulled By: sagar0

fbshipit-source-id: 31bc6f35e88dedd74cff55febe02c9e761304f76
2019-05-24 10:05:40 -07:00
Zhongyi Xie
6a54278b4a add class level comment for RepeatableThread
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5344

Differential Revision: D15485431

Pulled By: miasantreble

fbshipit-source-id: 9c0f6cf0d826743e743012549976705ceb8cc0c4
2019-05-23 17:03:23 -07:00
Sagar Vemuri
dda474399a Remove PATENTS text from a few straggler files (#5326)
Summary:
Remove PATENTS related wording from a few stragglers which still reference the old PATENTS file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5326

Differential Revision: D15423297

Pulled By: sagar0

fbshipit-source-id: 4babcddfc120b7d2fed6eb3898287cf8012bf8ea
2019-05-21 16:22:35 -07:00
Vijay Nadimpalli
931c9df886 Use separate status code for column family drop and db shutdown in progress (#5275)
Summary:
Currently RocksDB uses Status::ShutdownInProgress to inform about column family drop. I would like to have a separate Status code for this event.
https://github.com/facebook/rocksdb/blob/master/include/rocksdb/status.h#L55
Comment on this:
abc4202e47/db/version_set.cc (L2742):L2743
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5275

Differential Revision: D15204583

Pulled By: vjnadimpalli

fbshipit-source-id: 95e99e34b27bc165b554ecb8a48a7f8e60f21e2a
2019-05-20 10:47:32 -07:00
Zhichao Cao
a13026fb2f Added trace replay fast forward function (#5273)
Summary:
In the current db_bench trace replay, the replay process strictly follows the timestamp to issue the queries. In some cases, user does not care about the time. Therefore, fast forward is needed for users to speed up the replay process.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5273

Differential Revision: D15389232

Pulled By: zhichao-cao

fbshipit-source-id: 735d629b9d2a167b05af3e4fa0ddf9d5d0be1806
2019-05-16 20:21:18 -07:00
Maysam Yabandeh
f0e8216197 WritePrepared: Fix deadlock in WriteRecoverableState (#5306)
Summary:
The recent improvement in https://github.com/facebook/rocksdb/pull/3661 could cause a deadlock: When writing recoverable state, we also commit its sequence number to commit table, which could result into evicting existing commit entry, which could result into advancing max_evicted_seq_, which would need to get snapshots from database, which requires obtaining db mutex. The patch releases db_mutex before calling the callback in WriteRecoverableState to avoid the potential deadlock. It also improves the stress tests to let the issue be manifested in the tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5306

Differential Revision: D15341458

Pulled By: maysamyabandeh

fbshipit-source-id: 05dcbed7e21b789fd1e5fd5ee8eea08077162323
2019-05-15 13:53:54 -07:00
Maysam Yabandeh
6a40ee5eb1 Refresh snapshot list during long compactions (2nd attempt) (#5278)
Summary:
Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list.
This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278

Differential Revision: D15203291

Pulled By: maysamyabandeh

fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258
2019-05-03 17:30:22 -07:00
Maysam Yabandeh
521d234bda Revert snap_refresh_nanos feature (#5269)
Summary:
Our daily stress tests are failing after this feature. Reverting temporarily until we figure the reason for test failures.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5269

Differential Revision: D15151285

Pulled By: maysamyabandeh

fbshipit-source-id: e4002b99690a97df30d4b4b58bf0f61e9591bc6e
2019-05-01 10:07:30 -07:00
Yuqi Gu
03c7ae24c2 RocksDB CRC32c optimization with ARMv8 Intrinsic (#5221)
Summary:
1. Add Arm linear crc32c implemtation for RocksDB.
2. Arm runtime check for crc32
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5221

Differential Revision: D15013685

Pulled By: siying

fbshipit-source-id: 2c2983743d26656d93f212dc7c1a3cf66a1acf12
2019-04-30 10:59:05 -07:00
Maysam Yabandeh
506e8448be Refresh snapshot list during long compactions (#5099)
Summary:
Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099

Differential Revision: D15086710

Pulled By: maysamyabandeh

fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12
2019-04-25 18:17:22 -07:00
jsteemann
de76909464 refactor SavePoints (#5192)
Summary:
Savepoints are assumed to be used in a stack-wise fashion (only
the top element should be used), so they were stored by `WriteBatch`
in a member variable `save_points` using an std::stack.

Conceptually this is fine, but the implementation had a few issues:
- the `save_points_` instance variable was a plain pointer to a heap-
  allocated `SavePoints` struct. The destructor of `WriteBatch` simply
  deletes this pointer. However, the copy constructor of WriteBatch
  just copied that pointer, meaning that copying a WriteBatch with
  active savepoints will very likely have crashed before. Now a proper
  copy of the savepoints is made in the copy constructor, and not just
  a copy of the pointer
- `save_points_` was an std::stack, which defaults to `std::deque` for
  the underlying container. A deque is a bit over the top here, as we
  only need access to the most recent savepoint (i.e. stack.top()) but
  never any elements at the front. std::deque is rather expensive to
  initialize in common environments. For example, the STL implementation
  shipped with GNU g++ will perform a heap allocation of more than 500
  bytes to create an empty deque object. Although the `save_points_`
  container is created lazily by RocksDB, moving from a deque to a plain
  `std::vector` is much more memory-efficient. So `save_points_` is now
  a vector.
- `save_points_` was changed from a plain pointer to an `std::unique_ptr`,
  making ownership more explicit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5192

Differential Revision: D15024074

Pulled By: maysamyabandeh

fbshipit-source-id: 5b128786d3789cde94e46465c9e91badd07a25d7
2019-04-19 20:33:04 -07:00
Fosco Marotto
6c2bf9e916 Add copyright headers per FB open-source checkup tool. (#5199)
Summary:
internal task: T35568575
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5199

Differential Revision: D14962794

Pulled By: gfosco

fbshipit-source-id: 93838ede6d0235eaecff90d200faed9a8515bbbe
2019-04-18 10:55:01 -07:00
anand76
fefd4b98c5 Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.

Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency

The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.

Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).

Batch   Sizes

1        | 2        | 4         | 8      | 16  | 32

Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074        - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14        - MultiGet (w/ batching)

Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135

Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62

Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891

dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011

Differential Revision: D14348703

Pulled By: anand1976

fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
2019-04-11 14:28:26 -07:00
Sagar Vemuri
d3d20dcdca Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.

This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted.  And also, of course, it helps to cleanup data older than certain threshold.

- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).

This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166

Differential Revision: D14884441

Pulled By: sagar0

fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
2019-04-10 19:31:18 -07:00
Siying Dong
0bb555630f Consolidate hash function used for non-persistent data in a new function (#5155)
Summary:
Create new function NPHash64() and GetSliceNPHash64(), which are currently
implemented using murmurhash.
Replace the current direct call of murmurhash() to use the new functions
if the hash results are not used in on-disk format.
This will make it easier to try out or switch to alternative functions
in the uses where data format compatibility doesn't need to be considered.
This part shouldn't have any performance impact.

Also, the sharded cache hash function is changed to the new format, because
it falls into this categoery. It doesn't show visible performance impact
in db_bench results. CPU showed by perf is increased from about 0.2% to 0.4%
in an extreme benchmark setting (4KB blocks, no-compression, everything
cached in block cache). We've known that the current hash function used,
our own Hash() has serious hash quality problem. It can generate a lots of
conflicts with similar input. In this use case, it means extra lock contention
for reads from the same file. This slight CPU regression is worthy to me
to counter the potential bad performance with hot keys. And hopefully this
will get further improved in the future with a better hash function.

cache_test's condition is relaxed a little bit to. The new hash is slightly
more skewed in this use case, but I manually checked the data and see
the hash results are still in a reasonable range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5155

Differential Revision: D14834821

Pulled By: siying

fbshipit-source-id: ec9a2c0a2f8ae4b54d08b13a5c2e9cc97aa80cb5
2019-04-08 13:32:06 -07:00
Adam Simpkins
c06c4c01c5 Fix many bugs in log statement arguments (#5089)
Summary:
Annotate all of the logging functions to inform the compiler that these
use printf-style formatting arguments.  This allows the compiler to emit
warnings if the format arguments are incorrect.

This also fixes many problems reported now that format string checking
is enabled.  Many of these are simply mix-ups in the argument type (e.g,
int vs uint64_t), but in several cases the wrong number of arguments
were being passed in which can cause the code to crash.

The primary motivation for this was to fix the log message in
`DBImpl::SwitchMemtable()` which caused a segfault due to an extra %s
format parameter with no argument supplied.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5089

Differential Revision: D14574795

Pulled By: simpkins

fbshipit-source-id: 0921b03f0743652bf4ae21e414ff54b3bb65422a
2019-04-04 12:12:11 -07:00
Zhongyi Xie
26015f3b48 add compression options to table properties (#5081)
Summary:
Since we are planning to use dictionary compression and to use different compression level, it is quite useful to add compression options to TableProperties. For example, in MyRocks, if the feature is available, we can query from information_schema.rocksdb_sst_props to see if all sst files are converted to ZSTD dictionary compressions. Resolves https://github.com/facebook/rocksdb/issues/4992

With this PR, user can query table properties through `GetPropertiesOfAllTables` API and get compression options as std::string:
`window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0;`
or table_properties->ToString() will also contain it
`# data blocks=1; # entries=13; # deletions=0; # merge operands=0; # range deletions=0; raw key size=143; raw average key size=11.000000; raw value size=39; raw average value size=3.000000; data block size=120; index block size (user-key? 0, delta-value? 0)=27; filter block size=0; (estimated) table size=147; filter policy name=N/A; prefix extractor name=nullptr; column family ID=0; column family name=default; comparator name=leveldb.BytewiseComparator; merge operator name=nullptr; property collectors names=[]; SST file compression algo=Snappy; SST file compression options=window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ; creation time=1552946632; time stamp of earliest key=1552946632;`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5081

Differential Revision: D14716692

Pulled By: miasantreble

fbshipit-source-id: 7d2f2cf84e052bff876e71b4212cfdebf5be32dd
2019-04-02 14:52:34 -07:00
Yanqin Jin
09957ded1d Update RepeatableThreadTest with MockTimeEnv (#5107)
Summary:
**This PR updates RepeatableThread::wait, breaking some tests on OS X. The rest of the PR fixes the tests on OS X.**
`RepeatableThreadTest.MockEnvTest` uses `MockTimeEnv` and `RepeatableThread`. If `RepeatableThread::wait` calls `TimedWait` with a time smaller than or equal to the current (real) time, `TimedWait` returns immediately on certain platforms, e.g. OS X. #4560 addresses this issue by replacing `TimedWait` with `Wait` in test. This fixes the test but makes test/production code diverge, which is not optimal for test coverage. This PR proposes an alternative fix which unifies test and production code path for `RepeatableThread::wait`. We obtain the current (real) time in seconds and add 10 extra seconds to ensure that `RepeatableThread::wait` invokes `TimedWait` with a time greater than (real) current time. This is to prevent the `TimedWait` function from returning immediately without sleeping and releasing the mutex. If `TimedWait` returns immediately, the mutex will not be released, and `RepeatableThread::TEST_WaitForRun` never has a chance to execute the callback which, in this case, updates the result returned by `mock_env->NowMicros()`. Consequently, `RepeatableThread::wait` cannot break out of the loop, causing test to hang. The extra 10 seconds is a best-effort approach because there seems no reliable and deterministic way to provide the aforementioned guarantee. By the time `RepeatableThread::wait` is called, there is no guarantee that the `delay + mock_env->NowMicros()` will be greater than the current real time. However, 10 seconds should be sufficient in most cases. We will keep an eye for possible flakiness of this test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5107

Differential Revision: D14680885

Pulled By: riversand963

fbshipit-source-id: d1ecbe10e1dacd110bd464cd01e188bfee72b89e
2019-03-29 10:08:50 -07:00
anand76
dae3b5545c Smooth the deletion of WAL files (#5116)
Summary:
WAL files are currently not subject to deletion rate limiting by DeleteScheduler. If the size of the WAL files is significant, this can cause a high delete rate on SSDs that may affect other operations. To fix it, force WAL file deletions to go through the SstFileManager. Original PR for this is #2768
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5116

Differential Revision: D14669437

Pulled By: anand1976

fbshipit-source-id: c5f62d0640cebaa1574de841a1d01e4ce2faadf0
2019-03-28 15:17:13 -07:00
Yi Wu
d69241586e Fix perf_context.user_key_comparison_count for range scan (#5098)
Summary:
Currently `perf_context.user_key_comparison_count` is bump only in `InternalKeyComparator`. For places user comparator is used directly the counter is not bump. Fixing the majority of it.

Index iterator and filter code also use user comparator directly and don't bump the counter. It is not fixed in this patch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5098

Differential Revision: D14603753

Pulled By: siying

fbshipit-source-id: 1cd41035644ca9e49b97a51030a5d1e15f5f3cae
2019-03-27 10:34:27 -07:00
Yanqin Jin
9358178edc Support for single-primary, multi-secondary instances (#4899)
Summary:
This PR allows RocksDB to run in single-primary, multi-secondary process mode.
The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary.
Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`.

This PR has several components:
1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary.

2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue.

3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`.
3.1 Tail the primary's MANIFEST during recovery.
3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`.
3.3 Tailing WAL will be in a future PR.

4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899

Differential Revision: D14510945

Pulled By: riversand963

fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
2019-03-26 16:45:31 -07:00
anand76
959f86e5f8 Use placement new and delete in autovector (#5080)
Summary:
The stack buffer in rocksdb::autovector is currently defined as an array of elements of the template type. This results in unnecessary construction of those objects, which can be a significant overhead in some cases. This PR changes the type of the stack buf to char* and uses placement new to construct new objects when they are inserted into the autovector.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5080

Differential Revision: D14533221

Pulled By: anand1976

fbshipit-source-id: 9378985c7d03f4e1a28951bdd2403c72f10f23d7
2019-03-20 10:42:04 -07:00
Zhongyi Xie
a291f3a1e5 Collect compaction stats by priority and dump to info LOG (#5050)
Summary:
In order to better understand compaction done by different priority thread pool, we now collect compaction stats by priority and also print them to info LOG through stats dump.

```
** Compaction Stats [default] **
Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Low      0/0    0.00 KB   0.0     16.8    11.3      5.5       5.6      0.1       0.0   0.0    406.4    136.1     42.24             34.96        45    0.939     13M  8865K
High      0/0    0.00 KB   0.0      0.0     0.0      0.0      11.4     11.4       0.0   0.0      0.0     76.2    153.00             35.74     12185    0.013       0      0
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5050

Differential Revision: D14408583

Pulled By: miasantreble

fbshipit-source-id: e53746586ea27cb8abc9fec35805bd80ed30f608
2019-03-19 17:28:19 -07:00
Wenjie Yang
36c2a7cfb1 Add an option to filter traces (#5082)
Summary:
Add an option to filter out READ or WRITE operations while tracing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5082

Differential Revision: D14515083

Pulled By: mrmiywj

fbshipit-source-id: 2504c89a9abf1dd629cad44b4104092702d77610
2019-03-19 14:36:51 -07:00
Shobhit Dayal
b45b1cde3e Feature for sampling and reporting compressibility (#4842)
Summary:
This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms:
1. lz4 or snappy for fast compression
2. zstd or zlib for slow but higher compression.

The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType.

The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842

Differential Revision: D13629011

Pulled By: shobhitdayal

fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa
2019-03-18 12:15:34 -07:00
Zhongyi Xie
fdc72a5c5d add OptionType kInt32T and kInt64T
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5061

Differential Revision: D14418581

Pulled By: miasantreble

fbshipit-source-id: be7f90e16586666ddd0cce36971e403782ab0892
2019-03-12 13:49:52 -07:00
Siying Dong
aef763b6d6 Make statistics's stats_level change thread-safe (#5030)
Summary:
Right now, users can change statistics.stats_level while DB is running, but TSAN may report
data race. We make stats_level_ to be atomic, and access them using accessors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5030

Differential Revision: D14267519

Pulled By: siying

fbshipit-source-id: 37d7ebeff7a43a406230143422a16af899163f73
2019-03-01 10:42:09 -08:00
Siying Dong
5e298f865b Add two more StatsLevel (#5027)
Summary:
Statistics cost too much CPU for some use cases. Add two stats levels
so that people can choose to skip two types of expensive stats, timers and
histograms.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027

Differential Revision: D14252765

Pulled By: siying

fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592
2019-02-28 10:27:59 -08:00
Siying Dong
06f378d75e When closing BlobDB, should first wait for all background tasks (#5005)
Summary:
When closing a BlobDB, it only waits for background tasks
to finish as the last thing, but the background task may access
some variables that are destroyed. The fix is to introduce a
shutdown function in the timer queue and call the function as
the first thing when destorying BlobDB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5005

Differential Revision: D14170342

Pulled By: siying

fbshipit-source-id: 081e6a2d99b9765d5956cf6cdfc290c07270c233
2019-02-21 17:26:01 -08:00
Zhongyi Xie
c4f5d0aa15 add GetStatsHistory to retrieve stats snapshots (#4748)
Summary:
This PR adds public `GetStatsHistory` API to retrieve stats history in the form of an std map. The key of the map is the timestamp in microseconds when the stats snapshot is taken, the value is another std map from stats name to stats value (stored in std string). Two DBOptions are introduced: `stats_persist_period_sec` (default 10 minutes) controls the intervals between two snapshots are taken; `max_stats_history_count` (default 10) controls the max number of history snapshots to keep in memory. RocksDB will stop collecting stats snapshots if `stats_persist_period_sec` is set to 0.

(This PR is the in-memory part of https://github.com/facebook/rocksdb/pull/4535)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4748

Differential Revision: D13961471

Pulled By: miasantreble

fbshipit-source-id: ac836d401ecb84ea92216bf9966f969dedf4ad04
2019-02-20 15:52:54 -08:00
Maysam Yabandeh
0f4244fe00 WritePrepared: Improve stress tests with slow threads (#4974)
Summary:
The transaction stress tests, stress a high concurrency scenario. In WritePrepared/WriteUnPrepared we need to also stress the scenarios where an inserting/reading transaction is very slow. This would stress the corner cases that the caching is not sufficient and other slower data structures are engaged. To emulate such cases we make use of slow inserter/verifier threads and also reduce the size of cache data structures.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4974

Differential Revision: D14143070

Pulled By: maysamyabandeh

fbshipit-source-id: 81eb674678faf9fae0f654cd60ebcc74e26aeee7
2019-02-19 16:56:49 -08:00
Zhongyi Xie
ed995c6a69 add whole key bloom filter support in memtables (#4985)
Summary:
MyRocks calls `GetForUpdate` on `INSERT`, for unique key check, and in almost all cases GetForUpdate returns empty result. For such cases, whole key bloom filter is helpful.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4985

Differential Revision: D14118257

Pulled By: miasantreble

fbshipit-source-id: d35cb7109c62fd5ad541a26968e3a3e16d3e85ea
2019-02-19 12:15:39 -08:00
Aubin Sanyal
3231a2e581 Deprecate ttl option from CompactionOptionsFIFO (#4965)
Summary:
We introduced ttl option in CompactionOptionsFIFO when ttl-based file
deletion (compaction) was supported only as part of FIFO Compaction. But
with the extension of ttl semantics even to Level compaction,
CompactionOptionsFIFO.ttl can now be deprecated. Instead we will start
using ColumnFamilyOptions.ttl for FIFO compaction as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4965

Differential Revision: D14072960

Pulled By: sagar0

fbshipit-source-id: c98cc2ae695a28136295787cd88d36a220fc219e
2019-02-15 09:51:41 -08:00
Michael Liu
ca89ac2ba9 Apply modernize-use-override (2nd iteration)
Summary:
Use C++11’s override and remove virtual where applicable.
Change are automatically generated.

Reviewed By: Orvid

Differential Revision: D14090024

fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a
2019-02-14 14:41:36 -08:00
Yanqin Jin
a69d4deefb Atomic ingest (#4895)
Summary:
Make file ingestion atomic.

 as title.
Ingesting external SST files into multiple column families should be atomic. If
a crash occurs and db reopens, either all column families have successfully
ingested the files before the crash, or non of the ingestions have any effect
on the state of the db.

Also add unit tests for atomic ingestion.

Note that the unit test here does not cover the case of incomplete atomic group
in the MANIFEST, which is covered in VersionSetTest already.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4895

Differential Revision: D13718245

Pulled By: riversand963

fbshipit-source-id: 7df97cc483af73ad44dd6993008f99b083852198
2019-02-12 19:16:17 -08:00
Sagar Vemuri
eafb09a380 Fix issues found by Clang Analyzer (#4976)
Summary:
Fix issues found by Clang Analyzer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4976

Differential Revision: D14054211

Pulled By: sagar0

fbshipit-source-id: ec2053bae43af3b2ff3425306824c677e3ba70c2
2019-02-12 13:59:44 -08:00
Andrew Kryczka
62f70f6d14 Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.

So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:

- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952

Differential Revision: D13967980

Pulled By: ajkr

fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-11 19:47:32 -08:00
Maysam Yabandeh
d6b9b3b884 Enhance transaction_test_util with delays (#4970)
Summary:
Enhance ::Insert and ::Verify test functions to add artificial delay between prepare and commit, and take snapshot and reads respectively.  A future PR will make use of these to improve stress tests to test against long-running transactions as well as long-running backup jobs. Also randomly sets set_snapshot to false for inserters to skip setting the snapshot in the initialization phase and let the snapshot be taken later explicitly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4970

Differential Revision: D14031342

Pulled By: maysamyabandeh

fbshipit-source-id: b52b453751f0b25b81b23c48892bc1d152464cab
2019-02-11 16:02:37 -08:00
tang-jianfeng
08809f5e6c Implement trace sampling (#4963)
Summary:
Implement trace sampling to allow user to specify the sampling frequency, i.e. save one per how many requests, so that a user does not need to log all if he/she is interested in only a sampled set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4963

Differential Revision: D14011190

Pulled By: tang-jianfeng

fbshipit-source-id: 078b631d9319b67cb089dd2c30e21d0df8dc406a
2019-02-08 18:08:18 -08:00
Alexander Zinoviev
32a6dd9a41 Add a new CPU time counter to compaction report (#4889)
Summary:
Measure CPU time consumed for a compaction and report it in the stats report
Enable NowCPUNanos() to work for MacOS
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889

Differential Revision: D13701276

Pulled By: zinoale

fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055
2019-01-29 17:24:00 -08:00
anand76
d0d484b132 Always delete Blob DB files in the background (#4928)
Summary:
Blob DB files are not tracked by the SFM, so they currently don't get
deleted in the background. Force them to be deleted in background so
rate limiting can be applied
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4928

Differential Revision: D13854649

Pulled By: anand1976

fbshipit-source-id: 8031ce66842ff0af440c715d886b377983dad7d8
2019-01-29 15:50:03 -08:00
Siying Dong
fc53839bfa Disallow customized hash function in DynamicBloom (#4915)
Summary:
I didn't find where customized hash function is used in DynamicBloom. This can only reduce performance. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4915

Differential Revision: D13794452

Pulled By: siying

fbshipit-source-id: e38669b11e01444d2d782da11c7decabbd851819
2019-01-24 10:34:30 -08:00
Andrew Kryczka
8ec3e72551 Cache dictionary used for decompressing data blocks (#4881)
Summary:
- If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`.
- If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle.
- New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881

Differential Revision: D13663801

Pulled By: ajkr

fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f
2019-01-23 18:15:47 -08:00
Siying Dong
08b8cea69f Deleting Blob files also goes through SstFileManager (#4904)
Summary:
Right now, deleting blob files is not rate limited, even if SstFileManger is specified.
On the other hand, rate limiting blob deletion is not supported. With this change, Blob file
deletion will go through SstFileManager too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4904

Differential Revision: D13772545

Pulled By: siying

fbshipit-source-id: bd1b1d0beb26d5167385e00b7ecb8b94b879de84
2019-01-22 17:00:29 -08:00
Andrew Kryczka
01013ae766 Digest ZSTD compression dictionary once when writing SST file (#4849)
Summary:
This is essentially a re-submission of #4251 with a few improvements:

- Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict`
- Eliminated `Init` functions. Instead do all initialization work in constructors.
- Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849

Differential Revision: D13606039

Pulled By: ajkr

fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447
2019-01-18 19:12:57 -08:00
Yanqin Jin
e79df377c5 Use chrono::time_point instead of time_t (#4868)
Summary:
By convention, time_t almost always stores the integral number of seconds since
00:00 hours, Jan 1, 1970 UTC, according to http://www.cplusplus.com/reference/ctime/time_t/.
We surely want more precision than seconds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4868

Differential Revision: D13633046

Pulled By: riversand963

fbshipit-source-id: 4e01e23a22e8838023c51a91247a286dbf3a5396
2019-01-16 09:51:05 -08:00
Yi Wu
77a8d4d476 Detect if Jemalloc is linked with the binary (#4844)
Summary:
Declare Jemalloc non-standard APIs as weak symbols, so that if Jemalloc is linked with the binary, these symbols will be replaced by Jemalloc's, otherwise they will be nullptr. This is similar to how folly detect jemalloc, but we assume the main program use jemalloc as long as jemalloc is linked: https://github.com/facebook/folly/blob/master/folly/memory/Malloc.h#L147
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4844

Differential Revision: D13574934

Pulled By: yiwu-arbug

fbshipit-source-id: 7ea871beb1be7d5a1259cc38f9b78078793db2db
2019-01-03 16:30:12 -08:00
Burton Li
46e3209e0d Compaction limiter miscs (#4795)
Summary:
1. Remove unused API SubtractCompactionTask().
2. Assert outstanding tasks drop to zero in ConcurrentTaskLimiterImpl destructor.
3. Remove GetOutstandingTask() check from manual compaction test, as TEST_WaitForCompact() doesn't synced with 'delete prepicked_compaction' in DBImpl::BGWorkCompaction(), which may make the test flaky.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4795

Differential Revision: D13542183

Pulled By: siying

fbshipit-source-id: 5eb2a47e62efe4126937149aa0df6e243ebefc33
2018-12-26 13:59:35 -08:00
Abhishek Madan
abf931afa6 Add compaction logic to RangeDelAggregatorV2 (#4758)
Summary:
RangeDelAggregatorV2 now supports ShouldDelete calls on
snapshot stripes and creation of range tombstone compaction iterators.
RangeDelAggregator is no longer used on any non-test code path, and will
be removed in a future commit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4758

Differential Revision: D13439254

Pulled By: abhimadan

fbshipit-source-id: fe105bcf8e3d4a2df37a622d5510843cd71b0401
2018-12-17 13:20:51 -08:00
Burton Li
a8b9891f95 Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918

We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.

With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.

ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.

The usage is straight forward:
e.g.:

//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...

//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0);  // full throttle (0 task)

//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);

// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;

// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;

...

//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332

Differential Revision: D13226590

Pulled By: siying

fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 13:18:28 -08:00
Siying Dong
ae25546a7a Direct I/O Close() shouldn't rewrite the last block (#4771)
Summary:
In Direct I/O case, WritableFileWriter::Close() rewrites the last block again, even if there is nothing new. The reason is that, Close() flushes the buffer. For non-direct I/O case, the buffer is empty in this case so it is a no-op. However, in direct I/O case, the partial data in the last block is kept in the buffer because it needs to be rewritten for the next write. This piece of data is flushed again. This commit fixes it by skipping this write out if `pending_sync_` flag shows that there isn't new data sync last sync.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4771

Differential Revision: D13420426

Pulled By: siying

fbshipit-source-id: 9d39ec9a215b1425d4ed40d85e0eba1f5daa75c6
2018-12-11 13:55:02 -08:00
Maysam Yabandeh
b878f93c70 Extend Transaction::GetForUpdate with do_validate (#4680)
Summary:
Transaction::GetForUpdate is extended with a do_validate parameter with default value of true. If false it skips validating the snapshot (if there is any) before doing the read. After the read it also returns the latest value (expects the ReadOptions::snapshot to be nullptr). This allows RocksDB applications to use GetForUpdate similarly to how InnoDB does. Similarly ::Merge, ::Put, ::Delete, and ::SingleDelete are extended with assume_exclusive_tracked with default value of false. It true it indicates that call is assumed to be after a ::GetForUpdate(do_validate=false).
The Java APIs are accordingly updated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4680

Differential Revision: D13068508

Pulled By: maysamyabandeh

fbshipit-source-id: f0b59db28f7f6a078b60844d902057140765e67d
2018-12-06 17:49:00 -08:00
Zhongyi Xie
b0f3d9b478 fix unused param "options" error in jemalloc_nodump_allocator.cc (#4738)
Summary:
Currently tests are failing on master with the following message:
> util/jemalloc_nodump_allocator.cc:132:8: error: unused parameter ‘options’ [-Werror=unused-parameter]
 Status NewJemallocNodumpAllocator(

This PR attempts to fix the issue
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4738

Differential Revision: D13278804

Pulled By: miasantreble

fbshipit-source-id: 64a6204aa685bd85d8b5080655cafef9980fac2f
2018-11-30 12:08:55 -08:00
Yi Wu
cf1df5d3cb JemallocNodumpAllocator: option to limit tcache memory usage (#4736)
Summary:
Add option to limit tcache usage by allocation size. This is to reduce total tcache size in case there are many user threads accessing the allocator and incur non-trivial memory usage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4736

Differential Revision: D13269305

Pulled By: yiwu-arbug

fbshipit-source-id: 95a9b7fc67facd66837c849137e30e137112e19d
2018-11-29 17:33:40 -08:00
Abhishek Madan
8fe1e06ca0 Clean up FragmentedRangeTombstoneList (#4692)
Summary:
Removed `one_time_use` flag, which removed the need for some
tests, and changed all `NewRangeTombstoneIterator` methods to return
`FragmentedRangeTombstoneIterators`.

These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones`
and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692

Differential Revision: D13106570

Pulled By: abhimadan

fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845
2018-11-28 15:29:02 -08:00
Zhichao Cao
7125e24619 Add the max trace file size limitation option to Tracing (#4610)
Summary:
If user do not end the trace manually, the tracing will continue which can potential use up all the storage space and cause problem. In this PR, the max trace file size is added to the TraceOptions and user can set the value if they need or the default is 64GB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4610

Differential Revision: D12893400

Pulled By: zhichao-cao

fbshipit-source-id: acf4b5a6076bb691778bdfbac4864e1006758953
2018-11-27 14:27:05 -08:00
Soli
f1837595a3 FIX #3278: Move global const object definitions from .h to .cc (#4691)
Summary:
Summary

We should declare constants in headers and define them in source files.
But this commit is only aimed at compound types.

I don't know if it is necessary to do the same thing to fundamental types.

I used this command to find all of the constant definitions in header files.

`find . -name "*.h" | xargs grep -e "^const .*=.*"`

And here is what I found:

```
./db/version_edit.h:const uint64_t kFileNumberMask = 0x3FFFFFFFFFFFFFFF;
./include/rocksdb/env.h:const size_t kDefaultPageSize = 4 * 1024;
./include/rocksdb/statistics.h:const std::vector<std::pair<Tickers, std::string>> TickersNameMap = {
./include/rocksdb/statistics.h:const std::vector<std::pair<Histograms, std::string>> HistogramsNameMap = {
./include/rocksdb/table.h:const uint32_t kPlainTableVariableLength = 0;
./include/rocksdb/utilities/transaction_db.h:const uint32_t kInitialMaxDeadlocks = 5;
./port/port_posix.h:const uint32_t kMaxUint32 = std::numeric_limits<uint32_t>::max();
./port/port_posix.h:const int kMaxInt32 = std::numeric_limits<int32_t>::max();
./port/port_posix.h:const uint64_t kMaxUint64 = std::numeric_limits<uint64_t>::max();
./port/port_posix.h:const int64_t kMaxInt64 = std::numeric_limits<int64_t>::max();
./port/port_posix.h:const size_t kMaxSizet = std::numeric_limits<size_t>::max();
./port/win/port_win.h:const uint32_t kMaxUint32 = UINT32_MAX;
./port/win/port_win.h:const int kMaxInt32 = INT32_MAX;
./port/win/port_win.h:const int64_t kMaxInt64 = INT64_MAX;
./port/win/port_win.h:const uint64_t kMaxUint64 = UINT64_MAX;
./port/win/port_win.h:const size_t kMaxSizet = UINT64_MAX;
./port/win/port_win.h:const size_t kMaxSizet = UINT_MAX;
./port/win/port_win.h:const uint32_t kMaxUint32 = std::numeric_limits<uint32_t>::max();
./port/win/port_win.h:const int kMaxInt32 = std::numeric_limits<int>::max();
./port/win/port_win.h:const uint64_t kMaxUint64 = std::numeric_limits<uint64_t>::max();
./port/win/port_win.h:const int64_t kMaxInt64 = std::numeric_limits<int64_t>::max();
./port/win/port_win.h:const size_t kMaxSizet = std::numeric_limits<size_t>::max();
./port/win/port_win.h:const bool kLittleEndian = true;
./table/cuckoo_table_factory.h:const uint32_t kCuckooMurmurSeedMultiplier = 816922183;
./table/data_block_hash_index.h:const uint8_t kNoEntry = 255;
./table/data_block_hash_index.h:const uint8_t kCollision = 254;
./table/data_block_hash_index.h:const uint8_t kMaxRestartSupportedByHashIndex = 253;
./table/data_block_hash_index.h:const size_t kMaxBlockSizeSupportedByHashIndex = 1u << 16;
./table/data_block_hash_index.h:const double kDefaultUtilRatio = 0.75;
./table/filter_block.h:const uint64_t kNotValid = ULLONG_MAX;
./table/format.h:const int kMagicNumberLengthByte = 8;
./third-party/fbson/FbsonJsonParser.h:const char* const kJsonDelim = " ,]}\t\r\n";
./third-party/fbson/FbsonJsonParser.h:const char* const kWhiteSpace = " \t\n\r";
./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const BiggestInt kMaxBiggestInt =
./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const char kDeathTestStyleFlag[] = "death_test_style";
./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const char kDeathTestUseFork[] = "death_test_use_fork";
./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const char kInternalRunDeathTestFlag[] = "internal_run_death_test";
./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const char* pets[] = {"cat", "dog"};
./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const size_t kProtobufOneLinerMaxLength = 50;
./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const int kMaxStackTraceDepth = 100;
./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const T* WithParamInterface<T>::parameter_ = NULL;
./util/coding.h:const unsigned int kMaxVarint64Length = 10;
./util/filename.h:const size_t kFormatFileNumberBufSize = 38;
./util/testutil.h:const SliceTransform* RandomSliceTransform(Random* rnd, int pre_defined = -1);
./util/trace_replay.h:const std::string kTraceMagic = "feedcafedeadbeef";
./util/trace_replay.h:const unsigned int kTraceTimestampSize = 8;
./util/trace_replay.h:const unsigned int kTraceTypeSize = 1;
./util/trace_replay.h:const unsigned int kTracePayloadLengthSize = 4;
./util/trace_replay.h:const unsigned int kTraceMetadataSize =
./utilities/cassandra/serialize.h:const int64_t kCharMask = 0xFFLL;
./utilities/cassandra/serialize.h:const int32_t kBitsPerByte = 8;
```

And these 3 lines are related to this commit:

```
./include/rocksdb/statistics.h:const std::vector<std::pair<Tickers, std::string>> TickersNameMap = {
./include/rocksdb/statistics.h:const std::vector<std::pair<Histograms, std::string>> HistogramsNameMap = {
./util/trace_replay.h:const std::string kTraceMagic = "feedcafedeadbeef";
```

Any comments would be appreciated.
Thanks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4691

Differential Revision: D13208049

Pulled By: ajkr

fbshipit-source-id: e5ee55fdaec5447fc5798c6721e2821e7cdc0d5b
2018-11-26 21:32:03 -08:00
Abhishek Madan
85394a96ca Speed up range scans with range tombstones (#4677)
Summary:
Previously, every range tombstone iterator was seeked on every
ShouldDelete call, which quickly degraded performance for long range
scans. This PR improves performance by tracking iterator positions and
only advancing iterators when necessary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4677

Differential Revision: D13205373

Pulled By: abhimadan

fbshipit-source-id: 80c199dace1e19362a4c61c686bf01913eae87cb
2018-11-26 16:33:41 -08:00
Yi Wu
327097c952 JemallocAllocator: thread-local tcache (#4603)
Summary:
Add option to support  thread-local tcache to reduce mutex contention inside Jemalloc arena.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4603

Differential Revision: D12830738

Pulled By: yiwu-arbug

fbshipit-source-id: 59bd25b165b903f23a6a8531b18d72e140d69f65
2018-11-19 22:39:08 -08:00
Simon Liu
a2de8e52bb optimized the performance of autovector::emplace_back. (#4606)
Summary:
It called the autovector::push_back simply in autovector::emplace_back.
This was not efficient, and then optimazed this function through the
perfect forwarding.

This was the src and result of the benchmark(using the google'benchmark library, the type of elem in
autovector was std::string, and call emplace_back with the "char *" type):

https://gist.github.com/monadbobo/93448b89a42737b08cbada81de75c5cd

PS: The benchmark's result of  previous PR was not accurate, and so I update the test case and result.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4606

Differential Revision: D13046813

Pulled By: sagar0

fbshipit-source-id: 19cde1bcadafe899aa454b703acb35737a1cc02d
2018-11-13 14:39:03 -08:00
Andrew Kryczka
ea9454700a Backup engine support for direct I/O reads (#4640)
Summary:
Use the `DBOptions` that the backup engine already holds to figure out the right `EnvOptions` to use when reading the DB files. This means that, if a user opened a DB instance with `use_direct_reads=true`, then using `BackupEngine` to back up that DB instance will use direct I/O to read files when calculating checksums and copying. Currently the WALs and manifests would still be read using buffered I/O to prevent mixing direct I/O reads with concurrent buffered I/O writes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4640

Differential Revision: D13015268

Pulled By: ajkr

fbshipit-source-id: 77006ad6f3e00ce58374ca4793b785eea0db6269
2018-11-13 11:17:25 -08:00
Sagar Vemuri
dc3528077a Update all unique/shared_ptr instances to be qualified with namespace std (#4638)
Summary:
Ran the following commands to recursively change all the files under RocksDB:
```
find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
```
Running `make format` updated some formatting on the files touched.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638

Differential Revision: D12934992

Pulled By: sagar0

fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
2018-11-09 11:19:58 -08:00
Andrew Gallagher
0148f717ff Move #include outside of namespace (#4629)
Summary:
clang modules warns about `#include`s inside of namespaces.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4629

Reviewed By: ajkr

Differential Revision: D12927333

Pulled By: andrewjcg

fbshipit-source-id: a9e0b069e63d8224f78b7c3be1c3acf09bb83d3f
2018-11-06 17:18:28 -08:00
Soli
3f8f81cfeb FIX #3820: shorter file name in logs (#4616)
Summary:
Long absolute file names in log make it hard to read the LOG files.
So we shorter them to relative to the root of RocksDB project path.
In most cases, they will only have one level directory and one file name.

There was [a talk](#4316) about making "util/logging.h" a public header file.
But we concern the conflicts that might be introduced in for macros
named `STRINGIFY`, `TOSTRING`, and `PREPEND_FILE_LINE`.

So I prepend a prefix `ROCKS_LOG_` to them.
I also remove the line that includes "port.h" which seems unneccessary here.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4616

Differential Revision: D12892857

Pulled By: siying

fbshipit-source-id: af79aaf82153b8fd66b5966aced39a51fbca9c6c
2018-11-01 16:19:01 -07:00
Bo Hou
cd9404bb77 xxhash 64 support
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4607

Reviewed By: siying

Differential Revision: D12836696

Pulled By: jsjhoubo

fbshipit-source-id: 7122ccb712d0b0f1cd998aa4477e0da1401bd870
2018-11-01 15:44:06 -07:00
Yi Wu
5f5fddabc7 port folly::JemallocNodumpAllocator (#4534)
Summary:
Introduce `JemallocNodumpAllocator`, which allow exclusion of block cache usage from core dump. It utilize custom hook of jemalloc arena, and when jemalloc arena request memory from system, the allocator use the hook to set `MADV_DONTDUMP ` to the memory. The implementation is basically the same as `folly::JemallocNodumpAllocator`, except for some minor difference:
1. It only support jemalloc >= 5.0
2. When the allocator destruct, it explicitly destruct the corresponding arena via `arena.<i>.destroy` via `mallctl`.

Depending on #4502.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4534

Differential Revision: D10435474

Pulled By: yiwu-arbug

fbshipit-source-id: e80edea755d3853182485d2be710376384ce0bb4
2018-10-26 17:29:18 -07:00
Yi Wu
f560c8f5c8 s/CacheAllocator/MemoryAllocator/g (#4590)
Summary:
Rename the interface, as it is mean to be a generic interface for memory allocation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4590

Differential Revision: D10866340

Pulled By: yiwu-arbug

fbshipit-source-id: 85cb753351a40cb856c046aeaa3f3b369eef3d16
2018-10-26 14:30:30 -07:00
Abhishek Madan
8c78348c77 Use only "local" range tombstones during Get (#4449)
Summary:
Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key from higher levels, and compute the highest covering
sequence number at the current level. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.

In the benchmark results, the following command was used to initialize the database:
```
./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
```

...and the following command was used to measure read throughput:
```
./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
```

The filluniquerandom command was only run once, and the resulting database was used
to measure read performance before and after the PR. Both binaries were compiled with
`DEBUG_LEVEL=0`.

Readrandom results before PR:
```
readrandom   :       4.544 micros/op 220090 ops/sec;   16.9 MB/s (63103 of 100000 found)
```

Readrandom results after PR:
```
readrandom   :      11.147 micros/op 89707 ops/sec;    6.9 MB/s (63103 of 100000 found)
```

So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).

----
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449

Differential Revision: D10370575

Pulled By: abhimadan

fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
2018-10-24 12:31:12 -07:00
Neil Mayhew
43dbd4411e Adapt three unit tests with newer compiler/libraries (#4562)
Summary:
This fixes three tests that fail with relatively recent tools and libraries:

The tests are:

* `spatial_db_test`
* `table_test`
* `db_universal_compaction_test`

I'm using:

* `gcc` 7.3.0
* `glibc` 2.27
* `snappy` 1.1.7
* `gflags` 2.2.1
* `zlib` 1.2.11
* `bzip2` 1.0.6.0.1
* `lz4` 1.8.2
* `jemalloc` 5.0.1

The versions used in the Travis environment (which is two Ubuntu LTS versions behind the current one and doesn't use `lz4` or `jemalloc`) don't seem to have a problem. However, to be safe, I verified that these tests pass with and without my changes in a trusty Docker container without `lz4` and `jemalloc`.

However, I do get an unrelated set of other failures when using a trusty Docker container that uses `lz4` and `jemalloc`:

```
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/0, where GetParam() = (1, false) (1189 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1, where GetParam() = (1, true) (1246 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2, where GetParam() = (3, false) (1237 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3, where GetParam() = (3, true) (1195 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4, where GetParam() = (5, false) (1161 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5, where GetParam() = (5, true) (1229 ms)
```

I haven't attempted to fix these since I'm not using trusty and Travis doesn't use `lz4` and `jemalloc`. However, the final commit in this PR does at least fix the compilation errors that occur when using trusty's version of `lz4`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4562

Differential Revision: D10510917

Pulled By: maysamyabandeh

fbshipit-source-id: 59534042015ec339270e5fc2f6ac4d859370d189
2018-10-24 08:17:56 -07:00
Yi Wu
c7a45ca91f BlobDB: handle IO error on write (#4580)
Summary:
A fix similar to #4410 but on the write path. On IO error on `SelectBlobFile()` we didn't return error code properly, but simply a nullptr of `BlobFile`. The `AppendBlob()` method didn't have null check for the pointer and caused crash. The fix make sure we properly return error code in this case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4580

Differential Revision: D10513849

Pulled By: yiwu-arbug

fbshipit-source-id: 80bca920d1d7a3541149de981015ad83e0aa14b5
2018-10-23 15:03:45 -07:00
jsteemann
d1c0d3f358 Small issues (#4564)
Summary:
Couple of very minor improvements (typos in comments, full qualification of class name, reordering members of a struct to make it smaller)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4564

Differential Revision: D10510183

Pulled By: maysamyabandeh

fbshipit-source-id: c7ddf9bfbf2db08cd31896c3fd93789d3fa68c8b
2018-10-23 10:35:57 -07:00
Yi Wu
933250e355 Fix RepeatableThreadTest::MockEnvTest hang (#4560)
Summary:
When `MockTimeEnv` is used in test to mock time methods, we cannot use `CondVar::TimedWait` because it is using real time, not the mocked time for wait timeout. On Mac the method can return immediately without awaking other waiting threads, if the real time is larger than `wait_until` (which is a mocked time). When that happen, the `wait()` method will fall into an infinite loop.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4560

Differential Revision: D10472851

Pulled By: yiwu-arbug

fbshipit-source-id: 898902546ace7db7ac509337dd8677a527209d19
2018-10-21 20:17:18 -07:00
Yanqin Jin
da4aa59b4c Add read retry support to log reader (#4394)
Summary:
Current `log::Reader` does not perform retry after encountering `EOF`. In the future, we need the log reader to be able to retry tailing the log even after `EOF`.

Current implementation is simple. It does not provide more advanced retry policies. Will address this in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4394

Differential Revision: D9926508

Pulled By: riversand963

fbshipit-source-id: d86d145792a41bd64a72f642a2a08c7b7b5201e1
2018-10-19 11:53:00 -07:00
Yanqin Jin
ce52274640 Replace 'string' with 'const string&' in FileOperationInfo (#4491)
Summary:
Using const string& can avoid one extra string copy. This PR addresses a recent comment made by siying  on #3933.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4491

Differential Revision: D10381211

Pulled By: riversand963

fbshipit-source-id: 27fc2d65d84bc7cd07833c77cdc47f06dcfaeb31
2018-10-15 13:46:01 -07:00