rocksdb

Author	SHA1	Message	Date
Mike Kolupaev	b4d7209428	Add an option to put first key of each sst block in the index (#5289 ) Summary: The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes. Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it. So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks. Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files. This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289 Differential Revision: D15256423 Pulled By: al13n321 fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a	2019-06-24 20:54:04 -07:00
Sergei Petrunia	e731f44022	C file should not include <cinttypes>, it is a C++ header. (#5499 ) Summary: Include <inttypes.h> instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5499 Differential Revision: D15966937 Pulled By: miasantreble fbshipit-source-id: 2156c4329b91d26d447de94f1231264d52786350	2019-06-24 16:12:39 -07:00
Vijay Nadimpalli	22028aa9ab	Compaction Reads should read no more than compaction_readahead_size bytes, when set! (#5498 ) Summary: As a result of https://github.com/facebook/rocksdb/issues/5431 the compaction_readahead_size given by a user was not used exactly, the reason being the code behind readahead for user-read and compaction-read was unified in the above PR and the behavior for user-read is to read readahead_size+n bytes (see FilePrefetchBuffer::TryReadFromCache method). Before the unification the ReadaheadRandomAccessFileReader used compaction_readahead_size as it is. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5498 Test Plan: Ran strace command : strace -e pread64 -f -T -t ./db_compaction_test --gtest_filter=DBCompactionTest.PartialManualCompaction In the test the compaction_readahead_size was configured to 2MB and verified the pread syscall did indeed request 2MB. Before the change it was requesting more than 2MB. Strace Output: strace: Process 3798982 attached Note: Google Test filter = DBCompactionTest.PartialManualCompaction [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBCompactionTest [ RUN ] DBCompactionTest.PartialManualCompaction strace: Process 3798983 attached strace: Process 3798984 attached strace: Process 3798985 attached strace: Process 3798986 attached strace: Process 3798987 attached strace: Process 3798992 attached [pid 3798987] 12:07:05 +++ exited with 0 +++ strace: Process 3798993 attached [pid 3798993] 12:07:05 +++ exited with 0 +++ strace: Process 3798994 attached strace: Process 3799008 attached strace: Process 3799009 attached [pid 3799008] 12:07:05 +++ exited with 0 +++ strace: Process 3799010 attached [pid 3799009] 12:07:05 +++ exited with 0 +++ strace: Process 3799011 attached [pid 3799010] 12:07:05 +++ exited with 0 +++ [pid 3799011] 12:07:05 +++ exited with 0 +++ strace: Process 3799012 attached [pid 3799012] 12:07:05 +++ exited with 0 +++ strace: Process 3799013 attached strace: Process 3799014 attached [pid 3799013] 12:07:05 +++ exited with 0 +++ strace: Process 3799015 attached [pid 3799014] 12:07:05 +++ exited with 0 +++ [pid 3799015] 12:07:05 +++ exited with 0 +++ strace: Process 3799016 attached [pid 3799016] 12:07:05 +++ exited with 0 +++ strace: Process 3799017 attached [pid 3799017] 12:07:05 +++ exited with 0 +++ strace: Process 3799019 attached [pid 3799019] 12:07:05 +++ exited with 0 +++ strace: Process 3799020 attached strace: Process 3799021 attached [pid 3799020] 12:07:05 +++ exited with 0 +++ [pid 3799021] 12:07:05 +++ exited with 0 +++ strace: Process 3799022 attached [pid 3799022] 12:07:05 +++ exited with 0 +++ strace: Process 3799023 attached [pid 3799023] 12:07:05 +++ exited with 0 +++ strace: Process 3799047 attached strace: Process 3799048 attached [pid 3799047] 12:07:06 +++ exited with 0 +++ [pid 3799048] 12:07:06 +++ exited with 0 +++ [pid 3798994] 12:07:06 +++ exited with 0 +++ strace: Process 3799052 attached [pid 3799052] 12:07:06 +++ exited with 0 +++ strace: Process 3799054 attached strace: Process 3799069 attached strace: Process 3799070 attached [pid 3799069] 12:07:06 +++ exited with 0 +++ strace: Process 3799071 attached [pid 3799070] 12:07:06 +++ exited with 0 +++ [pid 3799071] 12:07:06 +++ exited with 0 +++ strace: Process 3799072 attached strace: Process 3799073 attached [pid 3799072] 12:07:06 +++ exited with 0 +++ [pid 3799073] 12:07:06 +++ exited with 0 +++ strace: Process 3799074 attached [pid 3799074] 12:07:06 +++ exited with 0 +++ strace: Process 3799075 attached [pid 3799075] 12:07:06 +++ exited with 0 +++ strace: Process 3799076 attached [pid 3799076] 12:07:06 +++ exited with 0 +++ strace: Process 3799077 attached [pid 3799077] 12:07:06 +++ exited with 0 +++ strace: Process 3799078 attached [pid 3799078] 12:07:06 +++ exited with 0 +++ strace: Process 3799079 attached [pid 3799079] 12:07:06 +++ exited with 0 +++ strace: Process 3799080 attached [pid 3799080] 12:07:06 +++ exited with 0 +++ strace: Process 3799081 attached [pid 3799081] 12:07:06 +++ exited with 0 +++ strace: Process 3799082 attached [pid 3799082] 12:07:06 +++ exited with 0 +++ strace: Process 3799083 attached [pid 3799083] 12:07:06 +++ exited with 0 +++ strace: Process 3799086 attached strace: Process 3799087 attached [pid 3798984] 12:07:06 pread64(9, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000121> [pid 3798984] 12:07:06 pread64(9, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000106> [pid 3798984] 12:07:06 pread64(9, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000081> [pid 3798984] 12:07:06 pread64(9, "\0\v\3foo\2\7\0\0\0\0\0\0\0\270 \0\v\4foo\2\3\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000138> [pid 3798984] 12:07:06 pread64(11, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000097> [pid 3798984] 12:07:06 pread64(11, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000086> [pid 3798984] 12:07:06 pread64(11, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000064> [pid 3798984] 12:07:06 pread64(11, "\0\v\3foo\2\21\0\0\0\0\0\0\0\270 \0\v\4foo\2\r\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000064> [pid 3798984] 12:07:06 pread64(12, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000080> [pid 3798984] 12:07:06 pread64(12, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000090> [pid 3798984] 12:07:06 pread64(12, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000059> [pid 3798984] 12:07:06 pread64(12, "\0\v\3foo\2\33\0\0\0\0\0\0\0\270 \0\v\4foo\2\27\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000065> [pid 3798984] 12:07:06 pread64(13, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000070> [pid 3798984] 12:07:06 pread64(13, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000059> [pid 3798984] 12:07:06 pread64(13, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000061> [pid 3798984] 12:07:06 pread64(13, "\0\v\3foo\2%\0\0\0\0\0\0\0\270 \0\v\4foo\2!\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000065> [pid 3798984] 12:07:06 pread64(14, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000118> [pid 3798984] 12:07:06 pread64(14, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000093> [pid 3798984] 12:07:06 pread64(14, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000050> [pid 3798984] 12:07:06 pread64(14, "\0\v\3foo\2/\0\0\0\0\0\0\0\270 \0\v\4foo\2+\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000082> [pid 3798984] 12:07:06 pread64(15, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000080> [pid 3798984] 12:07:06 pread64(15, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000086> [pid 3798984] 12:07:06 pread64(15, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000091> [pid 3798984] 12:07:06 pread64(15, "\0\v\3foo\0029\0\0\0\0\0\0\0\270 \0\v\4foo\0025\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000174> [pid 3798984] 12:07:06 pread64(16, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000080> [pid 3798984] 12:07:06 pread64(16, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000093> [pid 3798984] 12:07:06 pread64(16, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000194> [pid 3798984] 12:07:06 pread64(16, "\0\v\3foo\2C\0\0\0\0\0\0\0\270 \0\v\4foo\2?\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000086> [pid 3798984] 12:07:06 pread64(17, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000079> [pid 3798984] 12:07:06 pread64(17, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000047> [pid 3798984] 12:07:06 pread64(17, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000045> [pid 3798984] 12:07:06 pread64(17, "\0\v\3foo\2M\0\0\0\0\0\0\0\270 \0\v\4foo\2I\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000107> [pid 3798983] 12:07:06 pread64(17, "\0\v\200\10foo\2P\0\0\0\0\0\0)U?MSg_)j(roFn($e"..., 2097152, 0) = 11230 <0.000091> [pid 3798983] 12:07:06 pread64(17, "", 2085922, 11230) = 0 <0.000073> [pid 3798983] 12:07:06 pread64(16, "\0\v\200\10foo\2F\0\0\0\0\0\0k[h3%.OPH_^:\\S7T&"..., 2097152, 0) = 11230 <0.000083> [pid 3798983] 12:07:06 pread64(16, "", 2085922, 11230) = 0 <0.000078> [pid 3798983] 12:07:06 pread64(15, "\0\v\200\10foo\2<\0\0\0\0\0\0+qToi_c{S+4:N(:"..., 2097152, 0) = 11230 <0.000095> [pid 3798983] 12:07:06 pread64(15, "", 2085922, 11230) = 0 <0.000067> [pid 3798983] 12:07:06 pread64(14, "\0\v\200\10foo\0022\0\0\0\0\0\0%hw%OMa\"}9I609Q!B"..., 2097152, 0) = 11230 <0.000111> [pid 3798983] 12:07:06 pread64(14, "", 2085922, 11230) = 0 <0.000093> [pid 3798983] 12:07:06 pread64(13, "\0\v\200\10foo\2(\0\0\0\0\0\0p}Y&mu^DcaSGb2&nP"..., 2097152, 0) = 11230 <0.000128> [pid 3798983] 12:07:06 pread64(13, "", 2085922, 11230) = 0 <0.000076> [pid 3798983] 12:07:06 pread64(12, "\0\v\200\10foo\2\36\0\0\0\0\0\0YIyW#]oSs^6VHfB<`"..., 2097152, 0) = 11230 <0.000092> [pid 3798983] 12:07:06 pread64(12, "", 2085922, 11230) = 0 <0.000073> [pid 3798983] 12:07:06 pread64(11, "\0\v\200\10foo\2\24\0\0\0\0\0\0mfF8Jel/Zf :-#s("..., 2097152, 0) = 11230 <0.000088> [pid 3798983] 12:07:06 pread64(11, "", 2085922, 11230) = 0 <0.000067> [pid 3798983] 12:07:06 pread64(9, "\0\v\200\10foo\2\n\0\0\0\0\0\0\\X'cjiHX)D,RSj1X!"..., 2097152, 0) = 11230 <0.000115> [pid 3798983] 12:07:06 pread64(9, "", 2085922, 11230) = 0 <0.000073> [pid 3798983] 12:07:06 pread64(8, "\1\315\5 \36\30\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 754) = 53 <0.000098> [pid 3798983] 12:07:06 pread64(8, "\0\22\3rocksdb.properties;\215\5\0\0\0\0\1\0\0\0"..., 37, 717) = 37 <0.000064> [pid 3798983] 12:07:06 pread64(8, "\0$\4rocksdb.block.based.table.ind"..., 658, 59) = 658 <0.000074> [pid 3798983] 12:07:06 pread64(8, "\0\v\2foo\1\0\0\0\0\0\0\0\0\31\0\0\0\0\1\0\0\0\0\212\216\222P", 29, 30) = 29 <0.000064> [pid 3799086] 12:07:06 +++ exited with 0 +++ [pid 3799087] 12:07:06 +++ exited with 0 +++ [pid 3799054] 12:07:06 +++ exited with 0 +++ strace: Process 3799104 attached [pid 3799104] 12:07:06 +++ exited with 0 +++ [ OK ] DBCompactionTest.PartialManualCompaction (757 ms) [----------] 1 test from DBCompactionTest (758 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (759 ms total) [ PASSED ] 1 test. [pid 3798983] 12:07:06 +++ exited with 0 +++ [pid 3798984] 12:07:06 +++ exited with 0 +++ [pid 3798992] 12:07:06 +++ exited with 0 +++ [pid 3798986] 12:07:06 +++ exited with 0 +++ [pid 3798982] 12:07:06 +++ exited with 0 +++ [pid 3798985] 12:07:06 +++ exited with 0 +++ 12:07:06 +++ exited with 0 +++ Differential Revision: D15948422 Pulled By: vjnadimpalli fbshipit-source-id: 9b189d1e8675d290c7784e4b33e5d3b5761d2ac8	2019-06-21 21:31:49 -07:00
Vijay Nadimpalli	24b118ad98	Combine the read-ahead logic for user reads and compaction reads (#5431 ) Summary: Currently the read-ahead logic for user reads and compaction reads go through different code paths where compaction reads create new table readers and use `ReadaheadRandomAccessFile`. This change is to unify read-ahead logic to use read-ahead in BlockBasedTableReader::InitDataBlock(). As a result of the change `ReadAheadRandomAccessFile` class and `new_table_reader_for_compaction_inputs` option will no longer be used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5431 Test Plan: make check Here is the benchmarking - https://gist.github.com/vjnadimpalli/083cf423f7b6aa12dcdb14c858bc18a5 Differential Revision: D15772533 Pulled By: vjnadimpalli fbshipit-source-id: b71dca710590471ede6fb37553388654e2e479b9	2019-06-19 14:10:46 -07:00
Zhongyi Xie	d68f9f4580	simplify include directive involving inttypes (#5402 ) Summary: When using `PRIu64` type of printf specifier, current code base does the following: ``` #ifndef __STDC_FORMAT_MACROS #define __STDC_FORMAT_MACROS #endif #include <inttypes.h> ``` However, this can be simplified to ``` #include <cinttypes> ``` as long as flag `-std=c++11` is used. This should solve issues like https://github.com/facebook/rocksdb/issues/5159 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402 Differential Revision: D15701195 Pulled By: miasantreble fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03	2019-06-06 13:56:07 -07:00
Yanqin Jin	340ed4fac7	Add support for timestamp in Get/Put (#5079 ) Summary: It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now). Test plan (on devserver): ``` $COMPILE_WITH_ASAN=1 make -j32 all $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/* $make check ``` All tests must pass. We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled. ``` $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000 $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000 ``` Repeat for 6 times for both versions. Results are as follows: ``` \| \| readrandom \| fillrandom \| \| master \| 16.77 MB/s \| 47.05 MB/s \| \| PR5079 \| 16.44 MB/s \| 47.03 MB/s \| ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079 Differential Revision: D15132946 Pulled By: riversand963 fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c	2019-06-05 23:10:47 -07:00
Siying Dong	5851cb7fdb	Move util/trace_replay.* to trace_replay/ (#5376 ) Summary: util/ means for lower level libraries. trace_replay is highly integrated to DB and sometimes call DB. Move it out to a separate directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5376 Differential Revision: D15550938 Pulled By: siying fbshipit-source-id: f46dce5ceffdc05a73f26379c7bb1b79ebe6c207	2019-06-03 13:25:26 -07:00
Siying Dong	000b9ec217	Move some logging related files to logging/ (#5387 ) Summary: Many logging related source files are under util/. It will be more structured if they are together. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387 Differential Revision: D15579036 Pulled By: siying fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f	2019-05-31 17:23:59 -07:00
Yuan Zhou	79edf0a7a8	util: fix log_write_bench (#5335 ) Summary: log_write_bench doesn't compile due to some recent API changes. This patch fixes the compile by adding the missing params for OptimizeForLogWrite() and WritableFileWriter(). Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/5335 Differential Revision: D15588875 Pulled By: miasantreble fbshipit-source-id: 726ff4dc227733e915c3b796df25bd3ab0b431ac	2019-05-31 17:17:57 -07:00
Vijay Nadimpalli	49c5a12dbe	Organizing rocksdb/db directory Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390 Differential Revision: D15579388 Pulled By: vjnadimpalli fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366	2019-05-31 11:57:01 -07:00
Siying Dong	cb094e13bb	Auto roll logger to enforce options.keep_log_file_num immediately after a new file is created (#5370 ) Summary: Right now, with auto roll logger, options.keep_log_file_num enforcement is triggered by events like DB reopen or full obsolete scan happens. In the mean time, the size and number of log files can grow without a limit. We put a stronger enforcement to the option, so that the number of log files can always under control. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5370 Differential Revision: D15570413 Pulled By: siying fbshipit-source-id: 0916c3c4d42ab8fdd29389ee7fd7e1557b03176e	2019-05-31 10:50:19 -07:00
Siying Dong	8843129ece	Move some memory related files from util/ to memory/ (#5382 ) Summary: Move arena, allocator, and memory tools under util to a separate memory/ directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382 Differential Revision: D15564655 Pulled By: siying fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5	2019-05-30 17:44:09 -07:00
Vijay Nadimpalli	50e470791d	Organizing rocksdb/table directory by format Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373 Differential Revision: D15559425 Pulled By: vjnadimpalli fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55	2019-05-30 14:51:11 -07:00
Siying Dong	e9e0101ca4	Move test related files under util/ to test_util/ (#5377 ) Summary: There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo ve them to a new directory test_util/, so that util/ is cleaner. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377 Differential Revision: D15551366 Pulled By: siying fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c	2019-05-30 11:25:51 -07:00
Siying Dong	545d206040	Move some file related files outside util/ (#5375 ) Summary: util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375 Differential Revision: D15550935 Pulled By: siying fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2	2019-05-29 20:47:06 -07:00
Yanqin Jin	b5e4ee2e76	Fix a clang analyze error (#5365 ) Summary: The analyzer thinks max_allowed_ space can be 0. In that case, free_space will be assigned as free_space. It fails to realize that the function call GetFreeSpace actually sets the free_space variable properly, which is possibly due to lack of inter-function call analysis. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5365 Differential Revision: D15521790 Pulled By: riversand963 fbshipit-source-id: 839d0a285a1c8773a28a385f0c3be4bb7fbe32cb	2019-05-28 12:19:41 -07:00
Sagar Vemuri	e264eebcd7	Add comments in file_reader_writer.h (#5355 ) Summary: Add file and class level comments in file_reader_writer.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/5355 Differential Revision: D15499020 Pulled By: sagar0 fbshipit-source-id: 925b2326885cdb4357e6a139ac65ee5e2ce1d613	2019-05-24 20:31:45 -07:00
Yanqin Jin	bd9f1d2d0f	Fix RocksDB auto-recovery from SpaceLimit err (#5334 ) Summary: If RocksDB is configured with a positive max_allowed_space (via sst file manager), then the sst file manager should use this value instead of total free disk space to determine whether to clear the background error of space limit reached. In DBSSTTest.DBWithMaxSpaceAllowed, we configure a low space limit that is very likely lower than the free disk space of the test machine. Therefore, once the test db encounters a Status::SpaceLimit, error handler will call into sst file manager to start error recovery which may clear the bg error since disk free space is larger than reserved_disk_buffer_. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5334 Differential Revision: D15501622 Pulled By: riversand963 fbshipit-source-id: 58035efc450b062d6b28c78c322005ec3705fb47	2019-05-24 18:38:12 -07:00
Sagar Vemuri	b09c018b4d	Add comments to trace_replay.h (#5359 ) Summary: Add file, class, and function level comments in trace_replay.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/5359 Differential Revision: D15505318 Pulled By: sagar0 fbshipit-source-id: 181e3d4ea805fd9a33f91b89e123bbd0c1ead2ce	2019-05-24 16:59:54 -07:00
Sagar Vemuri	5d359fc337	Document AlignedBuffer (#5345 ) Summary: Add comments to util/aligned_buffer.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/5345 Differential Revision: D15496004 Pulled By: sagar0 fbshipit-source-id: 31bc6f35e88dedd74cff55febe02c9e761304f76	2019-05-24 10:05:40 -07:00
Zhongyi Xie	6a54278b4a	add class level comment for RepeatableThread Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5344 Differential Revision: D15485431 Pulled By: miasantreble fbshipit-source-id: 9c0f6cf0d826743e743012549976705ceb8cc0c4	2019-05-23 17:03:23 -07:00
Sagar Vemuri	dda474399a	Remove PATENTS text from a few straggler files (#5326 ) Summary: Remove PATENTS related wording from a few stragglers which still reference the old PATENTS file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5326 Differential Revision: D15423297 Pulled By: sagar0 fbshipit-source-id: 4babcddfc120b7d2fed6eb3898287cf8012bf8ea	2019-05-21 16:22:35 -07:00
Vijay Nadimpalli	931c9df886	Use separate status code for column family drop and db shutdown in progress (#5275 ) Summary: Currently RocksDB uses Status::ShutdownInProgress to inform about column family drop. I would like to have a separate Status code for this event. https://github.com/facebook/rocksdb/blob/master/include/rocksdb/status.h#L55 Comment on this: `abc4202e47/db/version_set.cc (L2742)`:L2743 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5275 Differential Revision: D15204583 Pulled By: vjnadimpalli fbshipit-source-id: 95e99e34b27bc165b554ecb8a48a7f8e60f21e2a	2019-05-20 10:47:32 -07:00
Zhichao Cao	a13026fb2f	Added trace replay fast forward function (#5273 ) Summary: In the current db_bench trace replay, the replay process strictly follows the timestamp to issue the queries. In some cases, user does not care about the time. Therefore, fast forward is needed for users to speed up the replay process. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5273 Differential Revision: D15389232 Pulled By: zhichao-cao fbshipit-source-id: 735d629b9d2a167b05af3e4fa0ddf9d5d0be1806	2019-05-16 20:21:18 -07:00
Maysam Yabandeh	f0e8216197	WritePrepared: Fix deadlock in WriteRecoverableState (#5306 ) Summary: The recent improvement in https://github.com/facebook/rocksdb/pull/3661 could cause a deadlock: When writing recoverable state, we also commit its sequence number to commit table, which could result into evicting existing commit entry, which could result into advancing max_evicted_seq_, which would need to get snapshots from database, which requires obtaining db mutex. The patch releases db_mutex before calling the callback in WriteRecoverableState to avoid the potential deadlock. It also improves the stress tests to let the issue be manifested in the tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5306 Differential Revision: D15341458 Pulled By: maysamyabandeh fbshipit-source-id: 05dcbed7e21b789fd1e5fd5ee8eea08077162323	2019-05-15 13:53:54 -07:00
Maysam Yabandeh	6a40ee5eb1	Refresh snapshot list during long compactions (2nd attempt) (#5278 ) Summary: Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list. For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list. This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278 Differential Revision: D15203291 Pulled By: maysamyabandeh fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258	2019-05-03 17:30:22 -07:00
Maysam Yabandeh	521d234bda	Revert snap_refresh_nanos feature (#5269 ) Summary: Our daily stress tests are failing after this feature. Reverting temporarily until we figure the reason for test failures. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5269 Differential Revision: D15151285 Pulled By: maysamyabandeh fbshipit-source-id: e4002b99690a97df30d4b4b58bf0f61e9591bc6e	2019-05-01 10:07:30 -07:00
Yuqi Gu	03c7ae24c2	RocksDB CRC32c optimization with ARMv8 Intrinsic (#5221 ) Summary: 1. Add Arm linear crc32c implemtation for RocksDB. 2. Arm runtime check for crc32 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5221 Differential Revision: D15013685 Pulled By: siying fbshipit-source-id: 2c2983743d26656d93f212dc7c1a3cf66a1acf12	2019-04-30 10:59:05 -07:00
Maysam Yabandeh	506e8448be	Refresh snapshot list during long compactions (#5099 ) Summary: Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099 Differential Revision: D15086710 Pulled By: maysamyabandeh fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12	2019-04-25 18:17:22 -07:00
jsteemann	de76909464	refactor SavePoints (#5192 ) Summary: Savepoints are assumed to be used in a stack-wise fashion (only the top element should be used), so they were stored by `WriteBatch` in a member variable `save_points` using an std::stack. Conceptually this is fine, but the implementation had a few issues: - the `save_points_` instance variable was a plain pointer to a heap- allocated `SavePoints` struct. The destructor of `WriteBatch` simply deletes this pointer. However, the copy constructor of WriteBatch just copied that pointer, meaning that copying a WriteBatch with active savepoints will very likely have crashed before. Now a proper copy of the savepoints is made in the copy constructor, and not just a copy of the pointer - `save_points_` was an std::stack, which defaults to `std::deque` for the underlying container. A deque is a bit over the top here, as we only need access to the most recent savepoint (i.e. stack.top()) but never any elements at the front. std::deque is rather expensive to initialize in common environments. For example, the STL implementation shipped with GNU g++ will perform a heap allocation of more than 500 bytes to create an empty deque object. Although the `save_points_` container is created lazily by RocksDB, moving from a deque to a plain `std::vector` is much more memory-efficient. So `save_points_` is now a vector. - `save_points_` was changed from a plain pointer to an `std::unique_ptr`, making ownership more explicit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5192 Differential Revision: D15024074 Pulled By: maysamyabandeh fbshipit-source-id: 5b128786d3789cde94e46465c9e91badd07a25d7	2019-04-19 20:33:04 -07:00
Fosco Marotto	6c2bf9e916	Add copyright headers per FB open-source checkup tool. (#5199 ) Summary: internal task: T35568575 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5199 Differential Revision: D14962794 Pulled By: gfosco fbshipit-source-id: 93838ede6d0235eaecff90d200faed9a8515bbbe	2019-04-18 10:55:01 -07:00
anand76	fefd4b98c5	Introduce a new MultiGet batching implementation (#5011 ) Summary: This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching. Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to - 1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch() 2. Bloom filter cachelines can be prefetched, hiding the cache miss latency The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress. Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32). Batch Sizes 1 \| 2 \| 4 \| 8 \| 16 \| 32 Random pattern (Stride length 0) 4.158 \| 4.109 \| 4.026 \| 4.05 \| 4.1 \| 4.074 - Get 4.438 \| 4.302 \| 4.165 \| 4.122 \| 4.096 \| 4.075 - MultiGet (no batching) 4.461 \| 4.256 \| 4.277 \| 4.11 \| 4.182 \| 4.14 - MultiGet (w/ batching) Good locality (Stride length 16) 4.048 \| 3.659 \| 3.248 \| 2.99 \| 2.84 \| 2.753 4.429 \| 3.728 \| 3.406 \| 3.053 \| 2.911 \| 2.781 4.452 \| 3.45 \| 2.833 \| 2.451 \| 2.233 \| 2.135 Good locality (Stride length 256) 4.066 \| 3.786 \| 3.581 \| 3.447 \| 3.415 \| 3.232 4.406 \| 4.005 \| 3.644 \| 3.49 \| 3.381 \| 3.268 4.393 \| 3.649 \| 3.186 \| 2.882 \| 2.676 \| 2.62 Medium locality (Stride length 4096) 4.012 \| 3.922 \| 3.768 \| 3.61 \| 3.582 \| 3.555 4.364 \| 4.057 \| 3.791 \| 3.65 \| 3.57 \| 3.465 4.479 \| 3.758 \| 3.316 \| 3.077 \| 2.959 \| 2.891 dbbench command used (on a DB with 4 levels, 12 million keys)- TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011 Differential Revision: D14348703 Pulled By: anand1976 fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b	2019-04-11 14:28:26 -07:00
Sagar Vemuri	d3d20dcdca	Periodic Compactions (#5166 ) Summary: Introducing Periodic Compactions. This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold. - Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF. - This works across all levels. - The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used). - Compaction filters, if any, are invoked as usual. - A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS). This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166 Differential Revision: D14884441 Pulled By: sagar0 fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47	2019-04-10 19:31:18 -07:00
Siying Dong	0bb555630f	Consolidate hash function used for non-persistent data in a new function (#5155 ) Summary: Create new function NPHash64() and GetSliceNPHash64(), which are currently implemented using murmurhash. Replace the current direct call of murmurhash() to use the new functions if the hash results are not used in on-disk format. This will make it easier to try out or switch to alternative functions in the uses where data format compatibility doesn't need to be considered. This part shouldn't have any performance impact. Also, the sharded cache hash function is changed to the new format, because it falls into this categoery. It doesn't show visible performance impact in db_bench results. CPU showed by perf is increased from about 0.2% to 0.4% in an extreme benchmark setting (4KB blocks, no-compression, everything cached in block cache). We've known that the current hash function used, our own Hash() has serious hash quality problem. It can generate a lots of conflicts with similar input. In this use case, it means extra lock contention for reads from the same file. This slight CPU regression is worthy to me to counter the potential bad performance with hot keys. And hopefully this will get further improved in the future with a better hash function. cache_test's condition is relaxed a little bit to. The new hash is slightly more skewed in this use case, but I manually checked the data and see the hash results are still in a reasonable range. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5155 Differential Revision: D14834821 Pulled By: siying fbshipit-source-id: ec9a2c0a2f8ae4b54d08b13a5c2e9cc97aa80cb5	2019-04-08 13:32:06 -07:00
Adam Simpkins	c06c4c01c5	Fix many bugs in log statement arguments (#5089 ) Summary: Annotate all of the logging functions to inform the compiler that these use printf-style formatting arguments. This allows the compiler to emit warnings if the format arguments are incorrect. This also fixes many problems reported now that format string checking is enabled. Many of these are simply mix-ups in the argument type (e.g, int vs uint64_t), but in several cases the wrong number of arguments were being passed in which can cause the code to crash. The primary motivation for this was to fix the log message in `DBImpl::SwitchMemtable()` which caused a segfault due to an extra %s format parameter with no argument supplied. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5089 Differential Revision: D14574795 Pulled By: simpkins fbshipit-source-id: 0921b03f0743652bf4ae21e414ff54b3bb65422a	2019-04-04 12:12:11 -07:00
Zhongyi Xie	26015f3b48	add compression options to table properties (#5081 ) Summary: Since we are planning to use dictionary compression and to use different compression level, it is quite useful to add compression options to TableProperties. For example, in MyRocks, if the feature is available, we can query from information_schema.rocksdb_sst_props to see if all sst files are converted to ZSTD dictionary compressions. Resolves https://github.com/facebook/rocksdb/issues/4992 With this PR, user can query table properties through `GetPropertiesOfAllTables` API and get compression options as std::string: `window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0;` or table_properties->ToString() will also contain it `# data blocks=1; # entries=13; # deletions=0; # merge operands=0; # range deletions=0; raw key size=143; raw average key size=11.000000; raw value size=39; raw average value size=3.000000; data block size=120; index block size (user-key? 0, delta-value? 0)=27; filter block size=0; (estimated) table size=147; filter policy name=N/A; prefix extractor name=nullptr; column family ID=0; column family name=default; comparator name=leveldb.BytewiseComparator; merge operator name=nullptr; property collectors names=[]; SST file compression algo=Snappy; SST file compression options=window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ; creation time=1552946632; time stamp of earliest key=1552946632;` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5081 Differential Revision: D14716692 Pulled By: miasantreble fbshipit-source-id: 7d2f2cf84e052bff876e71b4212cfdebf5be32dd	2019-04-02 14:52:34 -07:00
Yanqin Jin	09957ded1d	Update RepeatableThreadTest with MockTimeEnv (#5107 ) Summary: This PR updates RepeatableThread::wait, breaking some tests on OS X. The rest of the PR fixes the tests on OS X. `RepeatableThreadTest.MockEnvTest` uses `MockTimeEnv` and `RepeatableThread`. If `RepeatableThread::wait` calls `TimedWait` with a time smaller than or equal to the current (real) time, `TimedWait` returns immediately on certain platforms, e.g. OS X. #4560 addresses this issue by replacing `TimedWait` with `Wait` in test. This fixes the test but makes test/production code diverge, which is not optimal for test coverage. This PR proposes an alternative fix which unifies test and production code path for `RepeatableThread::wait`. We obtain the current (real) time in seconds and add 10 extra seconds to ensure that `RepeatableThread::wait` invokes `TimedWait` with a time greater than (real) current time. This is to prevent the `TimedWait` function from returning immediately without sleeping and releasing the mutex. If `TimedWait` returns immediately, the mutex will not be released, and `RepeatableThread::TEST_WaitForRun` never has a chance to execute the callback which, in this case, updates the result returned by `mock_env->NowMicros()`. Consequently, `RepeatableThread::wait` cannot break out of the loop, causing test to hang. The extra 10 seconds is a best-effort approach because there seems no reliable and deterministic way to provide the aforementioned guarantee. By the time `RepeatableThread::wait` is called, there is no guarantee that the `delay + mock_env->NowMicros()` will be greater than the current real time. However, 10 seconds should be sufficient in most cases. We will keep an eye for possible flakiness of this test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5107 Differential Revision: D14680885 Pulled By: riversand963 fbshipit-source-id: d1ecbe10e1dacd110bd464cd01e188bfee72b89e	2019-03-29 10:08:50 -07:00
anand76	dae3b5545c	Smooth the deletion of WAL files (#5116 ) Summary: WAL files are currently not subject to deletion rate limiting by DeleteScheduler. If the size of the WAL files is significant, this can cause a high delete rate on SSDs that may affect other operations. To fix it, force WAL file deletions to go through the SstFileManager. Original PR for this is #2768 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5116 Differential Revision: D14669437 Pulled By: anand1976 fbshipit-source-id: c5f62d0640cebaa1574de841a1d01e4ce2faadf0	2019-03-28 15:17:13 -07:00
Yi Wu	d69241586e	Fix perf_context.user_key_comparison_count for range scan (#5098 ) Summary: Currently `perf_context.user_key_comparison_count` is bump only in `InternalKeyComparator`. For places user comparator is used directly the counter is not bump. Fixing the majority of it. Index iterator and filter code also use user comparator directly and don't bump the counter. It is not fixed in this patch. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5098 Differential Revision: D14603753 Pulled By: siying fbshipit-source-id: 1cd41035644ca9e49b97a51030a5d1e15f5f3cae	2019-03-27 10:34:27 -07:00
Yanqin Jin	9358178edc	Support for single-primary, multi-secondary instances (#4899 ) Summary: This PR allows RocksDB to run in single-primary, multi-secondary process mode. The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary. Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`. This PR has several components: 1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary. 2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue. 3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`. 3.1 Tail the primary's MANIFEST during recovery. 3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`. 3.3 Tailing WAL will be in a future PR. 4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899 Differential Revision: D14510945 Pulled By: riversand963 fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886	2019-03-26 16:45:31 -07:00
anand76	959f86e5f8	Use placement new and delete in autovector (#5080 ) Summary: The stack buffer in rocksdb::autovector is currently defined as an array of elements of the template type. This results in unnecessary construction of those objects, which can be a significant overhead in some cases. This PR changes the type of the stack buf to char* and uses placement new to construct new objects when they are inserted into the autovector. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5080 Differential Revision: D14533221 Pulled By: anand1976 fbshipit-source-id: 9378985c7d03f4e1a28951bdd2403c72f10f23d7	2019-03-20 10:42:04 -07:00
Zhongyi Xie	a291f3a1e5	Collect compaction stats by priority and dump to info LOG (#5050 ) Summary: In order to better understand compaction done by different priority thread pool, we now collect compaction stats by priority and also print them to info LOG through stats dump. ``` Compaction Stats [default] Priority Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Low 0/0 0.00 KB 0.0 16.8 11.3 5.5 5.6 0.1 0.0 0.0 406.4 136.1 42.24 34.96 45 0.939 13M 8865K High 0/0 0.00 KB 0.0 0.0 0.0 0.0 11.4 11.4 0.0 0.0 0.0 76.2 153.00 35.74 12185 0.013 0 0 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5050 Differential Revision: D14408583 Pulled By: miasantreble fbshipit-source-id: e53746586ea27cb8abc9fec35805bd80ed30f608	2019-03-19 17:28:19 -07:00
Wenjie Yang	36c2a7cfb1	Add an option to filter traces (#5082 ) Summary: Add an option to filter out READ or WRITE operations while tracing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5082 Differential Revision: D14515083 Pulled By: mrmiywj fbshipit-source-id: 2504c89a9abf1dd629cad44b4104092702d77610	2019-03-19 14:36:51 -07:00
Shobhit Dayal	b45b1cde3e	Feature for sampling and reporting compressibility (#4842 ) Summary: This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms: 1. lz4 or snappy for fast compression 2. zstd or zlib for slow but higher compression. The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType. The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842 Differential Revision: D13629011 Pulled By: shobhitdayal fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa	2019-03-18 12:15:34 -07:00
Zhongyi Xie	fdc72a5c5d	add OptionType kInt32T and kInt64T Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5061 Differential Revision: D14418581 Pulled By: miasantreble fbshipit-source-id: be7f90e16586666ddd0cce36971e403782ab0892	2019-03-12 13:49:52 -07:00
Siying Dong	aef763b6d6	Make statistics's stats_level change thread-safe (#5030 ) Summary: Right now, users can change statistics.stats_level while DB is running, but TSAN may report data race. We make stats_level_ to be atomic, and access them using accessors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5030 Differential Revision: D14267519 Pulled By: siying fbshipit-source-id: 37d7ebeff7a43a406230143422a16af899163f73	2019-03-01 10:42:09 -08:00
Siying Dong	5e298f865b	Add two more StatsLevel (#5027 ) Summary: Statistics cost too much CPU for some use cases. Add two stats levels so that people can choose to skip two types of expensive stats, timers and histograms. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027 Differential Revision: D14252765 Pulled By: siying fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592	2019-02-28 10:27:59 -08:00
Siying Dong	06f378d75e	When closing BlobDB, should first wait for all background tasks (#5005 ) Summary: When closing a BlobDB, it only waits for background tasks to finish as the last thing, but the background task may access some variables that are destroyed. The fix is to introduce a shutdown function in the timer queue and call the function as the first thing when destorying BlobDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5005 Differential Revision: D14170342 Pulled By: siying fbshipit-source-id: 081e6a2d99b9765d5956cf6cdfc290c07270c233	2019-02-21 17:26:01 -08:00
Zhongyi Xie	c4f5d0aa15	add GetStatsHistory to retrieve stats snapshots (#4748 ) Summary: This PR adds public `GetStatsHistory` API to retrieve stats history in the form of an std map. The key of the map is the timestamp in microseconds when the stats snapshot is taken, the value is another std map from stats name to stats value (stored in std string). Two DBOptions are introduced: `stats_persist_period_sec` (default 10 minutes) controls the intervals between two snapshots are taken; `max_stats_history_count` (default 10) controls the max number of history snapshots to keep in memory. RocksDB will stop collecting stats snapshots if `stats_persist_period_sec` is set to 0. (This PR is the in-memory part of https://github.com/facebook/rocksdb/pull/4535) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4748 Differential Revision: D13961471 Pulled By: miasantreble fbshipit-source-id: ac836d401ecb84ea92216bf9966f969dedf4ad04	2019-02-20 15:52:54 -08:00
Maysam Yabandeh	0f4244fe00	WritePrepared: Improve stress tests with slow threads (#4974 ) Summary: The transaction stress tests, stress a high concurrency scenario. In WritePrepared/WriteUnPrepared we need to also stress the scenarios where an inserting/reading transaction is very slow. This would stress the corner cases that the caching is not sufficient and other slower data structures are engaged. To emulate such cases we make use of slow inserter/verifier threads and also reduce the size of cache data structures. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4974 Differential Revision: D14143070 Pulled By: maysamyabandeh fbshipit-source-id: 81eb674678faf9fae0f654cd60ebcc74e26aeee7	2019-02-19 16:56:49 -08:00

1 2 3 4 5 ...

1733 Commits