rocksdb

Author	SHA1	Message	Date
Peter Dillinger	aa2486b23c	Refactor some confusing logic in PlainTableReader Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5780 Test Plan: existing plain table unit test Differential Revision: D17368629 Pulled By: pdillinger fbshipit-source-id: f25409cdc2f39ebe8d5cbb599cf820270e6b5d26	2019-09-13 10:26:36 -07:00
HouBingjian	a378a4c2ac	arm64 crc prefetch optimise (#5773 ) Summary: prefetch data for following block，avoid cache miss when doing crc caculate I do performance test at kunpeng-920 server(arm-v8, 64core@2.6GHz) ./db_bench --benchmarks=crc32c --block_size=500000000 before optimise : 587313.500 micros/op 1 ops/sec; 811.9 MB/s (500000000 per op) after optimise : 289248.500 micros/op 3 ops/sec; 1648.5 MB/s (500000000 per op) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5773 Differential Revision: D17347339 fbshipit-source-id: bfcd74f0f0eb4b322b959be68019ddcaae1e3341	2019-09-12 16:59:44 -07:00
Shylock Hg	9eb3e1f77d	Use delete to disable automatic generated methods. (#5009 ) Summary: Use delete to disable automatic generated methods instead of private, and put the constructor together for more clear.This modification cause the unused field warning, so add unused attribute to disable this warning. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5009 Differential Revision: D17288733 fbshipit-source-id: 8a767ce096f185f1db01bd28fc88fef1cdd921f3	2019-09-11 18:09:00 -07:00
Peter Dillinger	7af6ced14b	Fix block allocation bug in new DynamicBloom (#5783 ) Summary: Bug found by valgrind. New DynamicBloom wasn't allocating in block sizes. New assertion added that probes starting in final word would be in bounds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5783 Test Plan: ROCKSDB_VALGRIND_RUN=1 DISABLE_JEMALLOC=1 valgrind --leak-check=full ./dynamic_bloom_test Differential Revision: D17270623 Pulled By: pdillinger fbshipit-source-id: 1e0407504b875133a771383cd488c70f91be2b87	2019-09-09 15:26:43 -07:00
Peter Dillinger	108c619acb	Add regression test for serialized Bloom filters (#5778 ) Summary: Check that we don't accidentally change the on-disk format of existing Bloom filter implementations, including for various CACHE_LINE_SIZE (by changing temporarily). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5778 Test Plan: thisisthetest Differential Revision: D17269630 Pulled By: pdillinger fbshipit-source-id: c77017662f010a77603b7d475892b1f0d5563d8b	2019-09-09 14:51:30 -07:00
Wilfried Goesgens	fbab9913e2	upgrade gtest 1.7.0 => 1.8.1 for json result writing Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5332 Differential Revision: D17242232 fbshipit-source-id: c0d4646556a1335e51ac7382b986ca7f6ced7b64	2019-09-09 11:24:11 -07:00
HouBingjian	ac97e6930f	bloom test check fail on arm (#5745 ) Summary: FullFilterBitsBuilder::CalculateSpace use CACHE_LINE_SIZE which is 64@X86 but 128@ARM64 when it run bloom_test.FullVaryingLengths it failed on ARM64 server, the assert can be fixed by change 128->CACHE_LINE_SIZE2 as merged ASSERT_LE(FilterSize(), (size_t)((length 10 / 8) + CACHE_LINE_SIZE * 2 + 5)) << length; run bloom_test before fix: /root/rocksdb-master/util/bloom_test.cc:281: Failure Expected: (FilterSize()) <= ((size_t)((length * 10 / 8) + 128 + 5)), actual: 389 vs 383 200 [ FAILED ] FullBloomTest.FullVaryingLengths (32 ms) [----------] 4 tests from FullBloomTest (32 ms total) [----------] Global test environment tear-down [==========] 7 tests from 2 test cases ran. (116 ms total) [ PASSED ] 6 tests. [ FAILED ] 1 test, listed below: [ FAILED ] FullBloomTest.FullVaryingLengths after fix: Filters: 37 good, 0 mediocre [ OK ] FullBloomTest.FullVaryingLengths (90 ms) [----------] 4 tests from FullBloomTest (90 ms total) [----------] Global test environment tear-down [==========] 7 tests from 2 test cases ran. (174 ms total) [ PASSED ] 7 tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5745 Differential Revision: D17076047 fbshipit-source-id: e7beb5d55d4855fceb2b84bc8119a6b0759de635	2019-09-05 17:03:24 -07:00
Peter Dillinger	b55b2f45d0	Faster new DynamicBloom implementation (for memtable) (#5762 ) Summary: Since DynamicBloom is now only used in-memory, we're free to change it without schema compatibility issues. The new implementation is drawn from (with manifest permission) `303542a767/bloom_simulation_tests/foo.cc (L613)` This has several speed advantages over the prior implementation: * Uses fastrange instead of % * Minimum logic to determine first (and all) probed memory addresses * (Major) Two probes per 64-bit memory fetch/write. * Very fast and effective (murmur-like) hash expansion/re-mixing. (At least on recent CPUs, integer multiplication is very cheap.) While a Bloom filter with 512-bit cache locality has about a 1.15x FP rate penalty (e.g. 0.84% to 0.97%), further restricting to two probes per 64 bits incurs an additional 1.12x FP rate penalty (e.g. 0.97% to 1.09%). Nevertheless, the unit tests show no "mediocre" FP rate samples, unlike the old implementation with more erratic FP rates. Especially for the memtable, we expect speed to outweigh somewhat higher FP rates. For example, a negative table query would have to be 1000x slower than a BF query to justify doubling BF query time to shave 10% off FP rate (working assumption around 1% FP rate). While that seems likely for SSTs, my data suggests a speed factor of roughly 50x for the memtable (vs. BF; ~1.5% lower write throughput when enabling memtable Bloom filter, after this change). Thus, it's probably not worth even 5% more time in the Bloom filter to shave off 1/10th of the Bloom FP rate, or 0.1% in absolute terms, and it's probably at least 20% slower to recoup that much FP rate from this new implementation. Because of this, we do not see a need for a 'locality' option that affects the MemTable Bloom filter and have decoupled the MemTable Bloom filter from Options::bloom_locality. Note that just 3% more memory to the Bloom filter (10.3 bits per key vs. just 10) is able to make up for the ~12% FP rate drop in the new implementation: [] # Nearly "ideal" FP-wise but reasonably fast cache-local implementation [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_WORM64_FROM32_any.out 10000000 6 10 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_WORM64_FROM32_any.out time: 3.29372 sampled_fp_rate: 0.00985956 ... [] # Close match to this new implementation [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out 10000000 6 10.3 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out time: 2.10072 sampled_fp_rate: 0.00985655 ... [] # Old locality=1 implementation [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_ROCKSDB_DYNAMIC_any.out 10000000 6 10 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_ROCKSDB_DYNAMIC_any.out time: 3.95472 sampled_fp_rate: 0.00988943 ... Also note the dramatic speed improvement vs. alternatives. -- Performance unit test: DynamicBloomTest.concurrent_with_perf is updated to report more precise timing data. (Measure running time of each thread, not just longest running thread, etc.) Results averaged over various sizes enabled with --enable_perf and 20 runs each; old dynamic bloom refers to locality=1, the faster of the old: old dynamic bloom, avg add latency = 65.6468 new dynamic bloom, avg add latency = 44.3809 old dynamic bloom, avg query latency = 50.6485 new dynamic bloom, avg query latency = 43.2186 old avg parallel add latency = 41.678 new avg parallel add latency = 24.5238 old avg parallel hit latency = 14.6322 new avg parallel hit latency = 12.3939 old avg parallel miss latency = 16.7289 new avg parallel miss latency = 12.2134 Tested on a dedicated 64-bit production machine at Facebook. Significant improvement all around. Despite now using std::atomic<uint64_t>, quick before-and-after test on a 32-bit machine (Intel Atom N270, released 2008) shows no regression in performance, in some cases modest improvement. -- Performance integration test (synthetic): with DEBUG_LEVEL=0, used TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom,readmissing,readrandom,stats --num=2000000 and optionally with -memtable_whole_key_filtering -memtable_bloom_size_ratio=0.01 300 runs each configuration. Write throughput change by enabling memtable bloom: Old locality=0: -3.06% Old locality=1: -2.37% New: -1.50% conclusion -> seems to substantially close the gap Readmissing throughput change by enabling memtable bloom: Old locality=0: +34.47% Old locality=1: +34.80% New: +33.25% conclusion -> maybe a small new penalty from FP rate Readrandom throughput change by enabling memtable bloom: Old locality=0: +31.54% Old locality=1: +31.13% New: +30.60% conclusion -> maybe also from FP rate (after memtable flush) -- Another conclusion we can draw from this new implementation is that the existing 32-bit hash function is not inherently crippling the Bloom filter speed or accuracy, below about 5 million keys. For speed, the implementation is essentially the same whether starting with 32-bits or 64-bits of hash; it just determines whether the first multiplication after fastrange is a pseudorandom expansion or needed re-mix. Note that this multiplication can occur while memory is fetching. For accuracy, in a standard configuration, you need about 5 million keys before you have about a 1.1x FP penalty due to using a 32-bit hash vs. 64-bit: [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out $((5 * 1000 * 1000 * 10)) 6 10 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out time: 2.52069 sampled_fp_rate: 0.0118267 ... [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_any.out $((5 * 1000 * 1000 * 10)) 6 10 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_any.out time: 2.43871 sampled_fp_rate: 0.0109059 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5762 Differential Revision: D17214194 Pulled By: pdillinger fbshipit-source-id: ad9da031772e985fd6b62a0e1db8e81892520595	2019-09-05 14:59:25 -07:00
Peter Dillinger	20dec1401f	Copy/split PlainTableBloomV1 from DynamicBloom (refactor) (#5767 ) Summary: DynamicBloom was being used both for memory-only and for on-disk filters, as part of the PlainTable format. To set up enhancements to the memtable Bloom filter, this splits the code into two copies and removes unused features from each copy. Adds test PlainTableDBTest.BloomSchema to ensure no accidental change to that format. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5767 Differential Revision: D17206963 Pulled By: pdillinger fbshipit-source-id: 6cce8d55305ed0df051b4c58bdc98c8ad81d0553	2019-09-05 10:05:20 -07:00
Zhongyi Xie	2f41ecfe75	Refactor trimming logic for immutable memtables (#5022 ) Summary: MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory. We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one. The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming. In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022 Differential Revision: D14394062 Pulled By: miasantreble fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5	2019-08-23 13:55:34 -07:00
DaiZhiwei	26293c89a6	crc32c_arm64 performance optimization (#5675 ) Summary: Crc32c Parallel computation coding optimization: Macro unfolding removes the "for" loop and is good to decrease branch-miss in arm64 micro architecture 1024 Bytes is divided into 8(head) + 1008( 6 * 7 * 3 * 8 ) + 8(tail) three parts Macro unfolding 42 loops to 6 CRC32C7X24BYTESs 1 CRC32C7X24BYTES containing 7 CRC32C24BYTESs 1, crc32c_test [==========] Running 4 tests from 1 test case. [----------] Global test environment set-up. [----------] 4 tests from CRC [ RUN ] CRC.StandardResults [ OK ] CRC.StandardResults (1 ms) [ RUN ] CRC.Values [ OK ] CRC.Values (0 ms) [ RUN ] CRC.Extend [ OK ] CRC.Extend (0 ms) [ RUN ] CRC.Mask [ OK ] CRC.Mask (0 ms) [----------] 4 tests from CRC (1 ms total) [----------] Global test environment tear-down [==========] 4 tests from 1 test case ran. (1 ms total) [ PASSED ] 4 tests. 2, db_bench --benchmarks="crc32c" crc32c : 0.218 micros/op 4595390 ops/sec; 17950.7 MB/s (4096 per op) 3, repeated crc32c_test case 60000 times perf stat -e branch-miss -- ./crc32c_test before optimization: 739,426,504 branch-miss after optimization: 1,128,572 branch-miss Pull Request resolved: https://github.com/facebook/rocksdb/pull/5675 Differential Revision: D16989210 fbshipit-source-id: 7204e6069bb6ed066d49c2d1b3ac385065a98557	2019-08-23 11:04:08 -07:00
Levi Tamasi	df8c307d63	Revert to storing UncompressionDicts in the cache (#5645 ) Summary: PR https://github.com/facebook/rocksdb/issues/5584 decoupled the uncompression dictionary object from the underlying block data; however, this defeats the purpose of the digested ZSTD dictionary, since the whole point of the digest is to create it once and reuse it over and over again. This patch goes back to storing the uncompression dictionary itself in the cache (which should be now safe to do, since it no longer includes a Statistics pointer), while preserving the rest of the refactoring. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5645 Test Plan: make asan_check Differential Revision: D16551864 Pulled By: ltamasi fbshipit-source-id: 2a7e2d34bb16e70e3c816506d5afe1d842057800	2019-08-23 08:27:30 -07:00
Kefu Chai	40712df9ab	ThreadPoolImpl::Impl::BGThreadWrapper() returns void (#5709 ) Summary: there is no need to return void*, as std:🧵:thread(Func&& f, Args&&... args ) only requires `Func` to be callable. Signed-off-by: Kefu Chai <tchaikov@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/5709 Differential Revision: D16832894 fbshipit-source-id: a1e1b876fa8d55589ef5feb5b27f3a435068b747	2019-08-16 13:55:41 -07:00
Yanqin Jin	ae152ee666	Avoid user key copying for Get/Put/Write with user-timestamp (#5502 ) Summary: In previous https://github.com/facebook/rocksdb/issues/5079, we added user-specified timestamp to `DB::Get()` and `DB::Put()`. Limitation is that these two functions may cause extra memory allocation and key copy. The reason is that `WriteBatch` does not allocate extra memory for timestamps because it is not aware of timestamp size, and we did not provide an API to assign/update timestamp of each key within a `WriteBatch`. We address these issues in this PR by doing the following. 1. Add a `timestamp_size_` to `WriteBatch` so that `WriteBatch` can take timestamps into account when calling `WriteBatch::Put`, `WriteBatch::Delete`, etc. 2. Add APIs `WriteBatch::AssignTimestamp` and `WriteBatch::AssignTimestamps` so that application can assign/update timestamps for each key in a `WriteBatch`. 3. Avoid key copy in `GetImpl` by adding new constructor to `LookupKey`. Test plan (on devserver): ``` $make clean && COMPILE_WITH_ASAN=1 make -j32 all $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/* $make check ``` If the API extension looks good, I will add more unit tests. Some simple benchmark using db_bench. ``` $rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000 $rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000 -disable_wal=true ``` Master is at `a78503bd6c`. ``` \| \| readrandom \| fillrandom \| \| master \| 15.53 MB/s \| 25.97 MB/s \| \| PR5502 \| 16.70 MB/s \| 25.80 MB/s \| ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5502 Differential Revision: D16340894 Pulled By: riversand963 fbshipit-source-id: 51132cf792be07d1efc3ac33f5768c4ee2608bb8	2019-07-25 15:27:39 -07:00
Levi Tamasi	092f417037	Move the uncompression dictionary object out of the block cache (#5584 ) Summary: RocksDB has historically stored uncompression dictionary objects in the block cache as opposed to storing just the block contents. This neccesitated evicting the object upon table close. With the new code, only the raw blocks are stored in the cache, eliminating the need for eviction. In addition, the patch makes the following improvements: 1) Compression dictionary blocks are now prefetched/pinned similarly to index/filter blocks. 2) A copy operation got eliminated when the uncompression dictionary is retrieved. 3) Errors related to retrieving the uncompression dictionary are propagated as opposed to silently ignored. Note: the patch temporarily breaks the compression dictionary evicition stats. They will be fixed in a separate phase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5584 Test Plan: make asan_check Differential Revision: D16344151 Pulled By: ltamasi fbshipit-source-id: 2962b295f5b19628f9da88a3fcebbce5a5017a7b	2019-07-23 16:01:44 -07:00
Eli Pozniansky	9f5cfb8e71	Fix for ReadaheadSequentialFile crash in ldb_cmd_test (#5586 ) Summary: Fixing a corner case crash when there was no data read from file, but status is still OK Pull Request resolved: https://github.com/facebook/rocksdb/pull/5586 Differential Revision: D16348117 Pulled By: elipoz fbshipit-source-id: f97973308024f020d8be79ca3c56466b84d80656	2019-07-17 17:04:39 -07:00
Yuqi Gu	a3c1832e86	Arm64 CRC32 parallel computation optimization for RocksDB (#5494 ) Summary: Crc32c Parallel computation optimization: Algorithm comes from Intel whitepaper: [crc-iscsi-polynomial-crc32-instruction-paper](https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf) Input data is divided into three equal-sized blocks Three parallel blocks (crc0, crc1, crc2) for 1024 Bytes One Block: 42(BLK_LENGTH) * 8(step length: crc32c_u64) bytes 1. crc32c_test: ``` [==========] Running 4 tests from 1 test case. [----------] Global test environment set-up. [----------] 4 tests from CRC [ RUN ] CRC.StandardResults [ OK ] CRC.StandardResults (1 ms) [ RUN ] CRC.Values [ OK ] CRC.Values (0 ms) [ RUN ] CRC.Extend [ OK ] CRC.Extend (0 ms) [ RUN ] CRC.Mask [ OK ] CRC.Mask (0 ms) [----------] 4 tests from CRC (1 ms total) [----------] Global test environment tear-down [==========] 4 tests from 1 test case ran. (1 ms total) [ PASSED ] 4 tests. ``` 2. RocksDB benchmark: db_bench --benchmarks="crc32c" ``` Linear Arm crc32c: crc32c: 1.005 micros/op 995133 ops/sec; 3887.2 MB/s (4096 per op) ``` ``` Parallel optimization with Armv8 crypto extension: crc32c: 0.419 micros/op 2385078 ops/sec; 9316.7 MB/s (4096 per op) ``` It gets ~2.4x speedup compared to linear Arm crc32c instructions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5494 Differential Revision: D16340806 fbshipit-source-id: 95dae9a5b646fd20a8303671d82f17b2e162e945	2019-07-17 11:22:38 -07:00
Eli Pozniansky	0f4d90e6e4	Added support for sequential read-ahead file (#5580 ) Summary: Added support for sequential read-ahead file that can prefetch the read data and later serve it from internal cache buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5580 Differential Revision: D16287082 Pulled By: elipoz fbshipit-source-id: a3e7ad9643d377d39352ff63058ce050ec31dcf3	2019-07-16 18:21:18 -07:00
sdong	699a569c52	Remove RandomAccessFileReader.for_compaction_ (#5572 ) Summary: RandomAccessFileReader.for_compaction_ doesn't seem to be used anymore. Remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5572 Test Plan: USE_CLANG=1 make all check -j Differential Revision: D16286178 fbshipit-source-id: aa338049761033dfbe5e8b1707bbb0be2df5be7e	2019-07-16 16:32:18 -07:00
Yikun Jiang	f064d74e45	Cleanup the Arm64 CRC32 unused warning (#5565 ) Summary: When 'HAVE_ARM64_CRC' is set, the blew methods: - bool rocksdb::crc32c::isSSE42() - bool rocksdb::crc32c::isPCLMULQDQ() are defined but not used, the unused-function is raised when do rocksdb build. This patch try to cleanup these warnings by add ifndef, if it build under the HAVE_ARM64_CRC, we will not define `isSSE42` and `isPCLMULQDQ`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5565 Differential Revision: D16233654 fbshipit-source-id: c32a9dda7465dbf65f9ccafef159124db92cdffd	2019-07-15 11:20:26 -07:00
ggaurav28	60d8b19836	Implemented a file logger that uses WritableFileWriter (#5491 ) Summary: Current PosixLogger performs IO operations using posix calls. Thus the current implementation will not work for non-posix env. Created a new logger class EnvLogger that uses env specific WritableFileWriter for IO operations. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5491 Test Plan: make check Differential Revision: D15909002 Pulled By: ggaurav28 fbshipit-source-id: 13a8105176e8e42db0c59798d48cb6a0dbccc965	2019-07-09 16:27:22 -07:00
sdong	e4dcf5fd22	db_bench to add a new "benchmark" to print out all stats history (#5532 ) Summary: Sometimes it is helpful to fetch the whole history of stats after benchmark runs. Add such an option Pull Request resolved: https://github.com/facebook/rocksdb/pull/5532 Test Plan: Run the benchmark manually and observe the output is as expected. Differential Revision: D16097764 fbshipit-source-id: 10b5b735a22a18be198b8f348be11f11f8806904	2019-07-03 20:03:28 -07:00
Sagar Vemuri	84c5c9aab1	Fix a bug in compaction reads causing checksum mismatches and asan errors (#5531 ) Summary: Fixed a bug in compaction reads due to which incorrect number of bytes were being read/utilized. The bug was introduced in https://github.com/facebook/rocksdb/issues/5498 , resulting in "Corruption: block checksum mismatch" and "heap-buffer-overflow" asan errors in our tests. https://github.com/facebook/rocksdb/issues/5498 was introduced recently and is not in any released versions. ASAN: ``` > ==2280939==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6250005e83da at pc 0x000000d57f62 bp 0x7f954f483770 sp 0x7f954f482f20 > === How to use this, how to get the raw stack trace, and more: fburl.com/ASAN === > READ of size 4 at 0x6250005e83da thread T4 > SCARINESS: 27 (4-byte-read-heap-buffer-overflow-far-from-bounds) > #0 tests+0xd57f61 __asan_memcpy > https://github.com/facebook/rocksdb/issues/1 rocksdb/src/util/coding.h:124 rocksdb::DecodeFixed32(char const) > https://github.com/facebook/rocksdb/issues/2 rocksdb/src/table/block_fetcher.cc:39 rocksdb::BlockFetcher::CheckBlockChecksum() > https://github.com/facebook/rocksdb/issues/3 rocksdb/src/table/block_fetcher.cc:99 rocksdb::BlockFetcher::TryGetFromPrefetchBuffer() > https://github.com/facebook/rocksdb/issues/4 rocksdb/src/table/block_fetcher.cc:209 rocksdb::BlockFetcher::ReadBlockContents() > https://github.com/facebook/rocksdb/issues/5 rocksdb/src/table/block_based/block_based_table_reader.cc:93 rocksdb::(anonymous namespace)::ReadBlockFromFile(rocksdb::RandomAccessFileReader, rocksdb::FilePrefetchBuffer, rocksdb::Footer const&, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, std::unique_ptr<...>, rocksdb::ImmutableCFOptions const&, bool, bool, rocksdb::UncompressionDict const&, rocksdb::PersistentCacheOptions const&, unsigned long, unsigned long, rocksdb::MemoryAllocator, bool) > https://github.com/facebook/rocksdb/issues/6 rocksdb/src/table/block_based/block_based_table_reader.cc:2331 rocksdb::BlockBasedTable::RetrieveBlock(rocksdb::FilePrefetchBuffer, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<...>, rocksdb::BlockType, rocksdb::GetContext, rocksdb::BlockCacheLookupContext, bool) const > https://github.com/facebook/rocksdb/issues/7 rocksdb/src/table/block_based/block_based_table_reader.cc:2090 rocksdb::DataBlockIter rocksdb::BlockBasedTable::NewDataBlockIterator<...>(rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::DataBlockIter, rocksdb::BlockType, bool, bool, rocksdb::GetContext, rocksdb::BlockCacheLookupContext, rocksdb::Status, rocksdb::FilePrefetchBuffe r, bool) const > https://github.com/facebook/rocksdb/issues/8 rocksdb/src/table/block_based/block_based_table_reader.cc:2720 rocksdb::BlockBasedTableIterator<...>::InitDataBlock() > https://github.com/facebook/rocksdb/issues/9 rocksdb/src/table/block_based/block_based_table_reader.cc:2607 rocksdb::BlockBasedTableIterator<...>::SeekToFirst() > https://github.com/facebook/rocksdb/issues/10 rocksdb/src/table/iterator_wrapper.h:83 rocksdb::IteratorWrapperBase<...>::SeekToFirst() > https://github.com/facebook/rocksdb/issues/11 rocksdb/src/table/merging_iterator.cc:100 rocksdb::MergingIterator::SeekToFirst() > https://github.com/facebook/rocksdb/issues/12 rocksdb/compaction/compaction_job.cc:877 rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState) > https://github.com/facebook/rocksdb/issues/13 rocksdb/compaction/compaction_job.cc:590 rocksdb::CompactionJob::Run() > https://github.com/facebook/rocksdb/issues/14 rocksdb/db_impl/db_impl_compaction_flush.cc:2689 rocksdb::DBImpl::BackgroundCompaction(bool, rocksdb::JobContext, rocksdb::LogBuffer, rocksdb::DBImpl::PrepickedCompaction, rocksdb::Env::Priority) > https://github.com/facebook/rocksdb/issues/15 rocksdb/db_impl/db_impl_compaction_flush.cc:2248 rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction, rocksdb::Env::Priority) > https://github.com/facebook/rocksdb/issues/16 rocksdb/db_impl/db_impl_compaction_flush.cc:2024 rocksdb::DBImpl::BGWorkCompaction(void) > https://github.com/facebook/rocksdb/issues/23 rocksdb/src/util/threadpool_imp.cc:266 rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) > https://github.com/facebook/rocksdb/issues/24 rocksdb/src/util/threadpool_imp.cc:307 rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5531 Test Plan: Verified that this fixes the fb-internal Logdevice test which caught the issue. Differential Revision: D16109702 Pulled By: sagar0 fbshipit-source-id: 1fc08549cf7b553e338a133ae11eb9f4d5011914	2019-07-03 19:06:46 -07:00
Eli Pozniansky	f872009237	Fix from some C-style casting (#5524 ) Summary: Fix from some C-style casting in bloom.cc and ./tools/db_bench_tool.cc Pull Request resolved: https://github.com/facebook/rocksdb/pull/5524 Differential Revision: D16075626 Pulled By: elipoz fbshipit-source-id: 352948885efb64a7ef865942c75c3c727a914207	2019-07-01 13:05:34 -07:00
anand76	7259e28d91	MultiGet parallel IO (#5464 ) Summary: Enhancement to MultiGet batching to read data blocks required for keys in a batch in parallel from disk. It uses Env::MultiRead() API to read multiple blocks and reduce latency. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5464 Test Plan: 1. make check 2. make asan_check 3. make asan_crash Differential Revision: D15911771 Pulled By: anand1976 fbshipit-source-id: 605036b9af0f90ca0020dc87c3a86b4da6e83394	2019-06-30 20:56:04 -07:00
Mike Kolupaev	b4d7209428	Add an option to put first key of each sst block in the index (#5289 ) Summary: The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes. Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it. So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks. Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files. This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289 Differential Revision: D15256423 Pulled By: al13n321 fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a	2019-06-24 20:54:04 -07:00
Sergei Petrunia	e731f44022	C file should not include <cinttypes>, it is a C++ header. (#5499 ) Summary: Include <inttypes.h> instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5499 Differential Revision: D15966937 Pulled By: miasantreble fbshipit-source-id: 2156c4329b91d26d447de94f1231264d52786350	2019-06-24 16:12:39 -07:00
Vijay Nadimpalli	22028aa9ab	Compaction Reads should read no more than compaction_readahead_size bytes, when set! (#5498 ) Summary: As a result of https://github.com/facebook/rocksdb/issues/5431 the compaction_readahead_size given by a user was not used exactly, the reason being the code behind readahead for user-read and compaction-read was unified in the above PR and the behavior for user-read is to read readahead_size+n bytes (see FilePrefetchBuffer::TryReadFromCache method). Before the unification the ReadaheadRandomAccessFileReader used compaction_readahead_size as it is. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5498 Test Plan: Ran strace command : strace -e pread64 -f -T -t ./db_compaction_test --gtest_filter=DBCompactionTest.PartialManualCompaction In the test the compaction_readahead_size was configured to 2MB and verified the pread syscall did indeed request 2MB. Before the change it was requesting more than 2MB. Strace Output: strace: Process 3798982 attached Note: Google Test filter = DBCompactionTest.PartialManualCompaction [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBCompactionTest [ RUN ] DBCompactionTest.PartialManualCompaction strace: Process 3798983 attached strace: Process 3798984 attached strace: Process 3798985 attached strace: Process 3798986 attached strace: Process 3798987 attached strace: Process 3798992 attached [pid 3798987] 12:07:05 +++ exited with 0 +++ strace: Process 3798993 attached [pid 3798993] 12:07:05 +++ exited with 0 +++ strace: Process 3798994 attached strace: Process 3799008 attached strace: Process 3799009 attached [pid 3799008] 12:07:05 +++ exited with 0 +++ strace: Process 3799010 attached [pid 3799009] 12:07:05 +++ exited with 0 +++ strace: Process 3799011 attached [pid 3799010] 12:07:05 +++ exited with 0 +++ [pid 3799011] 12:07:05 +++ exited with 0 +++ strace: Process 3799012 attached [pid 3799012] 12:07:05 +++ exited with 0 +++ strace: Process 3799013 attached strace: Process 3799014 attached [pid 3799013] 12:07:05 +++ exited with 0 +++ strace: Process 3799015 attached [pid 3799014] 12:07:05 +++ exited with 0 +++ [pid 3799015] 12:07:05 +++ exited with 0 +++ strace: Process 3799016 attached [pid 3799016] 12:07:05 +++ exited with 0 +++ strace: Process 3799017 attached [pid 3799017] 12:07:05 +++ exited with 0 +++ strace: Process 3799019 attached [pid 3799019] 12:07:05 +++ exited with 0 +++ strace: Process 3799020 attached strace: Process 3799021 attached [pid 3799020] 12:07:05 +++ exited with 0 +++ [pid 3799021] 12:07:05 +++ exited with 0 +++ strace: Process 3799022 attached [pid 3799022] 12:07:05 +++ exited with 0 +++ strace: Process 3799023 attached [pid 3799023] 12:07:05 +++ exited with 0 +++ strace: Process 3799047 attached strace: Process 3799048 attached [pid 3799047] 12:07:06 +++ exited with 0 +++ [pid 3799048] 12:07:06 +++ exited with 0 +++ [pid 3798994] 12:07:06 +++ exited with 0 +++ strace: Process 3799052 attached [pid 3799052] 12:07:06 +++ exited with 0 +++ strace: Process 3799054 attached strace: Process 3799069 attached strace: Process 3799070 attached [pid 3799069] 12:07:06 +++ exited with 0 +++ strace: Process 3799071 attached [pid 3799070] 12:07:06 +++ exited with 0 +++ [pid 3799071] 12:07:06 +++ exited with 0 +++ strace: Process 3799072 attached strace: Process 3799073 attached [pid 3799072] 12:07:06 +++ exited with 0 +++ [pid 3799073] 12:07:06 +++ exited with 0 +++ strace: Process 3799074 attached [pid 3799074] 12:07:06 +++ exited with 0 +++ strace: Process 3799075 attached [pid 3799075] 12:07:06 +++ exited with 0 +++ strace: Process 3799076 attached [pid 3799076] 12:07:06 +++ exited with 0 +++ strace: Process 3799077 attached [pid 3799077] 12:07:06 +++ exited with 0 +++ strace: Process 3799078 attached [pid 3799078] 12:07:06 +++ exited with 0 +++ strace: Process 3799079 attached [pid 3799079] 12:07:06 +++ exited with 0 +++ strace: Process 3799080 attached [pid 3799080] 12:07:06 +++ exited with 0 +++ strace: Process 3799081 attached [pid 3799081] 12:07:06 +++ exited with 0 +++ strace: Process 3799082 attached [pid 3799082] 12:07:06 +++ exited with 0 +++ strace: Process 3799083 attached [pid 3799083] 12:07:06 +++ exited with 0 +++ strace: Process 3799086 attached strace: Process 3799087 attached [pid 3798984] 12:07:06 pread64(9, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000121> [pid 3798984] 12:07:06 pread64(9, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000106> [pid 3798984] 12:07:06 pread64(9, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000081> [pid 3798984] 12:07:06 pread64(9, "\0\v\3foo\2\7\0\0\0\0\0\0\0\270 \0\v\4foo\2\3\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000138> [pid 3798984] 12:07:06 pread64(11, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000097> [pid 3798984] 12:07:06 pread64(11, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000086> [pid 3798984] 12:07:06 pread64(11, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000064> [pid 3798984] 12:07:06 pread64(11, "\0\v\3foo\2\21\0\0\0\0\0\0\0\270 \0\v\4foo\2\r\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000064> [pid 3798984] 12:07:06 pread64(12, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000080> [pid 3798984] 12:07:06 pread64(12, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000090> [pid 3798984] 12:07:06 pread64(12, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000059> [pid 3798984] 12:07:06 pread64(12, "\0\v\3foo\2\33\0\0\0\0\0\0\0\270 \0\v\4foo\2\27\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000065> [pid 3798984] 12:07:06 pread64(13, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000070> [pid 3798984] 12:07:06 pread64(13, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000059> [pid 3798984] 12:07:06 pread64(13, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000061> [pid 3798984] 12:07:06 pread64(13, "\0\v\3foo\2%\0\0\0\0\0\0\0\270 \0\v\4foo\2!\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000065> [pid 3798984] 12:07:06 pread64(14, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000118> [pid 3798984] 12:07:06 pread64(14, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000093> [pid 3798984] 12:07:06 pread64(14, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000050> [pid 3798984] 12:07:06 pread64(14, "\0\v\3foo\2/\0\0\0\0\0\0\0\270 \0\v\4foo\2+\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000082> [pid 3798984] 12:07:06 pread64(15, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000080> [pid 3798984] 12:07:06 pread64(15, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000086> [pid 3798984] 12:07:06 pread64(15, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000091> [pid 3798984] 12:07:06 pread64(15, "\0\v\3foo\0029\0\0\0\0\0\0\0\270 \0\v\4foo\0025\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000174> [pid 3798984] 12:07:06 pread64(16, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000080> [pid 3798984] 12:07:06 pread64(16, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000093> [pid 3798984] 12:07:06 pread64(16, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000194> [pid 3798984] 12:07:06 pread64(16, "\0\v\3foo\2C\0\0\0\0\0\0\0\270 \0\v\4foo\2?\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000086> [pid 3798984] 12:07:06 pread64(17, "\1\203W!\241QE\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 11177) = 53 <0.000079> [pid 3798984] 12:07:06 pread64(17, "\0\22\4rocksdb.properties\353Q\223\5\0\0\0\0\1\0\0"..., 38, 11139) = 38 <0.000047> [pid 3798984] 12:07:06 pread64(17, "\0$\4rocksdb.block.based.table.ind"..., 664, 10475) = 664 <0.000045> [pid 3798984] 12:07:06 pread64(17, "\0\v\3foo\2M\0\0\0\0\0\0\0\270 \0\v\4foo\2I\0\0\0\0\0\0\275"..., 74, 10401) = 74 <0.000107> [pid 3798983] 12:07:06 pread64(17, "\0\v\200\10foo\2P\0\0\0\0\0\0)U?MSg_)j(roFn($e"..., 2097152, 0) = 11230 <0.000091> [pid 3798983] 12:07:06 pread64(17, "", 2085922, 11230) = 0 <0.000073> [pid 3798983] 12:07:06 pread64(16, "\0\v\200\10foo\2F\0\0\0\0\0\0k[h3%.OPH_^:\\S7T&"..., 2097152, 0) = 11230 <0.000083> [pid 3798983] 12:07:06 pread64(16, "", 2085922, 11230) = 0 <0.000078> [pid 3798983] 12:07:06 pread64(15, "\0\v\200\10foo\2<\0\0\0\0\0\0+qToi_c{S+4:N(:"..., 2097152, 0) = 11230 <0.000095> [pid 3798983] 12:07:06 pread64(15, "", 2085922, 11230) = 0 <0.000067> [pid 3798983] 12:07:06 pread64(14, "\0\v\200\10foo\0022\0\0\0\0\0\0%hw%OMa\"}9I609Q!B"..., 2097152, 0) = 11230 <0.000111> [pid 3798983] 12:07:06 pread64(14, "", 2085922, 11230) = 0 <0.000093> [pid 3798983] 12:07:06 pread64(13, "\0\v\200\10foo\2(\0\0\0\0\0\0p}Y&mu^DcaSGb2&nP"..., 2097152, 0) = 11230 <0.000128> [pid 3798983] 12:07:06 pread64(13, "", 2085922, 11230) = 0 <0.000076> [pid 3798983] 12:07:06 pread64(12, "\0\v\200\10foo\2\36\0\0\0\0\0\0YIyW#]oSs^6VHfB<`"..., 2097152, 0) = 11230 <0.000092> [pid 3798983] 12:07:06 pread64(12, "", 2085922, 11230) = 0 <0.000073> [pid 3798983] 12:07:06 pread64(11, "\0\v\200\10foo\2\24\0\0\0\0\0\0mfF8Jel/Zf :-#s("..., 2097152, 0) = 11230 <0.000088> [pid 3798983] 12:07:06 pread64(11, "", 2085922, 11230) = 0 <0.000067> [pid 3798983] 12:07:06 pread64(9, "\0\v\200\10foo\2\n\0\0\0\0\0\0\\X'cjiHX)D,RSj1X!"..., 2097152, 0) = 11230 <0.000115> [pid 3798983] 12:07:06 pread64(9, "", 2085922, 11230) = 0 <0.000073> [pid 3798983] 12:07:06 pread64(8, "\1\315\5 \36\30\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 754) = 53 <0.000098> [pid 3798983] 12:07:06 pread64(8, "\0\22\3rocksdb.properties;\215\5\0\0\0\0\1\0\0\0"..., 37, 717) = 37 <0.000064> [pid 3798983] 12:07:06 pread64(8, "\0$\4rocksdb.block.based.table.ind"..., 658, 59) = 658 <0.000074> [pid 3798983] 12:07:06 pread64(8, "\0\v\2foo\1\0\0\0\0\0\0\0\0\31\0\0\0\0\1\0\0\0\0\212\216\222P", 29, 30) = 29 <0.000064> [pid 3799086] 12:07:06 +++ exited with 0 +++ [pid 3799087] 12:07:06 +++ exited with 0 +++ [pid 3799054] 12:07:06 +++ exited with 0 +++ strace: Process 3799104 attached [pid 3799104] 12:07:06 +++ exited with 0 +++ [ OK ] DBCompactionTest.PartialManualCompaction (757 ms) [----------] 1 test from DBCompactionTest (758 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (759 ms total) [ PASSED ] 1 test. [pid 3798983] 12:07:06 +++ exited with 0 +++ [pid 3798984] 12:07:06 +++ exited with 0 +++ [pid 3798992] 12:07:06 +++ exited with 0 +++ [pid 3798986] 12:07:06 +++ exited with 0 +++ [pid 3798982] 12:07:06 +++ exited with 0 +++ [pid 3798985] 12:07:06 +++ exited with 0 +++ 12:07:06 +++ exited with 0 +++ Differential Revision: D15948422 Pulled By: vjnadimpalli fbshipit-source-id: 9b189d1e8675d290c7784e4b33e5d3b5761d2ac8	2019-06-21 21:31:49 -07:00
Vijay Nadimpalli	24b118ad98	Combine the read-ahead logic for user reads and compaction reads (#5431 ) Summary: Currently the read-ahead logic for user reads and compaction reads go through different code paths where compaction reads create new table readers and use `ReadaheadRandomAccessFile`. This change is to unify read-ahead logic to use read-ahead in BlockBasedTableReader::InitDataBlock(). As a result of the change `ReadAheadRandomAccessFile` class and `new_table_reader_for_compaction_inputs` option will no longer be used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5431 Test Plan: make check Here is the benchmarking - https://gist.github.com/vjnadimpalli/083cf423f7b6aa12dcdb14c858bc18a5 Differential Revision: D15772533 Pulled By: vjnadimpalli fbshipit-source-id: b71dca710590471ede6fb37553388654e2e479b9	2019-06-19 14:10:46 -07:00
Zhongyi Xie	d68f9f4580	simplify include directive involving inttypes (#5402 ) Summary: When using `PRIu64` type of printf specifier, current code base does the following: ``` #ifndef __STDC_FORMAT_MACROS #define __STDC_FORMAT_MACROS #endif #include <inttypes.h> ``` However, this can be simplified to ``` #include <cinttypes> ``` as long as flag `-std=c++11` is used. This should solve issues like https://github.com/facebook/rocksdb/issues/5159 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402 Differential Revision: D15701195 Pulled By: miasantreble fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03	2019-06-06 13:56:07 -07:00
Yanqin Jin	340ed4fac7	Add support for timestamp in Get/Put (#5079 ) Summary: It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now). Test plan (on devserver): ``` $COMPILE_WITH_ASAN=1 make -j32 all $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/* $make check ``` All tests must pass. We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled. ``` $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000 $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000 ``` Repeat for 6 times for both versions. Results are as follows: ``` \| \| readrandom \| fillrandom \| \| master \| 16.77 MB/s \| 47.05 MB/s \| \| PR5079 \| 16.44 MB/s \| 47.03 MB/s \| ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079 Differential Revision: D15132946 Pulled By: riversand963 fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c	2019-06-05 23:10:47 -07:00
Siying Dong	5851cb7fdb	Move util/trace_replay.* to trace_replay/ (#5376 ) Summary: util/ means for lower level libraries. trace_replay is highly integrated to DB and sometimes call DB. Move it out to a separate directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5376 Differential Revision: D15550938 Pulled By: siying fbshipit-source-id: f46dce5ceffdc05a73f26379c7bb1b79ebe6c207	2019-06-03 13:25:26 -07:00
Siying Dong	000b9ec217	Move some logging related files to logging/ (#5387 ) Summary: Many logging related source files are under util/. It will be more structured if they are together. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387 Differential Revision: D15579036 Pulled By: siying fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f	2019-05-31 17:23:59 -07:00
Yuan Zhou	79edf0a7a8	util: fix log_write_bench (#5335 ) Summary: log_write_bench doesn't compile due to some recent API changes. This patch fixes the compile by adding the missing params for OptimizeForLogWrite() and WritableFileWriter(). Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/5335 Differential Revision: D15588875 Pulled By: miasantreble fbshipit-source-id: 726ff4dc227733e915c3b796df25bd3ab0b431ac	2019-05-31 17:17:57 -07:00
Vijay Nadimpalli	49c5a12dbe	Organizing rocksdb/db directory Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390 Differential Revision: D15579388 Pulled By: vjnadimpalli fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366	2019-05-31 11:57:01 -07:00
Siying Dong	cb094e13bb	Auto roll logger to enforce options.keep_log_file_num immediately after a new file is created (#5370 ) Summary: Right now, with auto roll logger, options.keep_log_file_num enforcement is triggered by events like DB reopen or full obsolete scan happens. In the mean time, the size and number of log files can grow without a limit. We put a stronger enforcement to the option, so that the number of log files can always under control. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5370 Differential Revision: D15570413 Pulled By: siying fbshipit-source-id: 0916c3c4d42ab8fdd29389ee7fd7e1557b03176e	2019-05-31 10:50:19 -07:00
Siying Dong	8843129ece	Move some memory related files from util/ to memory/ (#5382 ) Summary: Move arena, allocator, and memory tools under util to a separate memory/ directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382 Differential Revision: D15564655 Pulled By: siying fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5	2019-05-30 17:44:09 -07:00
Vijay Nadimpalli	50e470791d	Organizing rocksdb/table directory by format Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373 Differential Revision: D15559425 Pulled By: vjnadimpalli fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55	2019-05-30 14:51:11 -07:00
Siying Dong	e9e0101ca4	Move test related files under util/ to test_util/ (#5377 ) Summary: There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo ve them to a new directory test_util/, so that util/ is cleaner. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377 Differential Revision: D15551366 Pulled By: siying fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c	2019-05-30 11:25:51 -07:00
Siying Dong	545d206040	Move some file related files outside util/ (#5375 ) Summary: util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375 Differential Revision: D15550935 Pulled By: siying fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2	2019-05-29 20:47:06 -07:00
Yanqin Jin	b5e4ee2e76	Fix a clang analyze error (#5365 ) Summary: The analyzer thinks max_allowed_ space can be 0. In that case, free_space will be assigned as free_space. It fails to realize that the function call GetFreeSpace actually sets the free_space variable properly, which is possibly due to lack of inter-function call analysis. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5365 Differential Revision: D15521790 Pulled By: riversand963 fbshipit-source-id: 839d0a285a1c8773a28a385f0c3be4bb7fbe32cb	2019-05-28 12:19:41 -07:00
Sagar Vemuri	e264eebcd7	Add comments in file_reader_writer.h (#5355 ) Summary: Add file and class level comments in file_reader_writer.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/5355 Differential Revision: D15499020 Pulled By: sagar0 fbshipit-source-id: 925b2326885cdb4357e6a139ac65ee5e2ce1d613	2019-05-24 20:31:45 -07:00
Yanqin Jin	bd9f1d2d0f	Fix RocksDB auto-recovery from SpaceLimit err (#5334 ) Summary: If RocksDB is configured with a positive max_allowed_space (via sst file manager), then the sst file manager should use this value instead of total free disk space to determine whether to clear the background error of space limit reached. In DBSSTTest.DBWithMaxSpaceAllowed, we configure a low space limit that is very likely lower than the free disk space of the test machine. Therefore, once the test db encounters a Status::SpaceLimit, error handler will call into sst file manager to start error recovery which may clear the bg error since disk free space is larger than reserved_disk_buffer_. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5334 Differential Revision: D15501622 Pulled By: riversand963 fbshipit-source-id: 58035efc450b062d6b28c78c322005ec3705fb47	2019-05-24 18:38:12 -07:00
Sagar Vemuri	b09c018b4d	Add comments to trace_replay.h (#5359 ) Summary: Add file, class, and function level comments in trace_replay.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/5359 Differential Revision: D15505318 Pulled By: sagar0 fbshipit-source-id: 181e3d4ea805fd9a33f91b89e123bbd0c1ead2ce	2019-05-24 16:59:54 -07:00
Sagar Vemuri	5d359fc337	Document AlignedBuffer (#5345 ) Summary: Add comments to util/aligned_buffer.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/5345 Differential Revision: D15496004 Pulled By: sagar0 fbshipit-source-id: 31bc6f35e88dedd74cff55febe02c9e761304f76	2019-05-24 10:05:40 -07:00
Zhongyi Xie	6a54278b4a	add class level comment for RepeatableThread Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5344 Differential Revision: D15485431 Pulled By: miasantreble fbshipit-source-id: 9c0f6cf0d826743e743012549976705ceb8cc0c4	2019-05-23 17:03:23 -07:00
Sagar Vemuri	dda474399a	Remove PATENTS text from a few straggler files (#5326 ) Summary: Remove PATENTS related wording from a few stragglers which still reference the old PATENTS file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5326 Differential Revision: D15423297 Pulled By: sagar0 fbshipit-source-id: 4babcddfc120b7d2fed6eb3898287cf8012bf8ea	2019-05-21 16:22:35 -07:00
Vijay Nadimpalli	931c9df886	Use separate status code for column family drop and db shutdown in progress (#5275 ) Summary: Currently RocksDB uses Status::ShutdownInProgress to inform about column family drop. I would like to have a separate Status code for this event. https://github.com/facebook/rocksdb/blob/master/include/rocksdb/status.h#L55 Comment on this: `abc4202e47/db/version_set.cc (L2742)`:L2743 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5275 Differential Revision: D15204583 Pulled By: vjnadimpalli fbshipit-source-id: 95e99e34b27bc165b554ecb8a48a7f8e60f21e2a	2019-05-20 10:47:32 -07:00
Zhichao Cao	a13026fb2f	Added trace replay fast forward function (#5273 ) Summary: In the current db_bench trace replay, the replay process strictly follows the timestamp to issue the queries. In some cases, user does not care about the time. Therefore, fast forward is needed for users to speed up the replay process. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5273 Differential Revision: D15389232 Pulled By: zhichao-cao fbshipit-source-id: 735d629b9d2a167b05af3e4fa0ddf9d5d0be1806	2019-05-16 20:21:18 -07:00
Maysam Yabandeh	f0e8216197	WritePrepared: Fix deadlock in WriteRecoverableState (#5306 ) Summary: The recent improvement in https://github.com/facebook/rocksdb/pull/3661 could cause a deadlock: When writing recoverable state, we also commit its sequence number to commit table, which could result into evicting existing commit entry, which could result into advancing max_evicted_seq_, which would need to get snapshots from database, which requires obtaining db mutex. The patch releases db_mutex before calling the callback in WriteRecoverableState to avoid the potential deadlock. It also improves the stress tests to let the issue be manifested in the tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5306 Differential Revision: D15341458 Pulled By: maysamyabandeh fbshipit-source-id: 05dcbed7e21b789fd1e5fd5ee8eea08077162323	2019-05-15 13:53:54 -07:00

1 2 3 4 5 ...

1758 Commits