rocksdb

Author	SHA1	Message	Date
Hans Holmberg	679a413f11	Close databases on benchmark error exits in db_bench (#7327 ) Summary: Delete database instances to make sure there are no loose threads running before exit(). This fixes segfaults seen when running workloads through CompositeEnvs with custom file systems. For further background on the issues arising when using CompositeEnvs, see the discussion in: https://github.com/facebook/rocksdb/pull/6878 Pull Request resolved: https://github.com/facebook/rocksdb/pull/7327 Reviewed By: cheng-chang Differential Revision: D23433244 Pulled By: ajkr fbshipit-source-id: 4e19cf2067e3fe68c2a3fe1823f24b4091336bbe	2020-09-03 14:36:30 -07:00
Hans Holmberg	2a0d3c7054	Add a file system parameter: --fs_uri to db_stress and db_bench (#6878 ) Summary: This pull request adds the parameter --fs_uri to db_bench and db_stress, creating a composite env combining the default env with a specified registered rocksdb file system. This makes it easier to develop and test new RocksDB FileSystems. The pull request also registers the posix file system for testing purposes. Examples: ``` $./db_bench --fs_uri=posix:// --benchmarks=fillseq $./db_stress --fs_uri=zenfs://nullb1 ``` zenfs is a RocksDB FileSystem I'm developing to add support for zoned block devices, and in that case the zoned block device is specified in the uri (a zoned null block device in the above example). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6878 Reviewed By: siying Differential Revision: D23023063 Pulled By: ajkr fbshipit-source-id: 8b3fe7193ce45e683043b021779b7a4d547af247	2020-08-17 11:55:24 -07:00
Aaron Kabcenell	56ed601df3	Compaction Read/Write Stats by Compaction Type (#7165 ) Summary: Adds compaction statistics (total bytes read and written) for compactions that occur for delete-triggered, periodic, and TTL compaction reasons. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7165 Test Plan: TTL and periodic can be checked by runnning db_bench with the options activated: /db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -periodic_compaction_seconds=1 ./db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -fifo_compaction_ttl=1 Setting the time to one second causes non-zero bytes read/written for those compaction reasons. Disabling them or setting them to times longer than the test run length causes the stats to return to zero as expected. Delete-triggered compaction counting is tested in DBTablePropertiesTest.DeletionTriggeredCompactionMarking Reviewed By: ajkr Differential Revision: D22693050 Pulled By: akabcenell fbshipit-source-id: d15cef4d94576f703015c8942d5f0d492f69401d	2020-07-29 13:39:29 -07:00
Peter Dillinger	5b2bbacb6f	Minimize memory internal fragmentation for Bloom filters (#6427 ) Summary: New experimental option BBTO::optimize_filters_for_memory builds filters that maximize their use of "usable size" from malloc_usable_size, which is also used to compute block cache charges. Rather than always "rounding up," we track state in the BloomFilterPolicy object to mix essentially "rounding down" and "rounding up" so that the average FP rate of all generated filters is the same as without the option. (YMMV as heavily accessed filters might be unluckily lower accuracy.) Thus, the option near-minimizes what the block cache considers as "memory used" for a given target Bloom filter false positive rate and Bloom filter implementation. There are no forward or backward compatibility issues with this change, though it only works on the format_version=5 Bloom filter. With Jemalloc, we see about 10% reduction in memory footprint (and block cache charge) for Bloom filters, but 1-2% increase in storage footprint, due to encoding efficiency losses (FP rate is non-linear with bits/key). Why not weighted random round up/down rather than state tracking? By only requiring malloc_usable_size, we don't actually know what the next larger and next smaller usable sizes for the allocator are. We pick a requested size, accept and use whatever usable size it has, and use the difference to inform our next choice. This allows us to narrow in on the right balance without tracking/predicting usable sizes. Why not weight history of generated filter false positive rates by number of keys? This could lead to excess skew in small filters after generating a large filter. Results from filter_bench with jemalloc (irrelevant details omitted): (normal keys/filter, but high variance) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.6278 Number of filters: 5516 Total size (MB): 200.046 Reported total allocated memory (MB): 220.597 Reported internal fragmentation: 10.2732% Bits/key stored: 10.0097 Average FP rate %: 0.965228 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 30.5104 Number of filters: 5464 Total size (MB): 200.015 Reported total allocated memory (MB): 200.322 Reported internal fragmentation: 0.153709% Bits/key stored: 10.1011 Average FP rate %: 0.966313 (very few keys / filter, optimization not as effective due to ~59 byte internal fragmentation in blocked Bloom filter representation) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.5649 Number of filters: 162950 Total size (MB): 200.001 Reported total allocated memory (MB): 224.624 Reported internal fragmentation: 12.3117% Bits/key stored: 10.2951 Average FP rate %: 0.821534 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 31.8057 Number of filters: 159849 Total size (MB): 200 Reported total allocated memory (MB): 208.846 Reported internal fragmentation: 4.42297% Bits/key stored: 10.4948 Average FP rate %: 0.811006 (high keys/filter) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.7017 Number of filters: 164 Total size (MB): 200.352 Reported total allocated memory (MB): 221.5 Reported internal fragmentation: 10.5552% Bits/key stored: 10.0003 Average FP rate %: 0.969358 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 30.7131 Number of filters: 160 Total size (MB): 200.928 Reported total allocated memory (MB): 200.938 Reported internal fragmentation: 0.00448054% Bits/key stored: 10.1852 Average FP rate %: 0.963387 And from db_bench (block cache) with jemalloc: $ ./db_bench -db=/dev/shm/dbbench.no_optimize -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false $ ./db_bench -db=/dev/shm/dbbench -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -optimize_filters_for_memory -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false $ (for FILE in /dev/shm/dbbench.no_optimize/.sst; do ./sst_dump --file=$FILE --show_properties \| grep 'filter block' ; done) \| awk '{ t += $4; } END { print t; }' 17063835 $ (for FILE in /dev/shm/dbbench/.sst; do ./sst_dump --file=$FILE --show_properties \| grep 'filter block' ; done) \| awk '{ t += $4; } END { print t; }' 17430747 $ #^ 2.1% additional filter storage $ ./db_bench -db=/dev/shm/dbbench.no_optimize -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000 rocksdb.block.cache.index.add COUNT : 33 rocksdb.block.cache.index.bytes.insert COUNT : 8440400 rocksdb.block.cache.filter.add COUNT : 33 rocksdb.block.cache.filter.bytes.insert COUNT : 21087528 rocksdb.bloom.filter.useful COUNT : 4963889 rocksdb.bloom.filter.full.positive COUNT : 1214081 rocksdb.bloom.filter.full.true.positive COUNT : 1161999 $ #^ 1.04 % observed FP rate $ ./db_bench -db=/dev/shm/dbbench -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -optimize_filters_for_memory -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000 rocksdb.block.cache.index.add COUNT : 33 rocksdb.block.cache.index.bytes.insert COUNT : 8448592 rocksdb.block.cache.filter.add COUNT : 33 rocksdb.block.cache.filter.bytes.insert COUNT : 18220328 rocksdb.bloom.filter.useful COUNT : 5360933 rocksdb.bloom.filter.full.positive COUNT : 1321315 rocksdb.bloom.filter.full.true.positive COUNT : 1262999 $ #^ 1.08 % observed FP rate, 13.6% less memory usage for filters (Due to specific key density, this example tends to generate filters that are "worse than average" for internal fragmentation. "Better than average" cases can show little or no improvement.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6427 Test Plan: unit test added, 'make check' with gcc, clang and valgrind Reviewed By: siying Differential Revision: D22124374 Pulled By: pdillinger fbshipit-source-id: f3e3aa152f9043ddf4fae25799e76341d0d8714e	2020-06-22 13:32:07 -07:00
Peter Dillinger	c7432cc3c0	Fix more defects reported by Coverity Scan (#6935 ) Summary: Mostly uninitialized values: some probably written before use, but some seem like bugs. Also, destructor needs to be virtual, and possible use-after-free in test Pull Request resolved: https://github.com/facebook/rocksdb/pull/6935 Test Plan: make check Reviewed By: siying Differential Revision: D21885484 Pulled By: pdillinger fbshipit-source-id: e2e7cb0a0cf196f2b55edd16f0634e81f6cc8e08	2020-06-04 15:35:08 -07:00
Peter Dillinger	14eca6bf04	For ApproximateSizes, pro-rate table metadata size over data blocks (#6784 ) Summary: The implementation of GetApproximateSizes was inconsistent in its treatment of the size of non-data blocks of SST files, sometimes including and sometimes now. This was at its worst with large portion of table file used by filters and querying a small range that crossed a table boundary: the size estimate would include large filter size. It's conceivable that someone might want only to know the size in terms of data blocks, but I believe that's unlikely enough to ignore for now. Similarly, there's no evidence the internal function AppoximateOffsetOf is used for anything other than a one-sided ApproximateSize, so I intend to refactor to remove redundancy in a follow-up commit. So to fix this, GetApproximateSizes (and implementation details ApproximateSize and ApproximateOffsetOf) now consistently include in their returned sizes a portion of table file metadata (incl filters and indexes) based on the size portion of the data blocks in range. In other words, if a key range covers data blocks that are X% by size of all the table's data blocks, returned approximate size is X% of the total file size. It would technically be more accurate to attribute metadata based on number of keys, but that's not computationally efficient with data available and rarely a meaningful difference. Also includes miscellaneous comment improvements / clarifications. Also included is a new approximatesizerandom benchmark for db_bench. No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784 Test Plan: Test added to DBTest.ApproximateSizesFilesWithErrorMargin. Old code running new test... [ RUN ] DBTest.ApproximateSizesFilesWithErrorMargin db/db_test.cc:1562: Failure Expected: (size) <= (11 * 100), actual: 9478 vs 1100 Other tests updated to reflect consistent accounting of metadata. Reviewed By: siying Differential Revision: D21334706 Pulled By: pdillinger fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185	2020-06-02 12:30:23 -07:00
Mian Qin	d9e170d82b	Fix issues for reproducing synthetic ZippyDB workloads in the FAST20' paper (#6795 ) Summary: Fix issues for reproducing synthetic ZippyDB workloads in the FAST20' paper using db_bench. Details changes as follows. 1, add a separate random mode in MixGraph to produce all_random workload. 2, fix power inverse function for generating prefix_dist workload. 3, make sure key_offset in prefix mode is always unsigned. note: Need to carefully choose key_dist_a/b to avoid aliasing. Power inverse function range should be close to overall key space. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6795 Reviewed By: akankshamahajan15 Differential Revision: D21371095 Pulled By: zhichao-cao fbshipit-source-id: 80744381e242392c8c7cf8ac3d68fe67fe876048	2020-05-04 10:55:14 -07:00
Ziyue Yang	e619a20e93	Add an option for parallel compression in for db_stress (#6722 ) Summary: This commit adds an `compression_parallel_threads` option in db_stress. It also fixes the naming of parallel compression option in db_bench to keep it aligned with others. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6722 Reviewed By: pdillinger Differential Revision: D21091385 fbshipit-source-id: c9ba8c4e5cc327ff9e6094a6dc6a15fcff70f100	2020-04-30 10:49:07 -07:00
Derrick Pallas	5272305437	Fix FilterBench when RTTI=0 (#6732 ) Summary: The dynamic_cast in the filter benchmark causes release mode to fail due to no-rtti. Replace with static_cast_with_check. Signed-off-by: Derrick Pallas <derrick@pallas.us> Addition by peterd: Remove unnecessary 2nd template arg on all static_cast_with_check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6732 Reviewed By: ltamasi Differential Revision: D21304260 Pulled By: pdillinger fbshipit-source-id: 6e8eb437c4ca5a16dbbfa4053d67c4ad55f1608c	2020-04-29 13:09:23 -07:00
Peter Dillinger	31da5e34c1	C++20 compatibility (#6697 ) Summary: Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended: * Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists * Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition * std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API * Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line * Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20) * Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor * Added GCC 9 C++11 & GCC9 C++20 in Travis Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697 Test Plan: make check and CI Reviewed By: cheng-chang Differential Revision: D21020318 Pulled By: pdillinger fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91	2020-04-20 13:24:25 -07:00
sdong	1be3be5522	Auto-Format two recent diffs and add HISTORY.md (#6685 ) Summary: Two recent diffs can be autoformatted. Also add HISTORY.md entry for https://github.com/facebook/rocksdb/pull/6214 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6685 Test Plan: Run all existing tests Reviewed By: cheng-chang Differential Revision: D20965780 fbshipit-source-id: 195b08d7849513d42fe14073112cd19fdda6af95	2020-04-10 11:32:44 -07:00
Luca Giacchino	66a95f0fac	Provide an allocator for new memory type to be used with RocksDB block cache (#6214 ) Summary: New memory technologies are being developed by various hardware vendors (Intel DCPMM is one such technology currently available). These new memory types require different libraries for allocation and management (such as PMDK and memkind). The high capacities available make it possible to provision large caches (up to several TBs in size), beyond what is achievable with DRAM. The new allocator provided in this PR uses the memkind library to allocate memory on different media. Performance We tested the new allocator using db_bench. - For each test, we vary the size of the block cache (relative to the size of the uncompressed data in the database). - The database is filled sequentially. Throughput is then measured with a readrandom benchmark. - We use a uniform distribution as a worst-case scenario. The plot shows throughput (ops/s) relative to a configuration with no block cache and default allocator. For all tests, p99 latency is below 500 us. ![image](https://user-images.githubusercontent.com/26400080/71108594-42479100-2178-11ea-8231-8a775bbc92db.png) Changes - Add MemkindKmemAllocator - Add --use_cache_memkind_kmem_allocator db_bench option (to create an LRU block cache with the new allocator) - Add detection of memkind library with KMEM DAX support - Add test for MemkindKmemAllocator Minimum Requirements - kernel 5.3.12 - ndctl v67 - https://github.com/pmem/ndctl - memkind v1.10.0 - https://github.com/memkind/memkind Memory Configuration The allocator uses the MEMKIND_DAX_KMEM memory kind. Follow the instructions on[ memkind’s GitHub page](https://github.com/memkind/memkind) to set up NVDIMM memory accordingly. Note on memory allocation with NVDIMM memory exposed as system memory. - The MemkindKmemAllocator will only allocate from NVDIMM memory (using memkind_malloc with MEMKIND_DAX_KMEM kind). - The default allocator is not restricted to RAM by default. Based on NUMA node latency, the kernel should allocate from local RAM preferentially, but it’s a kernel decision. numactl --preferred/--membind can be used to allocate preferentially/exclusively from the local RAM node. Usage When creating an LRU cache, pass a MemkindKmemAllocator object as argument. For example (replace capacity with the desired value in bytes): ``` #include "rocksdb/cache.h" #include "memory/memkind_kmem_allocator.h" NewLRUCache( capacity /size_t/, 6 /cache_numshardbits/, false /strict_capacity_limit/, false /cache_high_pri_pool_ratio/, std::make_shared<MemkindKmemAllocator>()); ``` Refer to [RocksDB’s block cache documentation](https://github.com/facebook/rocksdb/wiki/Block-Cache) to assign the LRU cache as block cache for a database. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6214 Reviewed By: cheng-chang Differential Revision: D19292435 fbshipit-source-id: 7202f47b769e7722b539c86c2ffd669f64d7b4e1	2020-04-09 20:47:23 -07:00
CaixinGong	a91613dd06	Fix readrandom return NotFound after fillrandom in db_bench (#6665 ) Summary: This commit is fixing a bug that readrandom test returns many NotFound in db_bench from Version 6.2. Pull Request resolved: https://github.com/facebook/rocksdb/issues/6664 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6665 Reviewed By: cheng-chang Differential Revision: D20911298 Pulled By: ajkr fbshipit-source-id: c2658d4dbb35798ccbf67dff6e64923fb731ef81	2020-04-08 14:27:12 -07:00
Ziyue Yang	03a781a90c	Add pipelined & parallel compression optimization (#6262 ) Summary: This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262 Reviewed By: ajkr Differential Revision: D20651306 fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb	2020-04-01 16:40:18 -07:00
sdong	488b1e6739	Fix an error in db_bench with gcc 4.8 (#6537 ) Summary: I start to see following failures: tools/db_bench_tool.cc: In constructor ‘rocksdb::NormalDistribution::NormalDistribution(unsigned int, unsigned int)’: tools/db_bench_tool.cc:1528:58: error: declaration of ‘max’ shadows a member of 'this' [-Werror=shadow] NormalDistribution(unsigned int min, unsigned int max) : ^ tools/db_bench_tool.cc:1528:58: error: declaration of ‘min’ shadows a member of 'this' [-Werror=shadow] tools/db_bench_tool.cc: In constructor ‘rocksdb::UniformDistribution::UniformDistribution(unsigned int, unsigned int)’: tools/db_bench_tool.cc:1546:59: error: declaration of ‘max’ shadows a member of 'this' [-Werror=shadow] UniformDistribution(unsigned int min, unsigned int max) : ^ tools/db_bench_tool.cc:1546:59: error: declaration of ‘min’ shadows a member of 'this' [-Werror=shadow] when I build from GCC 4.8. Rename those variables to fix the problem. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6537 Test Plan: make all with the compiler that used to show the failure. Differential Revision: D20448741 fbshipit-source-id: 18bcf012dbe020f22f79038a9b08f447befa2574	2020-03-16 13:50:40 -07:00
Levi Tamasi	8637bc1eea	Fix the description of unordered_write in db_bench (#6476 ) Summary: As reported in https://github.com/facebook/rocksdb/issues/6467, the description of the `unordered_write` switch of `db_bench` was incorrect. (Note: the new description is based on https://rocksdb.org/blog/2019/08/15/unordered-write.html). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6476 Test Plan: `db_bench --help` Differential Revision: D20200653 Pulled By: ltamasi fbshipit-source-id: 4c3683fcfa6a069164167af5aaff9974a810c16a	2020-03-02 15:34:19 -08:00
sdong	9b3c9ef0e8	Add --index_with_first_key and --index_shortening_mode to DB bench (#5859 ) Summary: Some combinatino of --index_with_first_key and --index_shortening_mode can signifcantly improve performance for large values. Expose them in db_bench. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5859 Test Plan: Run them with the new options and observe the behavior. Differential Revision: D20104434 fbshipit-source-id: 21d48a732a9caf20b82312c7d7557d747ea3c304	2020-03-02 11:55:28 -08:00
Michael R. Crusoe	051696bf98	fix some spelling typos (#6464 ) Summary: Found from Debian's "Lintian" program Pull Request resolved: https://github.com/facebook/rocksdb/pull/6464 Differential Revision: D20162862 Pulled By: zhichao-cao fbshipit-source-id: 06941ee2437b038b2b8045becbe9d2c6fbff3e12	2020-02-28 14:14:03 -08:00
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	2020-02-20 12:09:57 -08:00
sdong	df3f33dd05	Fix db_bench LITE build recently broken (#6411 ) Summary: A recent change https://github.com/facebook/rocksdb/pull/6386 broke LITE build in a trivial way. Fix it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6411 Test Plan: Run "LITE=1 make all" Differential Revision: D19871765 fbshipit-source-id: 74f0ad3f8a9d666fbde0da7fd29ba1547a811f77	2020-02-13 10:52:50 -08:00
Burton Li	e64508917b	db_bench supports for generating random variable sized value. (#6386 ) Summary: 1. `db_bench` now supports `value_size_distribution_type`, `value_size_min`, `value_size_max` options for generating random variable sized value. 2. Added `blob_db_compression_type` option for BlobDB to enable blob compression. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6386 Differential Revision: D19859406 Pulled By: zhichao-cao fbshipit-source-id: ace52674090023fde15d832392110bf288a8e215	2020-02-12 14:47:03 -08:00
sdong	876c2dbff4	Allow readahead when reading option files. (#6372 ) Summary: Right, when reading from option files, no readahead is used and 8KB buffer is used. It might introduce high latency if the file system provide high latency and doesn't do readahead. Instead, introduce a readahead to the file. When calling inside DB, infer the value from options.log_readahead. Otherwise, a default 512KB readahead size is used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6372 Test Plan: Add --log_readahead_size in db_bench. Run it with several options and observe read size from option files using strace. Differential Revision: D19727739 fbshipit-source-id: e6d8053b0a64259abc087f1f388b9cd66fa8a583	2020-02-07 15:18:26 -08:00
sdong	24c9dce825	Remove include math.h (#6373 ) Summary: We see some odd errors complaining math. However, it doesn't seem that it is needed to be included. Remove the include of math.h. Just removing it from db_bench doesn't seem to break anything. Replacing sqrt from std::sqrt seems to work for histogram.cc Pull Request resolved: https://github.com/facebook/rocksdb/pull/6373 Test Plan: Watch Travis and appveyor to run. Differential Revision: D19730068 fbshipit-source-id: d3ad41defcdd9f51c2da1a3673fb258f5dfacf47	2020-02-05 21:00:49 -08:00
Levi Tamasi	130e710056	Add BlobDB GC cutoff parameter to db_bench (#6211 ) Summary: The patch makes it possible to set the BlobDB configuration option `garbage_collection_cutoff` on the command line. In addition, it changes the `db_bench` code so that the default values of BlobDB related parameters are taken from the defaults of the actual BlobDB configuration options (note: this changes the the default of `blob_db_bytes_per_sync`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6211 Test Plan: Ran `db_bench` with various values of the new parameter. Differential Revision: D19166895 Pulled By: ltamasi fbshipit-source-id: 305ccdf0123b9db032b744715810babdc3e3b7d5	2019-12-18 17:46:08 -08:00
Zhichao Cao	8ea087ad16	Workload generator (Mixgraph) based on prefix hotness (#5953 ) Summary: In the previous PR https://github.com/facebook/rocksdb/issues/4788, user can use db_bench mix_graph option to generate the workload that is from the social graph. The key is generated based on the key access hotness. In this PR, user can further model the key-range hotness and fit those to two-term-exponential distribution. First, user cuts the whole key space into small key ranges (e.g., key-ranges are the same size and the key-range number is the number of SST files). Then, user calculates the average access count per key of each key-range as the key-range hotness. Next, user fits the key-range hotness to two-term-exponential distribution (f(x) = f(x) = aexp(bx) + cexp(dx)) and generate the value of a, b, c, and d. They are the parameters in db_bench: prefix_dist_a, prefix_dist_b, prefix_dist_c, and prefix_dist_d. Finally, user can run db_bench by specify the parameters. For example: `./db_bench --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=268435456 -key_dist_a=0.002312 -key_dist_b=0.3467 -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=350 -sine_b=0.0105 -sine_d=50000 --perf_level=2 -reads=1000000 -num=5000000 -key_size=48` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5953 Test Plan: run db_bench with different parameters and checked the results. Differential Revision: D18053527 Pulled By: zhichao-cao fbshipit-source-id: 171f8b3142bd76462f1967c58345ad7e4f84bab7	2019-11-06 13:02:20 -08:00
Yanqin Jin	c0abc6bbc1	Use FLAGS_env for certain operations in db_bench (#5943 ) Summary: Since we already parse env_uri from command line and creates custom Env accordingly, we should invoke the methods of such Envs instead of using Env::Default(). Test Plan (on devserver): ``` $make db_bench db_stress $./db_bench -benchmarks=fillseq ./db_stress ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5943 Differential Revision: D18018550 Pulled By: riversand963 fbshipit-source-id: 03b61329aaae0dfd914a0b902cc677f570f102e3	2019-10-22 11:43:21 -07:00
Zhichao Cao	526e3b9763	Enable trace_replay with multi-threads (#5934 ) Summary: In the current trace replay, all the queries are serialized and called by single threads. It may not simulate the original application query situations closely. The multi-threads replay is implemented in this PR. Users can set the number of threads to replay the trace. The queries generated according to the trace records are scheduled in the thread pool job queue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5934 Test Plan: test with make check and real trace replay. Differential Revision: D17998098 Pulled By: zhichao-cao fbshipit-source-id: 87eecf6f7c17a9dc9d7ab29dd2af74f6f60212c8	2019-10-18 14:13:50 -07:00
Levi Tamasi	78b28d80b0	Support non-TTL Puts for BlobDB in db_bench (#5921 ) Summary: Currently, db_bench only supports PutWithTTL operations for BlobDB but not regular Puts. The patch adds support for regular (non-TTL) Puts and also changes the default for blob_db_max_ttl_range to zero, which corresponds to no TTL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5921 Test Plan: make check ./db_bench -benchmarks=fillrandom -statistics -stats_interval_seconds=1 -duration=90 -num=500000 -use_blob_db=1 -blob_db_file_size=1000000 -target_file_size_base=1000000 (issues Put operations with no TTL) ./db_bench -benchmarks=fillrandom -statistics -stats_interval_seconds=1 -duration=90 -num=500000 -use_blob_db=1 -blob_db_file_size=1000000 -target_file_size_base=1000000 -blob_db_max_ttl_range=86400 (issues PutWithTTL operations with random TTLs in the [0, blob_db_max_ttl_range) interval, as before) Differential Revision: D17919798 Pulled By: ltamasi fbshipit-source-id: b946c3522b836b92b4c157ffbad24f92ba2b0a16	2019-10-14 17:49:20 -07:00
sdong	e8263dbdaa	Apply formatter to recent 200+ commits. (#5830 ) Summary: Further apply formatter to more recent commits. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5830 Test Plan: Run all existing tests. Differential Revision: D17488031 fbshipit-source-id: 137458fd94d56dd271b8b40c522b03036943a2ab	2019-09-20 12:04:26 -07:00
Lingjing You	1a928c22a0	Add insert hints for each writebatch (#5728 ) Summary: Add insert hints for each writebatch so that they can be used in concurrent write, and add write option to enable it. Bench result (qps): `./db_bench --benchmarks=fillseq -allow_concurrent_memtable_write=true -num=4000000 -batch-size=1 -threads=1 -db=/data3/ylj/tmp -write_buffer_size=536870912 -num_column_families=4` master: \| batch size \ thread num \| 1 \| 2 \| 4 \| 8 \| \| ----------------------- \| ------- \| ------- \| ------- \| ------- \| \| 1 \| 387883 \| 220790 \| 308294 \| 490998 \| \| 10 \| 1397208 \| 978911 \| 1275684 \| 1733395 \| \| 100 \| 2045414 \| 1589927 \| 1798782 \| 2681039 \| \| 1000 \| 2228038 \| 1698252 \| 1839877 \| 2863490 \| fillseq with writebatch hint: \| batch size \ thread num \| 1 \| 2 \| 4 \| 8 \| \| ----------------------- \| ------- \| ------- \| ------- \| ------- \| \| 1 \| 286005 \| 223570 \| 300024 \| 466981 \| \| 10 \| 970374 \| 813308 \| 1399299 \| 1753588 \| \| 100 \| 1962768 \| 1983023 \| 2676577 \| 3086426 \| \| 1000 \| 2195853 \| 2676782 \| 3231048 \| 3638143 \| Pull Request resolved: https://github.com/facebook/rocksdb/pull/5728 Differential Revision: D17297240 fbshipit-source-id: b053590a6d77871f1ef2f911a7bd013b3899b26c	2019-09-12 17:15:18 -07:00
anand76	eb9026f09b	Add a db_bench benchmark to warm up the row cache Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5707 Differential Revision: D17242698 Pulled By: anand1976 fbshipit-source-id: 5d1bfda3c9e8f56176ae391cae6c91e6262016b8	2019-09-10 11:06:36 -07:00
Zhongyi Xie	2f41ecfe75	Refactor trimming logic for immutable memtables (#5022 ) Summary: MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory. We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one. The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming. In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022 Differential Revision: D14394062 Pulled By: miasantreble fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5	2019-08-23 13:55:34 -07:00
Vijay Nadimpalli	d150e01474	New API to get all merge operands for a Key (#5604 ) Summary: This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases: 1. Update subset of columns and read subset of columns - Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU. 2. Updating very few attributes in a value which is a JSON-like document - Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge. ---------------------------------------------------------------------------------------------------- API : Status GetMergeOperands( const ReadOptions& options, ColumnFamilyHandle* column_family, const Slice& key, PinnableSlice* merge_operands, GetMergeOperandsOptions* get_merge_operands_options, int* number_of_operands) Example usage : int size = 100; int number_of_operands = 0; std::vector<PinnableSlice> values(size); GetMergeOperandsOptions merge_operands_info; db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands); Description : Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion. merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604 Test Plan: Added unit test and perf test in db_bench that can be run using the command: ./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist Differential Revision: D16657366 Pulled By: vjnadimpalli fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf	2019-08-06 14:26:44 -07:00
Mark Rambacher	cfcf045acc	The ObjectRegistry class replaces the Registrar and NewCustomObjects.… (#5293 ) Summary: The ObjectRegistry class replaces the Registrar and NewCustomObjects. Objects are registered with the registry by Type (the class must implement the static const char *Type() method). This change is necessary for a few reasons: - By having a class (rather than static template instances), the class can be passed between compilation units, meaning that objects could be registered and shared from a dynamic library with an executable. - By having a class with instances, different units could have different objects registered. This could be useful if, for example, one Option allowed for a dynamic library and one did not. When combined with some other PRs (being able to load shared libraries, a Configurable interface to configure objects to/from string), this code will allow objects in external shared libraries to be added to a RocksDB image at run-time, rather than requiring every new extension to be built into the main library and called explicitly by every program. Test plan (on riversand963's devserver) ``` $COMPILE_WITH_ASAN=1 make -j32 all && sleep 1 && make check ``` All tests pass. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5293 Differential Revision: D16363396 Pulled By: riversand963 fbshipit-source-id: fbe4acb615bfc11103eef40a0b288845791c0180	2019-07-23 17:13:05 -07:00
sdong	e4dcf5fd22	db_bench to add a new "benchmark" to print out all stats history (#5532 ) Summary: Sometimes it is helpful to fetch the whole history of stats after benchmark runs. Add such an option Pull Request resolved: https://github.com/facebook/rocksdb/pull/5532 Test Plan: Run the benchmark manually and observe the output is as expected. Differential Revision: D16097764 fbshipit-source-id: 10b5b735a22a18be198b8f348be11f11f8806904	2019-07-03 20:03:28 -07:00
haoyuhuang	66464d1fde	Remove multiple declarations o kMicrosInSecond. Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5526 Test Plan: OPT=-g V=1 make J=1 unity_test -j32 make clean && make -j32 Differential Revision: D16079315 Pulled By: HaoyuHuang fbshipit-source-id: 294ab439cf0db8dd5da44e30eabf0cbb2bb8c4f6	2019-07-01 15:15:12 -07:00
Eli Pozniansky	3e6c185381	Formatting fixes in db_bench_tool (#5525 ) Summary: Formatting fixes in db_bench_tool that were accidentally omitted Pull Request resolved: https://github.com/facebook/rocksdb/pull/5525 Test Plan: Unit tests Differential Revision: D16078516 Pulled By: elipoz fbshipit-source-id: bf8df0e3f08092a91794ebf285396d9b8a335bb9	2019-07-01 14:57:28 -07:00
Eli Pozniansky	f872009237	Fix from some C-style casting (#5524 ) Summary: Fix from some C-style casting in bloom.cc and ./tools/db_bench_tool.cc Pull Request resolved: https://github.com/facebook/rocksdb/pull/5524 Differential Revision: D16075626 Pulled By: elipoz fbshipit-source-id: 352948885efb64a7ef865942c75c3c727a914207	2019-07-01 13:05:34 -07:00
Zhongyi Xie	671d15cbdd	Persistent Stats: persist stats history to disk (#5046 ) Summary: This PR continues the work in https://github.com/facebook/rocksdb/pull/4748 and https://github.com/facebook/rocksdb/pull/4535 by adding a new DBOption `persist_stats_to_disk` which instructs RocksDB to persist stats history to RocksDB itself. When statistics is enabled, and both options `stats_persist_period_sec` and `persist_stats_to_disk` are set, RocksDB will periodically write stats to a built-in column family in the following form: key -> (timestamp in microseconds)#(stats name), value -> stats value. The existing API `GetStatsHistory` will detect the current value of `persist_stats_to_disk` and either read from in-memory data structure or from the hidden column family on disk. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5046 Differential Revision: D15863138 Pulled By: miasantreble fbshipit-source-id: bb82abdb3f2ca581aa42531734ac799f113e931b	2019-06-17 15:21:50 -07:00
haoyuhuang	d43b4cd570	Integrate block cache tracing into db_bench (#5459 ) Summary: This PR integrates the block cache tracing into db_bench. It adds three command line arguments. -block_cache_trace_file (Block cache trace file path.) type: string default: "" -block_cache_trace_max_trace_file_size_in_bytes (The maximum block cache trace file size in bytes. Block cache accesses will not be logged if the trace file size exceeds this threshold. Default is 64 GB.) type: int64 default: 68719476736 -block_cache_trace_sampling_frequency (Block cache trace sampling frequency, termed s. It uses spatial downsampling and samples accesses to one out of s blocks.) type: int32 default: 1 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5459 Differential Revision: D15832031 Pulled By: HaoyuHuang fbshipit-source-id: 0ecf2f2686557251fe741a2769b21170777efa3d	2019-06-17 11:08:21 -07:00
Zhongyi Xie	d68f9f4580	simplify include directive involving inttypes (#5402 ) Summary: When using `PRIu64` type of printf specifier, current code base does the following: ``` #ifndef __STDC_FORMAT_MACROS #define __STDC_FORMAT_MACROS #endif #include <inttypes.h> ``` However, this can be simplified to ``` #include <cinttypes> ``` as long as flag `-std=c++11` is used. This should solve issues like https://github.com/facebook/rocksdb/issues/5159 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402 Differential Revision: D15701195 Pulled By: miasantreble fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03	2019-06-06 13:56:07 -07:00
Vijay Nadimpalli	49c5a12dbe	Organizing rocksdb/db directory Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390 Differential Revision: D15579388 Pulled By: vjnadimpalli fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366	2019-05-31 11:57:01 -07:00
Yanqin Jin	83f7a8eed0	Fix compilation error in LITE mode (#5391 ) Summary: Add macro ROCKSDB_LITE to fix compilation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5391 Differential Revision: D15574522 Pulled By: riversand963 fbshipit-source-id: 95aea83c5d9b2bf98a3ba0ef9167b63c9be2988b	2019-05-31 08:32:22 -07:00
Yanqin Jin	b9f5900658	Fix WAL replay by skipping old write batches (#5170 ) Summary: 1. Fix a bug in WAL replay in which write batches with old sequence numbers are mistakenly inserted into memtables. 2. Add support for benchmarking secondary instance to db_bench_tool. With changes made in this PR, we can start benchmarking secondary instance using two processes. It is also possible to vary the frequency at which the secondary instance tries to catch up with the primary. The info log of the secondary can be found in a directory whose path can be specified with '-secondary_path'. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5170 Differential Revision: D15564608 Pulled By: riversand963 fbshipit-source-id: ce97688ed3d33f69d3a0b9266ebbbbf887aa0ec8	2019-05-30 19:33:33 -07:00
Siying Dong	8843129ece	Move some memory related files from util/ to memory/ (#5382 ) Summary: Move arena, allocator, and memory tools under util to a separate memory/ directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382 Differential Revision: D15564655 Pulled By: siying fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5	2019-05-30 17:44:09 -07:00
Siying Dong	e9e0101ca4	Move test related files under util/ to test_util/ (#5377 ) Summary: There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo ve them to a new directory test_util/, so that util/ is cleaner. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377 Differential Revision: D15551366 Pulled By: siying fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c	2019-05-30 11:25:51 -07:00
Maysam Yabandeh	eab4f49a2c	WritePrepared: skip_concurrency_control option (#5330 ) Summary: This enables the user to set TransactionDBOptions::skip_concurrency_control so the standard `DB::Write(const WriteOptions& opts, WriteBatch* updates)` would skip the concurrency control. This would give higher throughput to the users who know their use case doesn't need concurrency control. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5330 Differential Revision: D15525932 Pulled By: maysamyabandeh fbshipit-source-id: 68421ac1ba34f549a4a8de9ce4c2dccf6fb4b06b	2019-05-28 16:29:45 -07:00
Zhichao Cao	a13026fb2f	Added trace replay fast forward function (#5273 ) Summary: In the current db_bench trace replay, the replay process strictly follows the timestamp to issue the queries. In some cases, user does not care about the time. Therefore, fast forward is needed for users to speed up the replay process. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5273 Differential Revision: D15389232 Pulled By: zhichao-cao fbshipit-source-id: 735d629b9d2a167b05af3e4fa0ddf9d5d0be1806	2019-05-16 20:21:18 -07:00
Maysam Yabandeh	f383641a1d	Unordered Writes (#5218 ) Summary: Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees. The patch is prepared based on an original by siying Existing unit tests are extended to include unordered_write option. Benchmark Results: ``` TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1 ``` With WAL - Vanilla RocksDB: 78.6 MB/s - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x) - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees) Without WAL - Vanilla RocksDB: 111.3 MB/s - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x) - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees) - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x) Limitations: - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218 Differential Revision: D15219029 Pulled By: maysamyabandeh fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759	2019-05-13 17:47:21 -07:00
Yi Wu	92c60547fe	db_bench: fix hang on IO error (#5300 ) Summary: db_bench will wait indefinitely if there's background error. Fix by pass `abs_time_us` to cond var. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5300 Differential Revision: D15319945 Pulled By: miasantreble fbshipit-source-id: 0034fb7f6ec7c3303c4ccf26e54c20fbdac8ab44	2019-05-13 11:30:35 -07:00
Sagar Vemuri	3548e4220d	Improve explicit user readahead performance (#5246 ) Summary: Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`. 1. Stop creating new table readers when the user explicitly sets readahead size. 2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases). 3. Add `readahead_size` to db_bench. Benchmarks: https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737 For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%. For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%. Test Plan: Updated `DBIteratorTest.ReadAhead` test to make sure that: - no new table readers are created for iterators on setting ReadOptions.readahead_size - At least "readahead" number of bytes are actually getting read on each iterator read. TODO later: - Use similar logic for compactions as well. - This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246 Differential Revision: D15107946 Pulled By: sagar0 fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea	2019-04-26 21:24:10 -07:00
Adam Retter	990b2f4cb3	Fix compilation on db_bench_tool.cc on Windows (#5227 ) Summary: I needed this change to be able to build the v6.0.1 release on Windows. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5227 Differential Revision: D15033815 Pulled By: sagar0 fbshipit-source-id: 579f3b8e694c34c0d43527eb2fa37175e37f5911	2019-04-23 11:16:51 -07:00
Zhongyi Xie	baa5302447	Avoid double-compacting data in bottom level in manual compactions (#5138 ) Summary: Depending on the config, manual compaction (leveled compaction style) does following compactions: L0->L1 L1->L2 ... Ln-1 -> Ln Ln -> Ln The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only. Resolves issue https://github.com/facebook/rocksdb/issues/4995 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138 Differential Revision: D14940106 Pulled By: miasantreble fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5	2019-04-16 23:32:20 -07:00
Yi Wu	b70967aac7	db_bench: support seek to non-exist prefix (#5163 ) Summary: Add `--seek_missing_prefix` flag to db_bench to allow benchmarking seeking to non-existing prefix. Usage example: ``` ./db_bench --db=/dev/shm/db_bench --use_existing_db=false --benchmarks=fillrandom --num=100000000 --prefix_size=9 --keys_per_prefix=10 ./db_bench --db=/dev/shm/db_bench --use_existing_db=true --benchmarks=seekrandom --disable_auto_compactions=true --num=100000000 --prefix_size=9 --keys_per_prefix=10 --reads=1000 --prefix_same_as_start=true --seek_missing_prefix=true ``` Also adding `--total_order_seek` and `--prefix_same_as_start` flags. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5163 Differential Revision: D14935724 Pulled By: riversand963 fbshipit-source-id: 7c41023f007febe373eb1589861f215432a9e18a	2019-04-15 10:54:58 -07:00
Yanqin Jin	3189398c00	Fix bugs detected by clang analyzer (#5185 ) Summary: as titled. False positive included, fixed anyway to make the check pass. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5185 Differential Revision: D14909384 Pulled By: riversand963 fbshipit-source-id: dc5177e72b1929ccfd6175a60e2cd7bdb9bd80f3	2019-04-12 10:45:56 -07:00
anand76	fefd4b98c5	Introduce a new MultiGet batching implementation (#5011 ) Summary: This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching. Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to - 1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch() 2. Bloom filter cachelines can be prefetched, hiding the cache miss latency The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress. Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32). Batch Sizes 1 \| 2 \| 4 \| 8 \| 16 \| 32 Random pattern (Stride length 0) 4.158 \| 4.109 \| 4.026 \| 4.05 \| 4.1 \| 4.074 - Get 4.438 \| 4.302 \| 4.165 \| 4.122 \| 4.096 \| 4.075 - MultiGet (no batching) 4.461 \| 4.256 \| 4.277 \| 4.11 \| 4.182 \| 4.14 - MultiGet (w/ batching) Good locality (Stride length 16) 4.048 \| 3.659 \| 3.248 \| 2.99 \| 2.84 \| 2.753 4.429 \| 3.728 \| 3.406 \| 3.053 \| 2.911 \| 2.781 4.452 \| 3.45 \| 2.833 \| 2.451 \| 2.233 \| 2.135 Good locality (Stride length 256) 4.066 \| 3.786 \| 3.581 \| 3.447 \| 3.415 \| 3.232 4.406 \| 4.005 \| 3.644 \| 3.49 \| 3.381 \| 3.268 4.393 \| 3.649 \| 3.186 \| 2.882 \| 2.676 \| 2.62 Medium locality (Stride length 4096) 4.012 \| 3.922 \| 3.768 \| 3.61 \| 3.582 \| 3.555 4.364 \| 4.057 \| 3.791 \| 3.65 \| 3.57 \| 3.465 4.479 \| 3.758 \| 3.316 \| 3.077 \| 2.959 \| 2.891 dbbench command used (on a DB with 4 levels, 12 million keys)- TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011 Differential Revision: D14348703 Pulled By: anand1976 fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b	2019-04-11 14:28:26 -07:00
Siying Dong	2b4d5ceb47	Remove some "using std::..." from header files. (#5113 ) Summary: The code convention we are following, Google C++ Style, discourage alias in header files, especially public headers: https://google.github.io/styleguide/cppguide.html#Aliases Remove some of them. Might removed some from .cc files as well to be consistent. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5113 Differential Revision: D14633030 Pulled By: siying fbshipit-source-id: b990edc919d5de60295992284f980195e501d424	2019-03-27 10:28:21 -07:00
Shobhit Dayal	b45b1cde3e	Feature for sampling and reporting compressibility (#4842 ) Summary: This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms: 1. lz4 or snappy for fast compression 2. zstd or zlib for slow but higher compression. The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType. The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842 Differential Revision: D13629011 Pulled By: shobhitdayal fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa	2019-03-18 12:15:34 -07:00
Zhichao Cao	05ebfebc17	Fixed the potential stack overflow of MixGraph in db_bench (#5051 ) Summary: In the MixGraph benchmark of db_bench, The max buffer size used for value of KV-pair might be extremely large (64MB), which might cause function stack overflow in some platforms, reduced to 1MB. Added the finished ops printing in MixGraph benchmark. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5051 Differential Revision: D14379571 Pulled By: zhichao-cao fbshipit-source-id: 24084fbe38f60f2902d9a40f6bc9a25e4e2c9bb9	2019-03-08 14:10:17 -08:00
Andrew Kryczka	18d2e4beb7	Run db_bench on database generated externally (#5017 ) Summary: Added an option, `-use_existing_keys`, which can be set to run benchmarks against an arbitrary existing database. Now users can benchmark against their actual database rather than synthetic data. Before the run begins, it loads all the keys into memory, then uses that set of keys rather than synthesizing new ones in `GenerateKeyFromInt`. This is mainly intended for small-scale DBs where the memory consumption is not a concern. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5017 Differential Revision: D14270303 Pulled By: riversand963 fbshipit-source-id: 6328df9dffb5e19170270dd00a69f4bbe424e5ed	2019-03-01 11:19:03 -08:00
Siying Dong	aef763b6d6	Make statistics's stats_level change thread-safe (#5030 ) Summary: Right now, users can change statistics.stats_level while DB is running, but TSAN may report data race. We make stats_level_ to be atomic, and access them using accessors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5030 Differential Revision: D14267519 Pulled By: siying fbshipit-source-id: 37d7ebeff7a43a406230143422a16af899163f73	2019-03-01 10:42:09 -08:00
Siying Dong	5e298f865b	Add two more StatsLevel (#5027 ) Summary: Statistics cost too much CPU for some use cases. Add two stats levels so that people can choose to skip two types of expensive stats, timers and histograms. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027 Differential Revision: D14252765 Pulled By: siying fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592	2019-02-28 10:27:59 -08:00
Zhongyi Xie	c4f5d0aa15	add GetStatsHistory to retrieve stats snapshots (#4748 ) Summary: This PR adds public `GetStatsHistory` API to retrieve stats history in the form of an std map. The key of the map is the timestamp in microseconds when the stats snapshot is taken, the value is another std map from stats name to stats value (stored in std string). Two DBOptions are introduced: `stats_persist_period_sec` (default 10 minutes) controls the intervals between two snapshots are taken; `max_stats_history_count` (default 10) controls the max number of history snapshots to keep in memory. RocksDB will stop collecting stats snapshots if `stats_persist_period_sec` is set to 0. (This PR is the in-memory part of https://github.com/facebook/rocksdb/pull/4535) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4748 Differential Revision: D13961471 Pulled By: miasantreble fbshipit-source-id: ac836d401ecb84ea92216bf9966f969dedf4ad04	2019-02-20 15:52:54 -08:00
Zhongyi Xie	ed995c6a69	add whole key bloom filter support in memtables (#4985 ) Summary: MyRocks calls `GetForUpdate` on `INSERT`, for unique key check, and in almost all cases GetForUpdate returns empty result. For such cases, whole key bloom filter is helpful. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4985 Differential Revision: D14118257 Pulled By: miasantreble fbshipit-source-id: d35cb7109c62fd5ad541a26968e3a3e16d3e85ea	2019-02-19 12:15:39 -08:00
Siying Dong	4db46aa2e6	Fix LITE Build (#4989 ) Summary: LITE mode has EventListener to be an empty class. However in db_bench, it is used. When "override" is added to the functions, the build breaks. Fix it by keeping the listener empty in LITE mode. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4989 Differential Revision: D14108132 Pulled By: siying fbshipit-source-id: 80121aab35b1120e502b37b782301dd700692697	2019-02-15 16:13:11 -08:00
Aubin Sanyal	3231a2e581	Deprecate ttl option from CompactionOptionsFIFO (#4965 ) Summary: We introduced ttl option in CompactionOptionsFIFO when ttl-based file deletion (compaction) was supported only as part of FIFO Compaction. But with the extension of ttl semantics even to Level compaction, CompactionOptionsFIFO.ttl can now be deprecated. Instead we will start using ColumnFamilyOptions.ttl for FIFO compaction as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4965 Differential Revision: D14072960 Pulled By: sagar0 fbshipit-source-id: c98cc2ae695a28136295787cd88d36a220fc219e	2019-02-15 09:51:41 -08:00
Michael Liu	ca89ac2ba9	Apply modernize-use-override (2nd iteration) Summary: Use C++11’s override and remove virtual where applicable. Change are automatically generated. Reviewed By: Orvid Differential Revision: D14090024 fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a	2019-02-14 14:41:36 -08:00
Siying Dong	cf3a671733	Remove cuckoo hash memtable (#4953 ) Summary: Cuckoo Hash is less useful than we initially expected. Remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4953 Differential Revision: D13979264 Pulled By: siying fbshipit-source-id: 2a60afdaa989f045357398b43a1cc5d46f4492ed	2019-02-07 16:15:27 -08:00
Siying Dong	d9c9f3c809	db_bench: fix "micros/op" reporting (#4949 ) Summary: `4985a9f73b (diff-e5276985b26a0551957144f4420a594bR511)` changes the meaning of latency reporting from running time per query, to elapse_time / #ops, without providing a reason why. Considering that this is a counter-intuitive reporting, Reverting the change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4949 Differential Revision: D13964684 Pulled By: siying fbshipit-source-id: d6304d3d4b5a802daa292302623c7dbca9a680bc	2019-02-05 17:20:02 -08:00
Alexander Zinoviev	32a6dd9a41	Add a new CPU time counter to compaction report (#4889 ) Summary: Measure CPU time consumed for a compaction and report it in the stats report Enable NowCPUNanos() to work for MacOS Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889 Differential Revision: D13701276 Pulled By: zinoale fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055	2019-01-29 17:24:00 -08:00
zhichao-cao	e2547103fd	Fix the build error caused by the dynamic array (#4918 ) Summary: In the MixGraph benchmark of db_bench #4788 , the char array is initialized with an argument from user's input, which can cause build error on some platforms. Also, the msg char array size can be potentially smaller than the printed data, which should be extended from 100 to 256. Tested with make check. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4918 Differential Revision: D13844298 Pulled By: sagar0 fbshipit-source-id: 33c4809c5c4438f0a9f7b289d3f42e20c545bbab	2019-01-28 12:39:57 -08:00
Andrew Kryczka	e242fa4664	Add latest toolchain (gcc-8, etc.) build support for fbcode users (#4923 ) Summary: - When building with internal dependencies, specify this toolchain by setting `ROCKSDB_FBCODE_BUILD_WITH_PLATFORM007=1` - It is not enabled by default. However, it is enabled for TSAN builds in CI since there is a known problem with TSAN in gcc-5: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71090 - I did not add support for Lua since (1) we agreed to deprecate it, and (2) we only have an internal build for v5.3 with this toolchain while that has breaking changes compared to our current version (v5.2). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4923 Differential Revision: D13827226 Pulled By: ajkr fbshipit-source-id: 9aa3388ed3679777cfb15ef8cbcb83c07f62f947	2019-01-28 11:26:32 -08:00
Sagar Vemuri	0cead31d10	Fix Clang static analyzer warning in db_bench (#4910 ) Summary: Fixed clang static analyzer warning about division by 0. ``` ar: creating librocksdb_debug.a tools/db_bench_tool.cc:4650:43: warning: Division by zero int pos = static_cast<int>(rand_num % range_); ~~~~~~~~~^~~~~~~~ 1 warning generated. make: *** [analyze] Error 1 ``` This is from the new code I recently merged in `ce8e88d2d7`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4910 Differential Revision: D13788037 Pulled By: sagar0 fbshipit-source-id: f48851dca85047c19fbb1a361e25ce643aa4c7ea	2019-01-23 13:33:02 -08:00
Zhongyi Xie	cbe0239270	add cast to avoid loss of precision error (#4906 ) Summary: this PR address the following error: > tools/db_bench_tool.cc:4776:68: error: implicit conversion loses integer precision: 'int64_t' (aka 'long') to 'unsigned int' [-Werror,-Wshorten-64-to-32] s = db_with_cfh->db->Put(write_options_, key, gen.Generate(value_size)); Pull Request resolved: https://github.com/facebook/rocksdb/pull/4906 Differential Revision: D13780185 Pulled By: miasantreble fbshipit-source-id: 1c83a77d341099518c72f0f4a63e97ab9c4784b3	2019-01-22 22:44:17 -08:00
Zhichao Cao	ce8e88d2d7	Generate mixed workload with Get, Put, Seek in db_bench (#4788 ) Summary: Based on the specific workload models (key access distribution, value size distribution, and iterator scan length distribution, the QPS variation), the MixGraph benchmark generate the synthetic workload according to these distributions which can reflect the real-world workload characteristics. After user enable the tracing function, they will get the trace file. By analyzing the trace file with the trace_analyzer tool, user can generate a set of statistic data files including. The _accessed_key_stats.txt, -accessed_value_size_distribution.txt, -iterator_length_distribution.txt, and -qps_stats.txt are mainly used to fit the Matlab model fitting. After that, user can get the parameters of the workload distributions (the modeling details are described: [here](https://github.com/facebook/rocksdb/wiki/RocksDB-Trace%2C-Replay%2C-and-Analyzer)) The key access distribution follows the The two-term power model. The probability density function is: `f(x) = ax^{b}+c`. The corresponding parameters are key_dist_a, key_dist_b, and key_dist_c in db_bench For the value size distribution and iterator scan length distribution, they both follow the Generalized Pareto Distribution. The probability density function is `f(x) = (1/sigma)(1+k(x-theta)/sigma))^{-1-1/k)`. The parameters are: value_k, value_theta, value_sigma and iter_k, iter_theta, iter_sigma. For more information about the Generalized Pareto Distribution, users can find the [wiki](https://en.wikipedia.org/wiki/Generalized_Pareto_distribution) and [Matalb page](https://www.mathworks.com/help/stats/generalized-pareto-distribution.html) As for the QPS, it follows the diurnal pattern. So Sine is a good model to fit it. `F(x) = sine_asin(sine_bx + sine_c) + sine_d`. The trace_will tell you the average QPS in the print out resutls, which is sine_d. After user fit the "-qps_stats.txt" to the Matlab model, user can get the sine_a, sine_b, and sine_c. By using the 4 parameters, user can control the QPS variation including the period, average, changes. To use the bench mark, user can indicate the following parameters as examples: ``` -benchmarks="mixgraph" -key_dist_a=0.002312 -key_dist_b=0.3467 -value_k=0.9233 -value_sigma=226.4092 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.7 -mix_put_ratio=0.25 -mix_seek_ratio=0.05 -sine_mix_rate_interval_milliseconds=500 -sine_a=15000 -sine_b=1 -sine_d=20000 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4788 Differential Revision: D13573940 Pulled By: sagar0 fbshipit-source-id: e184c27e07b4f1bc0b436c2be36c5090c1fb0222	2019-01-22 10:44:26 -08:00
Andrew Kryczka	01013ae766	Digest ZSTD compression dictionary once when writing SST file (#4849 ) Summary: This is essentially a re-submission of #4251 with a few improvements: - Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict` - Eliminated `Init` functions. Instead do all initialization work in constructors. - Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849 Differential Revision: D13606039 Pulled By: ajkr fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447	2019-01-18 19:12:57 -08:00
DorianZheng	2670fe8c73	Get `CompactionJobInfo` from CompactFiles Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4716 Differential Revision: D13207677 Pulled By: ajkr fbshipit-source-id: d0ccf5a66df6cbb07288b0c5ebad81fd9df3926b	2018-12-13 14:21:24 -08:00
Sagar Vemuri	70645355ad	Move FIFOCompactionPicker to a separate file (#4724 ) Summary: Summary: Simplified the code layout by moving FIFOCompactionPicker to a separate file. Why?: While trying to add ttl functionality to universal compaction, I found that `FIFOCompactionPicker` class and its impl methods to be interspersed between `LevelCompactionPicker` methods which kind-of made the code a little hard to traverse. So I moved `FIFOCompactionPicker` to a separate compaction_picker_fifo.h/cc file, similar to `UniversalCompactionPicker`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4724 Differential Revision: D13227914 Pulled By: sagar0 fbshipit-source-id: 89471766ea67fa4d87664a41c057dd7df4b3d4e3	2018-11-29 16:04:52 -08:00
Sagar Vemuri	c94f073e5e	Fix Mac build break in casting (#4722 ) Summary: Mac build is failing with the below error: ``` $ make db_bench -j8 ... ... tools/db_bench_tool.cc:4583:25: error: no matching function for call to 'max' (uint64_t)std::max(0l, seek_pos - FLAGS_max_scan_distance), ^~~~~~~~ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/algorithm:2717:1: note: candidate template ignored: deduced conflicting types for parameter '_Tp' ('long' vs. 'long long') max(const _Tp& __a, const _Tp& __b) ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/algorithm:2727:1: note: candidate template ignored: could not match 'initializer_list<type-parameter-0-0>' against 'long' max(initializer_list<_Tp> __t, _Compare __comp) ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/algorithm:2709:1: note: candidate function template not viable: requires 3 arguments, but 2 were provided max(const _Tp& __a, const _Tp& __b, _Compare __comp) ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/algorithm:2735:1: note: candidate function template not viable: requires single argument '__t', but 2 arguments were provided max(initializer_list<_Tp> __t) ^ 1 error generated. make: *** [tools/db_bench_tool.o] Error 1 ``` My compiler version: Mac OS X Mojave ``` $ clang++ --version Apple LLVM version 10.0.0 (clang-1000.11.45.5) Target: x86_64-apple-darwin18.2.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4722 Differential Revision: D13220196 Pulled By: sagar0 fbshipit-source-id: 01e5e928288a5613027c83a26ad8aedf04438b14	2018-11-27 13:30:16 -08:00
Po-Chuan Hsieh	60deb4485e	Fix build with ROCKSDB_LITE and -Wunused-private-field (#4715 ) Summary: The error message of databases/rocksdb-lite (FreeBSD port) is as follows: ``` tools/db_bench_tool.cc:1976:16: error: private field 'trace_options_' is not used [-Werror,-Wunused-private-field] TraceOptions trace_options_; ^ ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4715 Differential Revision: D13207902 Pulled By: ajkr fbshipit-source-id: be3c612eba656aeddb77e35e2f201dd25dc92f7e	2018-11-26 21:35:38 -08:00
Abhishek Madan	0ed738fdd0	Add max_scan_distance flag to db_bench (#4660 ) Summary: The new flag makes it possible to constrain iterator traversal by the upper/lower bound the iterator is expected to pass. This allows seekrandom results to be more easily comparable between DBs with and without deletions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4660 Differential Revision: D13053111 Pulled By: abhimadan fbshipit-source-id: 33e250f2e2d210b54c7726399da30a33f723c33c	2018-11-14 10:46:12 -08:00
Sagar Vemuri	dc3528077a	Update all unique/shared_ptr instances to be qualified with namespace std (#4638 ) Summary: Ran the following commands to recursively change all the files under RocksDB: ``` find . -type f -name ".cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} + ``` Running `make format` updated some formatting on the files touched. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638 Differential Revision: D12934992 Pulled By: sagar0 fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8	2018-11-09 11:19:58 -08:00
Zhongyi Xie	fe0d23059d	Fix two contrun job failures (#4587 ) Summary: Currently there are two contrun test failures: * rocksdb-contrun-lite: > tools/db_bench_tool.cc: In function ‘int rocksdb::db_bench_tool(int, char)’: tools/db_bench_tool.cc:5814:5: error: ‘DumpMallocStats’ is not a member of ‘rocksdb’ rocksdb::DumpMallocStats(&stats_string); ^ make: * [tools/db_bench_tool.o] Error 1 * rocksdb-contrun-unity: > In file included from unity.cc:44:0: db/range_tombstone_fragmenter.cc: In member function ‘void rocksdb::FragmentedRangeTombstoneIterator::FragmentTombstones(std::unique_ptr<rocksdb::InternalIteratorBase<rocksdb::Slice> >, rocksdb::SequenceNumber)’: db/range_tombstone_fragmenter.cc:90:14: error: reference to ‘ParsedInternalKeyComparator’ is ambiguous auto cmp = ParsedInternalKeyComparator(icmp_); This PR will fix them Pull Request resolved: https://github.com/facebook/rocksdb/pull/4587 Differential Revision: D10846554 Pulled By: miasantreble fbshipit-source-id: 8d3358879e105060197b1379c84aecf51b352b93	2018-10-24 20:16:45 -07:00
Yi Wu	0415244bfa	option to print malloc stats at the end of db_bench (#4582 ) Summary: Option to print malloc stats to stdout at the end of db_bench. This is different from `--dump_malloc_stats`, which periodically print the same information to LOG file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4582 Differential Revision: D10520814 Pulled By: yiwu-arbug fbshipit-source-id: beff5e514e414079d31092b630813f82939ffe5c	2018-10-24 11:39:05 -07:00
Simon Grätzer	f959e88048	Fix printf formatting on MacOS (#4533 ) Summary: On MacOS with clang the compilation of _tools/db_bench_tool.cc_ always fails because the format used in a `fprintf` call has the wrong type. This PR should hopefully fix this issue ``` tools/db_bench_tool.cc:4233:61: error: format specifies type 'unsigned long long' but the argument has type 'size_t' (aka 'unsigned long') ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4533 Differential Revision: D10471657 Pulled By: maysamyabandeh fbshipit-source-id: f20f5f3756d3571b586c895c845d0d4d1e34a398	2018-10-19 14:46:09 -07:00
Abhishek Madan	35cd754a6d	Add writes_before_delete_range flag to db_bench (#4538 ) Summary: The new flag allows tombstones to be generated after enough keys have been written to the database, which makes it easier to ensure that tombstones cover a lot of keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4538 Differential Revision: D10455685 Pulled By: abhimadan fbshipit-source-id: f25d5421745a353c830dea12b79784e852056551	2018-10-18 17:19:59 -07:00
Zhongyi Xie	d6ec288703	Add PerfContextByLevel to provide per level perf context information (#4226 ) Summary: Current implementation of perf context is level agnostic. Making it hard to do performance evaluation for the LSM tree. This PR adds `PerfContextByLevel` to decompose the counters by level. This will be helpful when analyzing point and range query performance as well as tuning bloom filter Also replaced __thread with thread_local keyword for perf_context Pull Request resolved: https://github.com/facebook/rocksdb/pull/4226 Differential Revision: D10369509 Pulled By: miasantreble fbshipit-source-id: f1ced4e0de5fcebdb7f9cff36164516bc6382d82	2018-10-17 11:19:40 -07:00
Igor Canadi	1cf5deb8fd	Introduce CacheAllocator, a custom allocator for cache blocks (#4437 ) Summary: This is a conceptually simple change, but it touches many files to pass the allocator through function calls. We introduce CacheAllocator, which can be used by clients to configure custom allocator for cache blocks. Our motivation is to hook this up with folly's `JemallocNodumpAllocator` (`f43ce6d686/folly/experimental/JemallocNodumpAllocator.h`), but there are many other possible use cases. Additionally, this commit cleans up memory allocation in `util/compression.h`, making sure that all allocations are wrapped in a unique_ptr as soon as possible. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4437 Differential Revision: D10132814 Pulled By: yiwu-arbug fbshipit-source-id: be1343a4b69f6048df127939fea9bbc96969f564	2018-10-02 17:24:58 -07:00
Abhishek Madan	519f8b145f	Generate appropriate number of keys in db_bench (#4404 ) Summary: If range tombstones are generated every few writes, the KeyGenerator's limit is now extended to account for the additional Next() calls. This is primarily important for `filluniquerandom` benchmarks that enforce the call limit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4404 Differential Revision: D9949326 Pulled By: abhimadan fbshipit-source-id: 0bdfeb2cad2098dc0b8b029236dab5e4bef25e38	2018-09-19 16:28:21 -07:00
Anand Ananthabhotla	a27fce408e	Auto recovery from out of space errors (#4164 ) Summary: This commit implements automatic recovery from a Status::NoSpace() error during background operations such as write callback, flush and compaction. The broad design is as follows - 1. Compaction errors are treated as soft errors and don't put the database in read-only mode. A compaction is delayed until enough free disk space is available to accomodate the compaction outputs, which is estimated based on the input size. This means that users can continue to write, and we rely on the WriteController to delay or stop writes if the compaction debt becomes too high due to persistent low disk space condition 2. Errors during write callback and flush are treated as hard errors, i.e the database is put in read-only mode and goes back to read-write only fater certain recovery actions are taken. 3. Both types of recovery rely on the SstFileManagerImpl to poll for sufficient disk space. We assume that there is a 1-1 mapping between an SFM and the underlying OS storage container. For cases where multiple DBs are hosted on a single storage container, the user is expected to allocate a single SFM instance and use the same one for all the DBs. If no SFM is specified by the user, DBImpl::Open() will allocate one, but this will be one per DB and each DB will recover independently. The recovery implemented by SFM is as follows - a) On the first occurance of an out of space error during compaction, subsequent compactions will be delayed until the disk free space check indicates enough available space. The required space is computed as the sum of input sizes. b) The free space check requirement will be removed once the amount of free space is greater than the size reserved by in progress compactions when the first error occured c) If the out of space error is a hard error, a background thread in SFM will poll for sufficient headroom before triggering the recovery of the database and putting it in write-only mode. The headroom is calculated as the sum of the write_buffer_size of all the DB instances associated with the SFM 4. EventListener callbacks will be called at the start and completion of automatic recovery. Users can disable the auto recov ery in the start callback, and later initiate it manually by calling DB::Resume() Todo: 1. More extensive testing 2. Add disk full condition to db_stress (follow-on PR) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164 Differential Revision: D9846378 Pulled By: anand1976 fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a	2018-09-15 13:43:04 -07:00
Andrew Kryczka	2c14662213	Revert "Digest ZSTD compression dictionary once per SST file (#4251 )" (#4347 ) Summary: Reverting is needed to unblock a user building against master, who is blocked for multiple days due to a thread-safety issue in `GetEmptyDict`. We haven't been able to fix it quickly, so reverting. Simply ran `git revert 6c40806e51a89386d2b066fddf73d3fd03a36f65`. There were no merge conflicts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4347 Differential Revision: D9668365 Pulled By: ajkr fbshipit-source-id: 0c56334f0a23cf5ee0233d4e4679eae6709739cd	2018-09-06 09:58:34 -07:00
Andrew Kryczka	6c40806e51	Digest ZSTD compression dictionary once per SST file (#4251 ) Summary: In RocksDB, for a given SST file, all data blocks are compressed with the same dictionary. When we compress a block using the dictionary's raw bytes, the compression library first has to digest the dictionary to get it into a usable form. This digestion work is redundant and ideally should be done once per file. ZSTD offers APIs for the caller to create and reuse a digested dictionary object (`ZSTD_CDict`). In this PR, we call `ZSTD_createCDict` once per file to digest the raw bytes. Then we use `ZSTD_compress_usingCDict` to compress each data block using the pre-digested dictionary. Once the file's created `ZSTD_freeCDict` releases the resources held by the digested dictionary. There are a couple other changes included in this PR: - Changed the parameter object for (un)compression functions from `CompressionContext`/`UncompressionContext` to `CompressionInfo`/`UncompressionInfo`. This avoids the previous pattern, where `CompressionContext`/`UncompressionContext` had to be mutated before calling a (un)compression function depending on whether dictionary should be used. I felt that mutation was error-prone so eliminated it. - Added support for digested uncompression dictionaries (`ZSTD_DDict`) as well. However, this PR does not support reusing them across uncompression calls for the same file. That work is deferred to a later PR when we will store the `ZSTD_DDict` objects in block cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4251 Differential Revision: D9257078 Pulled By: ajkr fbshipit-source-id: 21b8cb6bbdd48e459f1c62343780ab66c0a64438	2018-08-23 19:28:18 -07:00
Fenggang Wu	9d646a6311	Add db_bench options of data block hash index (#4281 ) Summary: Add `--data_block_index_type` and `--data_block_hash_table_util_ratio` option to `db_bench`. `--data_block_index_type` can be either of `binary` (default) or `binary_and_hash`; `--data_block_hash_table_util_ratio` will be a double. The default value is `0.75`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4281 Differential Revision: D9361476 Pulled By: fgwu fbshipit-source-id: dc53e01acef9db81b9eec5e8a96f3bc8ed718c10	2018-08-16 18:42:46 -07:00
Andrew Kryczka	7a9a164276	Fix db_bench default compression level (#4248 ) Summary: db_bench's previous default compression level (-1) was not the default compression level in all libraries. In particular, in ZSTD negative values are valid compression levels, while ZSTD's default compression level is three. This PR changes db_bench's default to be RocksDB's library-independent default compression level (see #3895). I also changed a couple other flags to get their default values from an options object directly rather than hardcoding. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4248 Differential Revision: D9235140 Pulled By: ajkr fbshipit-source-id: be4e0722d59fa1968832183db36d1d20fcf11e5b	2018-08-09 10:28:14 -07:00
Sagar Vemuri	fefdac1004	Fix lite build failure in db_bench due to trace/replay (#4225 ) Summary: Fix lite build failure in db_bench due to trace/replay feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4225 Differential Revision: D9153303 Pulled By: sagar0 fbshipit-source-id: 9f7a8035429d0dcdbe99616d11389ed7bccf44be	2018-08-03 11:58:55 -07:00
Sagar Vemuri	12b6cdeed3	Trace and Replay for RocksDB (#3837 ) Summary: A framework for tracing and replaying RocksDB operations. A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench. - Column-families are supported - Multi-threaded tracing is supported. - TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it. - This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations. Currently supported DB operations: - Writes: -- Put -- Merge -- Delete -- SingleDelete -- DeleteRange -- Write - Reads: -- Get (point lookups) Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837 Differential Revision: D7974837 Pulled By: sagar0 fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72	2018-08-01 00:27:08 -07:00
Zhichao Cao	6811fb0658	Fixed the db_bench MergeRandom only access CF_default (#4155 ) Summary: When running the tracing and analyzing, I found that MergeRandom benchmark in db_bench only access the default column family even the -num_column_families is specified > 1. changes: Using the db_with_cfh as DB to randomly select the column family to execute the Merge operation if -num_column_families is specified > 1. Tested with make asan_check and verified in tracing Pull Request resolved: https://github.com/facebook/rocksdb/pull/4155 Differential Revision: D8907888 Pulled By: zhichao-cao fbshipit-source-id: 2b4bc8fe0e99c8f262f5be6b986c7025d62cf850	2018-07-20 15:58:54 -07:00
Siying Dong	a5e851e113	Reformatting some recent changes (#4161 ) Summary: Lint is not happy with some new code recently committed. Format them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4161 Differential Revision: D8940582 Pulled By: siying fbshipit-source-id: c9b43b1ef8c88b5e923911058b44eb77234b36b7	2018-07-20 14:43:38 -07:00
Pooja Malik	1857576e03	db_bench support for OPTIONS+bloom and nicer output for perf_context (#4153 ) Summary: Adding the string "PERF_CONTEXT:" before the perf_context stats are printed. Setting the filter policy if it's a block based table even when options are being loaded from the provided FLAGS_options_file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4153 Differential Revision: D8905517 Pulled By: poojam23 fbshipit-source-id: 5956ed7882d39ec8ae654d5dadeb88727a36f0dd	2018-07-18 16:27:49 -07:00
Maysam Yabandeh	8581a93a6b	Per-thread unique test db names (#4135 ) Summary: The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel. Example: ``` ~/gtest-parallel/gtest-parallel ./table_test``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135 Differential Revision: D8846653 Pulled By: maysamyabandeh fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84	2018-07-13 17:27:39 -07:00
Zhongyi Xie	23b76252c8	db_bench: enable setting cache_size when loading options file Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4118 Differential Revision: D8845554 Pulled By: miasantreble fbshipit-source-id: 13bd3c1259a7c30bad762a413fe3bb24eea650ba	2018-07-13 16:43:53 -07:00
Zhongyi Xie	de98fd88e3	Support compaction filter in db_bench (#4106 ) Summary: Right now there is no support for enabling compaction filter in db_bench, we should add support for that to facilitate testing of compaction filter. This PR adds a compaction filter called KeepFilter and make `Filter` always returns false, essentially a noop compaction filter. This will allow us to test compaction filter code path without having to support arbitrary compaction filters Pull Request resolved: https://github.com/facebook/rocksdb/pull/4106 Differential Revision: D8828517 Pulled By: miasantreble fbshipit-source-id: 9ad76d04103eaa9d00da98334b4a39e542d26c41	2018-07-12 19:42:27 -07:00
Andrew Kryczka	97fe23fc5c	Fix unsigned int flag in db_bench (#4129 ) Summary: `DEFINE_uint32` was unavailable on some platforms, e.g., https://travis-ci.org/facebook/rocksdb/jobs/403352902. Use `DEFINE_uint64` instead which should work as it's used many times elsewhere in this file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4129 Differential Revision: D8830311 Pulled By: ajkr fbshipit-source-id: b4fc90ba3f50e649c070ce8069c68e530d731f05	2018-07-12 18:43:23 -07:00
Andrew Kryczka	63904434eb	db_bench periodically dump stats to info log (#4109 ) Summary: give control of how often stats are printed, including jemalloc stats if enabled. Previously the default was 10 minutes so we'd only see updated stats for very long benchmark runs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4109 Differential Revision: D8796444 Pulled By: ajkr fbshipit-source-id: fd7902fe3f105fae89322c4ab63316bba4a2b15e	2018-07-12 15:57:42 -07:00
Maysam Yabandeh	80ade9ad83	Pin top-level index on partitioned index/filter blocks (#4037 ) Summary: Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead. Closes https://github.com/facebook/rocksdb/pull/4037 Differential Revision: D8596218 Pulled By: maysamyabandeh fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2	2018-06-22 15:27:46 -07:00
Hans-Wilhelm Warlo	4faaab70a6	Benchmark sine wave write rate limit (#3914 ) Summary: As mentioned at the [dev forum.](https://www.facebook.com/groups/rocksdb.dev/1693425187422655/) Let me know if you would like me to do any changes! Closes https://github.com/facebook/rocksdb/pull/3914 Differential Revision: D8452824 Pulled By: siying fbshipit-source-id: 56439b3228ecdcc5a199d5198eff2fab553be961	2018-06-15 12:12:03 -07:00
Zhongyi Xie	f1592a06c2	run make format for PR 3838 (#3954 ) Summary: PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings. Run `make format` to fix formatting as suggested by siying . Also piggyback two changes: 1) fix singleton destruction order for windows and posix env 2) fix two clang warnings Closes https://github.com/facebook/rocksdb/pull/3954 Differential Revision: D8272041 Pulled By: miasantreble fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f	2018-06-05 12:58:02 -07:00
Maysam Yabandeh	d0c38c0c8c	Extend some tests to format_version=3 (#3942 ) Summary: format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3. Closes https://github.com/facebook/rocksdb/pull/3942 Differential Revision: D8238413 Pulled By: maysamyabandeh fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035	2018-06-04 20:13:00 -07:00
Dmitri Smirnov	f4b72d7056	Provide a way to override windows memory allocator with jemalloc for ZSTD Summary: Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably. For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance. The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression. Closes https://github.com/facebook/rocksdb/pull/3838 Differential Revision: D8229794 Pulled By: miasantreble fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d	2018-06-04 12:12:48 -07:00
Jacquin Mininger	727eb881a5	Compile error in db bench tool Summary: Small format error below causes build to fail. I believe that this : ``` fprintf(stderr, "num reads to do %lu\n", reads_); ``` Can be changed to this: ``` fprintf(stderr, "num reads to do %" PRIu64 "\n", reads_); ``` Successful build ``` CC utilities/blob_db/blob_dump_tool.o AR librocksdb_debug.a ar: creating archive librocksdb_debug.a /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: librocksdb_debug.a(rocks_lua_compaction_filter.o) has no symbols CC tools/db_bench.o CC tools/db_bench_tool.o tools/db_bench_tool.cc:4532:46: error: format specifies type 'unsigned long' but the argument has type 'int64_t' (aka 'long long') [-Werror,-Wformat] fprintf(stderr, "num reads to do %lu\n", reads_); ~~~ ^~~~~~ %lld 1 error generated. make: *** [tools/db_bench_tool.o] Error 1 ``` ``` $ cd rocksdb $ make all $ g++ --version Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1 Apple LLVM version 9.1.0 (clang-902.0.39.1) Target: x86_64-apple-darwin17.5.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin ``` Closes https://github.com/facebook/rocksdb/pull/3909 Differential Revision: D8215710 Pulled By: siying fbshipit-source-id: 15e49fb02a818fec846e9f9b2a50e372b6b67751	2018-05-30 18:01:36 -07:00
Yi Wu	bc7e8d472e	LRUCache midpoint insertion Summary: Implement midpoint insertion strategy where new blocks will be insert to the middle of LRU list, then move the head on the first hit in cache. Closes https://github.com/facebook/rocksdb/pull/3877 Differential Revision: D8100895 Pulled By: yiwu-arbug fbshipit-source-id: f4bd83cb8be469e5d02072cfc8bd66011391f3da	2018-05-24 15:57:33 -07:00
Andrew Kryczka	072ae671a7	Apply use_direct_io_for_flush_and_compaction to writes only Summary: Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used. This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously. Closes https://github.com/facebook/rocksdb/pull/3829 Differential Revision: D7915443 Pulled By: ajkr fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279	2018-05-09 19:42:58 -07:00
Zhongyi Xie	a703432808	MaxFileSizeForLevel: adjust max_file_size for dynamic level compaction Summary: `MutableCFOptions::RefreshDerivedOptions` always assume base level is L1, which is not true when `level_compaction_dynamic_level_bytes=true` and Level based compaction is used. This PR fixes this by recomputing `max_file_size` at query time (in `MaxFileSizeForLevel`) Fixes https://github.com/facebook/rocksdb/issues/3229 In master: ``` Level Files Size(MB) -------------------- 0 14 846 1 0 0 2 0 0 3 0 0 4 0 0 5 15 366 6 11 481 Cumulative compaction: 3.83 GB write, 2.27 GB read ``` In branch: ``` Level Files Size(MB) -------------------- 0 9 544 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 445 935 Cumulative compaction: 2.91 GB write, 1.46 GB read ``` db_bench command used: ``` ./db_bench --benchmarks="fillrandom,deleterandom,fillrandom,levelstats,stats" --statistics -deletes=5000 -db=tmp -compression_type=none --num=20000 -value_size=100000 -level_compaction_dynamic_level_bytes=true -target_file_size_base=2097152 -target_file_size_multiplier=2 ``` Closes https://github.com/facebook/rocksdb/pull/3755 Differential Revision: D7721381 Pulled By: miasantreble fbshipit-source-id: 39afb8503190bac3b466adf9bbf2a9b3655789f8	2018-05-03 16:42:13 -07:00
Zhongyi Xie	459bb9028f	remove prefixscanrandom from db_bench help Summary: fix issue reported in https://github.com/facebook/rocksdb/issues/3757 Closes https://github.com/facebook/rocksdb/pull/3784 Differential Revision: D7794107 Pulled By: miasantreble fbshipit-source-id: 43535074fcb82adb5656bcb916284b2dfc5cbb64	2018-04-27 12:13:19 -07:00
Gabriel Wicke	090c78a0d7	Support lowering CPU priority of background threads Summary: Background activities like compaction can negatively affect latency of higher-priority tasks like request processing. To avoid this, rocksdb already lowers the IO priority of background threads on Linux systems. While this takes care of typical IO-bound systems, it does not help much when CPU (temporarily) becomes the bottleneck. This is especially likely when using more expensive compression settings. This patch adds an API to allow for lowering the CPU priority of background threads, modeled on the IO priority API. Benchmarks (see below) show significant latency and throughput improvements when CPU bound. As a result, workloads with some CPU usage bursts should benefit from lower latencies at a given utilization, or should be able to push utilization higher at a given request latency target. A useful side effect is that compaction CPU usage is now easily visible in common tools, allowing for an easier estimation of the contribution of compaction vs. request processing threads. As with IO priority, the implementation is limited to Linux, degrading to a no-op on other systems. Closes https://github.com/facebook/rocksdb/pull/3763 Differential Revision: D7740096 Pulled By: gwicke fbshipit-source-id: e5d32373e8dc403a7b0c2227023f9ce4f22b413c	2018-04-24 08:41:51 -07:00
Yanqin Jin	2ee1496c43	Add missing whitespace. Summary: Closes https://github.com/facebook/rocksdb/pull/3729 Differential Revision: D7645465 Pulled By: riversand963 fbshipit-source-id: a64da0960fe6c39847ef848b8888fe9a9c1df25d	2018-04-17 09:57:40 -07:00
Yi Wu	2c2f388897	db_bench fillXXXdeterministic should respect compression type Summary: db_bench fillXXXdeterministic should respect compression type when calling CompactFiles(). Closes https://github.com/facebook/rocksdb/pull/3731 Differential Revision: D7647761 Pulled By: yiwu-arbug fbshipit-source-id: 15e12429e0dd93ece2231b015f2e26c2d94781e6	2018-04-16 18:01:47 -07:00
Zhongyi Xie	954b496b3f	fix memory leak in two_level_iterator Summary: this PR fixes a few failed contbuild: 1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406 2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662 3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag. Closes https://github.com/facebook/rocksdb/pull/3718 Reviewed By: maysamyabandeh Differential Revision: D7621192 Pulled By: miasantreble fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146	2018-04-15 17:26:26 -07:00
Anand Ananthabhotla	f9f4d40f93	Align SST file data blocks to avoid spanning multiple pages Summary: Provide a block_align option in BlockBasedTableOptions to allow alignment of SST file data blocks. This will avoid higher IOPS/throughput load due to < 4KB data blocks spanning 2 4KB pages. When this option is set to true, the block alignment is set to lower of block size and 4KB. Closes https://github.com/facebook/rocksdb/pull/3502 Differential Revision: D7400897 Pulled By: anand1976 fbshipit-source-id: 04cc3bd144e88e3431a4f97604e63ad7a0f06d44	2018-03-26 20:26:10 -07:00
Yi Wu	b864bc9b5b	Blob DB: Improve FIFO eviction Summary: Improving blob db FIFO eviction with the following changes, * Change blob_dir_size to max_db_size. Take into account SST file size when computing DB size. * FIFO now only take into account live sst files and live blob files. It is normal for disk usage to go over max_db_size because there are obsolete sst files and blob files pending deletion. * FIFO eviction now also evict TTL blob files that's still open. It doesn't evict non-TTL blob files. * If FIFO is triggered, it will pass an expiration and the current sequence number to compaction filter. Compaction filter will then filter inlined keys to evict those with an earlier expiration and smaller sequence number. So call LSM FIFO. * Compaction filter also filter those blob indexes where corresponding blob file is gone. * Add an event listener to listen compaction/flush event and update sst file size. * Implement DB::Close() to make sure base db, as well as event listener and compaction filter, destruct before blob db. * More blob db statistics around FIFO. * Fix some locking issue when accessing a blob file. Closes https://github.com/facebook/rocksdb/pull/3556 Differential Revision: D7139328 Pulled By: yiwu-arbug fbshipit-source-id: ea5edb07b33dfceacb2682f4789bea61de28bbfa	2018-03-06 11:57:42 -08:00
Pooya Shareghi	0a2354ca8f	Added bytes XOR merge operator Summary: Closes https://github.com/facebook/rocksdb/pull/575 I fixed the merge conflicts etc. Closes https://github.com/facebook/rocksdb/pull/3065 Differential Revision: D7128233 Pulled By: sagar0 fbshipit-source-id: 2c23a48c9f0432c290b0cd16a12fb691bb37820c	2018-03-06 10:27:36 -08:00
Andrew Kryczka	5d68243e61	Comment out unused variables Summary: Submitting on behalf of another employee. Closes https://github.com/facebook/rocksdb/pull/3557 Differential Revision: D7146025 Pulled By: ajkr fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e	2018-03-05 13:13:41 -08:00
Igor Sugak	aba3409740	Back out "[codemod] - comment out unused parameters" Reviewed By: igorsugak fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37	2018-02-22 12:43:17 -08:00
David Lai	f4a030ce81	- comment out unused parameters Reviewed By: everiq, igorsugak Differential Revision: D7046710 fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461	2018-02-22 09:44:23 -08:00
Andrew Kryczka	0a0fad447b	db_bench separate options for partition index and filters Summary: Some workloads (like my current benchmarking) may want partitioned indexes without partitioned filters. Particularly, when `-optimize_filters_for_hits=true`, the total index size may be larger than the total filter size, so it can make sense to hold all filters in-memory but not all indexes. Closes https://github.com/facebook/rocksdb/pull/3492 Differential Revision: D6970092 Pulled By: ajkr fbshipit-source-id: b7fa1828e1d13829339aefb90fd56eb7c5337f61	2018-02-12 14:57:13 -08:00
Tamir Duberstein	cd5092e168	Suppress unused warnings Summary: - Use `__unused__` everywhere - Suppress unused warnings in Release mode + This currently affects non-MSVC builds (e.g. mingw64). Closes https://github.com/facebook/rocksdb/pull/3448 Differential Revision: D6885496 Pulled By: miasantreble fbshipit-source-id: f2f6adacec940cc3851a9eee328fafbf61aad211	2018-02-02 12:27:07 -08:00
Siying Dong	e2d4b0efb1	db_bench: sanity check CuckooTable with mmap_read option Summary: This is to avoid run time error. Fail the db_bench immediately if cuckoo table is used but mmap_read is not specified. Closes https://github.com/facebook/rocksdb/pull/3420 Differential Revision: D6838284 Pulled By: siying fbshipit-source-id: 20893fa28d40fadc31e4ff154bed02f5a1bad341	2018-01-29 14:27:32 -08:00
Andrew Kryczka	9f7ccc8445	fix db_bench filluniquerandom key count assertion Summary: It failed every time. I guess people usually ran with assertions disabled. Closes https://github.com/facebook/rocksdb/pull/3422 Differential Revision: D6822984 Pulled By: ajkr fbshipit-source-id: 2e90db75618b26ac1c46ddfa9e03c095c7bf16e3	2018-01-29 11:43:21 -08:00
Andrew Kryczka	0e6e405fec	db_bench support for memtable in-place update Summary: Closes https://github.com/facebook/rocksdb/pull/3416 Differential Revision: D6820606 Pulled By: ajkr fbshipit-source-id: 5035ffb33ade8d50520cafeb685ee8c8fcf1cca8	2018-01-26 10:57:49 -08:00
Anand Ananthabhotla	199405192d	Add a BlockBasedTableOption to turn off index block compression. Summary: Add a new bool option index_uncompressed in BlockBasedTableOptions. Closes https://github.com/facebook/rocksdb/pull/3303 Differential Revision: D6686161 Pulled By: anand1976 fbshipit-source-id: 748b46993d48a01e5f89b6bd3e41f06a59ec6054	2018-01-10 15:11:59 -08:00
Yi Wu	46ec52499e	Fix db_bench write being disabled in lite build Summary: The macro was added by mistake in #2372 Closes https://github.com/facebook/rocksdb/pull/3343 Differential Revision: D6681356 Pulled By: yiwu-arbug fbshipit-source-id: 4180172fb0eaef4189c07f219241e0c261c03461	2018-01-09 10:57:29 -08:00
yingsu00	f54d7f5fea	Port 3 way SSE4.2 crc32c implementation from Folly Summary: # Summary RocksDB uses SSE crc32 intrinsics to calculate the crc32 values but it does it in single way fashion (not pipelined on single CPU core). Intel's whitepaper () published an algorithm that uses 3-way pipelining for the crc32 intrinsics, then use pclmulqdq intrinsic to combine the values. Because pclmulqdq has overhead on its own, this algorithm will show perf gains on buffers larger than 216 bytes, which makes RocksDB a perfect user, since most of the buffers RocksDB call crc32c on is over 4KB. Initial db_bench show tremendous CPU gain. This change uses the 3-way SSE algorithm by default. The old SSE algorithm is now behind a compiler tag NO_THREEWAY_CRC32C. If user compiles the code with NO_THREEWAY_CRC32C=1 then the old SSE Crc32c algorithm would be used. If the server does not have SSE4.2 at the run time the slow way (Non SSE) will be used. # Performance Test Results We ran the FillRandom and ReadRandom benchmarks in db_bench. ReadRandom is the point of interest here since it calculates the CRC32 for the in-mem buffers. We did 3 runs for each algorithm. Before this change the CRC32 value computation takes about 11.5% of total CPU cost, and with the new 3-way algorithm it reduced to around 4.5%. The overall throughput also improved from 25.53MB/s to 27.63MB/s. 1) ReadRandom in db_bench overall metrics PER RUN Algorithm \| run \| micros/op \| ops/sec \|Throughput (MB/s) 3-way \| 1 \| 4.143 \| 241387 \| 26.7 3-way \| 2 \| 3.775 \| 264872 \| 29.3 3-way \| 3 \| 4.116 \| 242929 \| 26.9 FastCrc32c\|1 \| 4.037 \| 247727 \| 27.4 FastCrc32c\|2 \| 4.648 \| 215166 \| 23.8 FastCrc32c\|3 \| 4.352 \| 229799 \| 25.4 AVG Algorithm \| Average of micros/op \| Average of ops/sec \| Average of Throughput (MB/s) 3-way \| 4.01 \| 249,729 \| 27.63 FastCrc32c \| 4.35 \| 230,897 \| 25.53 2) Crc32c computation CPU cost (inclusive samples percentage) PER RUN Implementation \| run \| TotalSamples \| Crc32c percentage 3-way \| 1 \| 4,572,250,000 \| 4.37% 3-way \| 2 \| 3,779,250,000 \| 4.62% 3-way \| 3 \| 4,129,500,000 \| 4.48% FastCrc32c \| 1 \| 4,663,500,000 \| 11.24% FastCrc32c \| 2 \| 4,047,500,000 \| 12.34% FastCrc32c \| 3 \| 4,366,750,000 \| 11.68% # Test Plan make -j64 corruption_test && ./corruption_test By default it uses 3-way SSE algorithm NO_THREEWAY_CRC32C=1 make -j64 corruption_test && ./corruption_test make clean && DEBUG_LEVEL=0 make -j64 db_bench make clean && DEBUG_LEVEL=0 NO_THREEWAY_CRC32C=1 make -j64 db_bench Closes https://github.com/facebook/rocksdb/pull/3173 Differential Revision: D6330882 Pulled By: yingsu00 fbshipit-source-id: 8ec3d89719533b63b536a736663ca6f0dd4482e9	2017-12-19 18:26:49 -08:00
Sagar Vemuri	d51fcb21f4	Blob DB: Add db_bench options Summary: Adding more BlobDB db_bench options which are needed for benchmarking. Closes https://github.com/facebook/rocksdb/pull/3230 Differential Revision: D6500711 Pulled By: sagar0 fbshipit-source-id: 91d63122905854ef7c9148a0235568719146e6c5	2017-12-06 20:44:12 -08:00
Andrew Kryczka	63f1c0a57d	fix gflags namespace Summary: I started adding gflags support for cmake on linux and got frustrated that I'd need to duplicate the build_detect_platform logic, which determines namespace based on attempting compilation. We can do it differently -- use the GFLAGS_NAMESPACE macro if available, and if not, that indicates it's an old gflags version without configurable namespace so we can simply hardcode "google". Closes https://github.com/facebook/rocksdb/pull/3212 Differential Revision: D6456973 Pulled By: ajkr fbshipit-source-id: 3e6d5bde3ca00d4496a120a7caf4687399f5d656	2017-12-01 10:42:05 -08:00
Sagar Vemuri	8954f830a0	Blob DB: db_bench flag to control BlobDB's garbage collection Summary: flag: blob_db_enable_gc, to control BlobDb's enable_garbage_collection. Closes https://github.com/facebook/rocksdb/pull/3190 Differential Revision: D6383395 Pulled By: sagar0 fbshipit-source-id: 4134e835150748c425b8187264273a54c6d8381c	2017-11-20 23:26:15 -08:00
Andrew Kryczka	114896c4e0	db_bench compression options Summary: - moved existing compression options to `InitializeOptionsGeneral` since they cannot be set through options file - added flag for `zstd_max_train_bytes` which was recently introduced by #3057 Closes https://github.com/facebook/rocksdb/pull/3128 Differential Revision: D6240460 Pulled By: ajkr fbshipit-source-id: 27dbebd86a55de237ba6a45cc79cff9214e82ebc	2017-11-07 14:00:03 -08:00
Andrew Kryczka	65c95d9c59	support db_bench compact benchmark on bottommost files Summary: Without this option, running the compact benchmark on a DB containing only bottommost files simply returned immediately. Closes https://github.com/facebook/rocksdb/pull/3138 Differential Revision: D6256660 Pulled By: ajkr fbshipit-source-id: e3b64543acd503d821066f4200daa201d4fb3a9d	2017-11-07 10:57:24 -08:00
Dmitri Smirnov	ebab2e2d42	Enable MSVC W4 with a few exceptions. Fix warnings and bugs Summary: Closes https://github.com/facebook/rocksdb/pull/3018 Differential Revision: D6079011 Pulled By: yiwu-arbug fbshipit-source-id: 988a721e7e7617967859dba71d660fc69f4dff57	2017-10-19 10:57:12 -07:00
Andrew Kryczka	731895214b	db_bench randomtransaction print throughput Summary: print throughput in MB/s upon finishing randomtransaction benchmark Closes https://github.com/facebook/rocksdb/pull/3016 Differential Revision: D6070426 Pulled By: ajkr fbshipit-source-id: 69df43beed4c374a36d826e761ca3a83e1fdcbf5	2017-10-16 18:42:25 -07:00
Andrew Kryczka	1026e794a3	rate limit auto-tuning Summary: Dynamic adjustment of rate limit according to demand for background I/O. It increases by a factor when limiter is drained too frequently, and decreases by the same factor when limiter is not drained frequently enough. The parameters for this behavior are fixed in `GenericRateLimiter::Tune`. Other changes: - make rate limiter's `Env*` configurable for testing - track num drain intervals in RateLimiter so we don't have to rely on stats, which may be shared across different DB instances from the ones that share the RateLimiter. Closes https://github.com/facebook/rocksdb/pull/2899 Differential Revision: D5858704 Pulled By: ajkr fbshipit-source-id: cc2bac30f85e7f6fd63655d0a6732ef9ed7403b1	2017-10-04 19:15:01 -07:00
Andrew Kryczka	8fc3de3c62	make rate limiter a general option Summary: it's unsupported in options file, so the flag should be respected by db_bench even when an options file is provided. Closes https://github.com/facebook/rocksdb/pull/2910 Differential Revision: D5869836 Pulled By: ajkr fbshipit-source-id: f67f591ae083e95e989f86b6fad50765d2e3d855	2017-09-21 11:11:00 -07:00
Andrew Kryczka	74f18c1301	db_bench support for non-uniform column family ops Summary: Previously we could only select the CF on which to operate uniformly at random. This is a limitation, e.g., when testing universal compaction as all CFs would need to run full compaction at roughly the same time, which isn't realistic. This PR allows the user to specify the probability distribution for selecting CFs via the `--column_family_distribution` argument. Closes https://github.com/facebook/rocksdb/pull/2677 Differential Revision: D5544436 Pulled By: ajkr fbshipit-source-id: 478d56260995236ae90895ce5bd51f38882e185a	2017-08-11 13:57:17 -07:00
Andrew Kryczka	dce6d5a838	db_bench background work thread pool size arguments Summary: The background thread pools' sizes weren't easily configurable by `max_background_compactions` and `max_background_flushes` in multi-instance setups. Introduced separate arguments for their sizes. Closes https://github.com/facebook/rocksdb/pull/2680 Differential Revision: D5550675 Pulled By: ajkr fbshipit-source-id: bab5f0a7bc5db63bb084d0c10facbe437096367d	2017-08-03 21:41:49 -07:00
Andrew Kryczka	cc01985db0	Introduce bottom-pri thread pool for large universal compactions Summary: When we had a single thread pool for compactions, a thread could be busy for a long time (minutes) executing a compaction involving the bottom level. In multi-instance setups, the entire thread pool could be consumed by such bottom-level compactions. Then, top-level compactions (e.g., a few L0 files) would be blocked for a long time ("head-of-line blocking"). Such top-level compactions are critical to prevent compaction stalls as they can quickly reduce number of L0 files / sorted runs. This diff introduces a bottom-priority queue for universal compactions including the bottom level. This alleviates the head-of-line blocking situation for fast, top-level compactions. - Added `Env::Priority::BOTTOM` thread pool. This feature is only enabled if user explicitly configures it to have a positive number of threads. - Changed `ThreadPoolImpl`'s default thread limit from one to zero. This change is invisible to users as we call `IncBackgroundThreadsIfNeeded` on the low-pri/high-pri pools during `DB::Open` with values of at least one. It is necessary, though, for bottom-pri to start with zero threads so the feature is disabled by default. - Separated `ManualCompaction` into two parts in `PrepickedCompaction`. `PrepickedCompaction` is used for any compaction that's picked outside of its execution thread, either manual or automatic. - Forward universal compactions involving last level to the bottom pool (worker thread's entry point is `BGWorkBottomCompaction`). - Track `bg_bottom_compaction_scheduled_` so we can wait for bottom-level compactions to finish. We don't count them against the background jobs limits. So users of this feature will get an extra compaction for free. Closes https://github.com/facebook/rocksdb/pull/2580 Differential Revision: D5422916 Pulled By: ajkr fbshipit-source-id: a74bd11f1ea4933df3739b16808bb21fcd512333	2017-08-03 15:43:29 -07:00
Andrew Kryczka	060ccd4f84	support multiple CFs with OPTIONS file Summary: Move an option necessary for running db_bench on multiple CFs into the general initialization area, so it works with both flag-based init and OPTIONS-based init. Closes https://github.com/facebook/rocksdb/pull/2675 Differential Revision: D5541378 Pulled By: ajkr fbshipit-source-id: 169926cb4ae95c17974f744faf7cc794d41e5c0a	2017-08-02 16:27:01 -07:00
Yi Wu	1900771bd2	Dump Blob DB options to info log Summary: * Dump blob db options to info log * Remove BlobDBOptionsImpl to disallow dynamic cast BlobDBOptions into BlobDBOptionsImpl. Move options there to be constants or into BlobDBOptions. The dynamic cast is broken after #2645 * Change some of the default options * Remove blob_db_options.min_blob_size, which is unimplemented. Will implement it soon. Closes https://github.com/facebook/rocksdb/pull/2671 Differential Revision: D5529912 Pulled By: yiwu-arbug fbshipit-source-id: dcd58ca981db5bcc7f123b65a0d6f6ae0dc703c7	2017-08-01 13:01:47 -07:00
Siying Dong	21696ba502	Replace dynamic_cast<> Summary: Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available. Some nontrivial changes: 1. Add Comparator::GetRootComparator() to get around the internal comparator hack 2. Add the two experiemental functions to DB 3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string 4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric. Closes https://github.com/facebook/rocksdb/pull/2645 Differential Revision: D5502723 Pulled By: siying fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c	2017-07-28 16:27:16 -07:00
Andrew Kryczka	f33f113683	fix db_bench argument type Summary: it should be a bool Closes https://github.com/facebook/rocksdb/pull/2653 Differential Revision: D5506148 Pulled By: ajkr fbshipit-source-id: f142f0f3aa8b678c68adef12e5ac6e1e163306f3	2017-07-27 12:14:37 -07:00
Sagar Vemuri	72502cf227	Revert "comment out unused parameters" Summary: This reverts the previous commit `1d7048c598`, which broke the build. Did a `git revert 1d7048c`. Closes https://github.com/facebook/rocksdb/pull/2627 Differential Revision: D5476473 Pulled By: sagar0 fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06	2017-07-21 18:26:26 -07:00
Victor Gao	1d7048c598	comment out unused parameters Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually. Reviewed By: igorsugak Differential Revision: D5454343 fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2	2017-07-21 14:57:44 -07:00

1 2 3 4 5 ...

344 Commits